Print Page - Test results for AVX and AVX-512 needed

Title: Test results for AVX and AVX-512 needed
Post by: Gunther on December 21, 2017, 11:43:25 AM

The attached file float.zip contains the sources and binaries of the test program. It calculates the sum of all elements of a floating point array. That's useful for the computation of the arithmetic mean, for example.

The program uses 5 different methods:

Simple C implemetation with 1 accumulator.
More sophisticated C implementation with 4 accumulators.
SSE2 implementation with 4 accumulators.
AVX implementation with 4 accumulators.
AVX-512F implementation with 4 accumulators.

The application checks via CPU dispatching which instruction sets are supported. So it'll nothing bad happen if, for example, AVX-512 isn't available. You'll get only a message on the screen and the AVX-512 procedure isn't called. That's all.

Here is a typical output on my Skylake box:

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 55.81 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 25.67 Seconds
Performance Boost = 217%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.53 Seconds
Performance Boost = 1583%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.06 Seconds
Performance Boost = 2707%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.06 Seconds
Performance Boost = 5250%

As you can see, AVX-512F increases the speed in a dramatic way. There are a lot of powerful new instructions available.

Test results with other environments are very welcome. Have fun.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: jj2007 on December 21, 2017, 12:28:16 PM

Intel Core i5, Win7-64:

Code Select

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 74.94 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 49.95 Seconds
Performance Boost = 150%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 6.36 Seconds
Performance Boost = 1177%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 3.29 Seconds
Performance Boost = 2277%

Your current CPU doesn't support the AVX-512 instruction set.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Siekmanski on December 21, 2017, 12:39:47 PM

Code Select

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8.1
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 54.22 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 36.31 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.61 Seconds
Performance Boost = 1175%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.41 Seconds
Performance Boost = 2251%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 21, 2017, 12:46:20 PM

Thank you for the results, Jochen. Your CPU doesn't support AVX-512F. But on the other hand, Windows 7 doesn't allow that instruction set. I'm not sure about that behavior of Intel and MS. Is it only the usual sales pitch?

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 21, 2017, 12:55:55 PM

Hi Marius,

thank you for testing the program. You've a good machine with an interesting environment (6 cores :t). The simple C code is a bit faster on your machine; that's surprising, the other results are as expected.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: sinsi on December 21, 2017, 01:51:14 PM

i7 4790, Windows 10 x64 Pro

Code Select

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 47.35 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 31.83 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.92 Seconds
Performance Boost = 1208%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 1.98 Seconds
Performance Boost = 2395%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 21, 2017, 02:04:14 PM

Haswell E/EP at 3.3 gig.

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 51.53 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 35.00 Seconds
Performance Boost = 147%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 4.33 Seconds
Performance Boost = 1190%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.19 Seconds
Performance Boost = 2356%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 21, 2017, 09:12:56 PM

I built with VS 2017 and got amazing results:

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 0.00 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 0.00 Seconds
Performance Boost = -nan(ind)%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 5.78 Seconds
Performance Boost = 0%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.79 Seconds
Performance Boost = 0%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

Project attached.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 21, 2017, 09:43:56 PM

Thank you Hutch and Sinsi for testing. Interesting results, indeed.

Quote
I built with VS 2017 and got amazing results:

aw27, indeed, the results are a bit confusing. The C sources contain nothing special, so it should compile without problems under VS. But the results show that there's something wrong with the time measuring. Could you attach the running EXE, please? I have no Visual Studio running here. The funny thing is, that the the times for the assembly language procedures are realistic. What happens with the C procedures? Very strange.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 21, 2017, 09:52:12 PM

@Gunther
I would say that VS was able to optimize a little better. Nothing new, it happens frequently. I include also the asm listing of floatfunc

Title: Re: Test results for AVX and AVX-512 needed
Post by: HSE on December 22, 2017, 02:20:21 AM

Hi gunther!
Just testing a 64bit notebook (7-64 SP1), system report a problem after I see:

Code Select

Assembly Language with 4 YMM accumulators:

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 22, 2017, 02:38:38 AM

Hi HSE,

that's a bit strange, because the CPU dispatching mechanism is testing for available instruction sets. I've tested the software under my old box running Win 7-64 with SP1; it works fine under this environment. Do you have an AMD CPU?

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 22, 2017, 04:56:00 AM

So, here I'm back with a few answers. Under the first post of this thread is attached the ZIP archive floatassembly.zip. It contains the sources and binaries of another test suite. Why that? The reason for it is this post. (http://masm32.com/board/index.php?topic=6776.msg72578#msg72578)

Quote from: aw27
I would say that VS was able to optimize a little better.

That's only half the story, and that's the smaller half. Of course, aw27's application shows the same behavior on my machines. And yes, VS uses a very aggressive optimization strategy, which has nothing to do with defensive programming. But it's not aggressive enough to produce a Zero result for the time measurement.

I haven't VS running here, but luckily did aw27 send the assembler source produced by VS with a very high optimization level; you can find it under this post. (http://masm32.com/board/index.php?topic=6776.msg72580#msg72580)

I can only assume a big trickery by the compiler builders. A short rough calculation should illustrate my point. In the simple C implementation uses VS one accumulator (xmm0) and is doing 8 scalar additions per loop cycle. The floating point operations are pipelined. On the other hand, we've a long dependency chain, because the next operation depends on the previous one. The more sophisticated C implementation uses 4 accumulators and is doing 4 scalar additions in parallel per loop cycle. That makes a big difference, because we've 4 dependency chains, but each is only a quarter as long.

Now the explicit vectorized code: The XMM version uses 16 additions , the YMM version 32 additions, and the ZMM version 64 additions per cycle. Could it be that 8 pipelined scalar additions are faster than 64 vectorized additions? Definitely not. It is striking that VS doesn't touch the assembly language part (it's not C); on the other hand, it seems to me that the original C code doesn't run over the 15000000 times function calls with parameter passing, call and return. This leads to the zero time results. But that isn't a Super, Mega optimization, but would be a big scam.

I think that I can prove that. What have I done? I've the assembly language sources, generated by VS, written in 2 different assembly language procedures. The C source is compiled without optimizations. Why? There's nothing time critical. The procedure to fill the array with a defined number pattern is called once and every C compiler should generate reasonable code for that. Every procedure is called 15000000 times; the loop overhead is the same for each of it. That's exactly what we want. Here are the results on my machine:

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 56.00 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 13.86 Seconds
Performance Boost = 404%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.44 Seconds
Performance Boost = 1629%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.05 Seconds
Performance Boost = 2734%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.05 Seconds
Performance Boost = 5354%

These are realistic results, I think. But this must be tested with other machines and environments. Thank you for your help.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 22, 2017, 05:14:48 AM

hi Gunther!
I did not any particular optimization, just the standard release mode.
However, I had a better look to find out what sort of trickery was used by VS, so you may eventually change the test, if you want, of course.
The trickery was simple. VS simply noticed that
   for (i = 0; i < FINISH; i++) {
      sum1 = Sum1AccuC(X, N);
   }
never changes the value of sum1, so why bother to repeat the loop if they already got the result. :icon_eek:

Title: Re: Test results for AVX and AVX-512 needed
Post by: Siekmanski on December 22, 2017, 05:33:22 AM

Results from floatassembly,

Code Select

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 53.91 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 18.37 Seconds
Performance Boost = 293%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.62 Seconds
Performance Boost = 1167%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.42 Seconds
Performance Boost = 2227%

Your current CPU doesn't support the AVX-512 instruction set.

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 22, 2017, 05:34:35 AM

I have seen this happen before, when the test was designed for manual code, it did not allow for compiler optimisation which will just short circuit the test method if its a simple repeat. Looks flashy but in simply not making the comparison that the author had in mind. The test needs to be designed so that the data that the compiler can simply short circuit can no longer be done, once the compiler cannot do this the test may be valid.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 22, 2017, 07:17:47 AM

Steve,

Quote from: hutch--
Looks flashy but in simply not making the comparison that the author had in mind.

Flashy? It's very mild spoken. But exactly that's the point.

Quote from: aw27
The trickery was simple. VS simply noticed that
for (i = 0; i < FINISH; i++) {
sum1 = Sum1AccuC(X, N);
}
never changes the value of sum1, so why bother to repeat the loop if they already got the result. :icon_eek:

That was my guess. That wasn't wanted and it isn't fair, isn't it? It has something of the programmer's disqualification. Two things, by the way: What says your machine now with the new code? And: It's an interesting product line what you offer with your Website. Furthermore, take care with VS; it has weaknesses. One weakness of MS libc was discussed in this thread. (http://masm32.com/board/index.php?topic=881.msg7665#msg7665)

Thank you Marinus for testing. Your results are looking very reasonable.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 22, 2017, 08:12:34 AM

You should not ask the compiler to sub-optimize the overall C code when you do your best to optimize the ASM.
I am sure you can produce a better test, that will also prevent the cache effects of a tight loop like this.
And, I am not interested in compiler wars. :t

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 22, 2017, 08:49:03 AM

Hi aw27,

Quote from: aw27
And, I am not interested in compiler wars. :t

No offense, please. You'll have good reasons to use VS. I had a copy of VS running here, but it's outdated. So I've to re-new it over my University. But there is no chance to do it between the years.

Quote from: aw27
You should not ask the compiler to sub-optimize the overall C code when you do your best to optimize the ASM.

Good point. On the other hand, the C code isn't the bottleneck.

Quote from: aw27
I am sure you can produce a better test, that will also prevent the cache effects of a tight loop like this.

That may or may not be. It's possible that I thought too briefly. But with 11 MB L3 cache shouldn't be much cache misses.

On the other hand, I have a hard time with many low blows behind me and am glad to be able to work reasonably well in the forum again. The last year was really not easy for me. Maybe I'll talk about some things soon inside the Soap Box. Maybe at the moment I lack the sophistication to write reasonably good software. I'll do my best to change the situation, but it was a school of hard knocks. I hope you'll understand that.

Nevertheless many thanks for the help of so many members in the forum, for example: Steve (aka Hutch--), Marinus, Jochen, Habran and, and, and ... Also for your help, aw27, I am very grateful.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: HSE on December 22, 2017, 11:28:36 AM

Quote from: Gunther on December 22, 2017, 02:38:38 AM
Do you have an AMD CPU?

No. Intel Pentium B960@2.2GHZ

I installed ArkDasm, and show me:

Code Select

; --------------------------------------------------------------------------
; sub_00401cc2
; --------------------------------------------------------------------------
sub_00401cc2   proc
main:0000000000401cc2 mov eax, 0x80
main:0000000000401cc7 vmovaps ymm0, ymmword ptr [rcx]   <-- this resalted

rcx = 000000000022bd80

Features say:

Code Select

 MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
 POPCNT, SSE4.2

 featurenumber = 13

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 22, 2017, 11:34:00 AM

:biggrin:

Gunther,

It won't take long to get back up to pace, on the bright side, when you are on the comeback trail you re-emerge with new ideas and a fresh look at things you may have missed before. :t

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 22, 2017, 01:52:18 PM

Steve,

thank you very much for your warm and heartfelt words. I'll do my best to contribute some ideas and code snippets. In general, my contributions are not very valuable; there's enough room for improvements.

HSE,

Quote from: HSE
Features say:
Code: [Select]

MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
POPCNT, SSE4.2

featurenumber = 13

With featurenumber 13 you should have AVX2 available.

Could you run Isets.exe from the first post of this thread (http://masm32.com/board/index.php?topic=1405.msg14224#msg14224), please? It's in IsetsAVX-512.zip and should make no harm on your machine. The program output should be similar to this, without the AVX-512 section:

Code Select


        Supported Features by Processor and Operating System
        ====================================================

Vendor String: GenuineIntel
Brand  String: Intel(R)Core(TM)i7-7820XCPU@3.60GHz

        Instruction Sets
        ----------------

MMX  SSE  SSE2  SSE3  SSSE3  SSE4.1  SSE4.2  AVX  AVX2
AVX-512 F  - Fundamental Instructions
AVX-512 DQ - Double and Quad Word Instructions
AVX-512 CD - Conflict Detection Instructions
AVX-512 BW - Byte and Word Support Instructions
AVX-512 ER - Exponential and Reciprocal Instructions


Please, press enter to end the application ...

You'll find the explanation in this post. (http://masm32.com/board/index.php?topic=1405.msg72523#msg72523) That should help.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 22, 2017, 06:53:17 PM

Quote
With featurenumber 13 you should have AVX2 available

I don't think the featurenumber is correctly calculated. I made my tests on a SandyBridge which has no support for AVX2 and it was considering it with featurenumber 12. It did not crash because the tests did not require AVX2. Intel Pentium B960 is a SandyBridge without AVX support.

Quote
On the other hand, the C code isn't the bottleneck

The simple fact is that the compiler is particularly good at optimizing traditional C code and particularly bad with vector instructions.
This has been sported in this article https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language .
Although a few people said the ASM was not well optimized, more than 6 month later the author is still seated and waiting for someone to produce a version able to beat the compiler.

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 22, 2017, 07:21:19 PM

:biggrin:

> more than 6 month later the author is still seated and waiting for someone to produce a version able to beat the compiler

I wonder with such open ended comparisons if anyone could be bothered responding.

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 22, 2017, 07:35:18 PM

Quote from: hutch-- on December 22, 2017, 07:21:19 PM
:biggrin:

> more than 6 month later the author is still seated and waiting for someone to produce a version able to beat the compiler

I wonder with such open ended comparisons if anyone could be bothered responding.

With open ended comparisons each response should be seen as a unique opinion, not statistically significant. :biggrin:

Title: Re: Test results for AVX and AVX-512 needed
Post by: HSE on December 23, 2017, 01:07:03 AM

Code Select

        Supported Features by Processor and Operating System
        ====================================================

Vendor String: GenuineIntel
Brand  String: Intel(R)Pentium(R)CPUB960@2.20GHz

        Instruction Sets
        ----------------

MMX  SSE  SSE2  SSE3  SSSE3  SSE4.1  SSE4.2


Please, press enter to end the application ...

Evidently there is not AVX, but still I was expecting some automatic detection in the program. Thanks.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 02:07:07 AM

Hi HSE,

Quote from: HSE
Evidently there is not AVX, but still I was expecting some automatic detection in the program. Thanks.

I think that aw27 is right. He wrote:

Quote from: aw27
I don't think the featurenumber is correctly calculated. I made my tests on a SandyBridge which has no support for AVX2 and it was considering it with featurenumber 12. It did not crash because the tests did not require AVX2. Intel Pentium B960 is a SandyBridge without AVX support.

That's the point. On the other hand, the main procedure should make sure that the AVX code path isn't called. It contains the following code snippet:

Code Select


// Check AVX support

    if (featurenumber >= 12){                 // can we use AVX? 
        printf("\nAssembly Language with 4 YMM accumulators:\n");
        printf("------------------------------------------\n");
        start = clock();                      // yes
        for(i = 0; i < FINISH; i++){
            sum4 = Sum4AccuYMM(X, N);
        }
        stop = clock();
        t4 = (float)(stop-start)/(float)CLOCKS_PER_SEC;
        boost = t1*100/t4;
        printf("sum4              = %.2f\n", sum4);
        printf("Elapsed Time      = %.2f Seconds\n", t4);
        printf("Performance Boost = %.0f%%\n",boost);
    }
    else{                                     // no: print a message
        printf("\nYour current CPU doesn't support the AVX instruction set.\n");
        printf("You'll need at least the Sandy Bridge or Ivy Bridge architecture.\n\n");
        printf("The application terminates now.\n");
    }

I'm not sure if I have posted the latest version in the forum. So, I re-compiled the sources and under the first post of this thread you'll find the latest versions of both applications: float.exe and floatassembly.exe (with sources, of course). I hope that both won't crash on your machine. Could you test that, please? Please excuse the inconvenience.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 02:31:24 AM

Hi aw27,

Quote from: aw27
The simple fact is that the compiler is particularly good at optimizing traditional C code and particularly bad with vector instructions.
This has been sported in this article https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language .
Although a few people said the ASM was not well optimized, more than 6 month later the author is still seated and waiting for someone to produce a version able to beat the compiler.

I could only read the article overflowing. But it seems to be very instructive. It could be a great challenge for the members of our forum. I've seen more than once how the compiler's excellent code has been improved and beaten. I'm just a simple lover of assembler code. There are many forum members who can do that better and contribute higher quality posts and applications. But my experience is: At the beginning, let the compiler do the job with the highest level of optimization. That's the right starting point. Then check the resulting assembler code very carefully and start with the improvements. Only then does one have a chance to win. Of course, someone can say: That's not fair. Well, life is not fair.

As I said, that's just my humble point of view.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 23, 2017, 03:16:59 AM

Hi Gunther,

Learning assembly language is important, even on the day compilers are able to do better than humans in every case. That day will arrive. I still remember the days most people believed it was impossible a machine to win on chess against a Grand Master because have no global vision, could not recognize patterns, had no sense of position, not able to think in strategic terms - could only use brute force. They were wrong, most chess programs nowadays beat easily every chess Grand master.

Title: Re: Test results for AVX and AVX-512 needed
Post by: felipe on December 23, 2017, 04:18:36 AM

Haha, and here we go again, right? ;)

:lol:

I would say, if that's a total true, so machines and computers will do everything better some day, but i think that's not correct. It's just a simple generalization. Humans will be always smartest than machines, even if we don't realize of that. :biggrin:

Btw i always question the real importance of the chess play. Maybe it's a stupid game. Humans had give machines the role of doing stupid and brutal things in an important part. So, they can win a chess play, but a cat can piss on a computer. :lol:

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 04:50:27 AM

Hi aw27,

Quote from: aw27
Learning assembly language is important, even on the day compilers are able to do better than humans in every case. That day will arrive.

Right. Every programmer should know how computers work intern, how is hardware accessed etc. etc. We'll see if this day is coming.

Quote
I still remember the days most people believed it was impossible a machine to win on chess against a Grand Master because have no global vision, could not recognize patterns, had no sense of position, not able to think in strategic terms - could only use brute force. They were wrong, most chess programs nowadays beat easily every chess Grand master.

That's more complicated than it seems at first glance. Here (http://www.computerchess.org.uk/ccrl/4040/) is one of the best hit-parades of chess engines. It's updated at least weekly and very precise. I think your statement is true for the top scorers: Stockfish (by the way: Asmfish is a stockfish derivate), Kommodo, Houdini, Shredder etc.

I'm not a top correspondence chess player; my cc ELO is round about 2300. By comparison, the world ranking first (https://www.iccf.com/RatingList.aspx) is ELO 2688, because we haven't such usual ELO inflation. The calculation of our ELO numbers are a bit different; but that has proven itself. I'm using chess engines daily and with a little luck I'm qualified for the semifinals of the European Championship. However, I have had to pay a lot of apprenticeship. It's wrong to think: I'm using a chess engine now and will beat everyone else. You have to feed the Chess Engine with your own strategic ideas and then check that you have not overlooked any tactical finesse. That's the art. The only thing what chess engines are doing is: brute force. It has often happened to me that the engine suggests moves that ruin the pawn structure. That's poison for the entire game. So you have to look for an alternative and you need often a second opinion. But that's a big field and if anyone wants to keep discussing these questions, we should do that in a separate thread in the Soap Box. I still have a lot to talk about chess engines.

By the way: What says your machine now with the updated software in floatasm.zip?

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 23, 2017, 04:59:28 AM

@Felipe
We are in the era of the self-driving cars! Any cheap robot can kick that stupid cat out of the window reducing by one its many lifes!.

Title: Re: Test results for AVX and AVX-512 needed
Post by: felipe on December 23, 2017, 05:12:33 AM

:biggrin:

Btw:

Quote from: Gunther on December 23, 2017, 04:50:27 AM
But that's a big field and if anyone wants to keep discussing these questions, we should do that in a separate thread in the Soap Box. I still have a lot to talk about chess engines.

Title: Re: Test results for AVX and AVX-512 needed
Post by: six_L on December 23, 2017, 05:18:19 AM

Quote
Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 62.32 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 44.33 Seconds
Performance Boost = 141%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 5.80 Seconds
Performance Boost = 1075%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.91 Seconds
Performance Boost = 2145%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Quote
Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1 = 8390656.00
Elapsed Time = 62.45 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2 = 8390656.00
Elapsed Time = 20.63 Seconds
Performance Boost = 303%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 5.19 Seconds
Performance Boost = 1204%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.62 Seconds
Performance Boost = 2381%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 05:26:07 AM

Thank you six_L for running the software and providing the results. What's your environment? I assume at least Windows 7-64. The processor would be interesting: Intel or AMD?

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 05:35:46 AM

What I mean is this:

Quote from: fellipe
Haha, and here we go again, right? ;)

:lol:

I would say, if that's a total true, so machines and computers will do everything better some day, but i think that's not correct. It's just a simple generalization. Humans will be always smartest than machines, even if we don't realize of that. :biggrin:

Btw i always question the real importance of the chess play. Maybe it's a stupid game. Humans had give machines the role of doing stupid and brutal things in an important part. So, they can win a chess play, but a cat can piss on a computer. :lol:

or that:

Quote from: aw27
@Felipe
We are in the era of the self-driving cars! Any cheap robot can kick that stupid cat out of the window reducing by one its many lifes!.

I am not the senior teacher here, just a simple forum member in the last row. Would it not be better to discuss such deep philosophical questions inside several threads in the Soap Box or in the Coloseum?

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: felipe on December 23, 2017, 05:38:49 AM

Code Select



Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 75.10 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 25.73 Seconds
Performance Boost = 292%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 6.63 Seconds
Performance Boost = 1132%

Your current CPU doesn't support the AVX instruction set.
You'll need at least the Sandy Bridge or Ivy Bridge architecture.

The application terminates now.

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Windows 8.1...And a formerly bay trail :redface:

Title: Re: Test results for AVX and AVX-512 needed
Post by: six_L on December 23, 2017, 05:41:10 AM

Quote from: Gunther on December 23, 2017, 05:26:07 AM
What's your environment?
Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: felipe on December 23, 2017, 05:41:53 AM

Yeah, that was i saying with this:

Quote from: felipe on December 23, 2017, 05:12:33 AM
Btw:
Quote from: Gunther on December 23, 2017, 04:50:27 AM
But that's a big field and if anyone wants to keep discussing these questions, we should do that in a separate thread in the Soap Box. I still have a lot to talk about chess engines.
:t

Title: Re: Test results for AVX and AVX-512 needed
Post by: nidud on December 23, 2017, 06:00:00 AM

deleted

Title: Re: Test results for AVX and AVX-512 needed
Post by: FORTRANS on December 23, 2017, 08:28:46 AM

Hi Gunther,

i3, Win 8.1, notebook.

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 108.83 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 73.05 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 9.08 Seconds
Performance Boost = 1199%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 4.59 Seconds
Performance Boost = 2369%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 107.92 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 36.27 Seconds
Performance Boost = 298%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 9.08 Seconds
Performance Boost = 1189%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 4.58 Seconds
Performance Boost = 2357%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

HTH,

Steve N.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 12:04:43 PM

Felipe,

thank you for testing floatasm. It's simply the code of VS which aw27 provided. I think that I should re-arrange the if statement. My fault, excuse me, please.

six_L,

special thanks for your detailed environment information. Where can I find Raistlins software?

nidud,

Quote from: nidud
I have hardware with support up to AVX-2 but AVX-512 is now implemented in Asmc. Good to see hardware is available for testing.

Wow, impressive link. It seems that you've included the complete instruction set, including the new mask registers. :t

Steve (aka FORTRANS),

I am looking forward to hearing from you again. We had a long break. I very much hope that we will work together as comradely as we used to. In this sense: thank you for testing. Not bad for a small i3.

To sum up, so far all testers are driving on the Intel rail. Is AMD out of fashion?

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: HSE on December 23, 2017, 12:13:06 PM

Perfect now :t

Float:

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 85.06 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 56.41 Seconds
Performance Boost = 151%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 7.21 Seconds
Performance Boost = 1180%

Your current CPU doesn't support the AVX instruction set.
You'll need at least the Sandy Bridge or Ivy Bridge architecture.

The application terminates now.

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Floatassembly:

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 84.75 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 28.91 Seconds
Performance Boost = 293%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 7.19 Seconds
Performance Boost = 1178%

Your current CPU doesn't support the AVX instruction set.
You'll need at least the Sandy Bridge or Ivy Bridge architecture.

The application terminates now.

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 23, 2017, 12:36:04 PM

Hi HSE,

good to see that. Please excuse the inconveniences. But where the hell did the instruction set number 13 come from? I've no answer, to be honest.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: felipe on December 23, 2017, 02:26:20 PM

Gunther here you will find that great work from raistlin:

http://masm32.com/board/index.php?topic=5964.0 (http://masm32.com/board/index.php?topic=5964.0)

:t

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 23, 2017, 07:03:23 PM

Just to see if the ZMM part works (with my modified ASM used in floatasm), I used the Intel Emulator selecting Icelake CPU.

Code Select


Calculating the sum of a float array with Intel Emulator (150000 iterations only)...
Emulating Icelake CPU


Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 0.71 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 0.28 Seconds
Performance Boost = 252%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 0.61 Seconds
Performance Boost = 117%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.49 Seconds
Performance Boost = 143%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 2.85 Seconds
Performance Boost = 25%

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 24, 2017, 12:40:27 AM

Thank you for the link, felipe. :t

aw27,

Did you have any doubt that the ZMM code is not working? Another point: the time for 4 scalar multiplications per loop cycle is 0.28 seconds, while 16 vectorized additions per loop cycle need 0.61 seconds? That's strange.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 24, 2017, 01:03:01 AM

Not really doubts, I found this more interesting from a programmer's point of view.

Quote
Another point: the time for 4 scalar multiplications per loop cycle is 0.28 seconds, while 16 vectorized additions per loop cycle need 0.61 seconds?

I have not checked how the emulator works, I believe it was not produced with competition in mind, but simply to emulate instructions. This is usually done by single stepping through the code and replace instructions not supported with some routine.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 24, 2017, 02:48:42 AM

Hi aw27,

Quote from: aw27
I have not checked how the emulator works, I believe it was not produced with competition in mind, but simply to emulate instructions. This is usually done by single stepping through the code and replace instructions not supported with some routine.

Well, be that as it may, the strangeness remains.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 24, 2017, 03:37:33 AM

I am not finding it very strange because for sure well over 90% of the time is not spent doing the exercise's sum computation.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 24, 2017, 04:48:06 AM

Hi aw27,

Quote from: aw27 on December 24, 2017, 03:37:33 AM
I am not finding it very strange because for sure well over 90% of the time is not spent doing the exercise's sum computation.

That's a very pessimistic estimate. I really do not want fruitless discussions here. Furthermore, I won't be penny-wise and pound-foolish. But is it right that the small loop overhead (a very simple for loop), passing 2 parameters plus an appropriate call and return needs more time than the floating point calculation of the array sum of 4096 elements?

Incidentally, this is not an unusual scenario. I'll give a practical example: For several years, our software for fractal image compression has been running at CERN. The encoder consists of about 150000 lines of code, as well as the decoder. The decoding process is no problem and runs in real time. The encoding is very expensive - hard number crunching. We've profiled the entire encoding process and we found out the following: There are 6 procedures, less than 0.5% of the code. They formed the bottleneck. With a small image of 256x256 pixels, they are called over 33 million times. With the doubling of the image size, the effort increases by a factor of four. Those were the time wasters.

What happened inside the procedures? Very simple things: Calculation of the arithmetic mean, the root mean, the scalar product always of 2 vectors. No witchcraft. So we realized only this procedures in assembly language with different code paths: classic usage of the good old FPU, SSE2 code, AVX code. The reason is simple: The PC farm at CERN consists of just over 20,000 computers and is very heterogeneous. That has historical reasons. Since these are true color images, of course, the calculation of the individual color planes was parallelized. That was not easy, because there is loose coupling inside the cluster and tight coupling between the cores. But it works well now. To sum up: We hope that we can scratch the real time with the AVX-512 code part. That would be a large step forward.

That's what I'm doing at the moment. This is my small, modest contribution in the hunt for the Higgs boson and other particles. That's why I designed and wrote the test bed in this form and not differently. Although it looks a bit stupid at first glance, it certainly has a real background. For the longtime members of the forum that was certainly a boring repetition, because they know what I'm doing for years. I wrote it down only for the newer members and ask for leniency.

All in all, all the testers have helped me a lot, and for that I would like to thank you once again.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: aw27 on December 24, 2017, 06:15:30 AM

Hi Gunther,
May be are talking about different things. This runs under an emulator, the emulator takes a lot of time to do emulation because it runs in single step mode, i.e, it sets the trap flag which invokes INT 01 after each instruction which causes a call to an ISR in the kernel. The ISR will check the next instruction to be executed and replace it with a call to a user mode routine, if necessary. This procedure is repeated until the program ends.This is what I believe happens, I don't know about alternatives, but they may exist. I can not check with the emulator source code because is not available.

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 24, 2017, 07:13:50 AM

Gunther,

I have not really kept up with this conversation but your last post had some interesting stuff in it in relation to the isolation of the main bottleneck in doing very large counts of number crunching calculations. I gather the 20000 computers are x86/64 based rather than Itaniums and that there must be some dedicated hardware to interface them so that the calculation load can be distributed in a useful manner. What I wonder is if the hotspot where the time is being taken can be tuned so that a section of the processing power deals directly with this bottleneck.

Its the usual stuff here, bash the guts of the calculation code in its single thread form to extract the maximum speed them find a technique to parallel process the workload then multithread the work based on the thread count of each processor to get more hardware working on the main bottleneck.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 24, 2017, 10:42:30 AM

Steve,

Quote from: hutch-- on December 24, 2017, 07:13:50 AM
I gather the 20000 computers are x86/64 based rather than Itaniums

That's the situation. We've machines with 2, 4, 6, 8, 12 and 16 cores. The machines with 16 cores do not exist that often, there is a shortage of money. Anyway, the boxes can be interconnected to a large Cluster. We're using Open MPI (http://www.open-mpi.de/) to handle all the complicated stuff. It's robust and rock solid.

The situation is, roughly speaking, as follows. Each image is broken down into domains and ranges. We do that recursively; this results in a domain pool and a range pool. Each domain is twice as large as a corresponding range at the appropriate recursion level. This has to do with the Contract Mapping Theorem by Stefan Banach, which forms the basis of the whole method. The ranges must not overlap (a disjoint set structure). The domains can overlap and they do. We simply move a domain window column by column and row by row over the image. Before we do that, the domain is shrinked down to the appropriate range size. And now comes the bottleneck: We want find the best fit with a given domain in comparison with each range. If we can't find a fit, a new recursion level is required. If we find a fit, this part of the image is marked and we've found the fractal code for this image part.

To find a fit with our eyes is easy. But the computer can't see. We have to attribute this to a calculation process. For this we use techniques from the error and compensation calculation and the regression. We calculate different mean values, the slope of the regression line (that is the contrast) and the absolute term of the line (that is the brightness). That must be done for every comparison and for 3 color planes. We don't use RGB but YUV. That has advantages for the parallelization. Y is simply a grey scale image, while U and V are rough color difference signals. So we need only 2 Cores per image; in the first core runs the Y encoding and in the other core the U and V encoding in several threads. That's all tested and proven. Several of my students have successfully written their master theses with me. For me it is now to summarize the different solutions and install the AVX-512 code, because so far we had no hardware.

I hope that I have not bored anyone with these many technical details.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: Siekmanski on December 24, 2017, 09:50:18 PM

Hi Gunther,

Sounds like a cool job for the GPU using the pixel shader.

A CPU consists of a few cores optimized for sequential serial processing while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 24, 2017, 11:20:17 PM

Hi Marinus,

Quote from: Siekmanski on December 24, 2017, 09:50:18 PM
Hi Gunther,

Sounds like a cool job for the GPU using the pixel shader.

Yes and no. We had a long and hard discussion about CUDA some months ago. The tricky point is: the CERN PC farm is very heterogeneous. We've Intel, AMD, Cyrix, Transmeta and Apple boxes running Linux or BSD. Often with special hardware and some very exotic graphics cards. Furthermore: With CUDA you are tied to Nvidia, all the ATI cards (is now AMD), VIA and S3 won't work for that. Not to forget: The GPU data types are a bit exotic: here is a byte sometimes 9 bit or you can have 12 bit fixed-point arithmetic. That doesn't really help for our tasks. But let's wait and see, nothing is set in stone. Maybe OpenCL brings an improvement here.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: FORTRANS on December 25, 2017, 12:20:42 AM

Hi Gunther,

Quote from: Gunther on December 24, 2017, 10:42:30 AM
I hope that I have not bored anyone with these many technical details.

Actually I find it interesting. Given the simple description of the
process you gave, I may be off-base here. Would a Fourier transform
from spatial to frequency space aid in feature classification? A fairly
complex operation, so it would have to help a lot, I guess.

Cheers,

Steve N.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 25, 2017, 01:41:34 AM

Steve,

Quote from: FORTRANS on December 25, 2017, 12:20:42 AM
Would a Fourier transform
from spatial to frequency space aid in feature classification? A fairly
complex operation, so it would have to help a lot, I guess.

You're speaking about JPEG or MPEG, didn't you? Yes, they use the discrete cosine transformation. That works fine and one can find a lot of code for that.

But: there is a big disadvantage. These methods are not independent of the resolution of the image. Let me explain this a bit. If you enlarge a given image by a factor of two, the entire image must be encoded in JPEG again. In contrast, the fractal coding is independent of the resolution. For example, you can encode a 512x512 image fractal and decode it on 8192x8192 without any loss of quality. This is because we do not look at the pixels of the image, but we store only the generating functions in the encoding. That's the trick: each picture is represented by a set of generating functions. This was proved in 1982 by the Australian mathematician Hutchinson (noun is Omen). All the effort is only spent to find these generating functions. When decoding we use a technique similar to that in Postscript, because Postscript (or the binary PostScript aka PDF) is resolution independent. We have a picture space and a device space. The complete decoding is done in the picture space. Only at the very end does the transfer to the device space (printer, plotter, screen - whatever you want) take place. For this we use special transformation matrices.

But as I said, decoding is not the problem at all. Here we already have real time and even the high level language (C ++) suffices. We'll probably have to try assembler for 4K movies later. But this is not a fundamental problem.

I still want to make an important comment. I can not go into the details at this point. Who deals with such things lives dangerously, that is not exaggerated. I also do not suffer from paranoia. I had to make very bad experiences myself, which will not let me go until my end. One of my former students has come to a tragic end. If you are interested, you can read this, at least in part, here. (https://en.wikipedia.org/wiki/Tron_(hacker)) You will find another dark side of the Wikipedia here, because that's not even a quarter of the truth. I do not speak like a blind man of the color, because I have had to experience all this in large part.

Normally all master theses are available in our university library. But as a consequence of this case, our Dean has decided that the work of my graduates will be kept in a safe place. Very few people, whom I can absolutely trust, know the entire content of the various master's theses. This practice is very unusual, but I have a responsibility to my graduates, and nobody takes that away from me. These are all young, diligent graduates, some of whom have families and small children; all this is worth protecting.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: GoneFishing on December 25, 2017, 02:48:37 AM

Hi Gunther,
That sounds like a detective or maybe even a thriller. It's a pity that we can't hear all your story.

Title: Re: Test results for AVX and AVX-512 needed
Post by: FORTRANS on December 25, 2017, 03:38:20 AM

Hi Gunther,

Quote from: Gunther on December 25, 2017, 01:41:34 AM
Steve,

Quote from: FORTRANS on December 25, 2017, 12:20:42 AM
Would a Fourier transform
from spatial to frequency space aid in feature classification? A fairly
complex operation, so it would have to help a lot, I guess.
You're speaking about JPEG or MPEG, didn't you? Yes, they use the discrete cosine transformation. That works fine and one can find a lot of code for that.

But: there is a big disadvantage. These methods are not independent of the resolution of the image. Let me explain this a bit. If you enlarge a given image by a factor of two, the entire image must be encoded in JPEG again. In contrast, the fractal coding is independent of the resolution.

No, I was not thinking of JPEG compression. Back in the old days
I saw people making seekers that would "look" for a target. One
group was trying to use a Fourier transform to look for recognizable
patterns in the frequency domain. Something like "look for a maximum
peak and compare it to the peaks near it, or peaks at integer multiples
of that frequency". They were looking for a straight line or rectangle
shaped areas (I think). They seemed to be having fun, but I never
found out how well they were doing. (Actually, since they, more or
less, went quietly away from my perspective, they probably did not
have much success.)

Some things showed up in the frequency domain as recognizable
that were too difficult to "classify" in the spatial domain. And some
size or orientation estimates could be made quickly as well.

Regards,

Steve N.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 25, 2017, 05:24:01 AM

Steve,

now I'm understanding. My previous answer is based on a misunderstanding. You talk about the auto-correlation function and the cross-correlation function, right? This is a method of the correlation electronics that lets you distinguish between Nyquist noise and hidden patterns, in the frequency domain of course. Very refined, indeed. It is used in the construction of radar stations that can look far beyond the horizon.

That was one of our first approaches (using such statistical parameters) some years ago, but we rejected that because the unpacked images did not meet the quality standards. We are now looking for the patterns directly in the original area and not via the detour of the frequency domain.

But your suggestion is very good and has made me think. For very accurate images, where time is of minor importance, this path can be a backup for the main process. This will be the subject of another Master's thesis. Please ask me later about the results. The graduates are young and ambitious and want to achieve good results. But you also have to give them time to get the chance. At least that's my experience.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 25, 2017, 05:42:15 AM

Quote from: GoneFishing on December 25, 2017, 02:48:37 AM
That sounds like a detective or maybe even a thriller. It's a pity that we can't hear all your story.

Oh yes, life writes the best stories, but that was also a very sad story. Those who are further interested in such questions and the very dark pages of the Wikipedia itself, should read the following book, David Talbot: The Devil's chessboard: Allen Dulles, the CIA, and the Rise of America's Secret Government. It's published by Harper Collins, in 2015. The content has opened my eyes and also has to do with Tron's case. Who gets between such millstones, has very bad cards. But I should open a separate thread in the Colosseum about this, because that is far away from the programming, but humanly very close.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: GoneFishing on December 25, 2017, 05:50:58 AM

Thanks, Gunther
I'll try to find that book

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 25, 2017, 06:03:28 AM

Quote from: GoneFishing on December 25, 2017, 05:50:58 AM
Thanks, Gunther
I'll try to find that book

It's worth reading in any case.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 25, 2017, 06:18:05 AM

It is unfortunate that information is subject to either commercial or government interests when both have an unhappy history in what they will do to obtain it, either for control or financial reasons so your choice to protect the identities of authors and the contents of their work is in fact a good idea. Both government agencies and private companies protect the identities and content of information they control so it makes sense that you should do the same.

From a quick read of the Wikipedia page you linked to, it sounds like Tron was "assisted" in his suicide because he knew something that someone else did not want him to know. You do have examples of some countries in the middle east killing scientists in other countries in the middle east so conduct like this is not unknown.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 25, 2017, 06:33:35 AM

Steve,

Quote from: hutch-- on December 25, 2017, 06:18:05 AM
From a quick read of the Wikipedia page you linked to, it sounds like Tron was "assisted" in his suicide because he knew something that someone else did not want him to know.

That's exactly the point. When I finally realized that, it was already too late. That will attach to me forever. I should have taken him out of the line of fire much earlier.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: FORTRANS on December 26, 2017, 12:12:23 AM

Hi,

Quote from: Gunther on December 25, 2017, 05:24:01 AM
Steve,

now I'm understanding. My previous answer is based on a misunderstanding. You talk about the auto-correlation function and the cross-correlation function, right?

Yes. Correlation between your image and a specified feature.

QuoteThis is a method of the correlation electronics that lets you distinguish between Nyquist noise and hidden patterns, in the frequency domain of course. Very refined, indeed. It is used in the construction of radar stations that can look far beyond the horizon.

That was one of our first approaches (using such statistical parameters) some years ago, but we rejected that because the unpacked images did not meet the quality standards. We are now looking for the patterns directly in the original area and not via the detour of the frequency domain.

That answers my question. You have already evaluated my
suggested action and are not currently using it. Not due to
calculation cost as I thought, but due to poor performance. Thank
you, very informative.

QuoteBut your suggestion is very good and has made me think. For very accurate images, where time is of minor importance, this path can be a backup for the main process. This will be the subject of another Master's thesis. Please ask me later about the results. The graduates are young and ambitious and want to achieve good results. But you also have to give them time to get the chance. At least that's my experience.

Best of luck to you and them. Keep us informed if something
shows up. Thank you for your response.

Regards,

Steve

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 26, 2017, 12:50:32 AM

Steve,

Quote from: FORTRANS on December 26, 2017, 12:12:23 AM
Yes. Correlation between your image and a specified feature.

Of course, the auto- and cross-correlation functions have to do with the Fourier Transformation; the convolution also plays a role. This can be very interesting things. We found that, for example, the convolution of two Bessel functions gives the harmonic sinusoidal oscillation. You can even recalculate that. This is important for the fractal encoding.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: HSE on December 26, 2017, 01:49:24 AM

Hi Gunther!

Just reading from the more complete ignorance: You are searching only for continuous patterns. Because if the pattern is discontinuous, you can't split the search in different threads. ¿?

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 26, 2017, 05:44:20 AM

Quote from: HSE on December 26, 2017, 01:49:24 AM
Just reading from the more complete ignorance: You are searching only for continuous patterns. Because if the pattern is discontinuous, you can't split the search in different threads. ¿?

That's a really good question. The basis of the described technique is the fractal geometry. The emphasis is on the geometry. In vivid terms, we search for self-similarities within a given image. These can be big - the bigger the better, because then the compression gets better - or small. Last but not least, these are geometric aspects that we assume. In this sense, the patterns must already be coherent. But to come back to your suggestion: At present a graduate is working on various new approaches to parallelization. His interim results look very good. The deadline for submitting his master's thesis is mid-February. If he defended his theses in April next year, we know more. We are already looking for different processes and sometimes several threads within a process. But with the new results we should be able to take decisive steps forward, because the data volume involved in film sequences or at CERN is gigantic.

Another problem that occurs when decoding movie clips or entire movies is the synchronization of the video and audio tracks. However, we have that under control since a few days, which makes me very happy.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: HSE on December 26, 2017, 06:31:02 AM

:t Thanks.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 26, 2017, 06:39:18 AM

Quote from: HSE on December 26, 2017, 06:31:02 AM
:t Thanks.

You're welcome. I just hope my explanation was reasonably clear and understandable.

Gunther

Title: Re: Test results for AVX and AVX-512 needed
Post by: hutch-- on December 26, 2017, 11:55:52 AM

Gunther,

I would be interested to know what video resolution you are working with on a project like CERN. I remember years ago that Silicon Graphics had the hardware to animate a 20 megapixel image at 100 frames a second but JPL were using very big computer grunt for rotating images of the known universe. I wondered if you guys are doing similar stuff.

I have a couple of cameras that will shoot decent 4k and I know there are a few that will shoot 8k but I got the impression that you may be working in much higher resolution than the normal standards offer.

Title: Re: Test results for AVX and AVX-512 needed
Post by: Gunther on December 26, 2017, 03:19:48 PM

Steve,

the CERN images are not the question, because we've no audio track. There, only the sheer mass is the problem.

The real challenge are the 4K movies. This is a parallel project and located at the University. We aim for a resolution of 3840x2160, 24 images per second (cinema quality) with Dolby Surround soundtrack. But that's all easier said than done. At the moment it looks a bit bumpy on my screen, but it gets better every day. The graduates are working with full steam. Well, they are young and ambitious, they want to show what they have learned and what they have on it.

Although you can do a lot more tricky with movies, we do not get around to do decoding with the good old assembler. We do that with the inline assembler, leave the old C++ code as comment, so that we can later do migration easier (UNIX, Mac, PowerPC, what do I know). In particular, the iterations and the conversion from YUV to RGB are critical. There is no free register and all cores are running full load; the cache could well be three times as big.

On the other hand, C++ comes with a lot of unnecessary overhead. In the last few days I've often thought about not going back to the good old C. You know what you have, you do not need constructors, destructors, and you do not have to be careful about bending the This pointer by mistake. But it's difficult in the middle of the work. Some design decisions avenge later, but they take their revenge.

But fine, I let myself be persuaded by the young people to this madness project and must now cope with it. What else is left for me? It is my duty to show those young people the right way, to eliminate difficulties and sometimes to give comfort. Whining does not help, just bite your teeth together and carry on. The crazy thing is: sometimes I have bad moments, like every human being. At such moments, I always hear the old Vince Guaraldi tune. (https://www.youtube.com/watch?v=rTA3aOfrDHA) There's also a guitar version by the great Earl Klugh with a bit help by Vince Gill (https://www.youtube.com/watch?v=IpY33RZQACY); the music starts at 1:29. Believe it or not, my graduates, who usually only listen to rap and hip-hop, find it really good and have come to the taste. That's a crazy world.

Gunther

The MASM Forum

64 bit assembler => 64 bit assembler. Conceptual Issues => Topic started by: Gunther on December 21, 2017, 11:43:25 AM