Test results for AVX and AVX-512 needed

Gunther · December 21, 2017, 11:43:25 AM

The attached file float.zip contains the sources and binaries of the test program. It calculates the sum of all elements of a floating point array. That's useful for the computation of the arithmetic mean, for example.

The program uses 5 different methods:

Simple C implemetation with 1 accumulator.
More sophisticated C implementation with 4 accumulators.
SSE2 implementation with 4 accumulators.
AVX implementation with 4 accumulators.
AVX-512F implementation with 4 accumulators.

The application checks via CPU dispatching which instruction sets are supported. So it'll nothing bad happen if, for example, AVX-512 isn't available. You'll get only a message on the screen and the AVX-512 procedure isn't called. That's all.

Here is a typical output on my Skylake box:

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 55.81 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 25.67 Seconds
Performance Boost = 217%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.53 Seconds
Performance Boost = 1583%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.06 Seconds
Performance Boost = 2707%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.06 Seconds
Performance Boost = 5250%

As you can see, AVX-512F increases the speed in a dramatic way. There are a lot of powerful new instructions available.

Test results with other environments are very welcome. Have fun.

Gunther

jj2007 · December 21, 2017, 12:28:16 PM

Intel Core i5, Win7-64:

Code Select

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 74.94 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 49.95 Seconds
Performance Boost = 150%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 6.36 Seconds
Performance Boost = 1177%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 3.29 Seconds
Performance Boost = 2277%

Your current CPU doesn't support the AVX-512 instruction set.

Siekmanski · December 21, 2017, 12:39:47 PM

Code Select

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8.1
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 54.22 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 36.31 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.61 Seconds
Performance Boost = 1175%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.41 Seconds
Performance Boost = 2251%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

Gunther · December 21, 2017, 12:46:20 PM

Thank you for the results, Jochen. Your CPU doesn't support AVX-512F. But on the other hand, Windows 7 doesn't allow that instruction set. I'm not sure about that behavior of Intel and MS. Is it only the usual sales pitch?

Gunther

Gunther · December 21, 2017, 12:55:55 PM

Hi Marius,

thank you for testing the program. You've a good machine with an interesting environment (6 cores :t). The simple C code is a bit faster on your machine; that's surprising, the other results are as expected.

Gunther

sinsi · December 21, 2017, 01:51:14 PM

i7 4790, Windows 10 x64 Pro

Code Select

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 47.35 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 31.83 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.92 Seconds
Performance Boost = 1208%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 1.98 Seconds
Performance Boost = 2395%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

hutch-- · December 21, 2017, 02:04:14 PM

Haswell E/EP at 3.3 gig.

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 51.53 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 35.00 Seconds
Performance Boost = 147%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 4.33 Seconds
Performance Boost = 1190%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.19 Seconds
Performance Boost = 2356%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

aw27 · December 21, 2017, 09:12:56 PM

I built with VS 2017 and got amazing results:

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 0.00 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 0.00 Seconds
Performance Boost = -nan(ind)%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 5.78 Seconds
Performance Boost = 0%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.79 Seconds
Performance Boost = 0%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

Project attached.

Gunther · December 21, 2017, 09:43:56 PM

Thank you Hutch and Sinsi for testing. Interesting results, indeed.

Quote
I built with VS 2017 and got amazing results:

aw27, indeed, the results are a bit confusing. The C sources contain nothing special, so it should compile without problems under VS. But the results show that there's something wrong with the time measuring. Could you attach the running EXE, please? I have no Visual Studio running here. The funny thing is, that the the times for the assembly language procedures are realistic. What happens with the C procedures? Very strange.

Gunther

aw27 · December 21, 2017, 09:52:12 PM

@Gunther
I would say that VS was able to optimize a little better. Nothing new, it happens frequently. I include also the asm listing of floatfunc

HSE · December 22, 2017, 02:20:21 AM

Hi gunther!
Just testing a 64bit notebook (7-64 SP1), system report a problem after I see:

Code Select

Assembly Language with 4 YMM accumulators:

Gunther · December 22, 2017, 02:38:38 AM

Hi HSE,

that's a bit strange, because the CPU dispatching mechanism is testing for available instruction sets. I've tested the software under my old box running Win 7-64 with SP1; it works fine under this environment. Do you have an AMD CPU?

Gunther

Gunther · December 22, 2017, 04:56:00 AM

So, here I'm back with a few answers. Under the first post of this thread is attached the ZIP archive floatassembly.zip. It contains the sources and binaries of another test suite. Why that? The reason for it is this post.

Quote from: aw27
I would say that VS was able to optimize a little better.

That's only half the story, and that's the smaller half. Of course, aw27's application shows the same behavior on my machines. And yes, VS uses a very aggressive optimization strategy, which has nothing to do with defensive programming. But it's not aggressive enough to produce a Zero result for the time measurement.

I haven't VS running here, but luckily did aw27 send the assembler source produced by VS with a very high optimization level; you can find it under this post.

I can only assume a big trickery by the compiler builders. A short rough calculation should illustrate my point. In the simple C implementation uses VS one accumulator (xmm0) and is doing 8 scalar additions per loop cycle. The floating point operations are pipelined. On the other hand, we've a long dependency chain, because the next operation depends on the previous one. The more sophisticated C implementation uses 4 accumulators and is doing 4 scalar additions in parallel per loop cycle. That makes a big difference, because we've 4 dependency chains, but each is only a quarter as long.

Now the explicit vectorized code: The XMM version uses 16 additions , the YMM version 32 additions, and the ZMM version 64 additions per cycle. Could it be that 8 pipelined scalar additions are faster than 64 vectorized additions? Definitely not. It is striking that VS doesn't touch the assembly language part (it's not C); on the other hand, it seems to me that the original C code doesn't run over the 15000000 times function calls with parameter passing, call and return. This leads to the zero time results. But that isn't a Super, Mega optimization, but would be a big scam.

I think that I can prove that. What have I done? I've the assembly language sources, generated by VS, written in 2 different assembly language procedures. The C source is compiled without optimizations. Why? There's nothing time critical. The procedure to fill the array with a defined number pattern is called once and every C compiler should generate reasonable code for that. Every procedure is called 15000000 times; the loop overhead is the same for each of it. That's exactly what we want. Here are the results on my machine:

Code Select


Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 56.00 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 13.86 Seconds
Performance Boost = 404%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.44 Seconds
Performance Boost = 1629%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.05 Seconds
Performance Boost = 2734%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.05 Seconds
Performance Boost = 5354%

These are realistic results, I think. But this must be tested with other machines and environments. Thank you for your help.

Gunther

aw27 · December 22, 2017, 05:14:48 AM

hi Gunther!
I did not any particular optimization, just the standard release mode.
However, I had a better look to find out what sort of trickery was used by VS, so you may eventually change the test, if you want, of course.
The trickery was simple. VS simply noticed that
   for (i = 0; i < FINISH; i++) {
      sum1 = Sum1AccuC(X, N);
   }
never changes the value of sum1, so why bother to repeat the loop if they already got the result. :icon_eek:

Siekmanski · December 22, 2017, 05:33:22 AM

Results from floatassembly,

Code Select

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 53.91 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 18.37 Seconds
Performance Boost = 293%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.62 Seconds
Performance Boost = 1167%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.42 Seconds
Performance Boost = 2227%

Your current CPU doesn't support the AVX-512 instruction set.

The MASM Forum

News:

Test results for AVX and AVX-512 needed

Gunther

jj2007

Siekmanski

Gunther

Gunther

sinsi

hutch--

aw27

Gunther

aw27

HSE

Gunther

Gunther

aw27

Siekmanski