News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Test results for AVX and AVX-512 needed

Started by Gunther, December 21, 2017, 11:43:25 AM

Previous topic - Next topic

Gunther

The attached file float.zip contains the sources and binaries of the test program. It calculates the sum of all elements of a floating point array. That's useful for the computation of the arithmetic mean, for example.

The program uses 5 different methods:

  • Simple C implemetation with 1 accumulator.
  • More sophisticated C implementation with 4 accumulators.
  • SSE2 implementation with 4 accumulators.
  • AVX implementation with 4 accumulators.
  • AVX-512F implementation with 4 accumulators.
The application checks via CPU dispatching which instruction sets are supported. So it'll nothing bad happen if, for example, AVX-512 isn't available. You'll get only a message on the screen and the AVX-512 procedure isn't called. That's all.

Here is a typical output on my Skylake box:

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 55.81 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 25.67 Seconds
Performance Boost = 217%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.53 Seconds
Performance Boost = 1583%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.06 Seconds
Performance Boost = 2707%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.06 Seconds
Performance Boost = 5250%

As you can see, AVX-512F increases the speed in a dramatic way. There are a lot of powerful new instructions available.

Test results with other environments are very welcome. Have fun.

Gunther
You have to know the facts before you can distort them.

jj2007

Intel Core i5, Win7-64:
Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 74.94 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 49.95 Seconds
Performance Boost = 150%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 6.36 Seconds
Performance Boost = 1177%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 3.29 Seconds
Performance Boost = 2277%

Your current CPU doesn't support the AVX-512 instruction set.

Siekmanski

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8.1
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 54.22 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 36.31 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.61 Seconds
Performance Boost = 1175%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.41 Seconds
Performance Boost = 2251%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.
Creative coders use backward thinking techniques as a strategy.

Gunther

Thank you for the results, Jochen. Your CPU doesn't support AVX-512F. But on the other hand, Windows 7 doesn't allow that instruction set. I'm not sure about that behavior of Intel and MS. Is it only the usual sales pitch?

Gunther
You have to know the facts before you can distort them.

Gunther

Hi Marius,

thank you for testing the program. You've a good machine with an interesting environment (6 cores  :t). The simple C code is a bit faster on your machine; that's surprising, the other results are as expected.

Gunther
You have to know the facts before you can distort them.

sinsi

i7 4790, Windows 10 x64 Pro
Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 47.35 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 31.83 Seconds
Performance Boost = 149%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.92 Seconds
Performance Boost = 1208%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 1.98 Seconds
Performance Boost = 2395%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.
Tá fuinneoga a haon déag níos fearr :biggrin:

hutch--

Haswell E/EP at 3.3 gig.

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 51.53 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 35.00 Seconds
Performance Boost = 147%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.33 Seconds
Performance Boost = 1190%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.19 Seconds
Performance Boost = 2356%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

The application terminates now.

aw27

I built with VS 2017 and got amazing results:

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 0.00 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 0.00 Seconds
Performance Boost = -nan(ind)%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 5.78 Seconds
Performance Boost = 0%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.79 Seconds
Performance Boost = 0%

Your current CPU doesn't support the AVX-512 instruction set.
You'll need at least the Knights Landing or Skylake architecture.

Project attached.

Gunther

Thank you Hutch and Sinsi for testing. Interesting results, indeed.

Quote
I built with VS 2017 and got amazing results:

aw27, indeed, the results are a bit confusing. The C sources contain nothing special, so it should compile without problems under VS. But the results show that there's something wrong with the time measuring. Could you attach the running EXE, please? I have no Visual Studio running here. The funny thing is, that the the times for the assembly language procedures are realistic. What happens with the C procedures? Very strange.

Gunther
You have to know the facts before you can distort them.

aw27

@Gunther
I would say that VS was able to optimize a little better. Nothing new, it happens frequently. I include also the asm listing of floatfunc

HSE

Hi gunther!
Just testing a 64bit notebook (7-64 SP1), system report a problem after I see:Assembly Language with 4 YMM accumulators:

Equations in Assembly: SmplMath

Gunther

Hi HSE,

that's a bit strange, because the CPU dispatching mechanism is testing for available instruction sets. I've tested the software under my old box running Win 7-64 with SP1; it works fine under this environment. Do you have an AMD CPU?

Gunther
You have to know the facts before you can distort them.

Gunther

So, here I'm back with a few answers. Under the first post of this thread is attached the ZIP archive floatassembly.zip. It contains the sources and binaries of another test suite. Why that? The reason for it is this post.
Quote from: aw27
I would say that VS was able to optimize a little better.
That's only half the story, and that's the smaller half. Of course, aw27's application shows the same behavior on my machines. And yes, VS uses a very aggressive optimization strategy, which has nothing to do with defensive programming. But it's not aggressive enough to produce a Zero result for the time measurement.

I haven't VS running here, but luckily did aw27 send the assembler source produced by VS with a very high optimization level; you can find it under this post.

I can only assume a big trickery by the compiler builders.  A short rough calculation should illustrate my point. In the simple C implementation uses VS one accumulator (xmm0) and is doing 8 scalar additions per loop cycle. The floating point operations are pipelined. On the other hand, we've a long dependency chain, because the next operation depends on the previous one. The more sophisticated C implementation uses 4 accumulators and is doing 4 scalar additions in parallel per loop cycle. That makes a big difference, because we've 4 dependency chains, but each is only a quarter as long.

Now the explicit vectorized code: The XMM version uses 16 additions , the YMM version 32 additions, and the ZMM version 64 additions per cycle. Could it be that 8 pipelined scalar additions are faster than 64 vectorized additions? Definitely not. It is striking that VS doesn't touch the assembly language part (it's not C); on the other hand, it seems to me that the original C code doesn't run over the 15000000 times function calls with parameter passing, call and return. This leads to the zero time results. But that isn't a Super, Mega optimization, but would be a big scam.

I think that I can prove that. What have I done? I've the assembly language sources, generated by VS, written in 2 different assembly language procedures. The C source is compiled without optimizations. Why? There's nothing time critical. The procedure to fill the array with a defined number pattern is called once and every C compiler should generate reasonable code for that. Every procedure is called 15000000 times; the loop overhead is the same for each of it. That's exactly what we want. Here are the results on my machine:

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 56.00 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 13.86 Seconds
Performance Boost = 404%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 3.44 Seconds
Performance Boost = 1629%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.05 Seconds
Performance Boost = 2734%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.05 Seconds
Performance Boost = 5354%

These are realistic results, I think. But this must be tested with other machines and environments. Thank you for your help.

Gunther
You have to know the facts before you can distort them.

aw27

hi Gunther!
I did not any particular optimization, just the standard release mode.
However, I had a better look to find out what sort of trickery was used by VS, so you may eventually change the test, if you want, of course.
The trickery was simple. VS simply noticed that
   for (i = 0; i < FINISH; i++) {
      sum1 = Sum1AccuC(X, N);
   }
never changes the value of sum1, so why bother to repeat the loop if they already got the result.  :icon_eek:

Siekmanski

Results from floatassembly,

Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...

Simple C with assembly code generated by VS:
--------------------------------------------
sum1              = 8390656.00
Elapsed Time      = 53.91 Seconds

C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2              = 8390656.00
Elapsed Time      = 18.37 Seconds
Performance Boost = 293%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 4.62 Seconds
Performance Boost = 1167%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 2.42 Seconds
Performance Boost = 2227%

Your current CPU doesn't support the AVX-512 instruction set.
Creative coders use backward thinking techniques as a strategy.