So, here I'm back with a few answers. Under the first post of this thread is attached the ZIP archive floatassembly.zip. It contains the sources and binaries of another test suite. Why that? The reason for it is this
post. I would say that VS was able to optimize a little better.
That's only half the story, and that's the smaller half. Of course, aw27's application shows the same behavior on my machines. And yes, VS uses a very aggressive optimization strategy, which has nothing to do with defensive programming. But it's not aggressive enough to produce a Zero result for the time measurement.
I haven't VS running here, but luckily did aw27 send the assembler source produced by VS with a very high optimization level; you can find it under
this post.I can only assume a big trickery by the compiler builders. A short rough calculation should illustrate my point. In the simple C implementation uses VS one accumulator (xmm0) and is doing 8 scalar additions per loop cycle. The floating point operations are pipelined. On the other hand, we've a long dependency chain, because the next operation depends on the previous one. The more sophisticated C implementation uses 4 accumulators and is doing 4 scalar additions in parallel per loop cycle. That makes a big difference, because we've 4 dependency chains, but each is only a quarter as long.
Now the explicit vectorized code: The XMM version uses 16 additions , the YMM version 32 additions, and the ZMM version 64 additions per cycle. Could it be that 8 pipelined scalar additions are faster than 64 vectorized additions? Definitely not. It is striking that VS doesn't touch the assembly language part (it's not C); on the other hand, it seems to me that the original C code doesn't run over the 15000000 times function calls with parameter passing, call and return. This leads to the zero time results. But that isn't a Super, Mega optimization, but would be a big scam.
I think that I can prove that. What have I done? I've the assembly language sources, generated by VS, written in 2 different assembly language procedures. The C source is compiled without optimizations. Why? There's nothing time critical. The procedure to fill the array with a defined number pattern is called once and every C compiler should generate reasonable code for that. Every procedure is called 15000000 times; the loop overhead is the same for each of it. That's exactly what we want. Here are the results on my machine:
Calculating the sum of a float array in 5 different variants.
That'll take a little while. Please be patient ...
Simple C with assembly code generated by VS:
--------------------------------------------
sum1 = 8390656.00
Elapsed Time = 56.00 Seconds
C and 4 accumulators with assembly code generated by VS:
--------------------------------------------------------
sum2 = 8390656.00
Elapsed Time = 13.86 Seconds
Performance Boost = 404%
Assembly language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 3.44 Seconds
Performance Boost = 1629%
Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 2.05 Seconds
Performance Boost = 2734%
Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5 = 8390656.00
Elapsed Time = 1.05 Seconds
Performance Boost = 5354%
These are realistic results, I think. But this must be tested with other machines and environments. Thank you for your help.
Gunther