I've added 2 archives to this message: features.zip and floatsum.zip. Please read the readme.txt file first (it's included in every archive). The applications should run under Win64, SP1 (native or VM).
The program features.exe checks the available instruction sets for the underlying machine during runtime. A lot of tests are not really necessary under Win64, but my goal was to develop a technique, which is useable under Win32, too (with some minor changes, that's clear).
The program floatsum.exe sums up an array of float (REAL4) numbers in C and assembly language (with SSE2 instructions and the new AVX instructions). The differences are tremendous. Here is the application's output on my machine: Intel Core i7-3770, 3.4 GHz with Win7 (64 bit) and SP1:
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ 47 SSE4.1 (Penryn) Instructions,
+ 7 SSE4.2 (Nehalem) Instructions,
+ AVX (Advanced Vector Extensions).
Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...
Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 12.68 Seconds
C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 4.29 Seconds
Performance Boost = 296%
Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 1.08 Seconds
Performance Boost = 1178%
Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 0.55 Seconds
Performance Boost = 2323%
For the C sources I used gcc 4.7.2 for Windows, but with some minimal changes (especially the data alignment) should it work with VC or Pelles C, too, but that's not tested. The assembly language sources are processed with yasm 1.2.0 for Windows, but nasm will do the same job (that's tested).
In the next days I'll upload the same example, working under Linux and BSD.
The software isn't in a final stadium. Hints and proposals for improvements are welcome, as well as any feedback.
Gunther