News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Win64 command line programs with AVX instructions

Started by Gunther, October 15, 2012, 04:29:42 AM

Previous topic - Next topic

Gunther

I've added 2 archives to this message: features.zip and floatsum.zip. Please read the readme.txt file first (it's included in every archive). The applications should run under Win64, SP1 (native or VM).

The program features.exe checks the available instruction sets for the underlying machine during runtime. A lot of tests are not really necessary under Win64, but my goal was to develop a technique, which is useable under Win32, too (with some minor changes, that's clear).

The program floatsum.exe sums up an array of float (REAL4) numbers in C and assembly language (with SSE2 instructions and the new AVX instructions). The differences are tremendous. Here is the application's output on my machine: Intel Core i7-3770, 3.4 GHz with Win7 (64 bit) and SP1:

Supported by Processor and installed Operating System:
------------------------------------------------------

     Pentium 4 Instruction Set,
     + FPU (floating point unit) on chip,
     + support of FXSAVE and FXRSTOR,
     +  57 MMX Instructions,
     +  70 SSE (Katmai) Instructions,
     + 144 SSE2 (Willamette) Instructions,
     +  13 SSE3 (Prescott) Instructions,
     +  47 SSE4.1 (Penryn) Instructions,
     +   7 SSE4.2 (Nehalem) Instructions,
     + AVX (Advanced Vector Extensions).

Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 12.68 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 4.29 Seconds
Performance Boost = 296%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.08 Seconds
Performance Boost = 1178%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.55 Seconds
Performance Boost = 2323%


For the C sources I used gcc 4.7.2 for Windows, but with some minimal changes (especially the data alignment) should it work with VC or Pelles C, too, but that's not tested. The assembly language sources are processed with yasm 1.2.0 for Windows, but nasm will do the same job (that's tested).

In the next days I'll upload the same example, working under Linux and BSD.

The software isn't in a final stadium. Hints and proposals for improvements are welcome, as well as any feedback.

Gunther

You have to know the facts before you can distort them.

Gunther

I've updated the application that sums up an array with REAL4 (float) numbers. The new instruction detection procedure is included and a procedure with FPU code, too.

A little bit feedback would be okay. The Linux version is coming soon.

Gunther

You have to know the facts before you can distort them.

dedndave

hi Gunther
my "little bit of feedback".....

your instruction-set ID code...
the first instruction is PUSH RBX
yet, you attempt to ID...
;                   0 = 8086
;                   1 = 80186
;                   2 = 80286
;                   3 = 80386
;                   4 = 80486

probably not much use in identifying those, as windows 2k+ won't run on any of them
but, the PUSH RBX would crash if it did   :P
the first logical step might be to identify processor width

Gunther

Hi Dave,

Quote from: dedndave on March 04, 2013, 05:35:38 AM
hi Gunther
my "little bit of feedback".....

your instruction-set ID code...
the first instruction is PUSH RBX
yet, you attempt to ID...
;                   0 = 8086
;                   1 = 80186
;                   2 = 80286
;                   3 = 80386
;                   4 = 80486

probably not much use in identifying those, as windows 2k+ won't run on any of them
but, the PUSH RBX would crash if it did   :P
the first logical step might be to identify processor width

that's right and not right. The numbers up to 7 are only for the 32 bit version and especially for the 16 bit version. The 32 bit version is finished and you've posted in: http://masm32.com/board/index.php?topic=1418.0. The 16 bit version isn't ready yet, but the numbers are already there.

Gunther
You have to know the facts before you can distort them.

dedndave

ok   :t
i saw those in there and thought i'd mention it

Gunther

Hi Dave,

Quote from: dedndave on March 04, 2013, 06:17:33 AM
ok   :t
i saw those in there and thought i'd mention it

never mind. Enjoy.  :t

Gunther
You have to know the facts before you can distort them.