News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

AVX for 32-bit Windows applications

Started by Gunther, May 27, 2014, 04:08:26 AM

Previous topic - Next topic

dedndave

....and, Intel does publish the article    :redface:

qWord

Quote from: dedndave on June 30, 2014, 07:12:37 AM
....and, Intel does publish the article    :redface:
Indeed strange, but the final word has the official documentation.
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Hi qWord,

Quote from: qWord on June 30, 2014, 07:18:27 AM
Indeed strange, but the final word has the official documentation.

no doubt about that. But wait a little bit. I'll post my new test program here. It'll come to very strange results.

Gunther
You have to know the facts before you can distort them.

qWord

Quote from: Gunther on July 01, 2014, 02:06:50 AMBut wait a little bit. I'll post my new test program here. It'll come to very strange results.
No strange result here (requires MASM 10+):
include \masm32\include\masm32rt.inc
.686
.mmx
.xmm

IF @Version GE 1000

print_ps8 macro m256:req
xor esi,esi
.while esi < 8*REAL4
movss xmm0,REAL4 ptr m256[esi]
sub esp,8
cvtss2sd xmm0,xmm0
movsd REAL8 ptr [esp],xmm0
.if esi != 7*REAL4
push chr$("%3.2G, ")
.else
push chr$("%3.2G")
.endif
call crt_printf
add esp,12
add esi,REAL4
.endw
endm

.const
align 16
vpsVector REAL4 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0
vpdVector REAL8 2.0, 2.0, 2.0, 2.0
.data?
vpsResult0 YMMWORD ?
vpsResult1 YMMWORD ?
.code

; return: eax => AVX?, edx => FMA?
supports_AVX_FMA proc uses ebx

.repeat
mov eax, 1
cpuid
push ecx
and ecx,18000000h
cmp ecx,18000000h
.break .if !ZERO?
xor ecx,ecx
xgetbv
and eax,6h
cmp eax,6h
.break .if !ZERO?
pop ecx
xor edx,edx
mov eax,1
test ecx,1000h
cmovnz edx,eax
ret
.until 1
pop ecx
xor eax,eax
ret

supports_AVX_FMA endp

main proc
LOCAL bAVX:BOOL

fnx bAVX = supports_AVX_FMA

.if bAVX
print "AVX supported:",13,10

vmovups ymm0,YMMWORD ptr vpsVector
vaddps ymm0,ymm0,ymm0
vmovups vpsResult0,ymm0
vsqrtps ymm0,ymm0
vmovups vpsResult1,ymm0

fnc crt_printf,   "ymm0            = { "
print_ps8 vpsVector
fnc crt_printf,"}\nymm0+ymm0       = { "
print_ps8 vpsResult0
fnc crt_printf,"}\n"
fnc crt_printf,   "sqrt(ymm0+ymm0) = { "
print_ps8 vpsResult1
fnc crt_printf,"}\n"
.else
print "AVX not supported",13,10
.endif

inkey
exit

main endp
ELSE
.err <MASM version 10+ required>
externdef main:proc
ENDIF
end main




Just an interesting side note, with AVX2 a new type of memory addressing has been introduced that allows to use ymm registers as scale register (*1\2\4\8 ) for SIB addressing.
An example using MASM v11+:
vgatherdps ymm0,[esi+ymm1*4],ymm2   ; ymm1 holds 8 DWORD indices which are used to load up to 8 REAL4 values.
These V[p]GATHERxxx instruction are really powerful, because they allow vectorized addressing whereas individual accesses can be masked due to the third operand.

BTW: jWasm's current AVX implementation seems to be buggy...
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Hi qWord,

here is the result of your application under Windows 7-64:

AVX supported:
ymm0            = {   1,   2,   3,   4,   5,   6,   7,   8}
ymm0+ymm0       = {   2,   4,   6,   8,  10,  12,  14,  16}
sqrt(ymm0+ymm0) = { 1.4,   2, 2.4, 2.8, 3.2, 3.5, 3.7,   4}
Press any key to continue ...


Could you post the code via attachment, please?

Quote from: qWord on July 01, 2014, 11:24:04 PM
BTW: jWasm's current AVX implementation seems to be buggy...
So, what is your recommendation instead?

Gunther
You have to know the facts before you can distort them.

qWord

Quote from: Gunther on July 02, 2014, 01:01:23 AMCould you post the code via attachment, please?
should I really support your laziness? :biggrin:

Quote from: Gunther on July 02, 2014, 01:01:23 AMSo, what is your recommendation instead?
MASM 10+ of course.

MREAL macros - when you need floating point arithmetic while assembling!

dedndave


Gunther

Hi qWord,

Quote from: qWord on July 02, 2014, 04:24:46 AM
should I really support your laziness? :biggrin:

Special thanks for that.  :t

Quote from: qWord on July 02, 2014, 04:24:46 AM
Quote from: Gunther on July 02, 2014, 01:01:23 AMSo, what is your recommendation instead?
MASM 10+ of course.

Okay. Is that part of the current MASM32 package?

Gunther
You have to know the facts before you can distort them.

qWord

Quote from: Gunther on July 02, 2014, 07:58:55 PM
Quote from: qWord on July 02, 2014, 04:24:46 AM
Quote from: Gunther on July 02, 2014, 01:01:23 AMSo, what is your recommendation instead?
MASM 10+ of course.

Okay. Is that part of the current MASM32 package?
No, as usual it comes with Visual Studio (2010 or later).
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Quote from: qWord on July 03, 2014, 12:06:47 AM
No, as usual it comes with Visual Studio (2010 or later).

Thank you for the information. I'll download and install it as soon as possible.

Gunther
You have to know the facts before you can distort them.

Gunther

Hi qWord,

is that part of the Express Edition, too? If not, is there another legal download possible?

Gunther
You have to know the facts before you can distort them.

dedndave


Gunther

You have to know the facts before you can distort them.

Gunther

I've attached the archive fsum.zip to this mail. It contains a test program for AVX instructions under 32-bit Windows. It sums up an array of float values and measures the calculation time. That's the programs output under Windows 7-32, SP 1 as virtual machine under VirtualBox:
Quote
Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 13.55 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 6.89 Seconds
Performance Boost = 197%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.15 Seconds
Performance Boost = 1176%

Your current CPU doesn't support the AVX instruction set.
The application terminates now.

No AVX support is available with that configuration. But that's tricky. CPU-Z indicates AVX:

In the compatibility mode under Windows 7-64, SP 1 the same application gives that output:
Quote
Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 13.04 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 6.52 Seconds
Performance Boost = 200%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.11 Seconds
Performance Boost = 1173%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.77 Seconds
Performance Boost = 1701%

The frame is written in C and compiled with gcc. With a few minor changes it should compile with MSVC, too. The dirty work is made by the assembly language procedures. Those are assembled with jWasm, but ml should work, too. I've provided the full source.

Some test results by other members under different environments would be fine.

Gunther
You have to know the facts before you can distort them.

sinsi

Windows 8.1 32-bit VMware guest on Windows 8.1 64-bit host

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 13.05 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 6.53 Seconds
Performance Boost = 200%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.11 Seconds
Performance Boost = 1175%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.78 Seconds
Performance Boost = 1670%