AVX for 32-bit Windows applications

dedndave · June 30, 2014, 07:12:37 AM

....and, Intel does publish the article :redface:

qWord · June 30, 2014, 07:18:27 AM

Quote from: dedndave on June 30, 2014, 07:12:37 AM
....and, Intel does publish the article :redface:

Indeed strange, but the final word has the official documentation.

Gunther · July 01, 2014, 02:06:50 AM

Hi qWord,

Quote from: qWord on June 30, 2014, 07:18:27 AM
Indeed strange, but the final word has the official documentation.

no doubt about that. But wait a little bit. I'll post my new test program here. It'll come to very strange results.

Gunther

qWord · July 01, 2014, 11:24:04 PM

Quote from: Gunther on July 01, 2014, 02:06:50 AMBut wait a little bit. I'll post my new test program here. It'll come to very strange results.

No strange result here (requires MASM 10+):

Code Select

include \masm32\include\masm32rt.inc
.686
.mmx
.xmm

IF @Version GE 1000

print_ps8 macro m256:req
	xor esi,esi
	.while esi < 8*REAL4
		movss xmm0,REAL4 ptr m256[esi]
		sub esp,8
		cvtss2sd xmm0,xmm0
		movsd REAL8 ptr [esp],xmm0
		.if esi != 7*REAL4
			push chr$("%3.2G, ")
		.else
			push chr$("%3.2G")
		.endif
		call crt_printf
		add esp,12
		add esi,REAL4
	.endw
endm

.const
	align 16
	vpsVector	REAL4 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0
	vpdVector	REAL8 2.0, 2.0, 2.0, 2.0
.data?
	vpsResult0	YMMWORD ?
	vpsResult1	YMMWORD ?
.code

; return: eax => AVX?, edx => FMA?
supports_AVX_FMA proc uses ebx

	.repeat
		mov eax, 1
		cpuid
		push ecx
		and ecx,18000000h
		cmp ecx,18000000h
		.break .if !ZERO?
		xor ecx,ecx
		xgetbv
		and eax,6h
		cmp eax,6h
		.break .if !ZERO?
		pop ecx
		xor edx,edx
		mov eax,1
		test ecx,1000h
		cmovnz edx,eax
		ret
	.until 1
	pop ecx
	xor eax,eax
	ret
	
supports_AVX_FMA endp

main proc
LOCAL bAVX:BOOL

	fnx bAVX = supports_AVX_FMA
	
	.if bAVX
		print "AVX supported:",13,10
		
		vmovups ymm0,YMMWORD ptr vpsVector
		vaddps ymm0,ymm0,ymm0
		vmovups vpsResult0,ymm0
		vsqrtps ymm0,ymm0
		vmovups vpsResult1,ymm0
	
		fnc crt_printf,   "ymm0            = { "
		print_ps8 vpsVector
		fnc crt_printf,"}\nymm0+ymm0       = { "
		print_ps8 vpsResult0
		fnc crt_printf,"}\n"
		fnc crt_printf,   "sqrt(ymm0+ymm0) = { "
		print_ps8 vpsResult1
		fnc crt_printf,"}\n"
	.else
		print "AVX not supported",13,10
	.endif
	
	inkey
	exit
	
main endp
ELSE
	.err <MASM version 10+ required>
	externdef main:proc
ENDIF
end main

Just an interesting side note, with AVX2 a new type of memory addressing has been introduced that allows to use ymm registers as scale register (*1\2\4\8 ) for SIB addressing.
An example using MASM v11+:

Code Select

vgatherdps ymm0,[esi+ymm1*4],ymm2 ; ymm1 holds 8 DWORD indices which are used to load up to 8 REAL4 values.
These V[p]GATHERxxx instruction are really powerful, because they allow vectorized addressing whereas individual accesses can be masked due to the third operand.

BTW: jWasm's current AVX implementation seems to be buggy...

Gunther · July 02, 2014, 01:01:23 AM

Hi qWord,

here is the result of your application under Windows 7-64:

Code Select


AVX supported:
ymm0            = {   1,   2,   3,   4,   5,   6,   7,   8}
ymm0+ymm0       = {   2,   4,   6,   8,  10,  12,  14,  16}
sqrt(ymm0+ymm0) = { 1.4,   2, 2.4, 2.8, 3.2, 3.5, 3.7,   4}
Press any key to continue ...

Could you post the code via attachment, please?

Quote from: qWord on July 01, 2014, 11:24:04 PM
BTW: jWasm's current AVX implementation seems to be buggy...

So, what is your recommendation instead?

Gunther

qWord · July 02, 2014, 04:24:46 AM

Quote from: Gunther on July 02, 2014, 01:01:23 AMCould you post the code via attachment, please?

should I really support your laziness?

Quote from: Gunther on July 02, 2014, 01:01:23 AMSo, what is your recommendation instead?

MASM 10+ of course.

dedndave · July 02, 2014, 04:31:48 AM

Quote from: qWord on July 02, 2014, 04:24:46 AM
should I really support your laziness?

yes

Gunther · July 02, 2014, 07:58:55 PM

Hi qWord,

Quote from: qWord on July 02, 2014, 04:24:46 AM
should I really support your laziness?

Special thanks for that. :t

Quote from: qWord on July 02, 2014, 04:24:46 AM
Quote from: Gunther on July 02, 2014, 01:01:23 AMSo, what is your recommendation instead?
MASM 10+ of course.

Okay. Is that part of the current MASM32 package?

Gunther

qWord · July 03, 2014, 12:06:47 AM

Quote from: Gunther on July 02, 2014, 07:58:55 PM
Quote from: qWord on July 02, 2014, 04:24:46 AM
Quote from: Gunther on July 02, 2014, 01:01:23 AMSo, what is your recommendation instead?
MASM 10+ of course.

Okay. Is that part of the current MASM32 package?

No, as usual it comes with Visual Studio (2010 or later).

Gunther · July 03, 2014, 01:43:42 AM

Quote from: qWord on July 03, 2014, 12:06:47 AM
No, as usual it comes with Visual Studio (2010 or later).

Thank you for the information. I'll download and install it as soon as possible.

Gunther

Gunther · July 03, 2014, 04:49:59 AM

Hi qWord,

is that part of the Express Edition, too? If not, is there another legal download possible?

Gunther

dedndave · July 03, 2014, 08:35:16 AM

sent you a PM, Gunther

Gunther · July 04, 2014, 04:11:04 AM

Quote from: dedndave on July 03, 2014, 08:35:16 AM
sent you a PM, Gunther

Received. :t

Gunther

Gunther · July 15, 2014, 09:35:05 AM

I've attached the archive fsum.zip to this mail. It contains a test program for AVX instructions under 32-bit Windows. It sums up an array of float values and measures the calculation time. That's the programs output under Windows 7-32, SP 1 as virtual machine under VirtualBox:

Quote
Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 13.55 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 6.89 Seconds
Performance Boost = 197%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 1.15 Seconds
Performance Boost = 1176%

Your current CPU doesn't support the AVX instruction set.
The application terminates now.

No AVX support is available with that configuration. But that's tricky. CPU-Z indicates AVX:

In the compatibility mode under Windows 7-64, SP 1 the same application gives that output:

Quote
Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 13.04 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 6.52 Seconds
Performance Boost = 200%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 1.11 Seconds
Performance Boost = 1173%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 0.77 Seconds
Performance Boost = 1701%

The frame is written in C and compiled with gcc. With a few minor changes it should compile with MSVC, too. The dirty work is made by the assembly language procedures. Those are assembled with jWasm, but ml should work, too. I've provided the full source.

Some test results by other members under different environments would be fine.

Gunther

sinsi · July 15, 2014, 10:14:22 AM

Windows 8.1 32-bit VMware guest on Windows 8.1 64-bit host

Simple C implementation:
------------------------
sum1 = 8390656.00
Elapsed Time = 13.05 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2 = 8390656.00
Elapsed Time = 6.53 Seconds
Performance Boost = 200%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3 = 8390656.00
Elapsed Time = 1.11 Seconds
Performance Boost = 1175%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4 = 8390656.00
Elapsed Time = 0.78 Seconds
Performance Boost = 1670%

The MASM Forum

News:

AVX for 32-bit Windows applications

dedndave

qWord

Gunther

qWord

Gunther

qWord

dedndave

Gunther

qWord

Gunther

Gunther

dedndave

Gunther

Gunther

sinsi