Author Topic: AVX for 32-bit Windows applications  (Read 35732 times)

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: AVX for 32-bit Windows applications
« Reply #30 on: June 30, 2014, 07:12:37 AM »
....and, Intel does publish the article    :redface:

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: AVX for 32-bit Windows applications
« Reply #31 on: June 30, 2014, 07:18:27 AM »
....and, Intel does publish the article    :redface:
Indeed strange, but the final word has the official documentation.
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #32 on: July 01, 2014, 02:06:50 AM »
Hi qWord,

Indeed strange, but the final word has the official documentation.

no doubt about that. But wait a little bit. I'll post my new test program here. It'll come to very strange results.

Gunther
Get your facts first, and then you can distort them.

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: AVX for 32-bit Windows applications
« Reply #33 on: July 01, 2014, 11:24:04 PM »
But wait a little bit. I'll post my new test program here. It'll come to very strange results.
No strange result here (requires MASM 10+):
Code: [Select]
include \masm32\include\masm32rt.inc
.686
.mmx
.xmm

IF @Version GE 1000

print_ps8 macro m256:req
xor esi,esi
.while esi < 8*REAL4
movss xmm0,REAL4 ptr m256[esi]
sub esp,8
cvtss2sd xmm0,xmm0
movsd REAL8 ptr [esp],xmm0
.if esi != 7*REAL4
push chr$("%3.2G, ")
.else
push chr$("%3.2G")
.endif
call crt_printf
add esp,12
add esi,REAL4
.endw
endm

.const
align 16
vpsVector REAL4 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0
vpdVector REAL8 2.0, 2.0, 2.0, 2.0
.data?
vpsResult0 YMMWORD ?
vpsResult1 YMMWORD ?
.code

; return: eax => AVX?, edx => FMA?
supports_AVX_FMA proc uses ebx

.repeat
mov eax, 1
cpuid
push ecx
and ecx,18000000h
cmp ecx,18000000h
.break .if !ZERO?
xor ecx,ecx
xgetbv
and eax,6h
cmp eax,6h
.break .if !ZERO?
pop ecx
xor edx,edx
mov eax,1
test ecx,1000h
cmovnz edx,eax
ret
.until 1
pop ecx
xor eax,eax
ret

supports_AVX_FMA endp

main proc
LOCAL bAVX:BOOL

fnx bAVX = supports_AVX_FMA

.if bAVX
print "AVX supported:",13,10

vmovups ymm0,YMMWORD ptr vpsVector
vaddps ymm0,ymm0,ymm0
vmovups vpsResult0,ymm0
vsqrtps ymm0,ymm0
vmovups vpsResult1,ymm0

fnc crt_printf,   "ymm0            = { "
print_ps8 vpsVector
fnc crt_printf,"}\nymm0+ymm0       = { "
print_ps8 vpsResult0
fnc crt_printf,"}\n"
fnc crt_printf,   "sqrt(ymm0+ymm0) = { "
print_ps8 vpsResult1
fnc crt_printf,"}\n"
.else
print "AVX not supported",13,10
.endif

inkey
exit

main endp
ELSE
.err <MASM version 10+ required>
externdef main:proc
ENDIF
end main


Just an interesting side note, with AVX2 a new type of memory addressing has been introduced that allows to use ymm registers as scale register (*1\2\4\8 ) for SIB addressing.
An example using MASM v11+:
Code: [Select]
vgatherdps ymm0,[esi+ymm1*4],ymm2   ; ymm1 holds 8 DWORD indices which are used to load up to 8 REAL4 values.
These V[p]GATHERxxx instruction are really powerful, because they allow vectorized addressing whereas individual accesses can be masked due to the third operand.

BTW: jWasm's current AVX implementation seems to be buggy...
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #34 on: July 02, 2014, 01:01:23 AM »
Hi qWord,

here is the result of your application under Windows 7-64:
Code: [Select]
AVX supported:
ymm0            = {   1,   2,   3,   4,   5,   6,   7,   8}
ymm0+ymm0       = {   2,   4,   6,   8,  10,  12,  14,  16}
sqrt(ymm0+ymm0) = { 1.4,   2, 2.4, 2.8, 3.2, 3.5, 3.7,   4}
Press any key to continue ...

Could you post the code via attachment, please?

BTW: jWasm's current AVX implementation seems to be buggy...
So, what is your recommendation instead?

Gunther
Get your facts first, and then you can distort them.

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: AVX for 32-bit Windows applications
« Reply #35 on: July 02, 2014, 04:24:46 AM »
Could you post the code via attachment, please?
should I really support your laziness? :biggrin:

So, what is your recommendation instead?
MASM 10+ of course.

MREAL macros - when you need floating point arithmetic while assembling!

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: AVX for 32-bit Windows applications
« Reply #36 on: July 02, 2014, 04:31:48 AM »
should I really support your laziness? :biggrin:

yes   :biggrin:

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #37 on: July 02, 2014, 07:58:55 PM »
Hi qWord,

should I really support your laziness? :biggrin:

Special thanks for that.  :t

So, what is your recommendation instead?
MASM 10+ of course.

Okay. Is that part of the current MASM32 package?

Gunther
Get your facts first, and then you can distort them.

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: AVX for 32-bit Windows applications
« Reply #38 on: July 03, 2014, 12:06:47 AM »
So, what is your recommendation instead?
MASM 10+ of course.

Okay. Is that part of the current MASM32 package?
No, as usual it comes with Visual Studio (2010 or later).
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #39 on: July 03, 2014, 01:43:42 AM »
No, as usual it comes with Visual Studio (2010 or later).

Thank you for the information. I'll download and install it as soon as possible.

Gunther
Get your facts first, and then you can distort them.

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #40 on: July 03, 2014, 04:49:59 AM »
Hi qWord,

is that part of the Express Edition, too? If not, is there another legal download possible?

Gunther
Get your facts first, and then you can distort them.

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: AVX for 32-bit Windows applications
« Reply #41 on: July 03, 2014, 08:35:16 AM »
sent you a PM, Gunther

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #42 on: July 04, 2014, 04:11:04 AM »
sent you a PM, Gunther

Received.  :t

Gunther
Get your facts first, and then you can distort them.

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: AVX for 32-bit Windows applications
« Reply #43 on: July 15, 2014, 09:35:05 AM »
I've attached the archive fsum.zip to this mail. It contains a test program for AVX instructions under 32-bit Windows. It sums up an array of float values and measures the calculation time. That's the programs output under Windows 7-32, SP 1 as virtual machine under VirtualBox:
Quote

Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 13.55 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 6.89 Seconds
Performance Boost = 197%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.15 Seconds
Performance Boost = 1176%

Your current CPU doesn't support the AVX instruction set.
The application terminates now.

No AVX support is available with that configuration. But that's tricky. CPU-Z indicates AVX:

In the compatibility mode under Windows 7-64, SP 1 the same application gives that output:
Quote

Calculating the sum of a float array in different ways.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 13.04 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 6.52 Seconds
Performance Boost = 200%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.11 Seconds
Performance Boost = 1173%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.77 Seconds
Performance Boost = 1701%

The frame is written in C and compiled with gcc. With a few minor changes it should compile with MSVC, too. The dirty work is made by the assembly language procedures. Those are assembled with jWasm, but ml should work, too. I've provided the full source.

Some test results by other members under different environments would be fine.

Gunther
Get your facts first, and then you can distort them.

sinsi

  • Guest
Re: AVX for 32-bit Windows applications
« Reply #44 on: July 15, 2014, 10:14:22 AM »
Windows 8.1 32-bit VMware guest on Windows 8.1 64-bit host

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 13.05 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 6.53 Seconds
Performance Boost = 200%

Assembly Language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 1.11 Seconds
Performance Boost = 1175%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.78 Seconds
Performance Boost = 1670%