The MASM Forum

64 bit assembler => 64 bit assembler. Conceptual Issues => Topic started by: Mark44 on February 24, 2014, 06:33:04 AM

Title: Using AVX instructions to add and multiply 8 floats at a time
Post by: Mark44 on February 24, 2014, 06:33:04 AM
Here's another small program that I've put together. It's based on a code example that I ran across in the Intel Optimization Reference manual (Ex 11 - 3. Direct Polynomial Calculation). There is almost no explanation of what's going on in the three versions of this example (SSE, 128-bit AVX, 256-bit AVX), and there are a couple of typos that make it more difficult to understand (e.g., the "vmulpsymm1" instruction, which threw me for a bit).

In a nutshell, for each float element x in an array, the routine calculates x^3 + x^2 + x. This is not all that interesting, but what might be of interest is the use of the AVX instructions vmovups, vmulps, and vaddps, all of which have 256-bit YMM registers as operands.

Being 64-bit, the assembly code is in its own file, not as inline.

Here's my assembly routine. (It's also in the attached zip file.) My array is not aligned on a 32-byte boundary, so I'm using vmovups rather than vmovaps.

avxTest PROC C     
; Parameters:
;   inputArray address in RCX.
;   outputArray address in RDX.
;   arr_len - 8  in R8.
; Returns nothing.

loop1:
vmovups ymm0, [rcx+r8*4] ; Load A, starting 8 floats from the end.
vmulps ymm1, ymm0, ymm0 ; Calculate A^2.
vmulps ymm2, ymm1, ymm0 ; Calculate A^3.
vaddps ymm0, ymm0, ymm1 ; Calculate A + A^2.
vaddps ymm0, ymm0, ymm2 ; Calculate A+A^2+A^3.
vmovups [rdx+r8*4], ymm0 ; Store result.
sub r8, 8
jge loop1
vzeroall
RET
avxTest ENDP


There is also a simple timing routine that I wrote, that uses the RDTSCP instruction. I don't know how accurate this is. Per the Intel Instruction Set Reference,
QuoteThe RDTSCP instruction waits until all previous instructions have been executed before reading the counter.
However, subsequent instructions may begin execution before the read operation is performed.

Here's my code.
readTimeStamp PROC C
;; Returns the current time, in clocks, in RAX.
rdtscp
shl rdx, 32
or rax, rdx
ret
readTimeStamp ENDP

I'm aware of timing code on this board, but haven't had a chance to explore it and try it out.

Here's my console output:

C:\Users\Mark\Documents\Visual Studio 2010\Projects\AVX64\amd64\Release>avx64
Value  0 is      3.000
Value  1 is     14.000
Value  2 is     39.000
Value  3 is     84.000
Value  4 is    155.000
Value  5 is    258.000
Value  6 is    399.000
Value  7 is    584.000
Value  8 is     14.000
Value  9 is     84.000
Value 10 is    258.000
Value 11 is    584.000
Value 12 is   1110.000
Value 13 is   1884.000
Value 14 is   2954.000
Value 15 is   4368.000
Value 16 is     39.000
Value 17 is    258.000
Value 18 is    819.000
Value 19 is   1884.000
Value 20 is   3615.000
Value 21 is   6174.000
Value 22 is   9723.000
Value 23 is  14424.000
Elapsed time: 121 clocks
Elapsed time: 163 clocks

C:\Users\Mark\Documents\Visual Studio 2010\Projects\AVX64\amd64\Release>

The first time shown above is from my assembly routine. The second time is from a C for loop in main that does more-or-less the same as my assembly routine. Here is the code that the (VS 10) compiler generated, obtained from disassembling the for loop. Although the compiler is using SSE instructions and XMM registers, it's using the scalar operations rather than the packed operations. IOW, it is putting a 32-bit float into a 128-bit register.
{
temp1 = inputArray[i];
000000013FED3618  movsxd      rax,dword ptr [rsp+11Ch] 
000000013FED3620  movss       xmm0,dword ptr [rsp+rax*4+30h] 
000000013FED3626  movss       dword ptr [rsp+114h],xmm0 
temp2 = temp1 * temp1;
000000013FED362F  movss       xmm0,dword ptr [rsp+114h] 
000000013FED3638  mulss       xmm0,dword ptr [rsp+114h] 
000000013FED3641  movss       dword ptr [rsp+118h],xmm0 
outputArray[i] = temp2 * temp1 + temp2 + temp1;
000000013FED364A  movss       xmm0,dword ptr [rsp+118h] 
000000013FED3653  mulss       xmm0,dword ptr [rsp+114h] 
000000013FED365C  addss       xmm0,dword ptr [rsp+118h] 
000000013FED3665  addss       xmm0,dword ptr [rsp+114h] 
000000013FED366E  movsxd      rax,dword ptr [rsp+11Ch] 
000000013FED3676  movss       dword ptr [rsp+rax*4+0B0h],xmm0 
}

The attached zip file contains the C main function, the two assembly procs, and the exe.
Mark
Title: Re: Using AVX instructions to add and multiply 8 floats at a time
Post by: Gunther on February 25, 2014, 03:56:46 AM
Mark,

here is the output of your program:

Value  0 is      3.000
Value  1 is     14.000
Value  2 is     39.000
Value  3 is     84.000
Value  4 is    155.000
Value  5 is    258.000
Value  6 is    399.000
Value  7 is    584.000
Value  8 is     14.000
Value  9 is     84.000
Value 10 is    258.000
Value 11 is    584.000
Value 12 is   1110.000
Value 13 is   1884.000
Value 14 is   2954.000
Value 15 is   4368.000
Value 16 is     39.000
Value 17 is    258.000
Value 18 is    819.000
Value 19 is   1884.000
Value 20 is   3615.000
Value 21 is   6174.000
Value 22 is   9723.000
Value 23 is  14424.000
Elapsed time: 262 clocks
Elapsed time: 332 clocks


You should check AVX support. I had a similar problem and solved it in that way (http://masm32.com/board/index.php?topic=795.0).

Gunther
Title: Re: Using AVX instructions to add and multiply 8 floats at a time
Post by: Mark44 on February 26, 2014, 05:15:51 AM
I didn't check first for AVX support because I already knew that the computer supported AVX (from by CPUInfo utility). Still, you're right, though - before attempting to use AVX features, it's prudent to check that they're actually there.
Title: Re: Using AVX instructions to add and multiply 8 floats at a time
Post by: Gunther on February 26, 2014, 07:16:51 AM
Mark,

Quote from: Mark44 on February 26, 2014, 05:15:51 AM
I didn't check first for AVX support because I already knew that the computer supported AVX (from by CPUInfo utility). Still, you're right, though - before attempting to use AVX features, it's prudent to check that they're actually there.

yes, it's easy to add and you're at the safer side with your application. You've done another solid work.  :t

Gunther
Title: Re: Using AVX instructions to add and multiply 8 floats at a time
Post by: Mark44 on February 27, 2014, 08:58:29 AM
Quote from: Gunther on February 26, 2014, 07:16:51 AM
Mark,

Quote from: Mark44 on February 26, 2014, 05:15:51 AM
I didn't check first for AVX support because I already knew that the computer supported AVX (from by CPUInfo utility). Still, you're right, though - before attempting to use AVX features, it's prudent to check that they're actually there.

yes, it's easy to add and you're at the safer side with your application. You've done another solid work.  :t

Gunther
Thanks!
In any future code I submit, I'll be sure to include code that checks for the presence of whatever feature I'm using. And I'll update the code in this thread to check for CPU and OS support for AVX in the next day or two.
Title: Re: Using AVX instructions to add and multiply 8 floats at a time
Post by: Mark44 on February 27, 2014, 05:18:37 PM
The attached version now checks to see if AVX is enabled. My isSupportedAVX is nearly identical to what appears in the Intel documentation, in the current Software Developer manual.

isSupportedAVX PROC C
; Determines whether AVX is supported by both the processor and the OS.
; Parameters: None.
; Returns 1 in EAX if AVX is supported, and 0 otherwise.

mov eax, 1
cpuid
; Check for OSXSAVE and AVX feature flags.
and ecx, 018000000h ; Bit 27 == OSXSAVE supported, bit 28 == AVX supported
cmp ecx, 018000000h
jne NotSupported

; Processor supports AVX instructions and OS enables XGETBV.
mov ecx, 0  ; Specify 0 for the XCR0 register.
xgetbv ; Result in EDX:EAX.
and eax, 06h
cmp eax, 06H ; Check for support for XMM state and YMM state.
jne NotSupported
mov eax, 1
jmp Done

NotSupported:
mov eax, 0

Done:
RET
isSupportedAVX ENDP
Title: Re: Using AVX instructions to add and multiply 8 floats at a time
Post by: Gunther on February 28, 2014, 12:10:02 AM
Thank you, Mark. Works like a charme.

Gunther