Here's another small program that I've put together. It's based on a code example that I ran across in the Intel Optimization Reference manual (Ex 11 - 3. Direct Polynomial Calculation). There is almost no explanation of what's going on in the three versions of this example (SSE, 128-bit AVX, 256-bit AVX), and there are a couple of typos that make it more difficult to understand (e.g., the "vmulpsymm1" instruction, which threw me for a bit).
In a nutshell, for each float element x in an array, the routine calculates x^3 + x^2 + x. This is not all that interesting, but what might be of interest is the use of the AVX instructions vmovups, vmulps, and vaddps, all of which have 256-bit YMM registers as operands.
Being 64-bit, the assembly code is in its own file, not as inline.
Here's my assembly routine. (It's also in the attached zip file.) My array is not aligned on a 32-byte boundary, so I'm using vmovups rather than vmovaps.
avxTest PROC C
; Parameters:
; inputArray address in RCX.
; outputArray address in RDX.
; arr_len - 8 in R8.
; Returns nothing.
loop1:
vmovups ymm0, [rcx+r8*4] ; Load A, starting 8 floats from the end.
vmulps ymm1, ymm0, ymm0 ; Calculate A^2.
vmulps ymm2, ymm1, ymm0 ; Calculate A^3.
vaddps ymm0, ymm0, ymm1 ; Calculate A + A^2.
vaddps ymm0, ymm0, ymm2 ; Calculate A+A^2+A^3.
vmovups [rdx+r8*4], ymm0 ; Store result.
sub r8, 8
jge loop1
vzeroall
RET
avxTest ENDP
There is also a simple timing routine that I wrote, that uses the RDTSCP instruction. I don't know how accurate this is. Per the Intel Instruction Set Reference,
The RDTSCP instruction waits until all previous instructions have been executed before reading the counter.
However, subsequent instructions may begin execution before the read operation is performed.
Here's my code.
readTimeStamp PROC C
;; Returns the current time, in clocks, in RAX.
rdtscp
shl rdx, 32
or rax, rdx
ret
readTimeStamp ENDP
I'm aware of timing code on this board, but haven't had a chance to explore it and try it out.
Here's my console output:
C:\Users\Mark\Documents\Visual Studio 2010\Projects\AVX64\amd64\Release>avx64
Value 0 is 3.000
Value 1 is 14.000
Value 2 is 39.000
Value 3 is 84.000
Value 4 is 155.000
Value 5 is 258.000
Value 6 is 399.000
Value 7 is 584.000
Value 8 is 14.000
Value 9 is 84.000
Value 10 is 258.000
Value 11 is 584.000
Value 12 is 1110.000
Value 13 is 1884.000
Value 14 is 2954.000
Value 15 is 4368.000
Value 16 is 39.000
Value 17 is 258.000
Value 18 is 819.000
Value 19 is 1884.000
Value 20 is 3615.000
Value 21 is 6174.000
Value 22 is 9723.000
Value 23 is 14424.000
Elapsed time: 121 clocks
Elapsed time: 163 clocks
C:\Users\Mark\Documents\Visual Studio 2010\Projects\AVX64\amd64\Release>The first time shown above is from my assembly routine. The second time is from a C for loop in main that does more-or-less the same as my assembly routine. Here is the code that the (VS 10) compiler generated, obtained from disassembling the for loop. Although the compiler is using SSE instructions and XMM registers, it's using the scalar operations rather than the packed operations. IOW, it is putting a 32-bit float into a 128-bit register.
{
temp1 = inputArray[i];
000000013FED3618 movsxd rax,dword ptr [rsp+11Ch]
000000013FED3620 movss xmm0,dword ptr [rsp+rax*4+30h]
000000013FED3626 movss dword ptr [rsp+114h],xmm0
temp2 = temp1 * temp1;
000000013FED362F movss xmm0,dword ptr [rsp+114h]
000000013FED3638 mulss xmm0,dword ptr [rsp+114h]
000000013FED3641 movss dword ptr [rsp+118h],xmm0
outputArray[i] = temp2 * temp1 + temp2 + temp1;
000000013FED364A movss xmm0,dword ptr [rsp+118h]
000000013FED3653 mulss xmm0,dword ptr [rsp+114h]
000000013FED365C addss xmm0,dword ptr [rsp+118h]
000000013FED3665 addss xmm0,dword ptr [rsp+114h]
000000013FED366E movsxd rax,dword ptr [rsp+11Ch]
000000013FED3676 movss dword ptr [rsp+rax*4+0B0h],xmm0
}
The attached zip file contains the C main function, the two assembly procs, and the exe.
Mark