Using AVX instructions to add and multiply 8 floats at a time

Mark44 · February 24, 2014, 06:33:04 AM

Here's another small program that I've put together. It's based on a code example that I ran across in the Intel Optimization Reference manual (Ex 11 - 3. Direct Polynomial Calculation). There is almost no explanation of what's going on in the three versions of this example (SSE, 128-bit AVX, 256-bit AVX), and there are a couple of typos that make it more difficult to understand (e.g., the "vmulpsymm1" instruction, which threw me for a bit).

In a nutshell, for each float element x in an array, the routine calculates x^3 + x^2 + x. This is not all that interesting, but what might be of interest is the use of the AVX instructions vmovups, vmulps, and vaddps, all of which have 256-bit YMM registers as operands.

Being 64-bit, the assembly code is in its own file, not as inline.

Here's my assembly routine. (It's also in the attached zip file.) My array is not aligned on a 32-byte boundary, so I'm using vmovups rather than vmovaps.

Code Select

avxTest PROC C      
	; Parameters:
	;   inputArray address in RCX.
	;   outputArray address in RDX.
	;   arr_len - 8  in R8.
	; Returns nothing.
			
loop1:
	vmovups ymm0, [rcx+r8*4] ; Load A, starting 8 floats from the end.
	vmulps ymm1, ymm0, ymm0 ; Calculate A^2.
	vmulps ymm2, ymm1, ymm0 ; Calculate A^3.
	vaddps ymm0, ymm0, ymm1 ; Calculate A + A^2.
	vaddps ymm0, ymm0, ymm2 ; Calculate A+A^2+A^3.
	vmovups [rdx+r8*4], ymm0 ; Store result.
	sub r8, 8
	jge loop1
	vzeroall
	RET
avxTest ENDP

There is also a simple timing routine that I wrote, that uses the RDTSCP instruction. I don't know how accurate this is. Per the Intel Instruction Set Reference,

QuoteThe RDTSCP instruction waits until all previous instructions have been executed before reading the counter.
However, subsequent instructions may begin execution before the read operation is performed.

Here's my code.

Code Select

readTimeStamp	PROC C
	;; Returns the current time, in clocks, in RAX.
	rdtscp
	shl rdx, 32
	or rax, rdx
	ret
readTimeStamp ENDP

I'm aware of timing code on this board, but haven't had a chance to explore it and try it out.

Here's my console output:

Code Select


C:\Users\Mark\Documents\Visual Studio 2010\Projects\AVX64\amd64\Release>avx64
Value  0 is      3.000
Value  1 is     14.000
Value  2 is     39.000
Value  3 is     84.000
Value  4 is    155.000
Value  5 is    258.000
Value  6 is    399.000
Value  7 is    584.000
Value  8 is     14.000
Value  9 is     84.000
Value 10 is    258.000
Value 11 is    584.000
Value 12 is   1110.000
Value 13 is   1884.000
Value 14 is   2954.000
Value 15 is   4368.000
Value 16 is     39.000
Value 17 is    258.000
Value 18 is    819.000
Value 19 is   1884.000
Value 20 is   3615.000
Value 21 is   6174.000
Value 22 is   9723.000
Value 23 is  14424.000
Elapsed time: 121 clocks
Elapsed time: 163 clocks

C:\Users\Mark\Documents\Visual Studio 2010\Projects\AVX64\amd64\Release>

The first time shown above is from my assembly routine. The second time is from a C for loop in main that does more-or-less the same as my assembly routine. Here is the code that the (VS 10) compiler generated, obtained from disassembling the for loop. Although the compiler is using SSE instructions and XMM registers, it's using the scalar operations rather than the packed operations. IOW, it is putting a 32-bit float into a 128-bit register.

Code Select

	{
		temp1 = inputArray[i];
000000013FED3618  movsxd      rax,dword ptr [rsp+11Ch]  
000000013FED3620  movss       xmm0,dword ptr [rsp+rax*4+30h]  
000000013FED3626  movss       dword ptr [rsp+114h],xmm0  
		temp2 = temp1 * temp1;
000000013FED362F  movss       xmm0,dword ptr [rsp+114h]  
000000013FED3638  mulss       xmm0,dword ptr [rsp+114h]  
000000013FED3641  movss       dword ptr [rsp+118h],xmm0  
		outputArray[i] = temp2 * temp1 + temp2 + temp1;
000000013FED364A  movss       xmm0,dword ptr [rsp+118h]  
000000013FED3653  mulss       xmm0,dword ptr [rsp+114h]  
000000013FED365C  addss       xmm0,dword ptr [rsp+118h]  
000000013FED3665  addss       xmm0,dword ptr [rsp+114h]  
000000013FED366E  movsxd      rax,dword ptr [rsp+11Ch]  
000000013FED3676  movss       dword ptr [rsp+rax*4+0B0h],xmm0  
	}

The attached zip file contains the C main function, the two assembly procs, and the exe.
Mark

Gunther · February 25, 2014, 03:56:46 AM

Mark,

here is the output of your program:

Code Select


Value  0 is      3.000
Value  1 is     14.000
Value  2 is     39.000
Value  3 is     84.000
Value  4 is    155.000
Value  5 is    258.000
Value  6 is    399.000
Value  7 is    584.000
Value  8 is     14.000
Value  9 is     84.000
Value 10 is    258.000
Value 11 is    584.000
Value 12 is   1110.000
Value 13 is   1884.000
Value 14 is   2954.000
Value 15 is   4368.000
Value 16 is     39.000
Value 17 is    258.000
Value 18 is    819.000
Value 19 is   1884.000
Value 20 is   3615.000
Value 21 is   6174.000
Value 22 is   9723.000
Value 23 is  14424.000
Elapsed time: 262 clocks
Elapsed time: 332 clocks

You should check AVX support. I had a similar problem and solved it in that way.

Gunther

Mark44 · February 26, 2014, 05:15:51 AM

I didn't check first for AVX support because I already knew that the computer supported AVX (from by CPUInfo utility). Still, you're right, though - before attempting to use AVX features, it's prudent to check that they're actually there.

Gunther · February 26, 2014, 07:16:51 AM

Mark,

Quote from: Mark44 on February 26, 2014, 05:15:51 AM
I didn't check first for AVX support because I already knew that the computer supported AVX (from by CPUInfo utility). Still, you're right, though - before attempting to use AVX features, it's prudent to check that they're actually there.

yes, it's easy to add and you're at the safer side with your application. You've done another solid work. :t

Gunther

Mark44 · February 27, 2014, 08:58:29 AM

Quote from: Gunther on February 26, 2014, 07:16:51 AM
Mark,

Quote from: Mark44 on February 26, 2014, 05:15:51 AM
I didn't check first for AVX support because I already knew that the computer supported AVX (from by CPUInfo utility). Still, you're right, though - before attempting to use AVX features, it's prudent to check that they're actually there.

yes, it's easy to add and you're at the safer side with your application. You've done another solid work. :t

Gunther

Thanks!
In any future code I submit, I'll be sure to include code that checks for the presence of whatever feature I'm using. And I'll update the code in this thread to check for CPU and OS support for AVX in the next day or two.

Mark44 · February 27, 2014, 05:18:37 PM

The attached version now checks to see if AVX is enabled. My isSupportedAVX is nearly identical to what appears in the Intel documentation, in the current Software Developer manual.

Code Select

isSupportedAVX PROC C
	; Determines whether AVX is supported by both the processor and the OS.
	; Parameters: None.
	; Returns 1 in EAX if AVX is supported, and 0 otherwise.

	mov eax, 1
	cpuid
	; Check for OSXSAVE and AVX feature flags.
	and ecx, 018000000h ; Bit 27 == OSXSAVE supported, bit 28 == AVX supported
	cmp ecx, 018000000h
	jne NotSupported

	; Processor supports AVX instructions and OS enables XGETBV.
	mov ecx, 0  ; Specify 0 for the XCR0 register.
	xgetbv		; Result in EDX:EAX.
	and eax, 06h
	cmp eax, 06H ; Check for support for XMM state and YMM state.
	jne NotSupported
	mov eax, 1
	jmp Done
		
NotSupported:
	mov eax, 0
	
Done:
	RET
isSupportedAVX ENDP

Gunther · February 28, 2014, 12:10:02 AM

Thank you, Mark. Works like a charme.

Gunther

The MASM Forum

News:

Using AVX instructions to add and multiply 8 floats at a time

Mark44

Gunther

Mark44

Gunther

Mark44

Mark44

Gunther