VMOVQ vs VMOVAPS

johnsa · May 24, 2018, 01:35:27 AM

Hey all,

Ran into an interesting optimisation issue today which I can't figure..

I have the following:



STATICMETHOD Math, Random, <real4>, <>, v0:DWORD
	_STATICREF r9, Math

	vmovaps xmm0,(Math PTR [r9])._state0                  ;<------ THIS VMOVAPS is 2x SLOWER than the below VMOVQ equivalent...
	;vmovq xmm0, (Math PTR [r9])._state0	; xmm0 = s[0]
	;vmovq xmm1, (Math PTR [r9])._state1	; xmm1 = s[1]

	vpcmpeqw xmm2, xmm2, xmm2
	vpslld xmm2, xmm2, 25
	vpsrld xmm2, xmm2, 2					; xmm2 = 1.0

	vmovdqa xmm5, xmm0					; xmm5 = xmm1 = s[0]
	vpshufd xmm1, xmm0, 01001110b			; xmm1 = s[1]  ---- REMOVE FOR VMOVQ option

	vpaddq xmm4, xmm0, xmm1				; xmm4 = (result) = s[1] + s[0]
	vpsrlq xmm4, xmm4, 12					; xmm0 = (result >> 12)
	mov eax, 0x3ff
	shl rax, 52
	vmovq xmm3, rax
	vpor xmm4, xmm4, xmm3					; xmm4 = (result >> 12) | (0x3ff << 52)
	vcvtsd2ss xmm4, xmm4, xmm4				; xmm4 = (float)result

	vpxor xmm1, xmm1, xmm0				; xmm1 = s[1] = s[1] ^ s[0]
	vmovdqa xmm3, xmm1					; xmm3 = xmm1 = s[1]
	vpsllq xmm3, xmm3, 14					; xmm3 = s[1] << 14
	
	vpsrlq xmm5, xmm5, 9					; xmm5 = s[0] >> 9
	vpsllq xmm0, xmm0, 55					; xmm0 = s[0] << 55
	vpor xmm0, xmm0, xmm5					; xmm0 = (s[0] << 55) | (s[0] >> 9)
	vpxor xmm0, xmm0, xmm1				; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1
	vpxor xmm0, xmm0, xmm3				; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1 ^ (s[1] << 14)
	vmovq (Math PTR [r9])._state0, xmm0

	vpsrlq xmm1, xmm1, 28					; xmm1 = s[1] >> 28
	vpsllq xmm3, xmm3, 22					; xmm3 = s[1] << 36
	vpor xmm1, xmm1, xmm3					; xmm1 = (s[1] << 36) | (s[1] >> 28)
	vmovq (Math PTR [r9])._state1, xmm1

	vmovdqa xmm0, xmm4					; xmm0 = (result)
	vsubss xmm0, xmm0, xmm2				; xmm0 = (float)result - 1.0

	ret
ENDMETHOD

I cannot think of any reason why a vmovaps for a 16byte aligned variable would have any performance difference to a pair of vmovq's, if anything I'd expect it to be faster..
I extracted the instructions and tested them independently in a loop, which produces a result indicating exactly the opposite result while accounting for the LEA that STATIC_REF generates.

Code Select



	;D845 microseconds for 2xvmovq
	;9573 microseconds for 1xvmovaps

	timer_begin 1, HIGH_PRIORITY_CLASS	
	lea rdi,testvar
	mov r12,100000000
	@@:
	lea rdi,testvar
	vmovaps xmm0,[rdi]
	;vmovq xmm0,[rdi]
	;vmovq xmm1,[rdi+8]
	dec r12
	jnz short @B
	timer_end

D845 microseconds for 2xvmovq
9573 microseconds for 1xvmovaps

I'm on an AMD Threadripper 1950X.. Anyone think of any other reason why the vmovaps would stall in this case ?
FYI the above code is the first go at implementing XoroShiro128+ PRNG (in relation to the other thread about random numbers and my testing thus far it seems to be pretty much the best performance wise and statistically very solid).

aw27 · May 24, 2018, 04:15:06 AM

Looks like everybody is supposed to know what STATICMETHOD, _STATICREF and other UASM specific keywords mean. BTW, they are not even explained in the UASM literature.
Why don't you make a simple full buildable example, with full source code?

johnsa · May 24, 2018, 09:02:44 AM

staticmethod is "effectively" a normal PROC,
_staticref is just a lea reg,STRUCTNAME
(They are all covered in the uasm manual under the OO stuff, but I agree it could be fleshed out in more detail :) )

and _state0/_state1 are just fields in the struct.

I don't experience the same behaviour on an Intel chip, and I didn't think anyone here was on a Zen core which is why I left it just as a guide, wonderinf if anyone could think of any reason for a possible stall on the movaps, possibly caused further down.

johnsa · May 25, 2018, 04:28:15 AM

Found the problem.. FYI, it's not the vmovaps so much that was causing the issue but rather the subsequent two vmovq writes to the same memory location. This must have caused some sort of partial stall on the vmovaps read waiting for both writes to complete, now that it's replaced with a vmovaps to store as well the write and read complete as expected and the performance is where it should be.

The MASM Forum

News:

VMOVQ vs VMOVAPS

johnsa

aw27

johnsa

johnsa