Hey all,
Ran into an interesting optimisation issue today which I can't figure..
I have the following:
STATICMETHOD Math, Random, <real4>, <>, v0:DWORD
_STATICREF r9, Math
vmovaps xmm0,(Math PTR [r9])._state0 ;<------ THIS VMOVAPS is 2x SLOWER than the below VMOVQ equivalent...
;vmovq xmm0, (Math PTR [r9])._state0 ; xmm0 = s[0]
;vmovq xmm1, (Math PTR [r9])._state1 ; xmm1 = s[1]
vpcmpeqw xmm2, xmm2, xmm2
vpslld xmm2, xmm2, 25
vpsrld xmm2, xmm2, 2 ; xmm2 = 1.0
vmovdqa xmm5, xmm0 ; xmm5 = xmm1 = s[0]
vpshufd xmm1, xmm0, 01001110b ; xmm1 = s[1] ---- REMOVE FOR VMOVQ option
vpaddq xmm4, xmm0, xmm1 ; xmm4 = (result) = s[1] + s[0]
vpsrlq xmm4, xmm4, 12 ; xmm0 = (result >> 12)
mov eax, 0x3ff
shl rax, 52
vmovq xmm3, rax
vpor xmm4, xmm4, xmm3 ; xmm4 = (result >> 12) | (0x3ff << 52)
vcvtsd2ss xmm4, xmm4, xmm4 ; xmm4 = (float)result
vpxor xmm1, xmm1, xmm0 ; xmm1 = s[1] = s[1] ^ s[0]
vmovdqa xmm3, xmm1 ; xmm3 = xmm1 = s[1]
vpsllq xmm3, xmm3, 14 ; xmm3 = s[1] << 14
vpsrlq xmm5, xmm5, 9 ; xmm5 = s[0] >> 9
vpsllq xmm0, xmm0, 55 ; xmm0 = s[0] << 55
vpor xmm0, xmm0, xmm5 ; xmm0 = (s[0] << 55) | (s[0] >> 9)
vpxor xmm0, xmm0, xmm1 ; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1
vpxor xmm0, xmm0, xmm3 ; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1 ^ (s[1] << 14)
vmovq (Math PTR [r9])._state0, xmm0
vpsrlq xmm1, xmm1, 28 ; xmm1 = s[1] >> 28
vpsllq xmm3, xmm3, 22 ; xmm3 = s[1] << 36
vpor xmm1, xmm1, xmm3 ; xmm1 = (s[1] << 36) | (s[1] >> 28)
vmovq (Math PTR [r9])._state1, xmm1
vmovdqa xmm0, xmm4 ; xmm0 = (result)
vsubss xmm0, xmm0, xmm2 ; xmm0 = (float)result - 1.0
ret
ENDMETHOD
I cannot think of any reason why a vmovaps for a 16byte aligned variable would have any performance difference to a pair of vmovq's, if anything I'd expect it to be faster..
I extracted the instructions and tested them independently in a loop, which produces a result indicating exactly the opposite result while accounting for the LEA that STATIC_REF generates.
;D845 microseconds for 2xvmovq
;9573 microseconds for 1xvmovaps
timer_begin 1, HIGH_PRIORITY_CLASS
lea rdi,testvar
mov r12,100000000
@@:
lea rdi,testvar
vmovaps xmm0,[rdi]
;vmovq xmm0,[rdi]
;vmovq xmm1,[rdi+8]
dec r12
jnz short @B
timer_end
D845 microseconds for 2xvmovq
9573 microseconds for 1xvmovaps
I'm on an AMD Threadripper 1950X.. Anyone think of any other reason why the vmovaps would stall in this case ?
FYI the above code is the first go at implementing XoroShiro128+ PRNG (in relation to the other thread about random numbers and my testing thus far it seems to be pretty much the best performance wise and statistically very solid).