Hey all,

Ran into an interesting optimisation issue today which I can't figure..

I have the following:

STATICMETHOD Math, Random, <real4>, <>, v0:DWORD

_STATICREF r9, Math

vmovaps xmm0,(Math PTR [r9])._state0 ;<------ THIS VMOVAPS is 2x SLOWER than the below VMOVQ equivalent...

;vmovq xmm0, (Math PTR [r9])._state0 ; xmm0 = s[0]

;vmovq xmm1, (Math PTR [r9])._state1 ; xmm1 = s[1]

vpcmpeqw xmm2, xmm2, xmm2

vpslld xmm2, xmm2, 25

vpsrld xmm2, xmm2, 2 ; xmm2 = 1.0

vmovdqa xmm5, xmm0 ; xmm5 = xmm1 = s[0]

vpshufd xmm1, xmm0, 01001110b ; xmm1 = s[1] ---- REMOVE FOR VMOVQ option

vpaddq xmm4, xmm0, xmm1 ; xmm4 = (result) = s[1] + s[0]

vpsrlq xmm4, xmm4, 12 ; xmm0 = (result >> 12)

mov eax, 0x3ff

shl rax, 52

vmovq xmm3, rax

vpor xmm4, xmm4, xmm3 ; xmm4 = (result >> 12) | (0x3ff << 52)

vcvtsd2ss xmm4, xmm4, xmm4 ; xmm4 = (float)result

vpxor xmm1, xmm1, xmm0 ; xmm1 = s[1] = s[1] ^ s[0]

vmovdqa xmm3, xmm1 ; xmm3 = xmm1 = s[1]

vpsllq xmm3, xmm3, 14 ; xmm3 = s[1] << 14

vpsrlq xmm5, xmm5, 9 ; xmm5 = s[0] >> 9

vpsllq xmm0, xmm0, 55 ; xmm0 = s[0] << 55

vpor xmm0, xmm0, xmm5 ; xmm0 = (s[0] << 55) | (s[0] >> 9)

vpxor xmm0, xmm0, xmm1 ; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1

vpxor xmm0, xmm0, xmm3 ; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1 ^ (s[1] << 14)

vmovq (Math PTR [r9])._state0, xmm0

vpsrlq xmm1, xmm1, 28 ; xmm1 = s[1] >> 28

vpsllq xmm3, xmm3, 22 ; xmm3 = s[1] << 36

vpor xmm1, xmm1, xmm3 ; xmm1 = (s[1] << 36) | (s[1] >> 28)

vmovq (Math PTR [r9])._state1, xmm1

vmovdqa xmm0, xmm4 ; xmm0 = (result)

vsubss xmm0, xmm0, xmm2 ; xmm0 = (float)result - 1.0

ret

ENDMETHOD

I cannot think of any reason why a vmovaps for a 16byte aligned variable would have any performance difference to a pair of vmovq's, if anything I'd expect it to be faster..

I extracted the instructions and tested them independently in a loop, which produces a result indicating exactly the opposite result while accounting for the LEA that STATIC_REF generates.

;D845 microseconds for 2xvmovq

;9573 microseconds for 1xvmovaps

timer_begin 1, HIGH_PRIORITY_CLASS

lea rdi,testvar

mov r12,100000000

@@:

lea rdi,testvar

vmovaps xmm0,[rdi]

;vmovq xmm0,[rdi]

;vmovq xmm1,[rdi+8]

dec r12

jnz short @B

timer_end

D845 microseconds for 2xvmovq

9573 microseconds for 1xvmovaps

I'm on an AMD Threadripper 1950X.. Anyone think of any other reason why the vmovaps would stall in this case ?

FYI the above code is the first go at implementing XoroShiro128+ PRNG (in relation to the other thread about random numbers and my testing thus far it seems to be pretty much the best performance wise and statistically very solid).