The MASM Forum

General => The Laboratory => Topic started by: johnsa on May 24, 2018, 01:35:27 AM

Post by: johnsa on May 24, 2018, 01:35:27 AM
Hey all,

Ran into an interesting optimisation issue today which I can't figure..

I have the following:

Code: [Select]

STATICMETHOD Math, Random, <real4>, <>, v0:DWORD

vmovaps xmm0,(Math PTR [r9])._state0                  ;<------ THIS VMOVAPS is 2x SLOWER than the below VMOVQ equivalent...
;vmovq xmm0, (Math PTR [r9])._state0 ; xmm0 = s[0]
;vmovq xmm1, (Math PTR [r9])._state1 ; xmm1 = s[1]

vpcmpeqw xmm2, xmm2, xmm2
vpslld xmm2, xmm2, 25
vpsrld xmm2, xmm2, 2 ; xmm2 = 1.0

vmovdqa xmm5, xmm0 ; xmm5 = xmm1 = s[0]
vpshufd xmm1, xmm0, 01001110b ; xmm1 = s[1]  ---- REMOVE FOR VMOVQ option

vpaddq xmm4, xmm0, xmm1 ; xmm4 = (result) = s[1] + s[0]
vpsrlq xmm4, xmm4, 12 ; xmm0 = (result >> 12)
mov eax, 0x3ff
shl rax, 52
vmovq xmm3, rax
vpor xmm4, xmm4, xmm3 ; xmm4 = (result >> 12) | (0x3ff << 52)
vcvtsd2ss xmm4, xmm4, xmm4 ; xmm4 = (float)result

vpxor xmm1, xmm1, xmm0 ; xmm1 = s[1] = s[1] ^ s[0]
vmovdqa xmm3, xmm1 ; xmm3 = xmm1 = s[1]
vpsllq xmm3, xmm3, 14 ; xmm3 = s[1] << 14

vpsrlq xmm5, xmm5, 9 ; xmm5 = s[0] >> 9
vpsllq xmm0, xmm0, 55 ; xmm0 = s[0] << 55
vpor xmm0, xmm0, xmm5 ; xmm0 = (s[0] << 55) | (s[0] >> 9)
vpxor xmm0, xmm0, xmm1 ; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1
vpxor xmm0, xmm0, xmm3 ; xmm0 = (s[0] << 55) | (s[0] >> 9) ^ s1 ^ (s[1] << 14)
vmovq (Math PTR [r9])._state0, xmm0

vpsrlq xmm1, xmm1, 28 ; xmm1 = s[1] >> 28
vpsllq xmm3, xmm3, 22 ; xmm3 = s[1] << 36
vpor xmm1, xmm1, xmm3 ; xmm1 = (s[1] << 36) | (s[1] >> 28)
vmovq (Math PTR [r9])._state1, xmm1

vmovdqa xmm0, xmm4 ; xmm0 = (result)
vsubss xmm0, xmm0, xmm2 ; xmm0 = (float)result - 1.0


I cannot think of any reason why a vmovaps for a 16byte aligned variable would have any performance difference to a pair of vmovq's, if anything I'd expect it to be faster..
I extracted the instructions and tested them independently in a loop, which produces a result indicating exactly the opposite result while accounting for the LEA that STATIC_REF generates.

Code: [Select]

;D845 microseconds for 2xvmovq
;9573 microseconds for 1xvmovaps

timer_begin 1, HIGH_PRIORITY_CLASS
lea rdi,testvar
mov r12,100000000
lea rdi,testvar
vmovaps xmm0,[rdi]
;vmovq xmm0,[rdi]
;vmovq xmm1,[rdi+8]
dec r12
jnz short @B

D845 microseconds for 2xvmovq
9573 microseconds for 1xvmovaps

I'm on an AMD Threadripper 1950X.. Anyone think of any other reason why the vmovaps would stall in this case ?
FYI the above code is the first go at implementing XoroShiro128+ PRNG (in relation to the other thread about random numbers and my testing thus far it seems to be pretty much the best performance wise and statistically very solid).
Post by: AW on May 24, 2018, 04:15:06 AM
Looks like everybody is supposed to know what STATICMETHOD, _STATICREF and other UASM specific keywords mean. BTW, they are not even explained in the UASM literature.
Why don't you make a simple full buildable example, with full source code?
Post by: johnsa on May 24, 2018, 09:02:44 AM
staticmethod is "effectively" a normal PROC,
_staticref is just a lea reg,STRUCTNAME
(They are all covered in the uasm manual under the OO stuff, but I agree it could be fleshed out in more detail :) )

and _state0/_state1 are just fields in the struct.

I don't experience the same behaviour on an Intel chip, and I didn't think anyone here was on a Zen core which is why I left it just as a guide, wonderinf if anyone could think of any reason for a possible stall on the movaps, possibly caused further down.
Post by: johnsa on May 25, 2018, 04:28:15 AM
Found the problem.. FYI, it's not the vmovaps so much that was causing the issue but rather the subsequent two vmovq writes to the same memory location. This must have caused some sort of partial stall on the vmovaps read waiting for both writes to complete, now that it's replaced with a vmovaps to store as well the write and read complete as expected and the performance is where it should be.