SAR/SAL?32bit shift
ROR/ROL?32bit rol
SHLD/SHRD ?64bit shift
use afterwards cmp followed by Jcc/jcs or setcc/setnc,adc
left/right shift 64bits of MMX or XMM regs ?
left/right shift 128bits XMM regs,byte resolution instead of previous bit resolution?
followed by PAND with one bit and MOVD gpreg,xmmreg
fastest?
best?
PSLLDQ--Packed Shift Left Logical Double Quadword (bytes)/?
PSLLW/PSLLD/PSLLQ--Packed Shift Left Logical (bits)/?
PSRAW/PSRAD--Packed Shift Right Arithmetic (bits)/?
PSRLDQ--Packed Shift Right Logical Double Quadword (bytes)/?
PSRLW/PSRLD/PSRLQ--Packed Shift Right Logical (bits)/?
It's finding fastest alternative for read bitmap and draw isometric tiles that's higher when bit is set
But remember bresenham line algo with conditional jumps works great on old cpu's before branch prediction fails I wonder if
Cmp
Setcc or adc would be faster on newest cpu than j** branch prediction penalty when "error" is reached?
Magnus,
Since we are in The Lab anyway, I made up a little test comparing a conditional jump against the sete instruction. The result is somewhat surprising - the jne is indeed slower, but only in the first test:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
40 mega iterations, 4096 instructions
137 megacycles for jne eax zero
33 megacycles for sete eax zero
33 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero
134 megacycles for jne eax zero
32 megacycles for sete eax zero
32 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero
133 megacycles for jne eax zero
34 megacycles for sete eax zero
32 megacycles for jne eax nonzero
33 megacycles for sete eax nonzero
115 megacycles for jne eax zero
32 megacycles for sete eax zero
32 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero
101 megacycles for jne eax zero
33 megacycles for sete eax zero
32 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero
110 megacycles for jne eax zero
32 megacycles for sete eax zero
32 megacycles for jne eax nonzero
34 megacycles for sete eax nonzero
112 megacycles for jne eax zero
34 megacycles for sete eax zero
33 megacycles for jne eax nonzero
37 megacycles for sete eax nonzero
Here is the essence of the test (full Masm64 SDK code attached):
repct=0
REPEAT 7
cmpzero=0 ; 0=set the zero flag
tCycles
xor r15, r15
align 4
lbl CATSTR <test>, %repct
lbl: mov eax, 123
REPEAT instructions
xor edx, edx
cmp eax, 123+cmpzero
jne @F
mov dl, 1
@@:
ENDM
inc r15
cmp r15, tests
jnz lbl
tCycles jne eax zero
tCycles
xor r15, r15
align 4
lbl CATSTR <test>, %repct+1
lbl: mov eax, 123
REPEAT instructions
cmp eax, 123+cmpzero
sete dl
ENDM
inc r15
cmp r15, tests
jnz lbl
tCycles sete eax zero ; end of test, print "xx cycles for mov eax, 123"
cmpzero=1 ; continue tests, but clear the zero flag
tCycles
xor r15, r15
align 4
lbl CATSTR <test>, %repct+2
lbl: mov eax, 123
REPEAT instructions
xor edx, edx
cmp eax, 123+cmpzero
mov dl, 1
ENDM
inc r15
cmp r15, tests
jnz lbl
tCycles jne eax nonzero
tCycles
xor r15, r15
align 4
lbl CATSTR <test>, %repct+4
lbl: mov eax, 123
REPEAT instructions
cmp eax, 123+cmpzero+3
sete dl
ENDM
inc r15
cmp r15, tests
jnz lbl
tCycles sete eax nonzero ; end of test, print "xx cycles for mov eax, 123"
invoke __imp__cprintf, cfm$("\n")
repct=repct+5
ENDM
So it seems that jxx is slow when the branch is not taken :rolleyes:
Try this, you are up for a surprise:
REPEAT instructions
xor edx, edx
cmp eax, 123+cmpzero
jmp @F
mov dl, 1
@@:
ENDM
Thanks jochen
Seem it learns predict better
And j** is probably improved by intel,but set** is kept for compability
Backward jump in loop = most likely ,so when not taking that jump slow reload cache
My interest in learn jumpless code is because my interest in simd,after several mulps ,addps,subps 4x comiss/j** seem slow
This code i have somewhere
Ror ebx,1 ; check tilemap
Rol eax,4 ; use tile height 16 if carry is set
seem they fixed and improved branch prediction in this later generation cpu
anyone test it on new AMD ?
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
40 mega iterations, 4096 instructions
41 megacycles for jne eax zero
36 megacycles for sete eax zero
32 megacycles for jne eax nonzero
36 megacycles for sete eax nonzero
41 megacycles for jne eax zero
36 megacycles for sete eax zero
39 megacycles for jne eax nonzero
40 megacycles for sete eax nonzero
41 megacycles for jne eax zero
35 megacycles for sete eax zero
31 megacycles for jne eax nonzero
35 megacycles for sete eax nonzero
40 megacycles for jne eax zero
35 megacycles for sete eax zero
32 megacycles for jne eax nonzero
36 megacycles for sete eax nonzero
41 megacycles for jne eax zero
36 megacycles for sete eax zero
31 megacycles for jne eax nonzero
35 megacycles for sete eax nonzero
43 megacycles for jne eax zero
36 megacycles for sete eax zero
31 megacycles for jne eax nonzero
36 megacycles for sete eax nonzero
42 megacycles for jne eax zero
36 megacycles for sete eax zero
30 megacycles for jne eax nonzero
35 megacycles for sete eax nonzero