different shift/rotate instructions?

daydreamer · September 12, 2023, 11:30:53 PM

SAR/SAL?32bit shift
ROR/ROL?32bit rol
SHLD/SHRD ?64bit shift
use afterwards cmp followed by Jcc/jcs or setcc/setnc,adc
left/right shift 64bits of MMX or XMM regs ?
left/right shift 128bits XMM regs,byte resolution instead of previous bit resolution?
followed by PAND with one bit and MOVD gpreg,xmmreg

fastest?
best?

jj2007 · September 13, 2023, 04:42:13 AM

PSLLDQ--Packed Shift Left Logical Double Quadword (bytes)/?
PSLLW/PSLLD/PSLLQ--Packed Shift Left Logical (bits)/?
PSRAW/PSRAD--Packed Shift Right Arithmetic (bits)/?
PSRLDQ--Packed Shift Right Logical Double Quadword (bytes)/?
PSRLW/PSRLD/PSRLQ--Packed Shift Right Logical (bits)/?

daydreamer · September 13, 2023, 01:57:49 PM

It's finding fastest alternative for read bitmap and draw isometric tiles that's higher when bit is set
But remember bresenham line algo with conditional jumps works great on old cpu's before branch prediction fails I wonder if
Cmp
Setcc or adc would be faster on newest cpu than j** branch prediction penalty when "error" is reached?

jj2007 · September 13, 2023, 06:11:14 PM

Magnus,

Since we are in The Lab anyway, I made up a little test comparing a conditional jump against the sete instruction. The result is somewhat surprising - the jne is indeed slower, but only in the first test:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

40 mega iterations, 4096 instructions

137 megacycles for jne eax zero
33 megacycles for sete eax zero
33 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero

134 megacycles for jne eax zero
32 megacycles for sete eax zero
32 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero

133 megacycles for jne eax zero
34 megacycles for sete eax zero
32 megacycles for jne eax nonzero
33 megacycles for sete eax nonzero

115 megacycles for jne eax zero
32 megacycles for sete eax zero
32 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero

101 megacycles for jne eax zero
33 megacycles for sete eax zero
32 megacycles for jne eax nonzero
32 megacycles for sete eax nonzero

110 megacycles for jne eax zero
32 megacycles for sete eax zero
32 megacycles for jne eax nonzero
34 megacycles for sete eax nonzero

112 megacycles for jne eax zero
34 megacycles for sete eax zero
33 megacycles for jne eax nonzero
37 megacycles for sete eax nonzero

Here is the essence of the test (full Masm64 SDK code attached):

Code Select

  repct=0
  REPEAT 7
    cmpzero=0    ; 0=set the zero flag
    tCycles
    xor r15, r15
    align 4
    lbl CATSTR <test>, %repct
    lbl:    mov eax, 123
        REPEAT instructions
            xor edx, edx
            cmp eax, 123+cmpzero
            jne @F
            mov dl, 1
            @@:
        ENDM
        inc r15
        cmp r15, tests
        jnz lbl
    tCycles jne eax zero
    tCycles
    xor r15, r15
    align 4
    lbl CATSTR <test>, %repct+1
    lbl:    mov eax, 123
            REPEAT instructions
            cmp eax, 123+cmpzero
            sete dl
        ENDM
        inc r15
        cmp r15, tests
        jnz lbl
    tCycles sete eax zero    ; end of test, print "xx cycles for mov eax, 123"

    cmpzero=1    ; continue tests, but clear the zero flag
    tCycles
    xor r15, r15
    align 4
    lbl CATSTR <test>, %repct+2
    lbl:    mov eax, 123
        REPEAT instructions
            xor edx, edx
            cmp eax, 123+cmpzero
            mov dl, 1
        ENDM
        inc r15
        cmp r15, tests
        jnz lbl
    tCycles jne eax nonzero
    tCycles
    xor r15, r15
    align 4
    lbl CATSTR <test>, %repct+4
    lbl:    mov eax, 123
            REPEAT instructions
            cmp eax, 123+cmpzero+3
            sete dl
        ENDM
        inc r15
        cmp r15, tests
        jnz lbl
    tCycles sete eax nonzero    ; end of test, print "xx cycles for mov eax, 123"
    invoke __imp__cprintf, cfm$("\n")
    repct=repct+5
  ENDM

So it seems that jxx is slow when the branch is not taken

Try this, you are up for a surprise:

REPEAT instructions
xor edx, edx
cmp eax, 123+cmpzero
jmp @F
mov dl, 1
@@:
ENDM

daydreamer · September 14, 2023, 03:26:33 PM

Thanks jochen
Seem it learns predict better
And j** is probably improved by intel,but set** is kept for compability
Backward jump in loop = most likely ,so when not taking that jump slow reload cache
My interest in learn jumpless code is because my interest in simd,after several mulps ,addps,subps 4x comiss/j** seem slow
This code i have somewhere
Ror ebx,1 ; check tilemap
Rol eax,4 ; use tile height 16 if carry is set

daydreamer · September 14, 2023, 11:24:16 PM

seem they fixed and improved branch prediction in this later generation cpu

anyone test it on new AMD ?

Code Select



Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz

40 mega iterations, 4096 instructions

41 megacycles for jne eax zero
36 megacycles for sete eax zero
32 megacycles for jne eax nonzero
36 megacycles for sete eax nonzero

41 megacycles for jne eax zero
36 megacycles for sete eax zero
39 megacycles for jne eax nonzero
40 megacycles for sete eax nonzero

41 megacycles for jne eax zero
35 megacycles for sete eax zero
31 megacycles for jne eax nonzero
35 megacycles for sete eax nonzero

40 megacycles for jne eax zero
35 megacycles for sete eax zero
32 megacycles for jne eax nonzero
36 megacycles for sete eax nonzero

41 megacycles for jne eax zero
36 megacycles for sete eax zero
31 megacycles for jne eax nonzero
35 megacycles for sete eax nonzero

43 megacycles for jne eax zero
36 megacycles for sete eax zero
31 megacycles for jne eax nonzero
36 megacycles for sete eax nonzero

42 megacycles for jne eax zero
36 megacycles for sete eax zero
30 megacycles for jne eax nonzero
35 megacycles for sete eax nonzero

The MASM Forum

News:

different shift/rotate instructions?

daydreamer

jj2007

daydreamer

jj2007

daydreamer

daydreamer