Print Page - SIMD rotate and shift

Title: SIMD rotate and shift
Post by: jj2007 on August 19, 2023, 04:40:44 AM

I just stumbled over the Rotate Bits SSE (https://masm32.com/board/index.php?topic=8496.0) thread, when looking for a way to do what shr eax, cl does, but for an xmm reg.

It turns out you can't do that, except with utterly slow self-modifying code :sad:

Title: Re: SIMD rotate and shift
Post by: NoCforMe on August 19, 2023, 04:49:28 AM

x86 forever!

Title: Re: SIMD rotate and shift
Post by: jj2007 on August 19, 2023, 11:38:06 AM

Well, a certain dosis of SIMD code does no harm, David :biggrin:

Anyway, I created two pseudo instructions shlXmm and shrXmm, equivalent to shl reg32, cl but with a reg32 as counter:

Code Select

include \masm32\MasmBasic\MasmBasic.inc
Otest    OWORD 0123456789ABCDEF0123456789ABCDEFh
  Init
  Cls 9        ; clear screen "lite": move current content 9 lines up
  movups xmm0, Otest
  m2m edx, 4
  deb 4, "start", x:xmm0
  shlXmm xmm0, edx
  deb 4, "shl 4",  x:xmm0
  m2m edx, 3
  shrXmm xmm0, edx
  deb 4, "shr 3",  x:xmm0
  Print 
  movups xmm7, Otest
  m2m ecx, 3
  deb 4, "start", x:xmm7
  shrXmm xmm7, ecx
  deb 4, "shr 3",  x:xmm7
  m2m ecx, 2
  shlXmm xmm7, ecx
  deb 4, "shl 2",  x:xmm7
EndOfCode

Code Select

start   x:xmm0          01234567 89ABCDEF 01234567 89ABCDEF
shl 4   x:xmm0          89ABCDEF 01234567 89ABCDEF 00000000
shr 3   x:xmm0          00000089 ABCDEF01 23456789 ABCDEF00

start   x:xmm7          01234567 89ABCDEF 01234567 89ABCDEF
shr 3   x:xmm7          00000001 23456789 ABCDEF01 23456789
shl 2   x:xmm7          00012345 6789ABCD EF012345 67890000

Source & exe attached, building requires MasmBasic 19.8.23 (http://masm32.com/board/index.php?topic=94.0)

Note these two "instructions" are too slow for a tight innermost loop. This is self-modifying code executed on the stack, and a single call takes several hundred cycles.

Title: Re: SIMD rotate and shift
Post by: jj2007 on August 19, 2023, 08:32:20 PM

I made some tests with a variant that a) places the instructions in the stack, then b) loops a thousand times with a call esp. So there is only once a self-modification.

No significant speed difference, apparently the cpu simply doesn't like to work on the stack. What strikes me, though, is that Windows has no problem with executing code on the stack. Both Windows 7-64 and Win 10 just allow it...

Title: Re: SIMD rotate and shift
Post by: guga on August 19, 2023, 09:16:23 PM

Hi JJ. I made some tests a while ago on a way to create a SHL with SSE. See if this can help you to speed up a little.

Equates needed

Code Select

[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number

Table/Variable/Other Equates needed

Code Select

;[M_BITS (128-1)] ; 32bit * 4 dwords - 1
[BITS_FMT 32]
[M_BITS <(BITS_FMT*4-1)>] ; 32bit * 4 dwords - 1
[<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
                               M_BITS, M_BITS]

; The functions

SSE_shl_Int4

Code Select

______________________________________________________________________________________________

;;
    SSE_SHL_INT4
    
    This function shifts left 4 integers in xmm0 on a power of 2
    It´ the same functionality as in shl eax 1, shl eax2 etc etc
    but with 4 integers at once

    Arguments:

        pTblNumber(in): Pointer to a Variable containing a array of 4 integers to be shifted.
                        The Position of each 1st Dword on the array will be shifted with the same position on the
                        Bit Count Table Array

        pTblBitCount(in): Pointer to a Bitcount Table Array containing of 4 integers used as a bitcount
                           The bitcount table is the same as if we do shl eax 5, where 5 is the number of bits to be
                           shifted left.
                           The bit count on each dword in the array table is limited form an integer from 0 to 31

    Return Value:
        The resultant 4 integers data will be stored on xmm0 register

    Remarks

        The shl of a number is equal to
            Number*(1 shl y) where y = count of bits to be shifted. Remembering that 1 shl y = 2 ^y
        Ex:
            7 shl 3 = 7 * (1 shl 3) = 7*8 = 7*(2^3) = 56
            35 shl 9 = 35 * (1 shl 9) = 35*512 = 35*(2^9) = 17920
            101 shl 14 = 101 * (1 shl 14) = 101*16384 = 101*(2^14) = 1654784

    Example of usage:
        ; internal usage of the function
        [M_BITS (128-1)] ; 32bit * 4 dwords - 1
        [<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
                                       M_BITS, M_BITS]

        ; our values to be calculated
        [NumberstoShift: D$ 8, 7, 6, 1]
        [BitCountTbl: D$ 3, 14, 2, 24]
        
        call SSE_SHL_INT4 NumberstoShift, BitCountTbl
        
        The example will shift left the numbers from NumberstoShift using the BitCountTbl Array.
        The matematical operation is as follows:
        8 shl 3 = 64
        7 shl 14 = 114688
        6 shl 2 = 24
        1 shl 24 = 16777216

References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
;;

[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number

Proc SSE_shl_Int4:
    Arguments @pTblNumber, @pTblBitCount
    Uses eax

    ; 1st we calculate the power of 2 (1 shl x) = 2^x
    mov eax D@pTblBitCount
    movdqu xmm0 X$eax | paddd xmm0 X$Tbl_Four_Floats_32Bit
    pslld xmm0 MANTISSA_32_BITS
    mov eax D@pTblNumber | movups xmm1 X$eax

    ; y shl x = y*(2^x)
    mulps xmm0 xmm1
    ;cvtps2dq xmm0 xmm0 ; convert the result to signed integer. No longer needed, since the input is a 4 dword and not 4 floats (No longer needed)

EndP

SSE_shl_Float4

Code Select


;;
    SSE_SHL_FLOAT4
    
    This function shifts left 4 Floats (32 bits floats) in xmm0 on a power of 2
    It's the same functionality as in shl eax 1, shl eax2 etc etc
    but with 4 values at once, and returning 4 Packed Single precision floats rather then integers

    Arguments:

        pTblNumber(in): Pointer to a Variable containing a array of 4 floats (32 bit float) to be shifted.
                        The Position of each 1st Dword on the array will be shifted with the same position on the
                        Bit Count Table Array

        pTblBitCount(in): Pointer to a Bitcount Table Array containing of 4 integers used as a bitcount
                           The bitcount table is the same as if we do shl eax 5, where 5 is the number of bits to be
                           shifted left
                           The bit count on each dword in the array table is limited form an integer from 0 to 31

    Return Value:
        The resultant 4 Packed Single precision floats data will be stored on xmm0 register

    Remarks

        The shl of a number is equal to
            Number*(1 shl y) where y = count of bits to be shifted. Remembering that 1 shl y = 2 ^y
        Ex:
            8.1457 shl 3 = 7 * (1 shl 3) = 7*8 = 78.1457(2^3) = 56
            7.48789 shl 9 = 35 * (1 shl 9) = 35*512 = 7.48789*(2^9) = 17920
            60.125 shl 14 = 101 * (1 shl 14) = 101*16384 = 60.125*(2^14) = 1654784

    Example of usage:
        ; internal usage of the function
        [M_BITS (128-1)] ; 32bit * 4 dwords - 1
        [<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
                                       M_BITS, M_BITS]

        ; our values to be calculated
        [NumberstoShift: F$ 8.1457, 7.48789, 60.125, 1.1458]
        [BitCountTbl: D$ 3, 14, 2, 24]
        
        call SSE_SHL_FLOAT4 NumberstoShift, BitCountTbl
        
        The example will shift left the numbers from NumberstoShift using the BitCountTbl Array.
        The matematical operation is as follows:
        8 shl 3 = 64
        7 shl 14 = 114688
        6 shl 2 = 24
        1 shl 24 = 16777216

References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
;;

;[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
;[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
;[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number

Proc SSE_shl_Float4:
    Arguments @pTblNumber, @pTblBitCount
    Uses eax

    ; 1st we calculate the power of 2 (1 shl x) = 2^x
    mov eax D@pTblBitCount
    movups xmm0 X$eax | paddd xmm0 X$Tbl_Four_Floats_32Bit
    pslld xmm0 MANTISSA_32_BITS
    mov eax D@pTblNumber | movups xmm1 X$eax

    ; y shl x = y*(2^x)
    mulps xmm0 xmm1

EndP

For doubles

Code Select


[BITS_FMT64 64]
[M_BITS2 <(BITS_FMT64*2-1)>] ; 64bit * 2 dwords - 1
[<16 Tbl_Four_Floats_64Bit: Q$ M_BITS, M_BITS]

Proc SSE_shl_Double:
    Arguments @pTblNumber, @pTblBitCount
    Uses eax

    ; 1st we calculate the power of 2 (1 shl x) = 2^x
    mov eax D@pTblBitCount
    movq xmm0 X$eax | paddq xmm0 X$Tbl_Four_Floats_64Bit
    psllq xmm0 MANTISSA_64_BITS
    mov eax D@pTblNumber | movups xmm1 X$eax

    ; y shl x = y*(2^x)
    mulpd xmm0 xmm1
    ;cvtps2dq xmm0 xmm0 ; convert the result to signed integer. No longer needed, since the input is a 4 dword and not 4 floats

EndP

Examples of usage:

Shifting Left Integers

Code Select


[NumberstoShift: D$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]

call SSE_SHL_INT4 NumberstoShift, BitCountTbl

Shifting Left Float32

Code Select


[NumberstoShift2: F$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]

call SSE_shl_Float4 NumberstoShift2, BitCountTbl

Shifting Left Doubles

Code Select

        [NumberstoShift3: R$ 6, 1]
        [BitCountTbl2: Q$ 2, 24]

call SSE_shl_Double NumberstoShift3, BitCountTbl2

References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2

Title: Re: SIMD rotate and shift
Post by: InfiniteLoop on August 21, 2023, 05:52:49 AM

For self-modifying code its 10% faster to place the modified code and modifying function in separate pages.
AVX512 has rotation and I've yet to see a use for it.

Title: Re: SIMD rotate and shift
Post by: jj2007 on August 21, 2023, 07:37:37 AM

Quote from: InfiniteLoop on August 21, 2023, 05:52:49 AMFor self-modifying code its 10% faster to place the modified code and modifying function in separate pages.
AVX512 has rotation and I've yet to see a use for it.

Thanks, InfiniteLoop, very interesting :thumbsup:

The MASM Forum

General => The Laboratory => Topic started by: jj2007 on August 19, 2023, 04:40:44 AM