News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

SIMD rotate and shift

Started by jj2007, August 19, 2023, 04:40:44 AM

Previous topic - Next topic

jj2007

I just stumbled over the Rotate Bits SSE thread, when looking for a way to do what shr eax, cl does, but for an xmm reg.

It turns out you can't do that, except with utterly slow self-modifying code :sad:


NoCforMe

Assembly language programming should be fun. That's why I do it.

jj2007

Well, a certain dosis of SIMD code does no harm, David :biggrin:

Anyway, I created two pseudo instructions shlXmm and shrXmm, equivalent to shl reg32, cl but with a reg32 as counter:
include \masm32\MasmBasic\MasmBasic.inc
Otest    OWORD 0123456789ABCDEF0123456789ABCDEFh
  Init
  Cls 9        ; clear screen "lite": move current content 9 lines up
  movups xmm0, Otest
  m2m edx, 4
  deb 4, "start", x:xmm0
  shlXmm xmm0, edx
  deb 4, "shl 4",  x:xmm0
  m2m edx, 3
  shrXmm xmm0, edx
  deb 4, "shr 3",  x:xmm0
  Print
  movups xmm7, Otest
  m2m ecx, 3
  deb 4, "start", x:xmm7
  shrXmm xmm7, ecx
  deb 4, "shr 3",  x:xmm7
  m2m ecx, 2
  shlXmm xmm7, ecx
  deb 4, "shl 2",  x:xmm7
EndOfCode

start   x:xmm0          01234567 89ABCDEF 01234567 89ABCDEF
shl 4   x:xmm0          89ABCDEF 01234567 89ABCDEF 00000000
shr 3   x:xmm0          00000089 ABCDEF01 23456789 ABCDEF00

start   x:xmm7          01234567 89ABCDEF 01234567 89ABCDEF
shr 3   x:xmm7          00000001 23456789 ABCDEF01 23456789
shl 2   x:xmm7          00012345 6789ABCD EF012345 67890000

Source & exe attached, building requires MasmBasic 19.8.23

Note these two "instructions" are too slow for a tight innermost loop. This is self-modifying code executed on the stack, and a single call takes several hundred cycles.

jj2007

I made some tests with a variant that a) places the instructions in the stack, then b) loops a thousand times with a call esp. So there is only once a self-modification.
 
No significant speed difference, apparently the cpu simply doesn't like to work on the stack. What strikes me, though, is that Windows has no problem with executing code on the stack. Both Windows 7-64 and Win 10 just allow it...

guga

Hi JJ. I made some tests a while ago on a way to create a SHL with SSE. See if this can help you to speed up a little.


Equates needed
[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number

Table/Variable/Other Equates needed
;[M_BITS (128-1)] ; 32bit * 4 dwords - 1
[BITS_FMT 32]
[M_BITS <(BITS_FMT*4-1)>] ; 32bit * 4 dwords - 1
[<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
                               M_BITS, M_BITS]

; The functions

SSE_shl_Int4
______________________________________________________________________________________________

;;
    SSE_SHL_INT4
   
    This function shifts left 4 integers in xmm0 on a power of 2
    It´ the same functionality as in shl eax 1, shl eax2 etc etc
    but with 4 integers at once

    Arguments:

        pTblNumber(in): Pointer to a Variable containing a array of 4 integers to be shifted.
                        The Position of each 1st Dword on the array will be shifted with the same position on the
                        Bit Count Table Array

        pTblBitCount(in): Pointer to a Bitcount Table Array containing of 4 integers used as a bitcount
                           The bitcount table is the same as if we do shl eax 5, where 5 is the number of bits to be
                           shifted left.
                           The bit count on each dword in the array table is limited form an integer from 0 to 31

    Return Value:
        The resultant 4 integers data will be stored on xmm0 register

    Remarks

        The shl of a number is equal to
            Number*(1 shl y) where y = count of bits to be shifted. Remembering that 1 shl y = 2 ^y
        Ex:
            7 shl 3 = 7 * (1 shl 3) = 7*8 = 7*(2^3) = 56
            35 shl 9 = 35 * (1 shl 9) = 35*512 = 35*(2^9) = 17920
            101 shl 14 = 101 * (1 shl 14) = 101*16384 = 101*(2^14) = 1654784

    Example of usage:
        ; internal usage of the function
        [M_BITS (128-1)] ; 32bit * 4 dwords - 1
        [<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
                                       M_BITS, M_BITS]

        ; our values to be calculated
        [NumberstoShift: D$ 8, 7, 6, 1]
        [BitCountTbl: D$ 3, 14, 2, 24]
       
        call SSE_SHL_INT4 NumberstoShift, BitCountTbl
       
        The example will shift left the numbers from NumberstoShift using the BitCountTbl Array.
        The matematical operation is as follows:
        8 shl 3 = 64
        7 shl 14 = 114688
        6 shl 2 = 24
        1 shl 24 = 16777216

References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
;;

[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number

Proc SSE_shl_Int4:
    Arguments @pTblNumber, @pTblBitCount
    Uses eax

    ; 1st we calculate the power of 2 (1 shl x) = 2^x
    mov eax D@pTblBitCount
    movdqu xmm0 X$eax | paddd xmm0 X$Tbl_Four_Floats_32Bit
    pslld xmm0 MANTISSA_32_BITS
    mov eax D@pTblNumber | movups xmm1 X$eax

    ; y shl x = y*(2^x)
    mulps xmm0 xmm1
    ;cvtps2dq xmm0 xmm0 ; convert the result to signed integer. No longer needed, since the input is a 4 dword and not 4 floats (No longer needed)

EndP

SSE_shl_Float4

;;
    SSE_SHL_FLOAT4
   
    This function shifts left 4 Floats (32 bits floats) in xmm0 on a power of 2
    It's the same functionality as in shl eax 1, shl eax2 etc etc
    but with 4 values at once, and returning 4 Packed Single precision floats rather then integers

    Arguments:

        pTblNumber(in): Pointer to a Variable containing a array of 4 floats (32 bit float) to be shifted.
                        The Position of each 1st Dword on the array will be shifted with the same position on the
                        Bit Count Table Array

        pTblBitCount(in): Pointer to a Bitcount Table Array containing of 4 integers used as a bitcount
                           The bitcount table is the same as if we do shl eax 5, where 5 is the number of bits to be
                           shifted left
                           The bit count on each dword in the array table is limited form an integer from 0 to 31

    Return Value:
        The resultant 4 Packed Single precision floats data will be stored on xmm0 register

    Remarks

        The shl of a number is equal to
            Number*(1 shl y) where y = count of bits to be shifted. Remembering that 1 shl y = 2 ^y
        Ex:
            8.1457 shl 3 = 7 * (1 shl 3) = 7*8 = 78.1457(2^3) = 56
            7.48789 shl 9 = 35 * (1 shl 9) = 35*512 = 7.48789*(2^9) = 17920
            60.125 shl 14 = 101 * (1 shl 14) = 101*16384 = 60.125*(2^14) = 1654784

    Example of usage:
        ; internal usage of the function
        [M_BITS (128-1)] ; 32bit * 4 dwords - 1
        [<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
                                       M_BITS, M_BITS]

        ; our values to be calculated
        [NumberstoShift: F$ 8.1457, 7.48789, 60.125, 1.1458]
        [BitCountTbl: D$ 3, 14, 2, 24]
       
        call SSE_SHL_FLOAT4 NumberstoShift, BitCountTbl
       
        The example will shift left the numbers from NumberstoShift using the BitCountTbl Array.
        The matematical operation is as follows:
        8 shl 3 = 64
        7 shl 14 = 114688
        6 shl 2 = 24
        1 shl 24 = 16777216

References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
;;

;[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
;[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
;[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number

Proc SSE_shl_Float4:
    Arguments @pTblNumber, @pTblBitCount
    Uses eax

    ; 1st we calculate the power of 2 (1 shl x) = 2^x
    mov eax D@pTblBitCount
    movups xmm0 X$eax | paddd xmm0 X$Tbl_Four_Floats_32Bit
    pslld xmm0 MANTISSA_32_BITS
    mov eax D@pTblNumber | movups xmm1 X$eax

    ; y shl x = y*(2^x)
    mulps xmm0 xmm1

EndP



For doubles


[BITS_FMT64 64]
[M_BITS2 <(BITS_FMT64*2-1)>] ; 64bit * 2 dwords - 1
[<16 Tbl_Four_Floats_64Bit: Q$ M_BITS, M_BITS]

Proc SSE_shl_Double:
    Arguments @pTblNumber, @pTblBitCount
    Uses eax

    ; 1st we calculate the power of 2 (1 shl x) = 2^x
    mov eax D@pTblBitCount
    movq xmm0 X$eax | paddq xmm0 X$Tbl_Four_Floats_64Bit
    psllq xmm0 MANTISSA_64_BITS
    mov eax D@pTblNumber | movups xmm1 X$eax

    ; y shl x = y*(2^x)
    mulpd xmm0 xmm1
    ;cvtps2dq xmm0 xmm0 ; convert the result to signed integer. No longer needed, since the input is a 4 dword and not 4 floats

EndP




Examples of usage:

Shifting Left Integers


[NumberstoShift: D$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]

call SSE_SHL_INT4 NumberstoShift, BitCountTbl

Shifting Left Float32


[NumberstoShift2: F$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]

call SSE_shl_Float4 NumberstoShift2, BitCountTbl

Shifting Left Doubles

        [NumberstoShift3: R$ 6, 1]
        [BitCountTbl2: Q$ 2, 24]

call SSE_shl_Double NumberstoShift3, BitCountTbl2


References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

InfiniteLoop

For self-modifying code its 10% faster to place the modified code and modifying function in separate pages.
AVX512 has rotation and I've yet to see a use for it.

jj2007

Quote from: InfiniteLoop on August 21, 2023, 05:52:49 AMFor self-modifying code its 10% faster to place the modified code and modifying function in separate pages.
AVX512 has rotation and I've yet to see a use for it.

Thanks, InfiniteLoop, very interesting :thumbsup: