I just stumbled over the Rotate Bits SSE (https://masm32.com/board/index.php?topic=8496.0) thread, when looking for a way to do what shr eax, cl does, but for an xmm reg.
It turns out you can't do that, except with utterly slow self-modifying code :sad:
x86 forever!
Well, a certain dosis of SIMD code does no harm, David :biggrin:
Anyway, I created two pseudo instructions shlXmm and shrXmm, equivalent to shl reg32, cl but with a reg32 as counter:
include \masm32\MasmBasic\MasmBasic.inc
Otest OWORD 0123456789ABCDEF0123456789ABCDEFh
Init
Cls 9 ; clear screen "lite": move current content 9 lines up
movups xmm0, Otest
m2m edx, 4
deb 4, "start", x:xmm0
shlXmm xmm0, edx
deb 4, "shl 4", x:xmm0
m2m edx, 3
shrXmm xmm0, edx
deb 4, "shr 3", x:xmm0
Print
movups xmm7, Otest
m2m ecx, 3
deb 4, "start", x:xmm7
shrXmm xmm7, ecx
deb 4, "shr 3", x:xmm7
m2m ecx, 2
shlXmm xmm7, ecx
deb 4, "shl 2", x:xmm7
EndOfCode
start x:xmm0 01234567 89ABCDEF 01234567 89ABCDEF
shl 4 x:xmm0 89ABCDEF 01234567 89ABCDEF 00000000
shr 3 x:xmm0 00000089 ABCDEF01 23456789 ABCDEF00
start x:xmm7 01234567 89ABCDEF 01234567 89ABCDEF
shr 3 x:xmm7 00000001 23456789 ABCDEF01 23456789
shl 2 x:xmm7 00012345 6789ABCD EF012345 67890000
Source & exe attached, building requires MasmBasic 19.8.23 (http://masm32.com/board/index.php?topic=94.0)
Note these two "instructions" are too slow for a tight innermost loop. This is self-modifying code executed on the stack, and a single call takes several hundred cycles.
I made some tests with a variant that a) places the instructions in the stack, then b) loops a thousand times with a call esp. So there is only once a self-modification.
No significant speed difference, apparently the cpu simply doesn't like to work on the stack. What strikes me, though, is that Windows has no problem with executing code on the stack. Both Windows 7-64 and Win 10 just allow it...
Hi JJ. I made some tests a while ago on a way to create a SHL with SSE. See if this can help you to speed up a little.
Equates needed
[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number
Table/Variable/Other Equates needed
;[M_BITS (128-1)] ; 32bit * 4 dwords - 1
[BITS_FMT 32]
[M_BITS <(BITS_FMT*4-1)>] ; 32bit * 4 dwords - 1
[<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
M_BITS, M_BITS]
; The functions
SSE_shl_Int4
______________________________________________________________________________________________
;;
SSE_SHL_INT4
This function shifts left 4 integers in xmm0 on a power of 2
It´ the same functionality as in shl eax 1, shl eax2 etc etc
but with 4 integers at once
Arguments:
pTblNumber(in): Pointer to a Variable containing a array of 4 integers to be shifted.
The Position of each 1st Dword on the array will be shifted with the same position on the
Bit Count Table Array
pTblBitCount(in): Pointer to a Bitcount Table Array containing of 4 integers used as a bitcount
The bitcount table is the same as if we do shl eax 5, where 5 is the number of bits to be
shifted left.
The bit count on each dword in the array table is limited form an integer from 0 to 31
Return Value:
The resultant 4 integers data will be stored on xmm0 register
Remarks
The shl of a number is equal to
Number*(1 shl y) where y = count of bits to be shifted. Remembering that 1 shl y = 2 ^y
Ex:
7 shl 3 = 7 * (1 shl 3) = 7*8 = 7*(2^3) = 56
35 shl 9 = 35 * (1 shl 9) = 35*512 = 35*(2^9) = 17920
101 shl 14 = 101 * (1 shl 14) = 101*16384 = 101*(2^14) = 1654784
Example of usage:
; internal usage of the function
[M_BITS (128-1)] ; 32bit * 4 dwords - 1
[<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
M_BITS, M_BITS]
; our values to be calculated
[NumberstoShift: D$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]
call SSE_SHL_INT4 NumberstoShift, BitCountTbl
The example will shift left the numbers from NumberstoShift using the BitCountTbl Array.
The matematical operation is as follows:
8 shl 3 = 64
7 shl 14 = 114688
6 shl 2 = 24
1 shl 24 = 16777216
References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
;;
[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number
Proc SSE_shl_Int4:
Arguments @pTblNumber, @pTblBitCount
Uses eax
; 1st we calculate the power of 2 (1 shl x) = 2^x
mov eax D@pTblBitCount
movdqu xmm0 X$eax | paddd xmm0 X$Tbl_Four_Floats_32Bit
pslld xmm0 MANTISSA_32_BITS
mov eax D@pTblNumber | movups xmm1 X$eax
; y shl x = y*(2^x)
mulps xmm0 xmm1
;cvtps2dq xmm0 xmm0 ; convert the result to signed integer. No longer needed, since the input is a 4 dword and not 4 floats (No longer needed)
EndP
SSE_shl_Float4
;;
SSE_SHL_FLOAT4
This function shifts left 4 Floats (32 bits floats) in xmm0 on a power of 2
It's the same functionality as in shl eax 1, shl eax2 etc etc
but with 4 values at once, and returning 4 Packed Single precision floats rather then integers
Arguments:
pTblNumber(in): Pointer to a Variable containing a array of 4 floats (32 bit float) to be shifted.
The Position of each 1st Dword on the array will be shifted with the same position on the
Bit Count Table Array
pTblBitCount(in): Pointer to a Bitcount Table Array containing of 4 integers used as a bitcount
The bitcount table is the same as if we do shl eax 5, where 5 is the number of bits to be
shifted left
The bit count on each dword in the array table is limited form an integer from 0 to 31
Return Value:
The resultant 4 Packed Single precision floats data will be stored on xmm0 register
Remarks
The shl of a number is equal to
Number*(1 shl y) where y = count of bits to be shifted. Remembering that 1 shl y = 2 ^y
Ex:
8.1457 shl 3 = 7 * (1 shl 3) = 7*8 = 78.1457(2^3) = 56
7.48789 shl 9 = 35 * (1 shl 9) = 35*512 = 7.48789*(2^9) = 17920
60.125 shl 14 = 101 * (1 shl 14) = 101*16384 = 60.125*(2^14) = 1654784
Example of usage:
; internal usage of the function
[M_BITS (128-1)] ; 32bit * 4 dwords - 1
[<16 Tbl_Four_Floats_32Bit: D$ M_BITS, M_BITS,
M_BITS, M_BITS]
; our values to be calculated
[NumberstoShift: F$ 8.1457, 7.48789, 60.125, 1.1458]
[BitCountTbl: D$ 3, 14, 2, 24]
call SSE_SHL_FLOAT4 NumberstoShift, BitCountTbl
The example will shift left the numbers from NumberstoShift using the BitCountTbl Array.
The matematical operation is as follows:
8 shl 3 = 64
7 shl 14 = 114688
6 shl 2 = 24
1 shl 24 = 16777216
References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
;;
;[MANTISSA_32_BITS 23] ; The mantissa on a 32 bits float number
;[MANTISSA_64_BITS 52] ; The mantissa on a 64 bits float number
;[MANTISSA_80_BITS 64] ; The mantissa on a 80 bits float number
Proc SSE_shl_Float4:
Arguments @pTblNumber, @pTblBitCount
Uses eax
; 1st we calculate the power of 2 (1 shl x) = 2^x
mov eax D@pTblBitCount
movups xmm0 X$eax | paddd xmm0 X$Tbl_Four_Floats_32Bit
pslld xmm0 MANTISSA_32_BITS
mov eax D@pTblNumber | movups xmm1 X$eax
; y shl x = y*(2^x)
mulps xmm0 xmm1
EndP
For doubles
[BITS_FMT64 64]
[M_BITS2 <(BITS_FMT64*2-1)>] ; 64bit * 2 dwords - 1
[<16 Tbl_Four_Floats_64Bit: Q$ M_BITS, M_BITS]
Proc SSE_shl_Double:
Arguments @pTblNumber, @pTblBitCount
Uses eax
; 1st we calculate the power of 2 (1 shl x) = 2^x
mov eax D@pTblBitCount
movq xmm0 X$eax | paddq xmm0 X$Tbl_Four_Floats_64Bit
psllq xmm0 MANTISSA_64_BITS
mov eax D@pTblNumber | movups xmm1 X$eax
; y shl x = y*(2^x)
mulpd xmm0 xmm1
;cvtps2dq xmm0 xmm0 ; convert the result to signed integer. No longer needed, since the input is a 4 dword and not 4 floats
EndP
Examples of usage:
Shifting Left Integers
[NumberstoShift: D$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]
call SSE_SHL_INT4 NumberstoShift, BitCountTbl
Shifting Left Float32
[NumberstoShift2: F$ 8, 7, 6, 1]
[BitCountTbl: D$ 3, 14, 2, 24]
call SSE_shl_Float4 NumberstoShift2, BitCountTbl
Shifting Left Doubles
[NumberstoShift3: R$ 6, 1]
[BitCountTbl2: Q$ 2, 24]
call SSE_shl_Double NumberstoShift3, BitCountTbl2
References:
https://www.cin.ufpe.br/~if817/06-FPU.pdf
https://stackoverflow.com/questions/57454416/sse-integer-2n-powers-of-2-for-32-bit-integers-without-avx2
For self-modifying code its 10% faster to place the modified code and modifying function in separate pages.
AVX512 has rotation and I've yet to see a use for it.
Quote from: InfiniteLoop on August 21, 2023, 05:52:49 AMFor self-modifying code its 10% faster to place the modified code and modifying function in separate pages.
AVX512 has rotation and I've yet to see a use for it.
Thanks, InfiniteLoop, very interesting :thumbsup: