Hi guys
someone suceeded to create a working 64bit reverse algorithm in SSE2 ? I´m trying to but i´m failing to do so. I mean, for masm32 and not masm64
The goal is reverse the bits of a Qword using SSE2. Ex:
Original Value:
0xC96C5795D7870F42 = 1100 1001 0110 1100 0101 0111 1001 0101 1101 0111 1000 0111 0000 1111 0100 0010
Bit reversed value:
0x42F0E1EBA9EA3693 = 0100 0010 1111 0000 1110 0001 1110 1011 1010 1001 1110 1010 0011 0110 1001 0011
I tried it, but i´m having weird results as:
0x420F87D795576CC9 = 0100 0010 0000 1111 1000 0111 1101 0111 1001 0101 0101 0111 0110 1100 1100 1001
I have no idea what i´m dong wrong.
What i did was:
[<16 SwapMaskA: Q$ 0_FFFFFFFF]
[<16 SwapMaskB: Q$ 0FFFF_0000FFFF]
[<16 SwapMaskC: Q$ 0FF00FF_00FF00FF]
Proc ReverseBits64_SSE2:
; Input: xmm0 = 64-bit value to reverse the bits
; Output: xmm0 = bit-reversed value
movdqu xmm1 xmm0 ; Copy the original value
movdqu xmm2 xmm0
; First step: ((x & 0x00000000FFFFFFFF) << 32) ^ ((x >> 32) & 0x00000000FFFFFFFF)
; ((x & 0x00000000FFFFFFFF) << 32)
pand xmm1 X$SwapMaskA
psllq xmm1 32
; (x >> 32) & 0x00000000ffffffff
psrlq xmm2 32 ; Shift right by 64-bit (equivalent to >> 32 for quadword)
pand xmm2 X$SwapMaskA
; XOR the results
pxor xmm2 xmm1
; Perform the second step ((x & 0x0000FFFF0000FFFF) << 16) ^ ((x >> 16) & 0x0000FFFF0000FFFF)
; (x & 0x0000ffff0000ffff) << 16
movdqu xmm1 xmm2
pand xmm1 X$SwapMaskB
psllq xmm1 16
; (x >> 16) & 0x0000ffff0000ffff
psrlq xmm2 16 ; Shift right by 16-bit (equivalent to >> 16 for quadword)
pand xmm2 X$SwapMaskB
; XOR the results
pxor xmm2 xmm1
; Third step: ((x & 0x00FF00FF00FF00FF) << 8) ^ ((x >> 8) & 0x00FF00FF00FF00FF)
; (x & 0x00ff00ff00ff00ff) << 8
movdqu xmm0 xmm2
pand xmm0 X$SwapMaskC
psllq xmm0 8
; (x >> 8) & 0x00ff00ff_00ff00ff
psrlq xmm2 8 ; Shift right by 8-bit (equivalent to >> 8 for quadword)
pand xmm2 X$SwapMaskC
; XOR the results
pxor xmm0 xmm2
EndP
Looks like the algo is swapping the bits in each byte, rather than reversing the bits on the full qword (as compared to expected results)
On closer examination, it appears to be reversing the byte order ... rather than reversing the bits of the full qword...
The only instructions starting with 'p' I am familiar with are 'push' and 'pop', though :sad: so I cannot offer any help on this. Just giving an observation.
Tks, zedd
IN ssse3 it can be done as this, i suppose
https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php (https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php)
> movdqa xmm2, xmm6
> movdqa xmm3, xmm5
>
> pshufb xmm2, xmm0
> pshufb xmm3, xmm1
>
> por xmm2, xmm3
>
> ; output = xmm2
>
> Especially on an i7, that would be very fast. On a Core2 it'd be nice too.
> That gives this routine plenty of potential users.
But, i need it for SSE2. On the link there´s a example but it is for 64bit
Maybe this is useful to you or gives you an idea......
Fast Bit Reversal algorithm
https://masm32.com/board/index.php?topic=5196.0 (https://masm32.com/board/index.php?topic=5196.0)
Tks Marinus
On your code, i´m a bit confused, where is the input value for a qword ? Do i need to input the value here ? BitRevMatrixConstant
Also, i´m taking a look on DrizzNew algo, but it is for 32 bit (dword) at a time, right ? If drizz algo is only for Dword, it´s fast, but for qword i used a table as below (Didn´t benchmarked yet):
[rbits: B$ 0, 080, 040, 0C0, 020, 0A0, 060, 0E0
B$ 010, 090, 050, 0D0, 030, 0B0, 070, 0F0
B$ 08, 088, 048, 0C8, 028, 0A8, 068, 0E8
B$ 018, 098, 058, 0D8, 038, 0B8, 078, 0F8
B$ 04, 084, 044, 0C4, 024, 0A4, 064, 0E4
B$ 014, 094, 054, 0D4, 034, 0B4, 074, 0F4
B$ 0C, 08C, 04C, 0CC, 02C, 0AC, 06C, 0EC
B$ 01C, 09C, 05C, 0DC, 03C, 0BC, 07C, 0FC
B$ 02, 082, 042, 0C2, 022, 0A2, 062, 0E2
B$ 012, 092, 052, 0D2, 032, 0B2, 072, 0F2
B$ 0A, 08A, 04A, 0CA, 02A, 0AA, 06A, 0EA
B$ 01A, 09A, 05A, 0DA, 03A, 0BA, 07A, 0FA
B$ 06, 086, 046, 0C6, 026, 0A6, 066, 0E6
B$ 016, 096, 056, 0D6, 036, 0B6, 076, 0F6
B$ 0E, 08E, 04E, 0CE, 02E, 0AE, 06E, 0EE
B$ 01E, 09E, 05E, 0DE, 03E, 0BE, 07E, 0FE
B$ 01, 081, 041, 0C1, 021, 0A1, 061, 0E1
B$ 011, 091, 051, 0D1, 031, 0B1, 071, 0F1
B$ 09, 089, 049, 0C9, 029, 0A9, 069, 0E9
B$ 019, 099, 059, 0D9, 039, 0B9, 079, 0F9
B$ 05, 085, 045, 0C5, 025, 0A5, 065, 0E5
B$ 015, 095, 055, 0D5, 035, 0B5, 075, 0F5
B$ 0D, 08D, 04D, 0CD, 02D, 0AD, 06D, 0ED
B$ 01D, 09D, 05D, 0DD, 03D, 0BD, 07D, 0FD
B$ 03, 083, 043, 0C3, 023, 0A3, 063, 0E3
B$ 013, 093, 053, 0D3, 033, 0B3, 073, 0F3
B$ 0B, 08B, 04B, 0CB, 02B, 0AB, 06B, 0EB
B$ 01B, 09B, 05B, 0DB, 03B, 0BB, 07B, 0FB
B$ 07, 087, 047, 0C7, 027, 0A7, 067, 0E7
B$ 017, 097, 057, 0D7, 037, 0B7, 077, 0F7
B$ 0F, 08F, 04F, 0CF, 02F, 0AF, 06F, 0EF
B$ 01F, 09F, 05F, 0DF, 03F, 0BF, 07F, 0FF]
; https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php
; https://github.com/gbraad/pxt-reversebit/blob/master/main.ts
Proc ReverseBits64:
Arguments @pInputQword, @OutputValue
mov esi D@pInputQword
mov edi D@OutputValue
movzx eax B$esi | movzx eax B$rbits+eax | mov B$edi al
movzx eax B$esi+1 | movzx eax B$rbits+eax | mov B$edi+1 al
movzx eax B$esi+2 | movzx eax B$rbits+eax | mov B$edi+2 al
movzx eax B$esi+3 | movzx eax B$rbits+eax | mov B$edi+3 al
movzx eax B$esi+4 | movzx eax B$rbits+eax | mov B$edi+4 al
movzx eax B$esi+5 | movzx eax B$rbits+eax | mov B$edi+5 al
movzx eax B$esi+6 | movzx eax B$rbits+eax | mov B$edi+6 al
movzx eax B$esi+7 | movzx eax B$rbits+eax | mov B$edi+7 al
EndP
Example of usage:
[Guga: Q$ 0] ; output the reversed value
[crc64_poly_Ecma: Q$ 0C96C5795D7870F42] ; CRC64 ecma polynomial
call ReverseBits64 crc64_poly_Ecma, Guga
Divide and conquer.
ShufB BYTE 7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8
And1 BYTE DUP 16 (11110000b)
And2 BYTE DUP 16 (11001100b)
And3 BYTE DUP 16 (10101010b)
vmovdqu xmm0,[inputx2]
vpshufb xmm0,xmm0,xmmword ptr [ShufB] ;reverse bytes
vpsllq xmm1,xmm0,4 ;0000 1111
vpsrlq xmm2,xmm0,4 ;1111 xxxx
vmovdqu xmm3,xmmword ptr [And1]
vpand xmm2,xmm2,xmm3
vpor xmm0,xmm1,xmm2
...
Tks Infiniteloop, but i´m looking a way in SSE2 for 32-bit. I guess i suceeded to convert drizzNew stir-fry algorithm to be used in a Qword.
https://masmforum.com/board/index.php/topic,12722.msg98224/sslRedirect.html?PHPSESSID=d9917beed356c22c4b0ba4c97bc18307#msg98224 (https://masmforum.com/board/index.php/topic,12722.msg98224/sslRedirect.html?PHPSESSID=d9917beed356c22c4b0ba4c97bc18307#msg98224)
This is the code i managed to work so far:
[<16 mask0: D$ 0F0F0F0F0 0F0F0F0F0]
[<16 mask1: D$ 0F0F0F0F 0F0F0F0F]
[<16 mask2: D$ 033333333 033333333]
[<16 mask3: D$ 055555555 055555555]
[<16 mask4: D$ 0CCCCCCCC 0CCCCCCCC]
[<16 mask5: D$ 0AAAAAAAA 0AAAAAAAA]
Proc reverse_qword_sse2:
Arguments @Input
mov eax D@Input
mov ecx D$eax | bswap ecx
mov edx D$eax+4 | bswap edx
movd xmm0 ecx | pslldq xmm0 4
movd xmm1 edx | por xmm0 xmm1
; Step 1: Swap adjacent nibbles
movdqu xmm1 xmm0 ; xmm1 = edx
pand xmm1 X$mask1 ; and edx 0F0F0F0F - mask0 = 0x0F0F0F0F0F0F0F0F
pand xmm0 X$mask0 ; and eax 0F0F0F0F0 - mask0 = 0x0F0F0F0F0F0F0F0F
psllq xmm1 4 ; shl edx 4 xmm0 = high nibbles shifted right
psrlq xmm0 4 ; shr eax 4 xmm0 = high nibbles shifted right
por xmm0 xmm1 ; or eax edx xmm0 = merged nibbles
; xmm1 = edx xmm0 = eax
; movdqu xmm2 xmm0
movdqu xmm1 xmm0
pand xmm1 X$mask2 ; mov edx 033333333 | and edx eax
pand xmm0 X$mask4 ; and eax 0CCCCCCCC
psrlq xmm0 2 ; shr eax 4 xmm0 = high nibbles shifted right
psllq xmm1 2 ; shl edx 4 xmm0 = high nibbles shifted right
por xmm0 xmm1 ; or eax edx xmm0 = merged nibbles
movdqu xmm1 xmm0
pand xmm1 X$mask3 ; mov edx 033333333 | and edx eax
pand xmm0 X$mask5 ; and eax 0CCCCCCCC
psrlq xmm0 1 ; shr eax 4 xmm0 = high nibbles shifted right
psllq xmm1 1 ; shl edx 4 xmm0 = high nibbles shifted right
por xmm0 xmm1 ; or eax edx xmm0 = merged nibbles
EndP
example of usage:
[crc64_poly_Ecma: Q$ 0C96C5795D7870F42]
call reverse_qword_sse2 crc64_poly_Ecma
will output in xmm0 the reverted bits = 0x42F0 E1EB A9EA 3693
I´ll review the code and comments later, and make it export on his own buffer or the proper output buffer. I´ll only clean up the code a little and see if it can be optimized further
Btw...maybe it be a bit faster if i suceed to recreate the bswap for SSE2, this could remve the extra lines at the start that uses 2 bswap