Bit reverse for 64 bit data in SSE2

guga · August 27, 2024, 04:52:20 AM

Hi guys

someone suceeded to create a working 64bit reverse algorithm in SSE2 ? I´m trying to but i´m failing to do so. I mean, for masm32 and not masm64

The goal is reverse the bits of a Qword using SSE2. Ex:

Original Value:
0xC96C5795D7870F42 = 1100 1001 0110 1100 0101 0111 1001 0101 1101 0111 1000 0111 0000 1111 0100 0010

Bit reversed value:
0x42F0E1EBA9EA3693 = 0100 0010 1111 0000 1110 0001 1110 1011 1010 1001 1110 1010 0011 0110 1001 0011

I tried it, but i´m having weird results as:
0x420F87D795576CC9 = 0100 0010 0000 1111 1000 0111 1101 0111 1001 0101 0101 0111 0110 1100 1100 1001

I have no idea what i´m dong wrong.

What i did was:

Code Select


[<16 SwapMaskA: Q$       0_FFFFFFFF]
[<16 SwapMaskB: Q$   0FFFF_0000FFFF]
[<16 SwapMaskC: Q$ 0FF00FF_00FF00FF]


Proc ReverseBits64_SSE2:

    ; Input: xmm0 = 64-bit value to reverse the bits
    ; Output: xmm0 = bit-reversed value
    movdqu xmm1 xmm0            ; Copy the original value
    movdqu xmm2 xmm0

    ; First step: ((x & 0x00000000FFFFFFFF) << 32) ^ ((x >> 32) & 0x00000000FFFFFFFF)
    ; ((x & 0x00000000FFFFFFFF) << 32)
    pand xmm1 X$SwapMaskA
    psllq xmm1 32

    ; (x >> 32) & 0x00000000ffffffff
    psrlq xmm2 32         ; Shift right by 64-bit (equivalent to >> 32 for quadword)
    pand xmm2 X$SwapMaskA

    ; XOR the results
    pxor xmm2 xmm1


    ; Perform the second step ((x & 0x0000FFFF0000FFFF) << 16) ^ ((x >> 16) & 0x0000FFFF0000FFFF)
    ; (x & 0x0000ffff0000ffff) << 16
    movdqu xmm1 xmm2
    pand xmm1 X$SwapMaskB
    psllq xmm1 16

    ; (x >> 16) & 0x0000ffff0000ffff
    psrlq xmm2 16         ; Shift right by 16-bit (equivalent to >> 16 for quadword)
    pand xmm2 X$SwapMaskB

    ; XOR the results
    pxor xmm2 xmm1

    ; Third step: ((x & 0x00FF00FF00FF00FF) << 8) ^ ((x >> 8) & 0x00FF00FF00FF00FF)
    ; (x & 0x00ff00ff00ff00ff) << 8
    movdqu xmm0 xmm2
    pand xmm0 X$SwapMaskC
    psllq xmm0 8

    ; (x >> 8) & 0x00ff00ff_00ff00ff
    psrlq xmm2 8         ; Shift right by 8-bit (equivalent to >> 8 for quadword)
    pand xmm2 X$SwapMaskC

    ; XOR the results
    pxor xmm0 xmm2

EndP

zedd · August 27, 2024, 05:17:36 AM

~~Looks like the algo is swapping the bits in each byte, rather than reversing the bits on the full qword (as compared to expected results)~~

On closer examination, it appears to be reversing the byte order ... rather than reversing the bits of the full qword...

The only instructions starting with 'p' I am familiar with are 'push' and 'pop', though

so I cannot offer any help on this. Just giving an observation.

guga · August 27, 2024, 06:01:59 AM

Tks, zedd

IN ssse3 it can be done as this, i suppose

https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php

> movdqa xmm2, xmm6
> movdqa xmm3, xmm5
>
> pshufb xmm2, xmm0
> pshufb xmm3, xmm1
>
> por xmm2, xmm3
>
> ; output = xmm2
>
> Especially on an i7, that would be very fast. On a Core2 it'd be nice too.
> That gives this routine plenty of potential users.

But, i need it for SSE2. On the link there´s a example but it is for 64bit

Siekmanski · August 27, 2024, 07:27:59 PM

Maybe this is useful to you or gives you an idea......

Fast Bit Reversal algorithm
https://masm32.com/board/index.php?topic=5196.0

guga · August 28, 2024, 12:15:29 AM

Tks Marinus

On your code, i´m a bit confused, where is the input value for a qword ? Do i need to input the value here ? BitRevMatrixConstant

Also, i´m taking a look on DrizzNew algo, but it is for 32 bit (dword) at a time, right ? If drizz algo is only for Dword, it´s fast, but for qword i used a table as below (Didn´t benchmarked yet):

Code Select

[rbits: B$ 0, 080, 040, 0C0, 020, 0A0, 060, 0E0
        B$ 010, 090, 050, 0D0, 030, 0B0, 070, 0F0
        B$ 08, 088, 048, 0C8, 028, 0A8, 068, 0E8
        B$ 018, 098, 058, 0D8, 038, 0B8, 078, 0F8
        B$ 04, 084, 044, 0C4, 024, 0A4, 064, 0E4
        B$ 014, 094, 054, 0D4, 034, 0B4, 074, 0F4
        B$ 0C, 08C, 04C, 0CC, 02C, 0AC, 06C, 0EC
        B$ 01C, 09C, 05C, 0DC, 03C, 0BC, 07C, 0FC
        B$ 02, 082, 042, 0C2, 022, 0A2, 062, 0E2
        B$ 012, 092, 052, 0D2, 032, 0B2, 072, 0F2
        B$ 0A, 08A, 04A, 0CA, 02A, 0AA, 06A, 0EA
        B$ 01A, 09A, 05A, 0DA, 03A, 0BA, 07A, 0FA
        B$ 06, 086, 046, 0C6, 026, 0A6, 066, 0E6
        B$ 016, 096, 056, 0D6, 036, 0B6, 076, 0F6
        B$ 0E, 08E, 04E, 0CE, 02E, 0AE, 06E, 0EE
        B$ 01E, 09E, 05E, 0DE, 03E, 0BE, 07E, 0FE
        B$ 01, 081, 041, 0C1, 021, 0A1, 061, 0E1
        B$ 011, 091, 051, 0D1, 031, 0B1, 071, 0F1
        B$ 09, 089, 049, 0C9, 029, 0A9, 069, 0E9
        B$ 019, 099, 059, 0D9, 039, 0B9, 079, 0F9
        B$ 05, 085, 045, 0C5, 025, 0A5, 065, 0E5
        B$ 015, 095, 055, 0D5, 035, 0B5, 075, 0F5
        B$ 0D, 08D, 04D, 0CD, 02D, 0AD, 06D, 0ED
        B$ 01D, 09D, 05D, 0DD, 03D, 0BD, 07D, 0FD
        B$ 03, 083, 043, 0C3, 023, 0A3, 063, 0E3
        B$ 013, 093, 053, 0D3, 033, 0B3, 073, 0F3
        B$ 0B, 08B, 04B, 0CB, 02B, 0AB, 06B, 0EB
        B$ 01B, 09B, 05B, 0DB, 03B, 0BB, 07B, 0FB
        B$ 07, 087, 047, 0C7, 027, 0A7, 067, 0E7
        B$ 017, 097, 057, 0D7, 037, 0B7, 077, 0F7
        B$ 0F, 08F, 04F, 0CF, 02F, 0AF, 06F, 0EF
        B$ 01F, 09F, 05F, 0DF, 03F, 0BF, 07F, 0FF]
; https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php
; https://github.com/gbraad/pxt-reversebit/blob/master/main.ts
Proc ReverseBits64:
    Arguments @pInputQword, @OutputValue

    mov esi D@pInputQword
    mov edi D@OutputValue

    movzx eax B$esi | movzx eax B$rbits+eax | mov B$edi al
    movzx eax B$esi+1 | movzx eax B$rbits+eax | mov B$edi+1 al
    movzx eax B$esi+2 | movzx eax B$rbits+eax | mov B$edi+2 al
    movzx eax B$esi+3 | movzx eax B$rbits+eax | mov B$edi+3 al
    movzx eax B$esi+4 | movzx eax B$rbits+eax | mov B$edi+4 al
    movzx eax B$esi+5 | movzx eax B$rbits+eax | mov B$edi+5 al
    movzx eax B$esi+6 | movzx eax B$rbits+eax | mov B$edi+6 al
    movzx eax B$esi+7 | movzx eax B$rbits+eax | mov B$edi+7 al


EndP

Example of usage:

Code Select

[Guga: Q$ 0] ; output the reversed value
[crc64_poly_Ecma: Q$ 0C96C5795D7870F42] ; CRC64 ecma polynomial

call ReverseBits64 crc64_poly_Ecma, Guga

InfiniteLoop · August 28, 2024, 06:09:04 AM

Divide and conquer.

ShufB BYTE 7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8
And1 BYTE DUP 16 (11110000b)
And2 BYTE DUP 16 (11001100b)
And3 BYTE DUP 16 (10101010b)

vmovdqu xmm0,[inputx2]
vpshufb xmm0,xmm0,xmmword ptr [ShufB] ;reverse bytes
vpsllq xmm1,xmm0,4 ;0000 1111
vpsrlq xmm2,xmm0,4 ;1111 xxxx
vmovdqu xmm3,xmmword ptr [And1]
vpand xmm2,xmm2,xmm3
vpor xmm0,xmm1,xmm2
...

guga · August 28, 2024, 08:04:39 AM

Tks Infiniteloop, but i´m looking a way in SSE2 for 32-bit. I guess i suceeded to convert drizzNew stir-fry algorithm to be used in a Qword.

https://masmforum.com/board/index.php/topic,12722.msg98224/sslRedirect.html?PHPSESSID=d9917beed356c22c4b0ba4c97bc18307#msg98224

This is the code i managed to work so far:

Code Select


[<16 mask0: D$ 0F0F0F0F0 0F0F0F0F0]
[<16 mask1: D$  0F0F0F0F  0F0F0F0F]
[<16 mask2: D$ 033333333 033333333]
[<16 mask3: D$ 055555555 055555555]
[<16 mask4: D$ 0CCCCCCCC 0CCCCCCCC]
[<16 mask5: D$ 0AAAAAAAA 0AAAAAAAA]

Proc reverse_qword_sse2:
    Arguments @Input

    mov eax D@Input
    mov ecx D$eax | bswap ecx
    mov edx D$eax+4 | bswap edx
    movd xmm0 ecx | pslldq xmm0 4
    movd xmm1 edx | por xmm0 xmm1

    ; Step 1: Swap adjacent nibbles
    movdqu xmm1 xmm0 ; xmm1 = edx
    pand xmm1 X$mask1  ; and edx 0F0F0F0F - mask0 = 0x0F0F0F0F0F0F0F0F
    pand xmm0 X$mask0  ; and eax 0F0F0F0F0 - mask0 = 0x0F0F0F0F0F0F0F0F
    psllq xmm1 4                  ; shl edx 4 xmm0 = high nibbles shifted right
    psrlq xmm0 4                  ; shr eax 4 xmm0 = high nibbles shifted right
    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles
                                    ; xmm1 = edx xmm0 = eax
;    movdqu xmm2 xmm0

    movdqu xmm1 xmm0
    pand xmm1 X$mask2  ; mov edx 033333333 | and edx eax
    pand xmm0 X$mask4  ; and eax 0CCCCCCCC
    psrlq xmm0 2                  ; shr eax 4 xmm0 = high nibbles shifted right
    psllq xmm1 2                  ; shl edx 4 xmm0 = high nibbles shifted right
    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles

    movdqu xmm1 xmm0
    pand xmm1 X$mask3  ; mov edx 033333333 | and edx eax
    pand xmm0 X$mask5  ; and eax 0CCCCCCCC
    psrlq xmm0 1                  ; shr eax 4 xmm0 = high nibbles shifted right
    psllq xmm1 1                  ; shl edx 4 xmm0 = high nibbles shifted right
    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles

EndP

example of usage:
[crc64_poly_Ecma: Q$ 0C96C5795D7870F42]
call reverse_qword_sse2 crc64_poly_Ecma

will output in xmm0 the reverted bits = 0x42F0 E1EB A9EA 3693

I´ll review the code and comments later, and make it export on his own buffer or the proper output buffer. I´ll only clean up the code a little and see if it can be optimized further

Btw...maybe it be a bit faster if i suceed to recreate the bswap for SSE2, this could remve the extra lines at the start that uses 2 bswap

The MASM Forum

News:

Bit reverse for 64 bit data in SSE2

guga

zedd

guga

Siekmanski

guga

InfiniteLoop

guga