## News:

Message to All Guests
NB: Posting URL's See here: Posted URL Change

## Bit reverse for 64 bit data in SSE2

Started by guga, August 27, 2024, 04:52:20 AM

#### guga

Hi guys

someone suceeded to create a working 64bit reverse algorithm in SSE2 ? I´m trying to but i´m failing to do so. I mean, for masm32 and not masm64

The goal is reverse the bits of a Qword using SSE2. Ex:

Original Value:
0xC96C5795D7870F42 = 1100 1001 0110 1100 0101 0111 1001 0101 1101 0111 1000 0111 0000 1111 0100 0010

Bit reversed value:
0x42F0E1EBA9EA3693 = 0100 0010 1111 0000 1110 0001 1110 1011 1010 1001 1110 1010 0011 0110 1001 0011

I tried it, but i´m having weird results as:
0x420F87D795576CC9 = 0100 0010 0000 1111 1000 0111 1101 0111 1001 0101 0101 0111 0110 1100 1100 1001

I have no idea what i´m dong wrong.

What i did was:

`[<16 SwapMaskA: Q\$       0_FFFFFFFF][<16 SwapMaskB: Q\$   0FFFF_0000FFFF][<16 SwapMaskC: Q\$ 0FF00FF_00FF00FF]Proc ReverseBits64_SSE2:    ; Input: xmm0 = 64-bit value to reverse the bits    ; Output: xmm0 = bit-reversed value    movdqu xmm1 xmm0            ; Copy the original value    movdqu xmm2 xmm0    ; First step: ((x & 0x00000000FFFFFFFF) << 32) ^ ((x >> 32) & 0x00000000FFFFFFFF)    ; ((x & 0x00000000FFFFFFFF) << 32)    pand xmm1 X\$SwapMaskA    psllq xmm1 32    ; (x >> 32) & 0x00000000ffffffff    psrlq xmm2 32         ; Shift right by 64-bit (equivalent to >> 32 for quadword)    pand xmm2 X\$SwapMaskA    ; XOR the results    pxor xmm2 xmm1    ; Perform the second step ((x & 0x0000FFFF0000FFFF) << 16) ^ ((x >> 16) & 0x0000FFFF0000FFFF)    ; (x & 0x0000ffff0000ffff) << 16    movdqu xmm1 xmm2    pand xmm1 X\$SwapMaskB    psllq xmm1 16    ; (x >> 16) & 0x0000ffff0000ffff    psrlq xmm2 16         ; Shift right by 16-bit (equivalent to >> 16 for quadword)    pand xmm2 X\$SwapMaskB    ; XOR the results    pxor xmm2 xmm1    ; Third step: ((x & 0x00FF00FF00FF00FF) << 8) ^ ((x >> 8) & 0x00FF00FF00FF00FF)    ; (x & 0x00ff00ff00ff00ff) << 8    movdqu xmm0 xmm2    pand xmm0 X\$SwapMaskC    psllq xmm0 8    ; (x >> 8) & 0x00ff00ff_00ff00ff    psrlq xmm2 8         ; Shift right by 8-bit (equivalent to >> 8 for quadword)    pand xmm2 X\$SwapMaskC    ; XOR the results    pxor xmm0 xmm2EndP`
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### zedd151

Looks like the algo is swapping the bits in each byte, rather than reversing the bits on the full qword (as compared to expected results)

On closer examination, it appears to be reversing the byte order ... rather than reversing the bits of the full qword...

The only instructions starting with 'p' I am familiar with are 'push' and 'pop', though    so I cannot offer any help on this. Just giving an observation.

#### guga

#2
Tks, zedd

IN ssse3 it can be done as this, i suppose

> movdqa xmm2, xmm6
> movdqa xmm3, xmm5
>
> pshufb xmm2, xmm0
> pshufb xmm3, xmm1
>
> por xmm2, xmm3
>
> ; output = xmm2
>
> Especially on an i7, that would be very fast. On a Core2 it'd be nice too.
> That gives this routine plenty of potential users.

But, i need it for SSE2. On the link there´s a example but it is for 64bit
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### Siekmanski

Maybe this is useful to you or gives you an idea......

Fast Bit Reversal algorithm
https://masm32.com/board/index.php?topic=5196.0
Creative coders use backward thinking techniques as a strategy.

#### guga

Tks Marinus

On your code, i´m a bit confused, where is the input value for a qword ? Do i need to input the value here ? BitRevMatrixConstant

Also, i´m taking a look on DrizzNew algo, but it is for 32 bit (dword) at a time, right ?  If drizz algo is only for Dword, it´s fast, but for qword i used a table as below (Didn´t benchmarked yet):

`[rbits: B\$ 0, 080, 040, 0C0, 020, 0A0, 060, 0E0        B\$ 010, 090, 050, 0D0, 030, 0B0, 070, 0F0        B\$ 08, 088, 048, 0C8, 028, 0A8, 068, 0E8        B\$ 018, 098, 058, 0D8, 038, 0B8, 078, 0F8        B\$ 04, 084, 044, 0C4, 024, 0A4, 064, 0E4        B\$ 014, 094, 054, 0D4, 034, 0B4, 074, 0F4        B\$ 0C, 08C, 04C, 0CC, 02C, 0AC, 06C, 0EC        B\$ 01C, 09C, 05C, 0DC, 03C, 0BC, 07C, 0FC        B\$ 02, 082, 042, 0C2, 022, 0A2, 062, 0E2        B\$ 012, 092, 052, 0D2, 032, 0B2, 072, 0F2        B\$ 0A, 08A, 04A, 0CA, 02A, 0AA, 06A, 0EA        B\$ 01A, 09A, 05A, 0DA, 03A, 0BA, 07A, 0FA        B\$ 06, 086, 046, 0C6, 026, 0A6, 066, 0E6        B\$ 016, 096, 056, 0D6, 036, 0B6, 076, 0F6        B\$ 0E, 08E, 04E, 0CE, 02E, 0AE, 06E, 0EE        B\$ 01E, 09E, 05E, 0DE, 03E, 0BE, 07E, 0FE        B\$ 01, 081, 041, 0C1, 021, 0A1, 061, 0E1        B\$ 011, 091, 051, 0D1, 031, 0B1, 071, 0F1        B\$ 09, 089, 049, 0C9, 029, 0A9, 069, 0E9        B\$ 019, 099, 059, 0D9, 039, 0B9, 079, 0F9        B\$ 05, 085, 045, 0C5, 025, 0A5, 065, 0E5        B\$ 015, 095, 055, 0D5, 035, 0B5, 075, 0F5        B\$ 0D, 08D, 04D, 0CD, 02D, 0AD, 06D, 0ED        B\$ 01D, 09D, 05D, 0DD, 03D, 0BD, 07D, 0FD        B\$ 03, 083, 043, 0C3, 023, 0A3, 063, 0E3        B\$ 013, 093, 053, 0D3, 033, 0B3, 073, 0F3        B\$ 0B, 08B, 04B, 0CB, 02B, 0AB, 06B, 0EB        B\$ 01B, 09B, 05B, 0DB, 03B, 0BB, 07B, 0FB        B\$ 07, 087, 047, 0C7, 027, 0A7, 067, 0E7        B\$ 017, 097, 057, 0D7, 037, 0B7, 077, 0F7        B\$ 0F, 08F, 04F, 0CF, 02F, 0AF, 06F, 0EF        B\$ 01F, 09F, 05F, 0DF, 03F, 0BF, 07F, 0FF]; https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php; https://github.com/gbraad/pxt-reversebit/blob/master/main.tsProc ReverseBits64:    Arguments @pInputQword, @OutputValue    mov esi D@pInputQword    mov edi D@OutputValue    movzx eax B\$esi | movzx eax B\$rbits+eax | mov B\$edi al    movzx eax B\$esi+1 | movzx eax B\$rbits+eax | mov B\$edi+1 al    movzx eax B\$esi+2 | movzx eax B\$rbits+eax | mov B\$edi+2 al    movzx eax B\$esi+3 | movzx eax B\$rbits+eax | mov B\$edi+3 al    movzx eax B\$esi+4 | movzx eax B\$rbits+eax | mov B\$edi+4 al    movzx eax B\$esi+5 | movzx eax B\$rbits+eax | mov B\$edi+5 al    movzx eax B\$esi+6 | movzx eax B\$rbits+eax | mov B\$edi+6 al    movzx eax B\$esi+7 | movzx eax B\$rbits+eax | mov B\$edi+7 alEndP`
Example of usage:

`[Guga: Q\$ 0] ; output the reversed value[crc64_poly_Ecma: Q\$ 0C96C5795D7870F42] ; CRC64 ecma polynomialcall ReverseBits64 crc64_poly_Ecma, Guga`
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### InfiniteLoop

Divide and conquer.

ShufB BYTE 7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8
And1 BYTE DUP 16 (11110000b)
And2 BYTE DUP 16 (11001100b)
And3 BYTE DUP 16 (10101010b)

vmovdqu xmm0,[inputx2]
vpshufb xmm0,xmm0,xmmword ptr [ShufB]  ;reverse bytes
vpsllq xmm1,xmm0,4     ;0000 1111
vpsrlq xmm2,xmm0,4     ;1111 xxxx
vmovdqu xmm3,xmmword ptr [And1]
vpand xmm2,xmm2,xmm3
vpor xmm0,xmm1,xmm2
...

#### guga

#6
Tks Infiniteloop, but i´m looking a way in SSE2 for 32-bit. I guess i suceeded to convert drizzNew stir-fry algorithm to be used in a Qword.

https://masmforum.com/board/index.php/topic,12722.msg98224/sslRedirect.html?PHPSESSID=d9917beed356c22c4b0ba4c97bc18307#msg98224

This is the code i managed to work so far:

`[<16 mask0: D\$ 0F0F0F0F0 0F0F0F0F0][<16 mask1: D\$  0F0F0F0F  0F0F0F0F][<16 mask2: D\$ 033333333 033333333][<16 mask3: D\$ 055555555 055555555][<16 mask4: D\$ 0CCCCCCCC 0CCCCCCCC][<16 mask5: D\$ 0AAAAAAAA 0AAAAAAAA]Proc reverse_qword_sse2:    Arguments @Input    mov eax D@Input    mov ecx D\$eax | bswap ecx    mov edx D\$eax+4 | bswap edx    movd xmm0 ecx | pslldq xmm0 4    movd xmm1 edx | por xmm0 xmm1    ; Step 1: Swap adjacent nibbles    movdqu xmm1 xmm0 ; xmm1 = edx    pand xmm1 X\$mask1  ; and edx 0F0F0F0F - mask0 = 0x0F0F0F0F0F0F0F0F    pand xmm0 X\$mask0  ; and eax 0F0F0F0F0 - mask0 = 0x0F0F0F0F0F0F0F0F    psllq xmm1 4                  ; shl edx 4 xmm0 = high nibbles shifted right    psrlq xmm0 4                  ; shr eax 4 xmm0 = high nibbles shifted right    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles                                    ; xmm1 = edx xmm0 = eax;    movdqu xmm2 xmm0    movdqu xmm1 xmm0    pand xmm1 X\$mask2  ; mov edx 033333333 | and edx eax    pand xmm0 X\$mask4  ; and eax 0CCCCCCCC    psrlq xmm0 2                  ; shr eax 4 xmm0 = high nibbles shifted right    psllq xmm1 2                  ; shl edx 4 xmm0 = high nibbles shifted right    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles    movdqu xmm1 xmm0    pand xmm1 X\$mask3  ; mov edx 033333333 | and edx eax    pand xmm0 X\$mask5  ; and eax 0CCCCCCCC    psrlq xmm0 1                  ; shr eax 4 xmm0 = high nibbles shifted right    psllq xmm1 1                  ; shl edx 4 xmm0 = high nibbles shifted right    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibblesEndP`
example of usage:
[crc64_poly_Ecma: Q\$ 0C96C5795D7870F42]
call reverse_qword_sse2 crc64_poly_Ecma

will output in xmm0 the reverted bits = 0x42F0 E1EB A9EA 3693

I´ll review the code and comments later, and make it export on his own buffer or the proper output buffer. I´ll only clean up the code a little and see if it can be optimized further

Btw...maybe it be a bit faster if i suceed to recreate the bswap for SSE2, this could remve the extra lines at the start that uses 2 bswap
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com