News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Bit reverse for 64 bit data in SSE2

Started by guga, August 27, 2024, 04:52:20 AM

Previous topic - Next topic

guga

Hi guys

someone suceeded to create a working 64bit reverse algorithm in SSE2 ? I´m trying to but i´m failing to do so. I mean, for masm32 and not masm64

The goal is reverse the bits of a Qword using SSE2. Ex:

Original Value:
0xC96C5795D7870F42 = 1100 1001 0110 1100 0101 0111 1001 0101 1101 0111 1000 0111 0000 1111 0100 0010

Bit reversed value:
0x42F0E1EBA9EA3693 = 0100 0010 1111 0000 1110 0001 1110 1011 1010 1001 1110 1010 0011 0110 1001 0011

I tried it, but i´m having weird results as:
0x420F87D795576CC9 = 0100 0010 0000 1111 1000 0111 1101 0111 1001 0101 0101 0111 0110 1100 1100 1001

I have no idea what i´m dong wrong.

What i did was:


[<16 SwapMaskA: Q$       0_FFFFFFFF]
[<16 SwapMaskB: Q$   0FFFF_0000FFFF]
[<16 SwapMaskC: Q$ 0FF00FF_00FF00FF]


Proc ReverseBits64_SSE2:

    ; Input: xmm0 = 64-bit value to reverse the bits
    ; Output: xmm0 = bit-reversed value
    movdqu xmm1 xmm0            ; Copy the original value
    movdqu xmm2 xmm0

    ; First step: ((x & 0x00000000FFFFFFFF) << 32) ^ ((x >> 32) & 0x00000000FFFFFFFF)
    ; ((x & 0x00000000FFFFFFFF) << 32)
    pand xmm1 X$SwapMaskA
    psllq xmm1 32

    ; (x >> 32) & 0x00000000ffffffff
    psrlq xmm2 32         ; Shift right by 64-bit (equivalent to >> 32 for quadword)
    pand xmm2 X$SwapMaskA

    ; XOR the results
    pxor xmm2 xmm1


    ; Perform the second step ((x & 0x0000FFFF0000FFFF) << 16) ^ ((x >> 16) & 0x0000FFFF0000FFFF)
    ; (x & 0x0000ffff0000ffff) << 16
    movdqu xmm1 xmm2
    pand xmm1 X$SwapMaskB
    psllq xmm1 16

    ; (x >> 16) & 0x0000ffff0000ffff
    psrlq xmm2 16         ; Shift right by 16-bit (equivalent to >> 16 for quadword)
    pand xmm2 X$SwapMaskB

    ; XOR the results
    pxor xmm2 xmm1

    ; Third step: ((x & 0x00FF00FF00FF00FF) << 8) ^ ((x >> 8) & 0x00FF00FF00FF00FF)
    ; (x & 0x00ff00ff00ff00ff) << 8
    movdqu xmm0 xmm2
    pand xmm0 X$SwapMaskC
    psllq xmm0 8

    ; (x >> 8) & 0x00ff00ff_00ff00ff
    psrlq xmm2 8         ; Shift right by 8-bit (equivalent to >> 8 for quadword)
    pand xmm2 X$SwapMaskC

    ; XOR the results
    pxor xmm0 xmm2

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

zedd151

Looks like the algo is swapping the bits in each byte, rather than reversing the bits on the full qword (as compared to expected results)

On closer examination, it appears to be reversing the byte order ... rather than reversing the bits of the full qword...

The only instructions starting with 'p' I am familiar with are 'push' and 'pop', though  :sad:  so I cannot offer any help on this. Just giving an observation.

guga

#2
Tks, zedd

IN ssse3 it can be done as this, i suppose

https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php

> movdqa xmm2, xmm6
> movdqa xmm3, xmm5
>
> pshufb xmm2, xmm0
> pshufb xmm3, xmm1
>
> por xmm2, xmm3
>
> ; output = xmm2
>
> Especially on an i7, that would be very fast. On a Core2 it'd be nice too.
> That gives this routine plenty of potential users.


But, i need it for SSE2. On the link there´s a example but it is for 64bit
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Maybe this is useful to you or gives you an idea......

Fast Bit Reversal algorithm
https://masm32.com/board/index.php?topic=5196.0
Creative coders use backward thinking techniques as a strategy.

guga

Tks Marinus

On your code, i´m a bit confused, where is the input value for a qword ? Do i need to input the value here ? BitRevMatrixConstant

Also, i´m taking a look on DrizzNew algo, but it is for 32 bit (dword) at a time, right ?  If drizz algo is only for Dword, it´s fast, but for qword i used a table as below (Didn´t benchmarked yet):

[rbits: B$ 0, 080, 040, 0C0, 020, 0A0, 060, 0E0
        B$ 010, 090, 050, 0D0, 030, 0B0, 070, 0F0
        B$ 08, 088, 048, 0C8, 028, 0A8, 068, 0E8
        B$ 018, 098, 058, 0D8, 038, 0B8, 078, 0F8
        B$ 04, 084, 044, 0C4, 024, 0A4, 064, 0E4
        B$ 014, 094, 054, 0D4, 034, 0B4, 074, 0F4
        B$ 0C, 08C, 04C, 0CC, 02C, 0AC, 06C, 0EC
        B$ 01C, 09C, 05C, 0DC, 03C, 0BC, 07C, 0FC
        B$ 02, 082, 042, 0C2, 022, 0A2, 062, 0E2
        B$ 012, 092, 052, 0D2, 032, 0B2, 072, 0F2
        B$ 0A, 08A, 04A, 0CA, 02A, 0AA, 06A, 0EA
        B$ 01A, 09A, 05A, 0DA, 03A, 0BA, 07A, 0FA
        B$ 06, 086, 046, 0C6, 026, 0A6, 066, 0E6
        B$ 016, 096, 056, 0D6, 036, 0B6, 076, 0F6
        B$ 0E, 08E, 04E, 0CE, 02E, 0AE, 06E, 0EE
        B$ 01E, 09E, 05E, 0DE, 03E, 0BE, 07E, 0FE
        B$ 01, 081, 041, 0C1, 021, 0A1, 061, 0E1
        B$ 011, 091, 051, 0D1, 031, 0B1, 071, 0F1
        B$ 09, 089, 049, 0C9, 029, 0A9, 069, 0E9
        B$ 019, 099, 059, 0D9, 039, 0B9, 079, 0F9
        B$ 05, 085, 045, 0C5, 025, 0A5, 065, 0E5
        B$ 015, 095, 055, 0D5, 035, 0B5, 075, 0F5
        B$ 0D, 08D, 04D, 0CD, 02D, 0AD, 06D, 0ED
        B$ 01D, 09D, 05D, 0DD, 03D, 0BD, 07D, 0FD
        B$ 03, 083, 043, 0C3, 023, 0A3, 063, 0E3
        B$ 013, 093, 053, 0D3, 033, 0B3, 073, 0F3
        B$ 0B, 08B, 04B, 0CB, 02B, 0AB, 06B, 0EB
        B$ 01B, 09B, 05B, 0DB, 03B, 0BB, 07B, 0FB
        B$ 07, 087, 047, 0C7, 027, 0A7, 067, 0E7
        B$ 017, 097, 057, 0D7, 037, 0B7, 077, 0F7
        B$ 0F, 08F, 04F, 0CF, 02F, 0AF, 06F, 0EF
        B$ 01F, 09F, 05F, 0DF, 03F, 0BF, 07F, 0FF]
; https://www.embeddedrelated.com/showthread/comp.arch.embedded/121818-4.php
; https://github.com/gbraad/pxt-reversebit/blob/master/main.ts
Proc ReverseBits64:
    Arguments @pInputQword, @OutputValue

    mov esi D@pInputQword
    mov edi D@OutputValue

    movzx eax B$esi | movzx eax B$rbits+eax | mov B$edi al
    movzx eax B$esi+1 | movzx eax B$rbits+eax | mov B$edi+1 al
    movzx eax B$esi+2 | movzx eax B$rbits+eax | mov B$edi+2 al
    movzx eax B$esi+3 | movzx eax B$rbits+eax | mov B$edi+3 al
    movzx eax B$esi+4 | movzx eax B$rbits+eax | mov B$edi+4 al
    movzx eax B$esi+5 | movzx eax B$rbits+eax | mov B$edi+5 al
    movzx eax B$esi+6 | movzx eax B$rbits+eax | mov B$edi+6 al
    movzx eax B$esi+7 | movzx eax B$rbits+eax | mov B$edi+7 al


EndP

Example of usage:

[Guga: Q$ 0] ; output the reversed value
[crc64_poly_Ecma: Q$ 0C96C5795D7870F42] ; CRC64 ecma polynomial

call ReverseBits64 crc64_poly_Ecma, Guga
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

InfiniteLoop

Divide and conquer.

ShufB BYTE 7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8
And1 BYTE DUP 16 (11110000b)
And2 BYTE DUP 16 (11001100b)
And3 BYTE DUP 16 (10101010b)

vmovdqu xmm0,[inputx2]
vpshufb xmm0,xmm0,xmmword ptr [ShufB]  ;reverse bytes
vpsllq xmm1,xmm0,4     ;0000 1111
vpsrlq xmm2,xmm0,4     ;1111 xxxx
vmovdqu xmm3,xmmword ptr [And1]
vpand xmm2,xmm2,xmm3
vpor xmm0,xmm1,xmm2
...

guga

#6
Tks Infiniteloop, but i´m looking a way in SSE2 for 32-bit. I guess i suceeded to convert drizzNew stir-fry algorithm to be used in a Qword.

https://masmforum.com/board/index.php/topic,12722.msg98224/sslRedirect.html?PHPSESSID=d9917beed356c22c4b0ba4c97bc18307#msg98224

This is the code i managed to work so far:


[<16 mask0: D$ 0F0F0F0F0 0F0F0F0F0]
[<16 mask1: D$  0F0F0F0F  0F0F0F0F]
[<16 mask2: D$ 033333333 033333333]
[<16 mask3: D$ 055555555 055555555]
[<16 mask4: D$ 0CCCCCCCC 0CCCCCCCC]
[<16 mask5: D$ 0AAAAAAAA 0AAAAAAAA]

Proc reverse_qword_sse2:
    Arguments @Input

    mov eax D@Input
    mov ecx D$eax | bswap ecx
    mov edx D$eax+4 | bswap edx
    movd xmm0 ecx | pslldq xmm0 4
    movd xmm1 edx | por xmm0 xmm1

    ; Step 1: Swap adjacent nibbles
    movdqu xmm1 xmm0 ; xmm1 = edx
    pand xmm1 X$mask1  ; and edx 0F0F0F0F - mask0 = 0x0F0F0F0F0F0F0F0F
    pand xmm0 X$mask0  ; and eax 0F0F0F0F0 - mask0 = 0x0F0F0F0F0F0F0F0F
    psllq xmm1 4                  ; shl edx 4 xmm0 = high nibbles shifted right
    psrlq xmm0 4                  ; shr eax 4 xmm0 = high nibbles shifted right
    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles
                                    ; xmm1 = edx xmm0 = eax
;    movdqu xmm2 xmm0

    movdqu xmm1 xmm0
    pand xmm1 X$mask2  ; mov edx 033333333 | and edx eax
    pand xmm0 X$mask4  ; and eax 0CCCCCCCC
    psrlq xmm0 2                  ; shr eax 4 xmm0 = high nibbles shifted right
    psllq xmm1 2                  ; shl edx 4 xmm0 = high nibbles shifted right
    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles

    movdqu xmm1 xmm0
    pand xmm1 X$mask3  ; mov edx 033333333 | and edx eax
    pand xmm0 X$mask5  ; and eax 0CCCCCCCC
    psrlq xmm0 1                  ; shr eax 4 xmm0 = high nibbles shifted right
    psllq xmm1 1                  ; shl edx 4 xmm0 = high nibbles shifted right
    por xmm0 xmm1                  ; or eax edx xmm0 = merged nibbles

EndP


example of usage:
[crc64_poly_Ecma: Q$ 0C96C5795D7870F42]
call reverse_qword_sse2 crc64_poly_Ecma

will output in xmm0 the reverted bits = 0x42F0 E1EB A9EA 3693

I´ll review the code and comments later, and make it export on his own buffer or the proper output buffer. I´ll only clean up the code a little and see if it can be optimized further



Btw...maybe it be a bit faster if i suceed to recreate the bswap for SSE2, this could remve the extra lines at the start that uses 2 bswap
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com