Author Topic: Fast Bit Reversal algorithm  (Read 4036 times)

Siekmanski

  • Member
  • *****
  • Posts: 1092
Re: Fast Bit Reversal algorithm
« Reply #15 on: March 26, 2016, 08:26:45 PM »
I finished a routine that calculates 16 to 1024 size tables.
The algorithm works but I don't have enough registers to calculate larger then 1024 size tables. ( at least I think so.... )
I may need to store temporary pre calculated column values inside the table memory to calculate larger power of 2 size tables.

After all the goal is, to use this algorithm to calculate very large tables as fast as possible.

Edit: see Reply #26 for the source code
« Last Edit: March 28, 2016, 02:27:42 AM by Siekmanski »

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast Bit Reversal algorithm
« Reply #16 on: March 26, 2016, 11:04:57 PM »
It crashes if I run it from the console, but for some reason seems to work inside Olly.
Code: [Select]
CPU Disasm
Address   Hex dump          Command                                  Comments
00401154  |.  33C0          XOR EAX,EAX
00401156  |.  BB 04000000   MOV EBX,4
0040115B  |.  BA 04000000   MOV EDX,4
00401160  |.  B9 20000000   MOV ECX,20
00401165  |.  8DA424 000000 LEA ESP,[LOCAL.1]
0040116C  |.  8D6424 00     LEA ESP,[LOCAL.1]
00401170  |>  660F7F0487    /MOVDQA DQWORD PTR DS:[EAX*4+EDI],XMM0   ; FLOAT 1.697596632914509e-311, 1.154365710300918e-311
00401175  |.  660F7F8C87 00 |MOVDQA DQWORD PTR DS:[EAX*4+EDI+800],XM
0040117E  |.  660FFEC2      |PADDD XMM0,XMM2
00401182  |.  660FFECA      |PADDD XMM1,XMM2
00401186  |.  660F7F8487 00 |MOVDQA DQWORD PTR DS:[EAX*4+EDI+400],XM
0040118F  |.  660F7F8C87 00 |MOVDQA DQWORD PTR DS:[EAX*4+EDI+0C00],X

CPU - main thread, module FastBitReversal

EAX 00150788
ECX 0000001C
EDX 00000003
EBX 00000004
ESP 0012FFB8
EBP 0012FFF0
ESI 01D18756
EDI 00403080 FastBitReversal.00403080
EIP 00401170 FastBitReversal.00401170

C 0  ES 0023 32bit 0(FFFFFFFF)
P 0  CS 001B 32bit 0(FFFFFFFF)
A 0  SS 0023 32bit 0(FFFFFFFF)
Z 0  DS 0023 32bit 0(FFFFFFFF)
S 0  FS 003B 32bit 7FFDF000(FFF)
T 0  GS 0000 NULL
D 0
O 0  LastErr 00000000 ERROR_SUCCESS
EFL 00010202 (NO,NB,NE,A,NS,PO,GE,G)

ST0 empty +UNORM 0061 00730072 00000001
ST1 empty -UNORM B411 7C90D80A 00650078
ST2 empty +UNORM 0024 0013F550 00000017
ST3 empty -UNORM B461 001599C8 00000001
ST4 empty +UNORM 0305 02020202 03030000
ST5 empty 0.0000000000000028100e-4933
ST6 empty 0.0000000000050182150e-4933
ST7 empty 0.0
               3 2 1 0      E S P U O Z D I
FST 0000  Cond 0 0 0 0  Err 0 0 0 0 0 0 0 0 (GT)
FCW 027F  Prec NEAR,53  Mask    1 1 1 1 1 1
Last cmnd 001B:0288EBBD

XMM0 00000320 00000120 00000220 00000020
XMM1 00000321 00000121 00000221 00000021
XMM2 00000002 00000002 00000002 00000002
XMM3 00158B40 00158B20 7C9100AD 00150788
XMM4 00158AD0 00158AF0 7C91005D 00150778
XMM5 00000000 00000060 00000020 00000040
XMM6 00000060 00000020 00000040 00000000
XMM7 7C913F92 7C980620 7C910460 7C90E920
                                P U O Z D I
MXCSR 00001F80  FZ 0 DZ 0  Err  0 0 0 0 0 0
                Rnd NEAR   Mask 1 1 1 1 1 1

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Bit Reversal algorithm
« Reply #17 on: March 27, 2016, 06:34:16 AM »
Works fine for me on Win7-64, even if launched directly from the archive.

Siekmanski

  • Member
  • *****
  • Posts: 1092
Re: Fast Bit Reversal algorithm
« Reply #18 on: March 27, 2016, 03:04:30 PM »
Hi Nidud,
Can't find what could cause this....

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Bit Reversal algorithm
« Reply #19 on: March 27, 2016, 07:05:29 PM »
Could be unaligned access. @Nidud: If you configure Olly as just-in-time debugger, where does it crash?7
->Options/debugging/just-in-time

Siekmanski

  • Member
  • *****
  • Posts: 1092
Re: Fast Bit Reversal algorithm
« Reply #20 on: March 27, 2016, 08:39:55 PM »
Memory access is 16 byte aligned.

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Bit Reversal algorithm
« Reply #21 on: March 27, 2016, 09:03:07 PM »
In fact, I can't produce a crash, whatever I try ::)

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast Bit Reversal algorithm
« Reply #22 on: March 27, 2016, 10:16:05 PM »
If you configure Olly as just-in-time debugger, where does it crash?

The dump shows where the crash is: 00401170  |>
EAX = 00150788 --> DS:[EAX*4+EDI]

It could be a hardware problem. The CPU is <= SSE3.

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Bit Reversal algorithm
« Reply #23 on: March 27, 2016, 10:46:01 PM »
It could be a hardware problem. The CPU is <= SSE3.

movdqa is SSE2. Besides, Olly is powerful but it can't upgrade the hardware :icon_mrgreen:

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast Bit Reversal algorithm
« Reply #24 on: March 27, 2016, 11:42:46 PM »
It could be a hardware problem. The CPU is <= SSE3.

movdqa is SSE2. Besides, Olly is powerful but it can't upgrade the hardware :icon_mrgreen:

 :biggrin:

I didn't look into the up-codes used but...

When stepping through the code using MS, EAX (from XMM3) is 40/20/60. Then EBX becomes 0 and EDX 3, so it jumps to NextColumn. At this point XMM4 is undefined.

Code: [Select]
    dec     ebx
    jnz     NextSwap ; Check if we need the next row ?
    mov     ebx,4 ; Row index counter
    dec     edx
    jnz     NextColumn ; Check if we need the next row ?
    mov     edx,4 ; Column index counter
    movdqa  xmm4,xmm5 ; Get Matrix constant
    pshufd  xmm4,xmm4,Shuffle(2,2,2,2)
    psrld   xmm4,4 ; Calculate Matrix step
NextColumn:
    movdqa  xmm3,xmm5 ; First row indices for matrix calculations
    pshufd  xmm5,xmm5,Shuffle(0,3,2,1) ; Prepare next column value
    pshufd  xmm3,xmm3,Shuffle(1,1,1,1) ; Get second value and transpose to next column ( incl. adding values for next row )
    psrld   xmm3,2 ; Calculate Matrix row step
    paddd   xmm3,xmm4
    paddd   xmm3,xmm6 ; New row indices for Table positions
    jmp     NextIndex
NextSwap:
    pshufd  xmm3,xmm3,Shuffle(0,3,2,1) ; Get next index from the row
NextIndex:
    movd    eax,xmm3

As a result EAX is now 7C910232, so I guess it depend on the value of XMM4 in the paddd xmm3,xmm4 above.

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast Bit Reversal algorithm
« Reply #25 on: March 27, 2016, 11:46:13 PM »
So, this works:
Code: [Select]
    pxor    xmm4,xmm4
    call    FastBitReversal16_1024

Siekmanski

  • Member
  • *****
  • Posts: 1092
Re: Fast Bit Reversal algorithm
« Reply #26 on: March 28, 2016, 02:26:01 AM »
Nidud.

Thanks, for finding this bug, xmm4 is indeed undefined when it jumps to NextColumn.  :t

jj2007

  • Member
  • *****
  • Posts: 7548
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast Bit Reversal algorithm
« Reply #27 on: May 28, 2016, 07:40:57 PM »
Here is a slow but easy-to-use MirrorBits() macro ruthlessly stolen from The Aggregate at University of Kentucky :biggrin:

Code: [Select]
include \masm32\MasmBasic\MasmBasic.inc ; http://www.webalice.it/jj2006/Masm32_Tips_Tricks_and_Traps.htm

MirrorBits MACRO arg32
  ifdifi <arg32>, <eax>
mov eax, arg32
  endif
  mov edx, eax
  and eax, 0aaaaaaaah
  and edx, 055555555h
  shr eax, 1
  add edx, edx ; shl edx,1
  or eax, edx
  mov edx, eax
  and eax, 0cccccccch
  and edx, 033333333h
  shr eax, 2
  lea edx, [4*edx] ; shl edx, 2
  or eax, edx
  mov edx, eax
  and eax, 0f0f0f0f0h
  and edx, 0f0f0f0fh
  shr eax, 4
  shl edx, 4
  or eax, edx
  mov edx, eax
  and eax, 0ff00ff00h
  and edx, 0ff00ffh
  shr eax, 8
  shl edx, 8
  or eax, edx
  mov edx, eax
  shr eax, 16
  shl edx, 16
  or eax, edx
  EXITM <eax>
ENDM

  Init
  push 29
  .Repeat
mov ecx, Rand(-1)
Print Bin$(ecx), " "
Print Bin$(MirrorBits(ecx)), CrLf$
dec stack
  .Until Sign?
EndOfCode

Code: [Select]
01110011101010011011100010001101 10110001000111011001010111001110
00010010011100011100110110011101 10111001101100111000111001001000
00100010111111101000101011011000 00011011010100010111111101000100
11011101000111011011100110100010 01000101100111011011100010111011
10100111111001110100000001001101 10110010000000101110011111100101
11010010000110000100011110110011 11001101111000100001100001001011
00111000011011011111010011000100 00100011001011111011011000011100
11001001100100010111001100000000 00000000110011101000100110010011
00000101111100101101011011111111 11111111011010110100111110100000
10000100010101101000011100001011 11010000111000010110101000100001
10010000000000000001001101011000 00011010110010000000000000001001
01011101011001101111011010001110 01110001011011110110011010111010
10010011101011101110101110101000 00010101110101110111010111001001
10101101000101100011100110111011 11011101100111000110100010110101
01000000010110111111100100100000 00000100100111111101101000000010
00100101011010000101111100010001 10001000111110100001011010100100
10010110110101010011100001110000 00001110000111001010101101101001
01110101111111110011100000101011 11010100000111001111111110101110
10110000001000010110001101110101 10101110110001101000010000001101
11111010101001110110111110011110 01111001111101101110010101011111
10100011100111111110111110100101 10100101111101111111100111000101
00101010000111001010100011010010 01001011000101010011100001010100
11010111001111101100100000011010 01011000000100110111110011101011
00011111101110101101101000110110 01101100010110110101110111111000
00111011101111100100110101000001 10000010101100100111110111011100
11000110101010111001011101011011 11011010111010011101010101100011
11100000111010001101111101101111 11110110111110110001011100000111
11110100001001100111100101101011 11010110100111100110010000101111
11110000110000101110100111011111 11111011100101110100001100001111
01100100111110010100110110001110 01110001101100101001111100100110