Bit-reversal is needed for a radix 2 FFT so, checked the output again.
When I was looking for repeating patterns in my earlier bit-reversal routine, I noticed this:
I could do bit-reversal without reversing bits.
8: 0 4 2 6 | 1 5 3 7
16: 0 8 4 12 2 10 6 14 | 1 9 5 13 3 11 7 15
32: 0 16 8 24 4 20 12 28 etc. ( every next row is the next power of two )
Second half = first half + 1 and my first thought was, cool, reduce size by 2 and add 1 to get the rest.
To do it this way, you need a lot of memory read\writes so, not an option.
Looked over it a few times and, EUREKA !
I can create a 4*4 matrix to calculate 32 bit-reversals with 1 index ( 16 + 16+1 )
by transposing 0 8 4 12 ( first quarter of length 16 ) to the first column.
Then fill the rest of the columns for a 512 size ( 16 * 32 values from 16 indices )
And it works, the matrix can be used to calculate all powers of two lengths and writes 4 values at once.
Just by only adding the values with 1 and putting them at the right place via the matrix indices I can get all bit-reversals.
It's easy to create a new matrix from the BitreversalMatrix for all other power of two sizes.
By multiplying ( power of two up ) or dividing the BitreversalMatrix ( power of two down ).
Multiplying is of cource done by shifting ( because we are dealing with power of 2 calculations ).
For example size 1024:
Multiply the Matrix by 2 ( 1024 / 512 ) and add 4 for every next 4*4 Matrix step.
For size 2048 then multiply the Matrix by 4 ( 2048 / 512 ) etc.
Why dividing by 512 ?, because the BitreversalMatrix represents length 512.
Next thing to do:
Initialize the BitreversalMatrix for the size you need.
Put the complete BitreversalMatrix into xmm4 - xmm7.
Copy every next row to xmm3 then shuffle each next index to eax ( no memory reads anymore ).
When you have done all 4 rows, add 4 for to every next 4*4 Matrix step.
So you can keep all code in registers and use the complete data cache ( size is also a power of two ) to do fast bit-reversals.
For example if you have a 32 KB L1 data cache, you can do a 8192 size bit-reversal very fast.
I'll code this soon. Decided to post it at this stage because it is easier to explain the algorithm this way.
Some eaxamples to show the logic of creating the 4*4 matrix:
BitreversalMatrix dd 00,32,16,48
dd 08,40,24,56
dd 04,36,20,52
dd 12,44,28,60
1024 dd 000, 064, 032, 096 ; First row = BitreversalMatrix * 2 ( 1024 / 512 = 2 )
dd 016, 080, 048, 112 ; Second row = BitreversalMatrix * 2
dd 008, 072, 040, 104 ; Third row = BitreversalMatrix * 2
dd 024, 088, 056, 120 ; Fourth row = BitreversalMatrix * 2
Next step refresh the Matrix with these new 16 values:
dd 004, 068, 036, 100 ; First row + 4
dd 020, 084, 052, 116 ; Second row + 4
dd 012, 076, 044, 108 ; Third row + 4
dd 028, 092, 060, 124 ; Fourth row + 4
512 == BitreversalMatrix
256 dd 00, 16, 08, 24 ; First row = BitreversalMatrix / 2 ( 256 / 512 = 0.5 )
dd 04, 20, 12, 28 ; Second row = BitreversalMatrix / 2
128 dd 00, 08, 04, 12 ; First row = BitreversalMatrix / 4 ( 128 / 512 = 0.25 )
64 dd 00, 04, 02, 06 ; First row = BitreversalMatrix / 8 ( 64 / 512 = 0.125 )
32 dd 00, 02, 01, 03 ; First row = BitreversalMatrix / 16 ( 32 / 512 = 0.0625 )
16 and 8 are less than 32 and are therefore special cases, but can be calculated like this,
lea esi,BitreversalMatrix
lea edi,BitreversalTable
movdqa xmm0,oword ptr [esi]
movdqa xmm1,oword ptr BitreversalADD
cmp BitreversalSize,16
je size_16
cmp BitreversalSize,8
jne done
psrld xmm0,3
movdqa oword ptr[edi],xmm0
paddd xmm0,xmm1
movdqa oword ptr[edi+BitreversalSize*2],xmm0
done:
ret
size_16:
psrld xmm0,2
movdqa oword ptr[edi],xmm0
paddd xmm0,xmm1
movdqa oword ptr[edi+BitreversalSize*2],xmm0
paddd xmm0,xmm1
movdqa oword ptr[edi+BitreversalSize],xmm0
paddd xmm0,xmm1
movdqa oword ptr[edi+BitreversalSize*2+BitreversalSize],xmm0
ret