Hi guys.
I ported a Fast Fourier Transform used in pHash algo to RosAsm. I hope it can be usefull for someone. It seems slow, though :(
[phcomplex.reDis 0
phcomplex.imDis 8]
[Size_Of_phcomplex 16]
[Float_Two_PI: R$ 6.28318530717958623]
[TmpFloat1: R$ 0]
[Float_One: R$ 1.0]
; InputX: Input array of Real data (Float Double)
; nfactor: The total amount of elements on the array
; XFactor: A buffer that will store the resultant values. (In Double float too.)
Proc fft:
Arguments @InputX, @nfactor, @XFactor
Local @twiddle_factors, @XtStruct, @kCounter, @MaxCount, @CurTwiddle_factors
Uses ecx, edx
mov eax D@nfactor | shr eax 1 | shl eax 4
C_call 'msvcrt.malloc' eax
mov D@twiddle_factors eax
mov D@CurTwiddle_factors eax
mov eax D@nfactor | shl eax 4 ; mul by Size_Of_phcomplex
C_call 'msvcrt.malloc' eax
mov D@XtStruct eax
mov D@kCounter 0
mov ebx D@nfactor | shr ebx 1
.While D@kCounter < ebx
fild F@kCounter | fmul R$Float_Two_PI | fidiv F@nfactor | fstp R$TmpFloat1
call polar_to_complex D@CurTwiddle_factors, Float_One, TmpFloat1
add D@CurTwiddle_factors Size_Of_phcomplex
inc D@kCounter
.End_While
call fft_calc D@nfactor, D@InputX, D@XFactor, D@XtStruct, 1, D@twiddle_factors
C_call 'msvcrt.free' D@twiddle_factors
C_call 'msvcrt.free' D@XtStruct
xor eax eax
EndP
_________________________________
Proc polar_to_complex:
Arguments @Output, @RValue, @Theta
Uses edi, ebx
mov eax D@Output
mov ebx D@Theta
mov edi D@RValue
fld R$ebx | fcos | fmul R$edi | fstp R$eax+phcomplex.reDis
fld R$ebx | fsin | fmul R$edi | fstp R$eax+phcomplex.imDis
EndP
________________________________________________
Proc fft_calc:
Arguments @nfactor, @InputX, @XFactor, @XtStruct, @step, @twids
Local @kCounter, @SComplex, @MaxStep, @MaxNCount, @CurXtStruct, @CurXFactor, @CurSComplex, @CurXtStruct2
Uses ecx, ebx, edi
If D@nfactor = 1
mov edi D@XFactor
mov eax D@InputX
fld R$eax | fstp R$edi+phcomplex.reDis
fldz | fstp R$edi+phcomplex.imDis
ExitP
End_If
mov ecx D@step | shl ecx 1 | mov D@MaxStep ecx
mov eax D@nfactor | shr eax 1 | mov D@MaxNCount eax
shl eax 4 | add eax D@XtStruct | mov D@SComplex eax
call fft_calc D@MaxNCount, D@InputX, D@SComplex, D@XFactor, D@MaxStep, D@twids
mov eax D@step | mov ecx D@InputX | lea ebx D$ecx+eax*8
call fft_calc D@MaxNCount, ebx, D@XtStruct, D@XFactor, D@MaxStep, D@twids
mov eax D@XtStruct | mov D@CurXtStruct eax | mov D@CurXtStruct2 eax
mov eax D@XFactor | mov D@CurXFactor eax
mov eax D@SComplex | mov D@CurSComplex eax
mov D@kCounter 0
mov ebx D@MaxNCount
.While D@kCounter < ebx
mov ecx D@kCounter | imul ecx D@Step | shl ecx 4 | add ecx D@twids
call mult_complex D@CurXtStruct2, D@CurXtStruct, ecx
call add_complex D@CurXFactor, D@CurSComplex, D@CurXtStruct2
mov eax ebx | add eax D@kCounter | shl eax 4 | add eax D@XFactor
call sub_complex eax, D@CurSComplex, D@CurXtStruct2
add D@CurXtStruct Size_Of_phcomplex
add D@CurXFactor Size_Of_phcomplex
add D@CurSComplex Size_Of_phcomplex
inc D@kCounter
.End_While
EndP
_________________________________________________
Proc mult_complex:
Arguments @Result, @AFactor, @BFActor
Uses ebx, esi, edi
mov ebx D@AFactor
mov esi D@BFactor
mov eax D@Result
fld R$ebx+phcomplex.reDis | fmul R$esi+phcomplex.reDis
fld R$ebx+phcomplex.imDis | fmul R$esi+phcomplex.imDis
fsubp ST1 ST0
fstp R$eax+phcomplex.reDis
fld R$ebx+phcomplex.reDis | fmul R$esi+phcomplex.imDis
fld R$ebx+phcomplex.imDis | fmul R$esi+phcomplex.reDis
faddp ST1 ST0
fstp R$eax+phcomplex.imDis
EndP
_________________________________________________
Proc add_complex:
Arguments @Result, @AFactor, @BFActor
Uses ebx, esi
mov ebx D@AFactor
mov esi D@BFactor
mov eax D@Result
fld R$ebx+phcomplex.reDis | fadd R$esi+phcomplex.reDis | fstp R$eax+phcomplex.reDis
fld R$ebx+phcomplex.imDis | fadd R$esi+phcomplex.imDis | fstp R$eax+phcomplex.imDis
EndP
_________________________________________________
Proc sub_complex:
Arguments @Result, @AFactor, @BFActor
Uses ebx, esi
mov ebx D@AFactor
mov esi D@BFactor
mov eax D@Result
fld R$ebx+phcomplex.reDis | fsub R$esi+phcomplex.reDis | fstp R$eax+phcomplex.reDis
fld R$ebx+phcomplex.imDis | fsub R$esi+phcomplex.imDis | fstp R$eax+phcomplex.imDis
EndP
_________________________________________________
Example of usage:
[Data1: R$ 1, 2, 3, 4, 5] ; Input
; Output
[Data2:
Data2.Val1: R$ 0
Data2.Val2: R$ 0
Data2.Val3: R$ 0
Data2.Val4: R$ 0
Data2.Val5: R$ 0]
call fft Data1, 5, Data2
Did we not stump this to death yet :icon_rolleyes:
Siekmanski & R*3 where are you?. My math is not well ,
as in cancerous to put it frankly. A newer algo would
spice things up and therefore I am all ears!
I'm at home. :biggrin:
rrr314159 did find the optimisations to skip all 0 and 90 degree trig calculations in the first 2 log2 loops.
Because before the bitreversal there is a stride of 4, 8 etc. and thus we can calculate 4 values at once. ( or 8 at once with AVX )
I found more optimisations before the bitreversal, starting in de 3th log2 loop for all the 45 degree rotations.
And we can skip even more 0 and 90 degree trig calculations starting from the 3th log2 loop.
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now :biggrin:)
for example:
(6.0 * 0.707) + (9.0 * -0.707) = 2.1
can be replaced by one of these 2 calculations:
(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) * 0.707 = 2.1 ; for 45 degree
Now we can skip many more trig calculations.
Made a print out of a fft32, the positions and cos sin values to make it more easier to explain.
Bit-reversal = 32
00 16 08 24 | 04 20 12 28 | 02 18 10 26 | 06 22 14 30
01 17 09 25 | 05 21 13 29 | 03 19 11 27 | 07 23 15 31
FFTsize = 32
LogLoop1 A B Cos Sin
0 1 1.000 0.000
16 17 1.000 0.000
8 9 1.000 0.000
24 25 1.000 0.000
4 5 1.000 0.000
20 21 1.000 0.000
12 13 1.000 0.000
28 29 1.000 0.000
2 3 1.000 0.000
18 19 1.000 0.000
10 11 1.000 0.000
26 27 1.000 0.000
6 7 1.000 0.000
22 23 1.000 0.000
14 15 1.000 0.000
30 31 1.000 0.000
LogLoop2 A B Cos Sin
0 2 1.000 0.000
16 18 1.000 0.000
8 10 1.000 0.000
24 26 1.000 0.000
4 6 1.000 0.000
20 22 1.000 0.000
12 14 1.000 0.000
28 30 1.000 0.000
1 3 0.000 -1.000
17 19 0.000 -1.000
9 11 0.000 -1.000
25 27 0.000 -1.000
5 7 0.000 -1.000
21 23 0.000 -1.000
13 15 0.000 -1.000
29 31 0.000 -1.000
LogLoop3 A B Cos Sin
0 4 1.000 0.000
16 20 1.000 0.000
8 12 1.000 0.000
24 28 1.000 0.000
2 6 0.000 -1.000
18 22 0.000 -1.000
10 14 0.000 -1.000
26 30 0.000 -1.000
1 5 0.707 -0.707
17 21 0.707 -0.707
9 13 0.707 -0.707
25 29 0.707 -0.707
3 7 -0.707 -0.707
19 23 -0.707 -0.707
11 15 -0.707 -0.707
27 31 -0.707 -0.707
LogLoop4 A B Cos Sin
0 8 1.000 0.000
16 24 1.000 0.000
4 12 -0.000 -1.000
20 28 -0.000 -1.000
2 10 0.707 -0.707
18 26 0.707 -0.707
6 14 -0.707 -0.707
22 30 -0.707 -0.707
1 9 0.924 -0.383
17 25 0.924 -0.383
5 13 -0.383 -0.924
21 29 -0.383 -0.924
3 11 0.383 -0.924
19 27 0.383 -0.924
7 15 -0.924 -0.383
23 31 -0.924 -0.383
LogLoop5 A B Cos Sin
0 16 1.000 0.000
1 17 0.981 -0.195
2 18 0.924 -0.383
3 19 0.831 -0.556
4 20 0.707 -0.707
5 21 0.556 -0.831
6 22 0.383 -0.924
7 23 0.195 -0.981
8 24 0.000 -1.000
9 25 -0.195 -0.981
10 26 -0.383 -0.924
11 27 -0.556 -0.831
12 28 -0.707 -0.707
13 29 -0.831 -0.556
14 30 -0.924 -0.383
15 31 -0.981 -0.195
edit: included the missing brackets.
You guys are seriously weird but KEWL. (think bro love)
This does make sense, thanks for dumbing it down :bgrin:
Hi guga,
here's a previous FFT thread, http://masm32.com/board/index.php?topic=4231.0, might be of interest. Has some fast routines, using multiple cores and SIMD instructions. (Not RosAsm)
Hi Siekmanski,
Quote from: Siekmanski on May 06, 2017, 05:39:14 AM
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now :biggrin:)
Well, get to work! :biggrin:) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments :icon_confused:
Quote from: Siekmanski
(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) * 0.707 = 2.1 ; for 45 degree
You left out a couple of brackets, shown above in bold.
Hi Raistlin,
Did you ever get around to finding those pulsars?
Hi rrr314159,
Glad to hear from you again. Thanks for correcting the typo's.
Quote from: rrr314159
Well, get to work! :biggrin:) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments :icon_confused:
:biggrin:
Luckily i've commented my FFT sources very heavily, or else I would loose track in an instance.
If I get in the mood, i'll try to fold my brain in the FFT mode again and try to implement it and see if it works out or not.
Think I found my latest SSE FFT routine...
With a few comments. :biggrin:
align 16
FFTSSE proc uses ebx esi edi SinCosTable:DWORD,Complex_Data:DWORD
push ebp
; To keep the code size small for the Log2 loops ( within 1 or 2 cache lines ) for all FFTsizes,
; we have to put the addresses of the AB stride positions in registers !
mov esi,Complex_Data ; RealA = Complex_Data
mov edi,FFTsize
mov eax,edi
lea ebx,[esi+edi*4] ; ImagA = Complex_Data + FFTsize*4
lea ecx,[esi+edi*2] ; RealB = Complex_Data + FFTsize*2
lea edx,[esi+edi*4]
lea edx,[edx+edi*2] ; ImagB = Complex_Data + FFTsize*6
; Because we are dealing with 0 degree and 90 degree rotations we can skip trigonometric calculations
; Before time domain decomposition:
; The stride of the first log2 loop = FFTsize/2
; The stride of the second log2 loop = FFTsize/4
; We can take advantage of that fact and process the first 2 log2 loops with 16 values at once
add eax,eax ; Start position = FFTsize*2
align 16
first_log2_LP:
movaps xmm0,oword ptr[esi+eax-16] ; RealA
movaps xmm1,oword ptr[ebx+eax-16] ; ImagA
movaps xmm2,oword ptr[ecx+eax-16] ; RealB
movaps xmm3,oword ptr[edx+eax-16] ; ImagB
movaps xmm4,xmm0 ; tempRealA
movaps xmm5,xmm1 ; tempImagA
addps xmm0,xmm2 ; RealA = RealA + RealB
addps xmm1,xmm3 ; ImagA = ImagA + ImagB
subps xmm4,xmm2 ; RealB = tempRealA - RealB
subps xmm5,xmm3 ; ImagB = tempImagA - ImagB
sub eax,16
movaps oword ptr[esi+eax],xmm0 ; new RealA
movaps oword ptr[ebx+eax],xmm1 ; new ImagA
movaps oword ptr[ecx+eax],xmm4 ; new RealB
movaps oword ptr[edx+eax],xmm5 ; new ImagB
jnz first_log2_LP ; Loop size = 56 bytes ( within 1 cache line )
; mov CodeSize,$ - first_log2_LP
sub ecx,edi ; RealB = Complex_Data + FFTsize
sub edx,edi ; ImagB = Complex_Data + FFTsize*5
align 16
second_log2_Even_LP:
movaps xmm0,oword ptr[esi+edi-16] ; RealA
movaps xmm1,oword ptr[ebx+edi-16] ; ImagA
movaps xmm2,oword ptr[ecx+edi-16] ; RealB
movaps xmm3,oword ptr[edx+edi-16] ; ImagB
movaps xmm4,xmm0 ; tempRealA
movaps xmm5,xmm1 ; tempImagA
addps xmm0,xmm2 ; RealA = RealA + RealB
addps xmm1,xmm3 ; ImagA = ImagA + ImagB
subps xmm4,xmm2 ; RealB = tempRealA - RealB
subps xmm5,xmm3 ; ImagB = tempImagA - ImagB
sub edi,16
movaps oword ptr[esi+edi],xmm0 ; new RealA
movaps oword ptr[ebx+edi],xmm1 ; new ImagA
movaps oword ptr[ecx+edi],xmm4 ; new RealB
movaps oword ptr[edx+edi],xmm5 ; new ImagB
jnz second_log2_Even_LP ; Loop size = 56 bytes ( within 1 cache line )
; mov CodeSize,$-second_log2_Even_LP
mov eax,FFTsize
add esi,FFTsize*2 ; RealA = Complex_Data + FFTsize*2
add ebx,FFTsize*2 ; ImagA = Complex_Data + FFTsize*6
add ecx,FFTsize*2 ; RealB = Complex_Data + FFTsize*3
add edx,FFTsize*2 ; ImagB = Complex_Data + FFTsize*7
align 16
second_log2_Odd_LP:
movaps xmm0,oword ptr[esi+eax-16] ; RealA
movaps xmm1,oword ptr[ebx+eax-16] ; ImagA
movaps xmm2,oword ptr[ecx+eax-16] ; RealB
movaps xmm3,oword ptr[edx+eax-16] ; ImagB
movaps xmm4,xmm0 ; tempRealA
movaps xmm5,xmm1 ; tempImagA
addps xmm0,xmm3 ; RealA = RealA + ImagB
addps xmm5,xmm2 ; ImagB = tempImagA + RealB
subps xmm1,xmm2 ; ImagA = ImagA - RealB
subps xmm4,xmm3 ; RealB = tempRealA - ImagB
sub eax,16
movaps oword ptr[esi+eax],xmm0 ; new RealA
movaps oword ptr[ebx+eax],xmm1 ; new ImagA
movaps oword ptr[ecx+eax],xmm4 ; new RealB
movaps oword ptr[edx+eax],xmm5 ; new ImagB
jnz second_log2_Odd_LP ; Loop size = 56 bytes ( within 1 cache line )
mov esi,Complex_Data
mov eax,FFTsize*2 ; Complex_Data + FFTsize*2
lea edi,ReverseBinaryTable ; reverse-binary reindexing table ( time domain decomposition )
lea edx,[esi+eax*2] ; imaginary data ptr = Complex_Data + FFTsize*4
mov ecx,dword ptr[edi] ; swap count
align 16
RBI_LP:
mov eax,dword ptr[edi+4] ; get the swap positions
mov ebx,dword ptr[edi+8]
movss xmm0,real4 ptr[esi+eax*2] ; get real data
movss xmm1,real4 ptr[esi+ebx*2] ; get real data
movss xmm2,real4 ptr[edx+eax*2] ; get imaginary data
movss xmm3,real4 ptr[edx+ebx*2] ; get imaginary data
movss real4 ptr[esi+eax*2],xmm1 ; swap real data
movss real4 ptr[esi+ebx*2],xmm0 ; swap real data
movss real4 ptr[edx+eax*2],xmm3 ; swap imaginary data
movss real4 ptr[edx+ebx*2],xmm2 ; swap imaginary data
add edi,8
dec ecx
jnz RBI_LP
; mov CodeSize,$-RBI_LP
mov ebp,SinCosTable
mov edx,4 ; AB distance = 4
align 16
Log2_LP: ; Third Log2 loop starts here, loop (log2(FFTsize)-2) times.
mov edi,edx ; AB distance
add edx,edx ; Step == AB distance * 2
xor ecx,ecx ; Next Log2 loop, first A position = 0
Butterfly_LP: ; For strides 4, 8, 16 etc.
movaps xmm6,oword ptr[ebp] ; Load Cos
movaps xmm7,oword ptr[ebp+16] ; Load Sin
mov eax,ecx ; Next A position
Decomposition_LP:
mov ebx,eax ; A position
add ebx,edi ; B position = A + AB distance
movaps xmm0,oword ptr[esi+eax*4] ; RealA
movaps xmm1,oword ptr[esi+eax*4+FFTsize*4] ; ImagA
movaps xmm2,oword ptr[esi+ebx*4] ; RealB
movaps xmm3,oword ptr[esi+ebx*4+FFTsize*4] ; ImagB
movaps xmm4,xmm2 ; RealB copy
movaps xmm5,xmm3 ; ImagB copy
mulps xmm2,xmm6 ; RealB * cos
mulps xmm3,xmm7 ; ImagB * sin
mulps xmm4,xmm7 ; RealB * sin
mulps xmm5,xmm6 ; ImagB * cos
subps xmm2,xmm3 ; tempReal = RealB * cos - ImagB * sin
addps xmm4,xmm5 ; tempImag = RealB * sin + ImagB * cos
movaps xmm3,xmm0 ; RealA copy
movaps xmm5,xmm1 ; ImagA copy
addps xmm0,xmm2 ; RealA = RealA + tempReal
addps xmm1,xmm4 ; ImagA = ImagA + tempImag
subps xmm3,xmm2 ; RealB = RealA - tempReal
subps xmm5,xmm4 ; ImagB = ImagA - tempImag
movaps oword ptr[esi+eax*4],xmm0 ; new RealA
movaps oword ptr[esi+eax*4+FFTsize*4],xmm1 ; new ImagA
movaps oword ptr[esi+ebx*4],xmm3 ; new RealB
movaps oword ptr[esi+ebx*4+FFTsize*4],xmm5 ; new ImagB
add eax,edx ; Next step in FFT data
cmp eax,FFTsize ; Do we need the next 4 Sine and Cosine values ?
jb Decomposition_LP
add ebp,32 ; Next step in SinCosTable
add ecx,4 ; Next A position
cmp ecx,edi ; Is current Log2 loop done ?
jb Butterfly_LP
cmp edx,FFTsize ; Are all Log2 loops done ?
jb Log2_LP ; Loop size = 116 bytes
pop ebp
ret
FFTSSE endp
QuoteHi Raistlin,
Did you ever get around to finding those pulsars?
Hail master rrr314159/Siekmanski, Nope did'nt find those pulsars yet (dreamy look), but I'am on my way...
Writing a paper on a High Performance / High Throughput Computing Hybrid grid that may be
able to implement FFT searches in parallel for the use of binary star pulsar searches and much more...
Still knocking it out though so don't hold you're breath just yet.... takes forever - thesis writing part that is.
Coding part - well on its way - all in >ASM<
Hope you're all still doing well, miss chatting with you (mayhap I'll drop you a PM so you can do some beta testing ??) :bgrin:
Raistlin
@Siekmanski,
You've got the right idea, comment almost every line. When coding, comments seem superfluous (it's all so obvious), but later on I regret leaving them out.
@raistlin,
No doubt there are a lot of pulsars out there waiting to be discovered. It's interesting that shortly after our 2015 FFT thread, LIGO finally detected gravitational waves. That adds considerable legitimacy to Hulse & Taylor. Wouldn't be surprised if it gives a boost to pulsar research, looking for more examples like PSR B1913+16.
Good luck with your thesis! Finishing it will be one of the great moments of your life. Unfortunately health problems make me an unlikely beta tester these days, but I'm with you in spirit :-)
BTW, looks like you guys were right about Trump. I'm not happy with him. Still, better than Hillary