Author Topic: Fast Fourier Transform  (Read 407 times)

guga

• Moderator
• Member
• Posts: 825
• Assembly is a state of art.
Fast Fourier Transform
« on: April 04, 2017, 05:17:53 AM »
Hi guys.

I ported a Fast Fourier Transform used in pHash algo to RosAsm. I hope it can be usefull for someone. It seems slow, though

Code: [Select]
`[phcomplex.reDis 0 phcomplex.imDis 8][Size_Of_phcomplex 16][Float_Two_PI: R\$ 6.28318530717958623][TmpFloat1: R\$ 0][Float_One: R\$ 1.0]; InputX: Input array of Real data (Float Double); nfactor: The total amount of elements on the array; XFactor: A buffer that will store the resultant values. (In Double float too.)Proc fft:    Arguments @InputX, @nfactor, @XFactor    Local @twiddle_factors, @XtStruct, @kCounter, @MaxCount, @CurTwiddle_factors    Uses ecx, edx    mov eax D@nfactor | shr eax 1 | shl eax 4    C_call 'msvcrt.malloc' eax    mov D@twiddle_factors eax    mov D@CurTwiddle_factors eax    mov eax D@nfactor | shl eax 4 ; mul by Size_Of_phcomplex    C_call 'msvcrt.malloc' eax    mov D@XtStruct eax    mov D@kCounter 0    mov ebx D@nfactor | shr ebx 1    .While D@kCounter < ebx        fild F@kCounter | fmul R\$Float_Two_PI | fidiv F@nfactor | fstp R\$TmpFloat1        call polar_to_complex D@CurTwiddle_factors, Float_One, TmpFloat1        add D@CurTwiddle_factors Size_Of_phcomplex        inc D@kCounter    .End_While    call fft_calc D@nfactor, D@InputX, D@XFactor, D@XtStruct, 1, D@twiddle_factors    C_call 'msvcrt.free' D@twiddle_factors    C_call 'msvcrt.free' D@XtStruct    xor eax eaxEndP`_________________________________
Code: [Select]
`Proc polar_to_complex:    Arguments @Output, @RValue, @Theta    Uses edi, ebx    mov eax D@Output    mov ebx D@Theta    mov edi D@RValue    fld R\$ebx | fcos | fmul R\$edi | fstp R\$eax+phcomplex.reDis    fld R\$ebx | fsin | fmul R\$edi | fstp R\$eax+phcomplex.imDisEndP`________________________________________________
Code: [Select]
`Proc fft_calc:    Arguments @nfactor, @InputX, @XFactor, @XtStruct, @step, @twids    Local @kCounter, @SComplex, @MaxStep, @MaxNCount, @CurXtStruct, @CurXFactor, @CurSComplex, @CurXtStruct2    Uses ecx, ebx, edi    If D@nfactor = 1        mov edi D@XFactor        mov eax D@InputX        fld R\$eax | fstp R\$edi+phcomplex.reDis        fldz | fstp R\$edi+phcomplex.imDis        ExitP    End_If    mov ecx D@step | shl ecx 1 | mov D@MaxStep ecx    mov eax D@nfactor | shr eax 1 | mov D@MaxNCount eax    shl eax 4 | add eax D@XtStruct | mov D@SComplex eax    call fft_calc D@MaxNCount, D@InputX, D@SComplex, D@XFactor, D@MaxStep, D@twids    mov eax D@step | mov ecx D@InputX | lea ebx D\$ecx+eax*8    call fft_calc D@MaxNCount, ebx, D@XtStruct, D@XFactor, D@MaxStep, D@twids    mov eax D@XtStruct | mov D@CurXtStruct eax | mov D@CurXtStruct2 eax    mov eax D@XFactor | mov D@CurXFactor eax    mov eax D@SComplex | mov D@CurSComplex eax    mov D@kCounter 0    mov ebx D@MaxNCount    .While D@kCounter < ebx        mov ecx D@kCounter | imul ecx D@Step | shl ecx 4 | add ecx D@twids        call mult_complex D@CurXtStruct2, D@CurXtStruct, ecx        call add_complex D@CurXFactor, D@CurSComplex, D@CurXtStruct2        mov eax ebx | add eax D@kCounter | shl eax 4 | add eax D@XFactor        call sub_complex eax, D@CurSComplex, D@CurXtStruct2        add D@CurXtStruct Size_Of_phcomplex        add D@CurXFactor Size_Of_phcomplex        add D@CurSComplex Size_Of_phcomplex        inc D@kCounter    .End_WhileEndP`_________________________________________________
Code: [Select]
`Proc mult_complex:    Arguments @Result, @AFactor, @BFActor    Uses ebx, esi, edi    mov ebx D@AFactor    mov esi D@BFactor    mov eax D@Result    fld R\$ebx+phcomplex.reDis | fmul R\$esi+phcomplex.reDis    fld R\$ebx+phcomplex.imDis | fmul R\$esi+phcomplex.imDis    fsubp ST1 ST0    fstp R\$eax+phcomplex.reDis    fld R\$ebx+phcomplex.reDis | fmul R\$esi+phcomplex.imDis    fld R\$ebx+phcomplex.imDis | fmul R\$esi+phcomplex.reDis    faddp ST1 ST0    fstp R\$eax+phcomplex.imDisEndP`_________________________________________________

Code: [Select]
`Proc add_complex:    Arguments @Result, @AFactor, @BFActor    Uses ebx, esi    mov ebx D@AFactor    mov esi D@BFactor    mov eax D@Result    fld R\$ebx+phcomplex.reDis | fadd R\$esi+phcomplex.reDis | fstp R\$eax+phcomplex.reDis    fld R\$ebx+phcomplex.imDis | fadd R\$esi+phcomplex.imDis | fstp R\$eax+phcomplex.imDisEndP`_________________________________________________

Code: [Select]
`Proc sub_complex:    Arguments @Result, @AFactor, @BFActor    Uses ebx, esi    mov ebx D@AFactor    mov esi D@BFactor    mov eax D@Result    fld R\$ebx+phcomplex.reDis | fsub R\$esi+phcomplex.reDis | fstp R\$eax+phcomplex.reDis    fld R\$ebx+phcomplex.imDis | fsub R\$esi+phcomplex.imDis | fstp R\$eax+phcomplex.imDisEndP`_________________________________________________

Example of usage:

Code: [Select]
`[Data1: R\$ 1, 2, 3, 4, 5] ; Input; Output[Data2: Data2.Val1: R\$ 0 Data2.Val2: R\$ 0 Data2.Val3: R\$ 0 Data2.Val4: R\$ 0 Data2.Val5: R\$ 0]call fft Data1, 5, Data2`

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Raistlin

• Member
• Posts: 219
Re: Fast Fourier Transform
« Reply #1 on: May 06, 2017, 04:12:52 AM »
Did we not stump this to death yet
Siekmanski & R*3 where are you?. My math is not well ,
as in cancerous to put it frankly. A newer algo would
spice things up and therefore I am all ears!

Siekmanski

• Member
• Posts: 1058
Re: Fast Fourier Transform
« Reply #2 on: May 06, 2017, 05:39:14 AM »
I'm at home.

rrr314159 did find the optimisations to skip all 0 and 90 degree trig calculations in the first 2 log2 loops.
Because before the bitreversal there is a stride of 4, 8 etc. and thus we can calculate 4 values at once. ( or 8 at once with AVX )

I found more optimisations before the bitreversal, starting in de 3th log2 loop for all the 45 degree rotations.
And we can skip even more 0 and 90 degree trig calculations starting from the 3th log2 loop.
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now  )

for example:

(6.0 * 0.707) + (9.0 * -0.707) = 2.1

can be replaced by one of these 2 calculations:

(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) *  0.707 = 2.1 ; for  45 degree

Now we can skip many more trig calculations.
Made a print out of a fft32, the positions and cos sin values to make it more easier to explain.

Code: [Select]
`Bit-reversal = 3200 16 08 24 | 04 20 12 28 | 02 18 10 26 | 06 22 14 3001 17 09 25 | 05 21 13 29 | 03 19 11 27 | 07 23 15 31FFTsize = 32LogLoop1  A  B  Cos    Sin          0  1  1.000  0.000         16 17  1.000  0.000          8  9  1.000  0.000         24 25  1.000  0.000          4  5  1.000  0.000         20 21  1.000  0.000         12 13  1.000  0.000         28 29  1.000  0.000          2  3  1.000  0.000         18 19  1.000  0.000         10 11  1.000  0.000         26 27  1.000  0.000          6  7  1.000  0.000         22 23  1.000  0.000         14 15  1.000  0.000         30 31  1.000  0.000LogLoop2  A  B  Cos    Sin          0  2  1.000  0.000         16 18  1.000  0.000          8 10  1.000  0.000         24 26  1.000  0.000          4  6  1.000  0.000         20 22  1.000  0.000         12 14  1.000  0.000         28 30  1.000  0.000          1  3  0.000  -1.000         17 19  0.000  -1.000          9 11  0.000  -1.000         25 27  0.000  -1.000          5  7  0.000  -1.000         21 23  0.000  -1.000         13 15  0.000  -1.000         29 31  0.000  -1.000LogLoop3  A  B  Cos    Sin          0  4  1.000  0.000         16 20  1.000  0.000          8 12  1.000  0.000         24 28  1.000  0.000          2  6  0.000  -1.000         18 22  0.000  -1.000         10 14  0.000  -1.000         26 30  0.000  -1.000          1  5  0.707  -0.707         17 21  0.707  -0.707          9 13  0.707  -0.707         25 29  0.707  -0.707          3  7  -0.707  -0.707         19 23  -0.707  -0.707         11 15  -0.707  -0.707         27 31  -0.707  -0.707LogLoop4  A  B  Cos    Sin          0  8  1.000  0.000         16 24  1.000  0.000          4 12  -0.000  -1.000         20 28  -0.000  -1.000          2 10  0.707  -0.707         18 26  0.707  -0.707          6 14  -0.707  -0.707         22 30  -0.707  -0.707          1  9  0.924  -0.383         17 25  0.924  -0.383          5 13  -0.383  -0.924         21 29  -0.383  -0.924          3 11  0.383  -0.924         19 27  0.383  -0.924          7 15  -0.924  -0.383         23 31  -0.924  -0.383LogLoop5  A  B  Cos    Sin          0 16  1.000  0.000          1 17  0.981  -0.195          2 18  0.924  -0.383          3 19  0.831  -0.556          4 20  0.707  -0.707          5 21  0.556  -0.831          6 22  0.383  -0.924          7 23  0.195  -0.981          8 24  0.000  -1.000          9 25  -0.195  -0.981         10 26  -0.383  -0.924         11 27  -0.556  -0.831         12 28  -0.707  -0.707         13 29  -0.831  -0.556         14 30  -0.924  -0.383         15 31  -0.981  -0.195`
edit: included the missing brackets.
« Last Edit: May 06, 2017, 04:00:56 PM by Siekmanski »

Raistlin

• Member
• Posts: 219
Re: Fast Fourier Transform
« Reply #3 on: May 06, 2017, 05:48:26 AM »
You guys are seriously weird but KEWL. (think bro love)
This does make sense, thanks for dumbing it down

rrr314159

• Member
• Posts: 1364
Re: Fast Fourier Transform
« Reply #4 on: May 06, 2017, 11:08:44 AM »
Hi guga,

here's a previous FFT thread, http://masm32.com/board/index.php?topic=4231.0, might be of interest. Has some fast routines, using multiple cores and SIMD instructions. (Not RosAsm)

Hi Siekmanski,

Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now  )

Well, get to work! ) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments

Quote from: Siekmanski
(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) *  0.707 = 2.1 ; for  45 degree

You left out a couple of brackets, shown above in bold.

Hi Raistlin,

Did you ever get around to finding those pulsars?
I am NaN

Siekmanski

• Member
• Posts: 1058
Re: Fast Fourier Transform
« Reply #5 on: May 06, 2017, 01:38:41 PM »
Hi rrr314159,

Glad to hear from you again. Thanks for correcting the typo's.

Quote from: rrr314159
Well, get to work! ) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments

Luckily i've commented my FFT sources very heavily, or else I would loose track in an instance.
If I get in the mood, i'll try to fold my brain in the FFT mode again and try to implement it and see if it works out or not.

Siekmanski

• Member
• Posts: 1058
Re: Fast Fourier Transform
« Reply #6 on: May 06, 2017, 03:04:54 PM »
Think I found my latest SSE FFT routine...

Code: [Select]
`align 16FFTSSE proc uses ebx esi edi SinCosTable:DWORD,Complex_Data:DWORD    push    ebp                                            ; To keep the code size small for the Log2 loops ( within 1 or 2 cache lines ) for all FFTsizes,                                             ; we have to put the addresses of the AB stride positions in registers !    mov     esi,Complex_Data                ; RealA = Complex_Data    mov     edi,FFTsize    mov     eax,edi    lea     ebx,[esi+edi*4]                 ; ImagA = Complex_Data + FFTsize*4    lea     ecx,[esi+edi*2]                 ; RealB = Complex_Data + FFTsize*2    lea     edx,[esi+edi*4]    lea     edx,[edx+edi*2]                 ; ImagB = Complex_Data + FFTsize*6                                            ; Because we are dealing with 0 degree and 90 degree rotations we can skip trigonometric calculations                                            ; Before time domain decomposition:                                            ; The stride of the first log2 loop = FFTsize/2                                            ; The stride of the second log2 loop = FFTsize/4                                                                                        ; We can take advantage of that fact and process the first 2 log2 loops with 16 values at once    add     eax,eax                         ; Start position = FFTsize*2align 16first_log2_LP:    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB    movaps  xmm4,xmm0                       ; tempRealA    movaps  xmm5,xmm1                       ; tempImagA    addps   xmm0,xmm2                       ; RealA = RealA + RealB    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB    sub     eax,16    movaps  oword ptr[esi+eax],xmm0         ; new RealA    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA    movaps  oword ptr[ecx+eax],xmm4         ; new RealB    movaps  oword ptr[edx+eax],xmm5         ; new ImagB    jnz     first_log2_LP   ; Loop size = 56 bytes ( within 1 cache line );   mov     CodeSize,\$ - first_log2_LP    sub     ecx,edi                         ; RealB = Complex_Data + FFTsize    sub     edx,edi                         ; ImagB = Complex_Data + FFTsize*5align 16second_log2_Even_LP:    movaps  xmm0,oword ptr[esi+edi-16]      ; RealA    movaps  xmm1,oword ptr[ebx+edi-16]      ; ImagA    movaps  xmm2,oword ptr[ecx+edi-16]      ; RealB    movaps  xmm3,oword ptr[edx+edi-16]      ; ImagB    movaps  xmm4,xmm0                       ; tempRealA    movaps  xmm5,xmm1                       ; tempImagA    addps   xmm0,xmm2                       ; RealA = RealA + RealB    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB    sub     edi,16    movaps  oword ptr[esi+edi],xmm0         ; new RealA    movaps  oword ptr[ebx+edi],xmm1         ; new ImagA    movaps  oword ptr[ecx+edi],xmm4         ; new RealB    movaps  oword ptr[edx+edi],xmm5         ; new ImagB    jnz     second_log2_Even_LP ; Loop size = 56 bytes ( within 1 cache line );   mov     CodeSize,\$-second_log2_Even_LP    mov     eax,FFTsize    add     esi,FFTsize*2                   ; RealA = Complex_Data + FFTsize*2    add     ebx,FFTsize*2                   ; ImagA = Complex_Data + FFTsize*6    add     ecx,FFTsize*2                   ; RealB = Complex_Data + FFTsize*3    add     edx,FFTsize*2                   ; ImagB = Complex_Data + FFTsize*7align 16second_log2_Odd_LP:    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB    movaps  xmm4,xmm0                       ; tempRealA    movaps  xmm5,xmm1                       ; tempImagA    addps   xmm0,xmm3                       ; RealA = RealA + ImagB    addps   xmm5,xmm2                       ; ImagB = tempImagA + RealB    subps   xmm1,xmm2                       ; ImagA = ImagA - RealB    subps   xmm4,xmm3                       ; RealB = tempRealA - ImagB    sub     eax,16    movaps  oword ptr[esi+eax],xmm0         ; new RealA    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA    movaps  oword ptr[ecx+eax],xmm4         ; new RealB    movaps  oword ptr[edx+eax],xmm5         ; new ImagB    jnz     second_log2_Odd_LP  ; Loop size = 56 bytes ( within 1 cache line )    mov     esi,Complex_Data    mov     eax,FFTsize*2                   ; Complex_Data + FFTsize*2    lea     edi,ReverseBinaryTable          ; reverse-binary reindexing table ( time domain decomposition )    lea     edx,[esi+eax*2]                 ; imaginary data ptr = Complex_Data + FFTsize*4    mov     ecx,dword ptr[edi]              ; swap countalign 16RBI_LP:    mov     eax,dword ptr[edi+4]            ; get the swap positions    mov     ebx,dword ptr[edi+8]    movss   xmm0,real4 ptr[esi+eax*2]       ; get real data    movss   xmm1,real4 ptr[esi+ebx*2]       ; get real data    movss   xmm2,real4 ptr[edx+eax*2]       ; get imaginary data    movss   xmm3,real4 ptr[edx+ebx*2]       ; get imaginary data    movss   real4 ptr[esi+eax*2],xmm1       ; swap real data    movss   real4 ptr[esi+ebx*2],xmm0       ; swap real data    movss   real4 ptr[edx+eax*2],xmm3       ; swap imaginary data    movss   real4 ptr[edx+ebx*2],xmm2       ; swap imaginary data    add     edi,8    dec     ecx    jnz     RBI_LP;   mov     CodeSize,\$-RBI_LP    mov     ebp,SinCosTable    mov     edx,4                           ; AB distance = 4align 16Log2_LP:                                    ; Third Log2 loop starts here, loop (log2(FFTsize)-2) times.    mov     edi,edx                         ; AB distance    add     edx,edx                         ; Step == AB distance * 2    xor     ecx,ecx                         ; Next Log2 loop, first A position = 0Butterfly_LP:                               ; For strides 4, 8, 16 etc.    movaps  xmm6,oword ptr[ebp]             ; Load Cos    movaps  xmm7,oword ptr[ebp+16]          ; Load Sin    mov     eax,ecx                         ; Next A positionDecomposition_LP:    mov     ebx,eax                         ; A position    add     ebx,edi                         ; B position = A + AB distance                 movaps  xmm0,oword ptr[esi+eax*4]       ; RealA    movaps  xmm1,oword ptr[esi+eax*4+FFTsize*4]     ; ImagA    movaps  xmm2,oword ptr[esi+ebx*4]       ; RealB    movaps  xmm3,oword ptr[esi+ebx*4+FFTsize*4]     ; ImagB    movaps  xmm4,xmm2                       ; RealB copy    movaps  xmm5,xmm3                       ; ImagB copy    mulps   xmm2,xmm6                       ; RealB * cos    mulps   xmm3,xmm7                       ; ImagB * sin    mulps   xmm4,xmm7                       ; RealB * sin    mulps   xmm5,xmm6                       ; ImagB * cos    subps   xmm2,xmm3                       ; tempReal = RealB * cos - ImagB * sin    addps   xmm4,xmm5                       ; tempImag = RealB * sin + ImagB * cos    movaps  xmm3,xmm0                       ; RealA copy    movaps  xmm5,xmm1                       ; ImagA copy    addps   xmm0,xmm2                       ; RealA = RealA + tempReal    addps   xmm1,xmm4                       ; ImagA = ImagA + tempImag    subps   xmm3,xmm2                       ; RealB = RealA - tempReal    subps   xmm5,xmm4                       ; ImagB = ImagA - tempImag    movaps  oword ptr[esi+eax*4],xmm0       ; new RealA    movaps  oword ptr[esi+eax*4+FFTsize*4],xmm1     ; new ImagA    movaps  oword ptr[esi+ebx*4],xmm3       ; new RealB    movaps  oword ptr[esi+ebx*4+FFTsize*4],xmm5     ; new ImagB    add     eax,edx                         ; Next step in FFT data    cmp     eax,FFTsize                     ; Do we need the next 4 Sine and Cosine values ?    jb      Decomposition_LP    add     ebp,32                          ; Next step in SinCosTable    add     ecx,4                           ; Next A position    cmp     ecx,edi                         ; Is current Log2 loop done ?    jb      Butterfly_LP    cmp     edx,FFTsize                     ; Are all Log2 loops done ?    jb      Log2_LP     ; Loop size = 116 bytes    pop     ebp     ret    FFTSSE endp`

Raistlin

• Member
• Posts: 219
Re: Fast Fourier Transform
« Reply #7 on: May 08, 2017, 03:42:35 PM »
Quote
Hi Raistlin,

Did you ever get around to finding those pulsars?

Hail master rrr314159/Siekmanski, Nope did'nt find those pulsars yet (dreamy look), but I'am on my way...

Writing a paper on a High Performance / High Throughput Computing Hybrid grid that may be
able to implement  FFT searches in parallel for the use of binary star pulsar searches and much more...
Still knocking it out though so don't hold you're breath just yet.... takes forever - thesis writing part that is.
Coding part - well on its way - all in >ASM<

Hope you're all still doing well, miss chatting with you (mayhap I'll drop you a PM so you can do some beta testing ??)

Raistlin

rrr314159

• Member
• Posts: 1364
Re: Fast Fourier Transform
« Reply #8 on: May 13, 2017, 11:03:31 AM »
@Siekmanski,

You've got the right idea, comment almost every line. When coding, comments seem superfluous (it's all so obvious), but later on I regret leaving them out.

@raistlin,

No doubt there are a lot of pulsars out there waiting to be discovered. It's interesting that shortly after our 2015 FFT thread, LIGO finally detected gravitational waves. That adds considerable legitimacy to Hulse & Taylor. Wouldn't be surprised if it gives a boost to pulsar research, looking for more examples like PSR B1913+16.

Good luck with your thesis! Finishing it will be one of the great moments of your life. Unfortunately health problems make me an unlikely beta tester these days, but I'm with you in spirit :-)

BTW, looks like you guys were right about Trump. I'm not happy with him. Still, better than Hillary
I am NaN