Fast Fourier Transform

guga · April 04, 2017, 05:17:53 AM

Hi guys.

I ported a Fast Fourier Transform used in pHash algo to RosAsm. I hope it can be usefull for someone. It seems slow, though :(

Code Select



[phcomplex.reDis 0
 phcomplex.imDis 8]

[Size_Of_phcomplex 16]

[Float_Two_PI: R$ 6.28318530717958623]
[TmpFloat1: R$ 0]
[Float_One: R$ 1.0]

; InputX: Input array of Real data (Float Double)
; nfactor: The total amount of elements on the array
; XFactor: A buffer that will store the resultant values. (In Double float too.)

Proc fft:
    Arguments @InputX, @nfactor, @XFactor
    Local @twiddle_factors, @XtStruct, @kCounter, @MaxCount, @CurTwiddle_factors
    Uses ecx, edx

    mov eax D@nfactor | shr eax 1 | shl eax 4
    C_call 'msvcrt.malloc' eax
    mov D@twiddle_factors eax
    mov D@CurTwiddle_factors eax

    mov eax D@nfactor | shl eax 4 ; mul by Size_Of_phcomplex
    C_call 'msvcrt.malloc' eax
    mov D@XtStruct eax

    mov D@kCounter 0

    mov ebx D@nfactor | shr ebx 1

    .While D@kCounter < ebx

        fild F@kCounter | fmul R$Float_Two_PI | fidiv F@nfactor | fstp R$TmpFloat1
        call polar_to_complex D@CurTwiddle_factors, Float_One, TmpFloat1

        add D@CurTwiddle_factors Size_Of_phcomplex
        inc D@kCounter
    .End_While

    call fft_calc D@nfactor, D@InputX, D@XFactor, D@XtStruct, 1, D@twiddle_factors
    C_call 'msvcrt.free' D@twiddle_factors
    C_call 'msvcrt.free' D@XtStruct
    xor eax eax

EndP

_________________________________

Code Select


Proc polar_to_complex:
    Arguments @Output, @RValue, @Theta
    Uses edi, ebx

    mov eax D@Output
    mov ebx D@Theta
    mov edi D@RValue
    fld R$ebx | fcos | fmul R$edi | fstp R$eax+phcomplex.reDis
    fld R$ebx | fsin | fmul R$edi | fstp R$eax+phcomplex.imDis

EndP

________________________________________________

Code Select


Proc fft_calc:
    Arguments @nfactor, @InputX, @XFactor, @XtStruct, @step, @twids
    Local @kCounter, @SComplex, @MaxStep, @MaxNCount, @CurXtStruct, @CurXFactor, @CurSComplex, @CurXtStruct2
    Uses ecx, ebx, edi

    If D@nfactor = 1
        mov edi D@XFactor
        mov eax D@InputX
        fld R$eax | fstp R$edi+phcomplex.reDis
        fldz | fstp R$edi+phcomplex.imDis
        ExitP
    End_If

    mov ecx D@step | shl ecx 1 | mov D@MaxStep ecx
    mov eax D@nfactor | shr eax 1 | mov D@MaxNCount eax
    shl eax 4 | add eax D@XtStruct | mov D@SComplex eax

    call fft_calc D@MaxNCount, D@InputX, D@SComplex, D@XFactor, D@MaxStep, D@twids

    mov eax D@step | mov ecx D@InputX | lea ebx D$ecx+eax*8
    call fft_calc D@MaxNCount, ebx, D@XtStruct, D@XFactor, D@MaxStep, D@twids

    mov eax D@XtStruct | mov D@CurXtStruct eax | mov D@CurXtStruct2 eax
    mov eax D@XFactor | mov D@CurXFactor eax
    mov eax D@SComplex | mov D@CurSComplex eax

    mov D@kCounter 0
    mov ebx D@MaxNCount
    .While D@kCounter < ebx

        mov ecx D@kCounter | imul ecx D@Step | shl ecx 4 | add ecx D@twids
        call mult_complex D@CurXtStruct2, D@CurXtStruct, ecx
        call add_complex D@CurXFactor, D@CurSComplex, D@CurXtStruct2

        mov eax ebx | add eax D@kCounter | shl eax 4 | add eax D@XFactor
        call sub_complex eax, D@CurSComplex, D@CurXtStruct2

        add D@CurXtStruct Size_Of_phcomplex
        add D@CurXFactor Size_Of_phcomplex
        add D@CurSComplex Size_Of_phcomplex

        inc D@kCounter
    .End_While

EndP

_________________________________________________

Code Select


Proc mult_complex:
    Arguments @Result, @AFactor, @BFActor
    Uses ebx, esi, edi

    mov ebx D@AFactor
    mov esi D@BFactor
    mov eax D@Result

    fld R$ebx+phcomplex.reDis | fmul R$esi+phcomplex.reDis
    fld R$ebx+phcomplex.imDis | fmul R$esi+phcomplex.imDis
    fsubp ST1 ST0
    fstp R$eax+phcomplex.reDis

    fld R$ebx+phcomplex.reDis | fmul R$esi+phcomplex.imDis
    fld R$ebx+phcomplex.imDis | fmul R$esi+phcomplex.reDis
    faddp ST1 ST0
    fstp R$eax+phcomplex.imDis

EndP

_________________________________________________

Code Select

Proc add_complex:
    Arguments @Result, @AFactor, @BFActor
    Uses ebx, esi

    mov ebx D@AFactor
    mov esi D@BFactor
    mov eax D@Result

    fld R$ebx+phcomplex.reDis | fadd R$esi+phcomplex.reDis | fstp R$eax+phcomplex.reDis
    fld R$ebx+phcomplex.imDis | fadd R$esi+phcomplex.imDis | fstp R$eax+phcomplex.imDis

EndP

_________________________________________________

Code Select

Proc sub_complex:
    Arguments @Result, @AFactor, @BFActor
    Uses ebx, esi

    mov ebx D@AFactor
    mov esi D@BFactor
    mov eax D@Result

    fld R$ebx+phcomplex.reDis | fsub R$esi+phcomplex.reDis | fstp R$eax+phcomplex.reDis
    fld R$ebx+phcomplex.imDis | fsub R$esi+phcomplex.imDis | fstp R$eax+phcomplex.imDis

EndP

_________________________________________________

Example of usage:

Code Select

[Data1: R$ 1, 2, 3, 4, 5] ; Input

; Output
[Data2:
 Data2.Val1: R$ 0
 Data2.Val2: R$ 0
 Data2.Val3: R$ 0
 Data2.Val4: R$ 0
 Data2.Val5: R$ 0]

call fft Data1, 5, Data2

Raistlin · May 06, 2017, 04:12:52 AM

Did we not stump this to death yet :icon_rolleyes:
Siekmanski & R*3 where are you?. My math is not well ,
as in cancerous to put it frankly. A newer algo would
spice things up and therefore I am all ears!

Siekmanski · May 06, 2017, 05:39:14 AM

I'm at home.

rrr314159 did find the optimisations to skip all 0 and 90 degree trig calculations in the first 2 log2 loops.
Because before the bitreversal there is a stride of 4, 8 etc. and thus we can calculate 4 values at once. ( or 8 at once with AVX )

I found more optimisations before the bitreversal, starting in de 3th log2 loop for all the 45 degree rotations.
And we can skip even more 0 and 90 degree trig calculations starting from the 3th log2 loop.
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now

)

for example:

(6.0 * 0.707) + (9.0 * -0.707) = 2.1

can be replaced by one of these 2 calculations:

(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) * 0.707 = 2.1 ; for 45 degree

Now we can skip many more trig calculations.
Made a print out of a fft32, the positions and cos sin values to make it more easier to explain.

Code Select

Bit-reversal = 32
00 16 08 24 | 04 20 12 28 | 02 18 10 26 | 06 22 14 30
01 17 09 25 | 05 21 13 29 | 03 19 11 27 | 07 23 15 31

FFTsize = 32
LogLoop1  A  B  Cos    Sin
          0  1  1.000  0.000
         16 17  1.000  0.000
          8  9  1.000  0.000
         24 25  1.000  0.000

          4  5  1.000  0.000
         20 21  1.000  0.000
         12 13  1.000  0.000
         28 29  1.000  0.000

          2  3  1.000  0.000
         18 19  1.000  0.000
         10 11  1.000  0.000
         26 27  1.000  0.000

          6  7  1.000  0.000
         22 23  1.000  0.000
         14 15  1.000  0.000
         30 31  1.000  0.000

LogLoop2  A  B  Cos    Sin
          0  2  1.000  0.000
         16 18  1.000  0.000
          8 10  1.000  0.000
         24 26  1.000  0.000

          4  6  1.000  0.000
         20 22  1.000  0.000
         12 14  1.000  0.000
         28 30  1.000  0.000

          1  3  0.000  -1.000
         17 19  0.000  -1.000
          9 11  0.000  -1.000
         25 27  0.000  -1.000

          5  7  0.000  -1.000
         21 23  0.000  -1.000
         13 15  0.000  -1.000
         29 31  0.000  -1.000

LogLoop3  A  B  Cos    Sin
          0  4  1.000  0.000
         16 20  1.000  0.000
          8 12  1.000  0.000
         24 28  1.000  0.000

          2  6  0.000  -1.000
         18 22  0.000  -1.000
         10 14  0.000  -1.000
         26 30  0.000  -1.000

          1  5  0.707  -0.707
         17 21  0.707  -0.707
          9 13  0.707  -0.707
         25 29  0.707  -0.707

          3  7  -0.707  -0.707
         19 23  -0.707  -0.707
         11 15  -0.707  -0.707
         27 31  -0.707  -0.707

LogLoop4  A  B  Cos    Sin
          0  8  1.000  0.000
         16 24  1.000  0.000
          4 12  -0.000  -1.000
         20 28  -0.000  -1.000

          2 10  0.707  -0.707
         18 26  0.707  -0.707
          6 14  -0.707  -0.707
         22 30  -0.707  -0.707

          1  9  0.924  -0.383
         17 25  0.924  -0.383
          5 13  -0.383  -0.924
         21 29  -0.383  -0.924

          3 11  0.383  -0.924
         19 27  0.383  -0.924
          7 15  -0.924  -0.383
         23 31  -0.924  -0.383

LogLoop5  A  B  Cos    Sin
          0 16  1.000  0.000
          1 17  0.981  -0.195
          2 18  0.924  -0.383
          3 19  0.831  -0.556
          4 20  0.707  -0.707
          5 21  0.556  -0.831
          6 22  0.383  -0.924
          7 23  0.195  -0.981
          8 24  0.000  -1.000
          9 25  -0.195  -0.981
         10 26  -0.383  -0.924
         11 27  -0.556  -0.831
         12 28  -0.707  -0.707
         13 29  -0.831  -0.556
         14 30  -0.924  -0.383
         15 31  -0.981  -0.195

edit: included the missing brackets.

Raistlin · May 06, 2017, 05:48:26 AM

You guys are seriously weird but KEWL. (think bro love)
This does make sense, thanks for dumbing it down

rrr314159 · May 06, 2017, 11:08:44 AM

Hi guga,

here's a previous FFT thread, http://masm32.com/board/index.php?topic=4231.0, might be of interest. Has some fast routines, using multiple cores and SIMD instructions. (Not RosAsm)

Hi Siekmanski,

Quote from: Siekmanski on May 06, 2017, 05:39:14 AM
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now )

Well, get to work!

) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments :icon_confused:

Quote from: Siekmanski
(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) * 0.707 = 2.1 ; for 45 degree

You left out a couple of brackets, shown above in bold.

Hi Raistlin,

Did you ever get around to finding those pulsars?

Siekmanski · May 06, 2017, 01:38:41 PM

Hi rrr314159,

Glad to hear from you again. Thanks for correcting the typo's.

Quote from: rrr314159
Well, get to work! ) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments :icon_confused:

Luckily i've commented my FFT sources very heavily, or else I would loose track in an instance.
If I get in the mood, i'll try to fold my brain in the FFT mode again and try to implement it and see if it works out or not.

Siekmanski · May 06, 2017, 03:04:54 PM

Think I found my latest SSE FFT routine...
With a few comments.

Code Select


align 16
FFTSSE proc uses ebx esi edi SinCosTable:DWORD,Complex_Data:DWORD

    push    ebp
                                            ; To keep the code size small for the Log2 loops ( within 1 or 2 cache lines ) for all FFTsizes, 
                                            ; we have to put the addresses of the AB stride positions in registers !
    mov     esi,Complex_Data                ; RealA = Complex_Data
    mov     edi,FFTsize
    mov     eax,edi
    lea     ebx,[esi+edi*4]                 ; ImagA = Complex_Data + FFTsize*4
    lea     ecx,[esi+edi*2]                 ; RealB = Complex_Data + FFTsize*2
    lea     edx,[esi+edi*4]
    lea     edx,[edx+edi*2]                 ; ImagB = Complex_Data + FFTsize*6

                                            ; Because we are dealing with 0 degree and 90 degree rotations we can skip trigonometric calculations
                                            ; Before time domain decomposition:
                                            ; The stride of the first log2 loop = FFTsize/2
                                            ; The stride of the second log2 loop = FFTsize/4                                            
                                            ; We can take advantage of that fact and process the first 2 log2 loops with 16 values at once
    add     eax,eax                         ; Start position = FFTsize*2
align 16
first_log2_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     first_log2_LP   ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$ - first_log2_LP


    sub     ecx,edi                         ; RealB = Complex_Data + FFTsize
    sub     edx,edi                         ; ImagB = Complex_Data + FFTsize*5
align 16
second_log2_Even_LP:
    movaps  xmm0,oword ptr[esi+edi-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+edi-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+edi-16]      ; RealB
    movaps  xmm3,oword ptr[edx+edi-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     edi,16

    movaps  oword ptr[esi+edi],xmm0         ; new RealA
    movaps  oword ptr[ebx+edi],xmm1         ; new ImagA
    movaps  oword ptr[ecx+edi],xmm4         ; new RealB
    movaps  oword ptr[edx+edi],xmm5         ; new ImagB
    jnz     second_log2_Even_LP ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$-second_log2_Even_LP

    mov     eax,FFTsize
    add     esi,FFTsize*2                   ; RealA = Complex_Data + FFTsize*2
    add     ebx,FFTsize*2                   ; ImagA = Complex_Data + FFTsize*6
    add     ecx,FFTsize*2                   ; RealB = Complex_Data + FFTsize*3
    add     edx,FFTsize*2                   ; ImagB = Complex_Data + FFTsize*7
align 16
second_log2_Odd_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm3                       ; RealA = RealA + ImagB
    addps   xmm5,xmm2                       ; ImagB = tempImagA + RealB
    subps   xmm1,xmm2                       ; ImagA = ImagA - RealB
    subps   xmm4,xmm3                       ; RealB = tempRealA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     second_log2_Odd_LP  ; Loop size = 56 bytes ( within 1 cache line )


    mov     esi,Complex_Data
    mov     eax,FFTsize*2                   ; Complex_Data + FFTsize*2
    lea     edi,ReverseBinaryTable          ; reverse-binary reindexing table ( time domain decomposition )
    lea     edx,[esi+eax*2]                 ; imaginary data ptr = Complex_Data + FFTsize*4
    mov     ecx,dword ptr[edi]              ; swap count
align 16
RBI_LP:
    mov     eax,dword ptr[edi+4]            ; get the swap positions
    mov     ebx,dword ptr[edi+8]
    movss   xmm0,real4 ptr[esi+eax*2]       ; get real data
    movss   xmm1,real4 ptr[esi+ebx*2]       ; get real data
    movss   xmm2,real4 ptr[edx+eax*2]       ; get imaginary data
    movss   xmm3,real4 ptr[edx+ebx*2]       ; get imaginary data
    movss   real4 ptr[esi+eax*2],xmm1       ; swap real data
    movss   real4 ptr[esi+ebx*2],xmm0       ; swap real data
    movss   real4 ptr[edx+eax*2],xmm3       ; swap imaginary data
    movss   real4 ptr[edx+ebx*2],xmm2       ; swap imaginary data
    add     edi,8
    dec     ecx
    jnz     RBI_LP
;   mov     CodeSize,$-RBI_LP

    mov     ebp,SinCosTable
    mov     edx,4                           ; AB distance = 4
align 16
Log2_LP:                                    ; Third Log2 loop starts here, loop (log2(FFTsize)-2) times.
    mov     edi,edx                         ; AB distance
    add     edx,edx                         ; Step == AB distance * 2
    xor     ecx,ecx                         ; Next Log2 loop, first A position = 0
Butterfly_LP:                               ; For strides 4, 8, 16 etc.
    movaps  xmm6,oword ptr[ebp]             ; Load Cos
    movaps  xmm7,oword ptr[ebp+16]          ; Load Sin
    mov     eax,ecx                         ; Next A position
Decomposition_LP:
    mov     ebx,eax                         ; A position
    add     ebx,edi                         ; B position = A + AB distance             

    movaps  xmm0,oword ptr[esi+eax*4]       ; RealA
    movaps  xmm1,oword ptr[esi+eax*4+FFTsize*4]     ; ImagA
    movaps  xmm2,oword ptr[esi+ebx*4]       ; RealB
    movaps  xmm3,oword ptr[esi+ebx*4+FFTsize*4]     ; ImagB
    movaps  xmm4,xmm2                       ; RealB copy
    movaps  xmm5,xmm3                       ; ImagB copy
    mulps   xmm2,xmm6                       ; RealB * cos
    mulps   xmm3,xmm7                       ; ImagB * sin
    mulps   xmm4,xmm7                       ; RealB * sin
    mulps   xmm5,xmm6                       ; ImagB * cos
    subps   xmm2,xmm3                       ; tempReal = RealB * cos - ImagB * sin
    addps   xmm4,xmm5                       ; tempImag = RealB * sin + ImagB * cos
    movaps  xmm3,xmm0                       ; RealA copy
    movaps  xmm5,xmm1                       ; ImagA copy
    addps   xmm0,xmm2                       ; RealA = RealA + tempReal
    addps   xmm1,xmm4                       ; ImagA = ImagA + tempImag
    subps   xmm3,xmm2                       ; RealB = RealA - tempReal
    subps   xmm5,xmm4                       ; ImagB = ImagA - tempImag
    movaps  oword ptr[esi+eax*4],xmm0       ; new RealA
    movaps  oword ptr[esi+eax*4+FFTsize*4],xmm1     ; new ImagA
    movaps  oword ptr[esi+ebx*4],xmm3       ; new RealB
    movaps  oword ptr[esi+ebx*4+FFTsize*4],xmm5     ; new ImagB

    add     eax,edx                         ; Next step in FFT data
    cmp     eax,FFTsize                     ; Do we need the next 4 Sine and Cosine values ?
    jb      Decomposition_LP
    add     ebp,32                          ; Next step in SinCosTable
    add     ecx,4                           ; Next A position
    cmp     ecx,edi                         ; Is current Log2 loop done ?
    jb      Butterfly_LP
    cmp     edx,FFTsize                     ; Are all Log2 loops done ?
    jb      Log2_LP     ; Loop size = 116 bytes
    pop     ebp 
    ret    

FFTSSE endp

Raistlin · May 08, 2017, 03:42:35 PM

QuoteHi Raistlin,

Did you ever get around to finding those pulsars?

Hail master rrr314159/Siekmanski, Nope did'nt find those pulsars yet (dreamy look), but I'am on my way...

Writing a paper on a High Performance / High Throughput Computing Hybrid grid that may be
able to implement FFT searches in parallel for the use of binary star pulsar searches and much more...
Still knocking it out though so don't hold you're breath just yet.... takes forever - thesis writing part that is.
Coding part - well on its way - all in >ASM<

Hope you're all still doing well, miss chatting with you (mayhap I'll drop you a PM so you can do some beta testing ??)

Raistlin

rrr314159 · May 13, 2017, 11:03:31 AM

@Siekmanski,

You've got the right idea, comment almost every line. When coding, comments seem superfluous (it's all so obvious), but later on I regret leaving them out.

@raistlin,

No doubt there are a lot of pulsars out there waiting to be discovered. It's interesting that shortly after our 2015 FFT thread, LIGO finally detected gravitational waves. That adds considerable legitimacy to Hulse & Taylor. Wouldn't be surprised if it gives a boost to pulsar research, looking for more examples like PSR B1913+16.

Good luck with your thesis! Finishing it will be one of the great moments of your life. Unfortunately health problems make me an unlikely beta tester these days, but I'm with you in spirit :-)

BTW, looks like you guys were right about Trump. I'm not happy with him. Still, better than Hillary

The MASM Forum

News: