News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast Fourier Transform

Started by guga, April 04, 2017, 05:17:53 AM

Previous topic - Next topic

guga

Hi guys.

I ported a Fast Fourier Transform used in pHash algo to RosAsm. I hope it can be usefull for someone. It seems slow, though :(



[phcomplex.reDis 0
phcomplex.imDis 8]

[Size_Of_phcomplex 16]

[Float_Two_PI: R$ 6.28318530717958623]
[TmpFloat1: R$ 0]
[Float_One: R$ 1.0]

; InputX: Input array of Real data (Float Double)
; nfactor: The total amount of elements on the array
; XFactor: A buffer that will store the resultant values. (In Double float too.)

Proc fft:
    Arguments @InputX, @nfactor, @XFactor
    Local @twiddle_factors, @XtStruct, @kCounter, @MaxCount, @CurTwiddle_factors
    Uses ecx, edx

    mov eax D@nfactor | shr eax 1 | shl eax 4
    C_call 'msvcrt.malloc' eax
    mov D@twiddle_factors eax
    mov D@CurTwiddle_factors eax

    mov eax D@nfactor | shl eax 4 ; mul by Size_Of_phcomplex
    C_call 'msvcrt.malloc' eax
    mov D@XtStruct eax

    mov D@kCounter 0

    mov ebx D@nfactor | shr ebx 1

    .While D@kCounter < ebx

        fild F@kCounter | fmul R$Float_Two_PI | fidiv F@nfactor | fstp R$TmpFloat1
        call polar_to_complex D@CurTwiddle_factors, Float_One, TmpFloat1

        add D@CurTwiddle_factors Size_Of_phcomplex
        inc D@kCounter
    .End_While

    call fft_calc D@nfactor, D@InputX, D@XFactor, D@XtStruct, 1, D@twiddle_factors
    C_call 'msvcrt.free' D@twiddle_factors
    C_call 'msvcrt.free' D@XtStruct
    xor eax eax

EndP

_________________________________

Proc polar_to_complex:
    Arguments @Output, @RValue, @Theta
    Uses edi, ebx

    mov eax D@Output
    mov ebx D@Theta
    mov edi D@RValue
    fld R$ebx | fcos | fmul R$edi | fstp R$eax+phcomplex.reDis
    fld R$ebx | fsin | fmul R$edi | fstp R$eax+phcomplex.imDis

EndP

________________________________________________

Proc fft_calc:
    Arguments @nfactor, @InputX, @XFactor, @XtStruct, @step, @twids
    Local @kCounter, @SComplex, @MaxStep, @MaxNCount, @CurXtStruct, @CurXFactor, @CurSComplex, @CurXtStruct2
    Uses ecx, ebx, edi

    If D@nfactor = 1
        mov edi D@XFactor
        mov eax D@InputX
        fld R$eax | fstp R$edi+phcomplex.reDis
        fldz | fstp R$edi+phcomplex.imDis
        ExitP
    End_If

    mov ecx D@step | shl ecx 1 | mov D@MaxStep ecx
    mov eax D@nfactor | shr eax 1 | mov D@MaxNCount eax
    shl eax 4 | add eax D@XtStruct | mov D@SComplex eax

    call fft_calc D@MaxNCount, D@InputX, D@SComplex, D@XFactor, D@MaxStep, D@twids

    mov eax D@step | mov ecx D@InputX | lea ebx D$ecx+eax*8
    call fft_calc D@MaxNCount, ebx, D@XtStruct, D@XFactor, D@MaxStep, D@twids

    mov eax D@XtStruct | mov D@CurXtStruct eax | mov D@CurXtStruct2 eax
    mov eax D@XFactor | mov D@CurXFactor eax
    mov eax D@SComplex | mov D@CurSComplex eax

    mov D@kCounter 0
    mov ebx D@MaxNCount
    .While D@kCounter < ebx

        mov ecx D@kCounter | imul ecx D@Step | shl ecx 4 | add ecx D@twids
        call mult_complex D@CurXtStruct2, D@CurXtStruct, ecx
        call add_complex D@CurXFactor, D@CurSComplex, D@CurXtStruct2

        mov eax ebx | add eax D@kCounter | shl eax 4 | add eax D@XFactor
        call sub_complex eax, D@CurSComplex, D@CurXtStruct2

        add D@CurXtStruct Size_Of_phcomplex
        add D@CurXFactor Size_Of_phcomplex
        add D@CurSComplex Size_Of_phcomplex

        inc D@kCounter
    .End_While

EndP

_________________________________________________

Proc mult_complex:
    Arguments @Result, @AFactor, @BFActor
    Uses ebx, esi, edi

    mov ebx D@AFactor
    mov esi D@BFactor
    mov eax D@Result

    fld R$ebx+phcomplex.reDis | fmul R$esi+phcomplex.reDis
    fld R$ebx+phcomplex.imDis | fmul R$esi+phcomplex.imDis
    fsubp ST1 ST0
    fstp R$eax+phcomplex.reDis

    fld R$ebx+phcomplex.reDis | fmul R$esi+phcomplex.imDis
    fld R$ebx+phcomplex.imDis | fmul R$esi+phcomplex.reDis
    faddp ST1 ST0
    fstp R$eax+phcomplex.imDis

EndP

_________________________________________________

Proc add_complex:
    Arguments @Result, @AFactor, @BFActor
    Uses ebx, esi

    mov ebx D@AFactor
    mov esi D@BFactor
    mov eax D@Result

    fld R$ebx+phcomplex.reDis | fadd R$esi+phcomplex.reDis | fstp R$eax+phcomplex.reDis
    fld R$ebx+phcomplex.imDis | fadd R$esi+phcomplex.imDis | fstp R$eax+phcomplex.imDis

EndP

_________________________________________________

Proc sub_complex:
    Arguments @Result, @AFactor, @BFActor
    Uses ebx, esi

    mov ebx D@AFactor
    mov esi D@BFactor
    mov eax D@Result

    fld R$ebx+phcomplex.reDis | fsub R$esi+phcomplex.reDis | fstp R$eax+phcomplex.reDis
    fld R$ebx+phcomplex.imDis | fsub R$esi+phcomplex.imDis | fstp R$eax+phcomplex.imDis

EndP

_________________________________________________

Example of usage:

[Data1: R$ 1, 2, 3, 4, 5] ; Input

; Output
[Data2:
Data2.Val1: R$ 0
Data2.Val2: R$ 0
Data2.Val3: R$ 0
Data2.Val4: R$ 0
Data2.Val5: R$ 0]

call fft Data1, 5, Data2





Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Raistlin

Did we not stump this to death yet  :icon_rolleyes:
Siekmanski & R*3 where are you?. My math is not well ,
as in cancerous to put it frankly. A newer algo would
spice things up and therefore I am all ears!
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Siekmanski

#2
I'm at home.  :biggrin:

rrr314159 did find the optimisations to skip all 0 and 90 degree trig calculations in the first 2 log2 loops.
Because before the bitreversal there is a stride of 4, 8 etc. and thus we can calculate 4 values at once. ( or 8 at once with AVX )

I found more optimisations before the bitreversal, starting in de 3th log2 loop for all the 45 degree rotations.
And we can skip even more 0 and 90 degree trig calculations starting from the 3th log2 loop.
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now  :biggrin:)

for example:

(6.0 * 0.707) + (9.0 * -0.707) = 2.1

can be replaced by one of these 2 calculations:

(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) *  0.707 = 2.1 ; for  45 degree

Now we can skip many more trig calculations.
Made a print out of a fft32, the positions and cos sin values to make it more easier to explain.

Bit-reversal = 32
00 16 08 24 | 04 20 12 28 | 02 18 10 26 | 06 22 14 30
01 17 09 25 | 05 21 13 29 | 03 19 11 27 | 07 23 15 31

FFTsize = 32
LogLoop1  A  B  Cos    Sin
          0  1  1.000  0.000
         16 17  1.000  0.000
          8  9  1.000  0.000
         24 25  1.000  0.000

          4  5  1.000  0.000
         20 21  1.000  0.000
         12 13  1.000  0.000
         28 29  1.000  0.000

          2  3  1.000  0.000
         18 19  1.000  0.000
         10 11  1.000  0.000
         26 27  1.000  0.000

          6  7  1.000  0.000
         22 23  1.000  0.000
         14 15  1.000  0.000
         30 31  1.000  0.000

LogLoop2  A  B  Cos    Sin
          0  2  1.000  0.000
         16 18  1.000  0.000
          8 10  1.000  0.000
         24 26  1.000  0.000

          4  6  1.000  0.000
         20 22  1.000  0.000
         12 14  1.000  0.000
         28 30  1.000  0.000

          1  3  0.000  -1.000
         17 19  0.000  -1.000
          9 11  0.000  -1.000
         25 27  0.000  -1.000

          5  7  0.000  -1.000
         21 23  0.000  -1.000
         13 15  0.000  -1.000
         29 31  0.000  -1.000

LogLoop3  A  B  Cos    Sin
          0  4  1.000  0.000
         16 20  1.000  0.000
          8 12  1.000  0.000
         24 28  1.000  0.000

          2  6  0.000  -1.000
         18 22  0.000  -1.000
         10 14  0.000  -1.000
         26 30  0.000  -1.000

          1  5  0.707  -0.707
         17 21  0.707  -0.707
          9 13  0.707  -0.707
         25 29  0.707  -0.707

          3  7  -0.707  -0.707
         19 23  -0.707  -0.707
         11 15  -0.707  -0.707
         27 31  -0.707  -0.707

LogLoop4  A  B  Cos    Sin
          0  8  1.000  0.000
         16 24  1.000  0.000
          4 12  -0.000  -1.000
         20 28  -0.000  -1.000

          2 10  0.707  -0.707
         18 26  0.707  -0.707
          6 14  -0.707  -0.707
         22 30  -0.707  -0.707

          1  9  0.924  -0.383
         17 25  0.924  -0.383
          5 13  -0.383  -0.924
         21 29  -0.383  -0.924

          3 11  0.383  -0.924
         19 27  0.383  -0.924
          7 15  -0.924  -0.383
         23 31  -0.924  -0.383

LogLoop5  A  B  Cos    Sin
          0 16  1.000  0.000
          1 17  0.981  -0.195
          2 18  0.924  -0.383
          3 19  0.831  -0.556
          4 20  0.707  -0.707
          5 21  0.556  -0.831
          6 22  0.383  -0.924
          7 23  0.195  -0.981
          8 24  0.000  -1.000
          9 25  -0.195  -0.981
         10 26  -0.383  -0.924
         11 27  -0.556  -0.831
         12 28  -0.707  -0.707
         13 29  -0.831  -0.556
         14 30  -0.924  -0.383
         15 31  -0.981  -0.195


edit: included the missing brackets.
Creative coders use backward thinking techniques as a strategy.

Raistlin

You guys are seriously weird but KEWL. (think bro love)
This does make sense, thanks for dumbing it down :bgrin:
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

rrr314159

Hi guga,

here's a previous FFT thread, http://masm32.com/board/index.php?topic=4231.0, might be of interest. Has some fast routines, using multiple cores and SIMD instructions. (Not RosAsm)

Hi Siekmanski,

Quote from: Siekmanski on May 06, 2017, 05:39:14 AM
Haven't implemented it yet in my fft source code. ( it's under a layer of dust by now  :biggrin:)

Well, get to work! :biggrin:) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments :icon_confused:

Quote from: Siekmanski
(6.0 - 9.0) * -0.707 = 2.1 ; for -45 degree
(9.0 - 6.0) *  0.707 = 2.1 ; for  45 degree

You left out a couple of brackets, shown above in bold.

Hi Raistlin,

Did you ever get around to finding those pulsars?
I am NaN ;)

Siekmanski

Hi rrr314159,

Glad to hear from you again. Thanks for correcting the typo's.

Quote from: rrr314159
Well, get to work! :biggrin:) Like you, I never really finished my old FFT routine. There's no dust on it - but there is a big layer in my brain! Would take a while to figure out what I was doing, and why. Wish there were more comments :icon_confused:

:biggrin:
Luckily i've commented my FFT sources very heavily, or else I would loose track in an instance.
If I get in the mood, i'll try to fold my brain in the FFT mode again and try to implement it and see if it works out or not.
Creative coders use backward thinking techniques as a strategy.

Siekmanski

Think I found my latest SSE FFT routine...
With a few comments.  :biggrin:


align 16
FFTSSE proc uses ebx esi edi SinCosTable:DWORD,Complex_Data:DWORD

    push    ebp
                                            ; To keep the code size small for the Log2 loops ( within 1 or 2 cache lines ) for all FFTsizes,
                                            ; we have to put the addresses of the AB stride positions in registers !
    mov     esi,Complex_Data                ; RealA = Complex_Data
    mov     edi,FFTsize
    mov     eax,edi
    lea     ebx,[esi+edi*4]                 ; ImagA = Complex_Data + FFTsize*4
    lea     ecx,[esi+edi*2]                 ; RealB = Complex_Data + FFTsize*2
    lea     edx,[esi+edi*4]
    lea     edx,[edx+edi*2]                 ; ImagB = Complex_Data + FFTsize*6

                                            ; Because we are dealing with 0 degree and 90 degree rotations we can skip trigonometric calculations
                                            ; Before time domain decomposition:
                                            ; The stride of the first log2 loop = FFTsize/2
                                            ; The stride of the second log2 loop = FFTsize/4                                           
                                            ; We can take advantage of that fact and process the first 2 log2 loops with 16 values at once
    add     eax,eax                         ; Start position = FFTsize*2
align 16
first_log2_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     first_log2_LP   ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$ - first_log2_LP


    sub     ecx,edi                         ; RealB = Complex_Data + FFTsize
    sub     edx,edi                         ; ImagB = Complex_Data + FFTsize*5
align 16
second_log2_Even_LP:
    movaps  xmm0,oword ptr[esi+edi-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+edi-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+edi-16]      ; RealB
    movaps  xmm3,oword ptr[edx+edi-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     edi,16

    movaps  oword ptr[esi+edi],xmm0         ; new RealA
    movaps  oword ptr[ebx+edi],xmm1         ; new ImagA
    movaps  oword ptr[ecx+edi],xmm4         ; new RealB
    movaps  oword ptr[edx+edi],xmm5         ; new ImagB
    jnz     second_log2_Even_LP ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$-second_log2_Even_LP

    mov     eax,FFTsize
    add     esi,FFTsize*2                   ; RealA = Complex_Data + FFTsize*2
    add     ebx,FFTsize*2                   ; ImagA = Complex_Data + FFTsize*6
    add     ecx,FFTsize*2                   ; RealB = Complex_Data + FFTsize*3
    add     edx,FFTsize*2                   ; ImagB = Complex_Data + FFTsize*7
align 16
second_log2_Odd_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm3                       ; RealA = RealA + ImagB
    addps   xmm5,xmm2                       ; ImagB = tempImagA + RealB
    subps   xmm1,xmm2                       ; ImagA = ImagA - RealB
    subps   xmm4,xmm3                       ; RealB = tempRealA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     second_log2_Odd_LP  ; Loop size = 56 bytes ( within 1 cache line )


    mov     esi,Complex_Data
    mov     eax,FFTsize*2                   ; Complex_Data + FFTsize*2
    lea     edi,ReverseBinaryTable          ; reverse-binary reindexing table ( time domain decomposition )
    lea     edx,[esi+eax*2]                 ; imaginary data ptr = Complex_Data + FFTsize*4
    mov     ecx,dword ptr[edi]              ; swap count
align 16
RBI_LP:
    mov     eax,dword ptr[edi+4]            ; get the swap positions
    mov     ebx,dword ptr[edi+8]
    movss   xmm0,real4 ptr[esi+eax*2]       ; get real data
    movss   xmm1,real4 ptr[esi+ebx*2]       ; get real data
    movss   xmm2,real4 ptr[edx+eax*2]       ; get imaginary data
    movss   xmm3,real4 ptr[edx+ebx*2]       ; get imaginary data
    movss   real4 ptr[esi+eax*2],xmm1       ; swap real data
    movss   real4 ptr[esi+ebx*2],xmm0       ; swap real data
    movss   real4 ptr[edx+eax*2],xmm3       ; swap imaginary data
    movss   real4 ptr[edx+ebx*2],xmm2       ; swap imaginary data
    add     edi,8
    dec     ecx
    jnz     RBI_LP
;   mov     CodeSize,$-RBI_LP

    mov     ebp,SinCosTable
    mov     edx,4                           ; AB distance = 4
align 16
Log2_LP:                                    ; Third Log2 loop starts here, loop (log2(FFTsize)-2) times.
    mov     edi,edx                         ; AB distance
    add     edx,edx                         ; Step == AB distance * 2
    xor     ecx,ecx                         ; Next Log2 loop, first A position = 0
Butterfly_LP:                               ; For strides 4, 8, 16 etc.
    movaps  xmm6,oword ptr[ebp]             ; Load Cos
    movaps  xmm7,oword ptr[ebp+16]          ; Load Sin
    mov     eax,ecx                         ; Next A position
Decomposition_LP:
    mov     ebx,eax                         ; A position
    add     ebx,edi                         ; B position = A + AB distance             

    movaps  xmm0,oword ptr[esi+eax*4]       ; RealA
    movaps  xmm1,oword ptr[esi+eax*4+FFTsize*4]     ; ImagA
    movaps  xmm2,oword ptr[esi+ebx*4]       ; RealB
    movaps  xmm3,oword ptr[esi+ebx*4+FFTsize*4]     ; ImagB
    movaps  xmm4,xmm2                       ; RealB copy
    movaps  xmm5,xmm3                       ; ImagB copy
    mulps   xmm2,xmm6                       ; RealB * cos
    mulps   xmm3,xmm7                       ; ImagB * sin
    mulps   xmm4,xmm7                       ; RealB * sin
    mulps   xmm5,xmm6                       ; ImagB * cos
    subps   xmm2,xmm3                       ; tempReal = RealB * cos - ImagB * sin
    addps   xmm4,xmm5                       ; tempImag = RealB * sin + ImagB * cos
    movaps  xmm3,xmm0                       ; RealA copy
    movaps  xmm5,xmm1                       ; ImagA copy
    addps   xmm0,xmm2                       ; RealA = RealA + tempReal
    addps   xmm1,xmm4                       ; ImagA = ImagA + tempImag
    subps   xmm3,xmm2                       ; RealB = RealA - tempReal
    subps   xmm5,xmm4                       ; ImagB = ImagA - tempImag
    movaps  oword ptr[esi+eax*4],xmm0       ; new RealA
    movaps  oword ptr[esi+eax*4+FFTsize*4],xmm1     ; new ImagA
    movaps  oword ptr[esi+ebx*4],xmm3       ; new RealB
    movaps  oword ptr[esi+ebx*4+FFTsize*4],xmm5     ; new ImagB

    add     eax,edx                         ; Next step in FFT data
    cmp     eax,FFTsize                     ; Do we need the next 4 Sine and Cosine values ?
    jb      Decomposition_LP
    add     ebp,32                          ; Next step in SinCosTable
    add     ecx,4                           ; Next A position
    cmp     ecx,edi                         ; Is current Log2 loop done ?
    jb      Butterfly_LP
    cmp     edx,FFTsize                     ; Are all Log2 loops done ?
    jb      Log2_LP     ; Loop size = 116 bytes
    pop     ebp
    ret   

FFTSSE endp

Creative coders use backward thinking techniques as a strategy.

Raistlin

QuoteHi Raistlin,

Did you ever get around to finding those pulsars?

Hail master rrr314159/Siekmanski, Nope did'nt find those pulsars yet (dreamy look), but I'am on my way...

Writing a paper on a High Performance / High Throughput Computing Hybrid grid that may be
able to implement  FFT searches in parallel for the use of binary star pulsar searches and much more...
Still knocking it out though so don't hold you're breath just yet.... takes forever - thesis writing part that is.
Coding part - well on its way - all in >ASM<

Hope you're all still doing well, miss chatting with you (mayhap I'll drop you a PM so you can do some beta testing ??) :bgrin:

Raistlin


Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

rrr314159

@Siekmanski,

You've got the right idea, comment almost every line. When coding, comments seem superfluous (it's all so obvious), but later on I regret leaving them out.

@raistlin,

No doubt there are a lot of pulsars out there waiting to be discovered. It's interesting that shortly after our 2015 FFT thread, LIGO finally detected gravitational waves. That adds considerable legitimacy to Hulse & Taylor. Wouldn't be surprised if it gives a boost to pulsar research, looking for more examples like PSR B1913+16.

Good luck with your thesis! Finishing it will be one of the great moments of your life. Unfortunately health problems make me an unlikely beta tester these days, but I'm with you in spirit :-)

BTW, looks like you guys were right about Trump. I'm not happy with him. Still, better than Hillary
I am NaN ;)