Multithreading with FFT-IFFT

guga · March 26, 2018, 05:59:22 PM

I´ll wait for the code to study this. I´m a bit confused with the results, though.

If i use, for example, only 16 samples the result i´ve got is diferent from some places i computed online. Ex:
On input
[<16 ComplexData: F$ 0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0]

but, on output i´ve got 32 samples and their values are diferent from the ones i calculated here:

http://www.themobilestudio.net/the-fourier-transform-part-15
or from here
http://scistatcalc.blogspot.com.br/2013/12/fft-calculator.html

Siekmanski · March 26, 2018, 09:52:04 PM

Hi Guga,

FFT ( Fast Fourier Transform ) comes in all kind of flavours, complex to complex, real to complex, real to real, interleaved or not interleaved, 1D, 2D etc....

My FFT routine posted here is "complex to complex", interleaved input and interleaved output. ( real,imag,real,imag, etc. )

If you insert these 32 ( = FFTsize 16 ) values to "Mark's FFT Calculator"
http://www.themobilestudio.net/the-fourier-transform-part-15

"Input Samples" set to "Complex Wave"

Copy these values ( = Complex input) as complex wave,

Code Select

0.0
0.0
1.0
0.0
2.0
0.0
3.0
0.0
4.0
0.0
5.0
0.0
6.0
0.0
7.0
0.0
8.0
0.0
9.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

You get this as FFT results:

REAL: 45.000 -25.452 10.364 -9.064 4.000 -1.279 -2.364 3.795
REAL: -5.000 3.795 -2.364 -1.279 4.000 -9.064 10.364 -25.452
IMAG: 0.000 -16.665 3.293 2.328 -5.000 5.642 -4.707 2.648
IMAG: 0.000 -2.649 4.707 -5.642 5.000 -2.328 -3.293 16.665

Why he outputs the results twice I don't know, 0-15 are the actual results, 16-31 is just a copy of 0-15.

So, if you insert the same values in my FFT routine ( FFTsize equ 16, and do not normalize! )
Complex to Complex, Interleaved input:

Code Select

ComplexData 		real4 0.0,0.0,1.0,0.0,2.0,0.0,3.0,0.0,4.0,0.0,5.0,0.0,6.0,0.0,7.0,0.0,8.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ; test data

You get this as output,

Interleaved ouput: ( real, Imaginary, not normalized)

45.000 0.000 -25.452 -16.665 10.364 3.293 -9.064 2.328
4.000 -5.000 -1.279 5.642 -2.364 -4.707 3.795 2.648
-5.000 0.000 3.795 -2.649 -2.364 4.707 -1.279 -5.642
4.000 5.000 -9.064 -2.328 10.364 -3.293 -25.452 16.665

And is exactly the same as calculated in "Mark's FFT Calculator"

I promised to post my latest FFT code.
But there is an error in the calculations which I need to correct first. This can take some time since I'm really busy at the moment. Sorry...

LordAdef · March 27, 2018, 02:43:18 PM

Otimo post!

Nice thread, guys. Replying to follow. Cheers

daydreamer · March 28, 2018, 06:31:15 PM

next step is to have 32threads x the number of PC's in a small home LAN,once I had 7 PC's
it is most effective in memoryintensive 3d anmations,first an initial several gb data transfer and after that if you used 5 computers, it rendered 5 times the speed,computer1did frame1,computer2 did frame2etc

guga · February 06, 2022, 02:39:56 PM

Old thread, but i´m testing the newer Routines from http://masm32.com/board/index.php?topic=6111.msg66562#msg66562

I ported it to RosAsm and i´m analyzing the results, but it seems that the newer version (code below) is producing incorrect values. For some reason it is jumping the 2nd data on each Frequency Index (http://www.themobilestudio.net/the-fourier-transform-part-15)

Code Select

align 16
FFTSSE proc uses ebx esi edi SinCosTable:DWORD,Complex_Data:DWORD

    push    ebp
                                            ; To keep the code size small for the Log2 loops ( within 1 or 2 cache lines ) for all FFTsizes,
                                            ; we have to put the addresses of the AB stride positions in registers !
    mov     esi,Complex_Data                ; RealA = Complex_Data
    mov     edi,FFTsize
    mov     eax,edi
    lea     ebx,[esi+edi*4]                 ; ImagA = Complex_Data + FFTsize*4
    lea     ecx,[esi+edi*2]                 ; RealB = Complex_Data + FFTsize*2
    lea     edx,[esi+edi*4]
    lea     edx,[edx+edi*2]                 ; ImagB = Complex_Data + FFTsize*6

                                            ; Because we are dealing with 0 degree and 90 degree rotations we can skip trigonometric calculations
                                            ; Before time domain decomposition:
                                            ; The stride of the first log2 loop = FFTsize/2
                                            ; The stride of the second log2 loop = FFTsize/4                                           
                                            ; We can take advantage of that fact and process the first 2 log2 loops with 16 values at once
    add     eax,eax                         ; Start position = FFTsize*2
align 16
first_log2_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     first_log2_LP   ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$ - first_log2_LP


    sub     ecx,edi                         ; RealB = Complex_Data + FFTsize
    sub     edx,edi                         ; ImagB = Complex_Data + FFTsize*5
align 16
second_log2_Even_LP:
    movaps  xmm0,oword ptr[esi+edi-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+edi-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+edi-16]      ; RealB
    movaps  xmm3,oword ptr[edx+edi-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     edi,16

    movaps  oword ptr[esi+edi],xmm0         ; new RealA
    movaps  oword ptr[ebx+edi],xmm1         ; new ImagA
    movaps  oword ptr[ecx+edi],xmm4         ; new RealB
    movaps  oword ptr[edx+edi],xmm5         ; new ImagB
    jnz     second_log2_Even_LP ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$-second_log2_Even_LP

    mov     eax,FFTsize
    add     esi,FFTsize*2                   ; RealA = Complex_Data + FFTsize*2
    add     ebx,FFTsize*2                   ; ImagA = Complex_Data + FFTsize*6
    add     ecx,FFTsize*2                   ; RealB = Complex_Data + FFTsize*3
    add     edx,FFTsize*2                   ; ImagB = Complex_Data + FFTsize*7
align 16
second_log2_Odd_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm3                       ; RealA = RealA + ImagB
    addps   xmm5,xmm2                       ; ImagB = tempImagA + RealB
    subps   xmm1,xmm2                       ; ImagA = ImagA - RealB
    subps   xmm4,xmm3                       ; RealB = tempRealA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     second_log2_Odd_LP  ; Loop size = 56 bytes ( within 1 cache line )


    mov     esi,Complex_Data
    mov     eax,FFTsize*2                   ; Complex_Data + FFTsize*2
    lea     edi,ReverseBinaryTable          ; reverse-binary reindexing table ( time domain decomposition )
    lea     edx,[esi+eax*2]                 ; imaginary data ptr = Complex_Data + FFTsize*4
    mov     ecx,dword ptr[edi]              ; swap count
align 16
RBI_LP:
    mov     eax,dword ptr[edi+4]            ; get the swap positions
    mov     ebx,dword ptr[edi+8]
    movss   xmm0,real4 ptr[esi+eax*2]       ; get real data
    movss   xmm1,real4 ptr[esi+ebx*2]       ; get real data
    movss   xmm2,real4 ptr[edx+eax*2]       ; get imaginary data
    movss   xmm3,real4 ptr[edx+ebx*2]       ; get imaginary data
    movss   real4 ptr[esi+eax*2],xmm1       ; swap real data
    movss   real4 ptr[esi+ebx*2],xmm0       ; swap real data
    movss   real4 ptr[edx+eax*2],xmm3       ; swap imaginary data
    movss   real4 ptr[edx+ebx*2],xmm2       ; swap imaginary data
    add     edi,8
    dec     ecx
    jnz     RBI_LP
;   mov     CodeSize,$-RBI_LP

    mov     ebp,SinCosTable
    mov     edx,4                           ; AB distance = 4
align 16
Log2_LP:                                    ; Third Log2 loop starts here, loop (log2(FFTsize)-2) times.
    mov     edi,edx                         ; AB distance
    add     edx,edx                         ; Step == AB distance * 2
    xor     ecx,ecx                         ; Next Log2 loop, first A position = 0
Butterfly_LP:                               ; For strides 4, 8, 16 etc.
    movaps  xmm6,oword ptr[ebp]             ; Load Cos
    movaps  xmm7,oword ptr[ebp+16]          ; Load Sin
    mov     eax,ecx                         ; Next A position
Decomposition_LP:
    mov     ebx,eax                         ; A position
    add     ebx,edi                         ; B position = A + AB distance             

    movaps  xmm0,oword ptr[esi+eax*4]       ; RealA
    movaps  xmm1,oword ptr[esi+eax*4+FFTsize*4]     ; ImagA
    movaps  xmm2,oword ptr[esi+ebx*4]       ; RealB
    movaps  xmm3,oword ptr[esi+ebx*4+FFTsize*4]     ; ImagB
    movaps  xmm4,xmm2                       ; RealB copy
    movaps  xmm5,xmm3                       ; ImagB copy
    mulps   xmm2,xmm6                       ; RealB * cos
    mulps   xmm3,xmm7                       ; ImagB * sin
    mulps   xmm4,xmm7                       ; RealB * sin
    mulps   xmm5,xmm6                       ; ImagB * cos
    subps   xmm2,xmm3                       ; tempReal = RealB * cos - ImagB * sin
    addps   xmm4,xmm5                       ; tempImag = RealB * sin + ImagB * cos
    movaps  xmm3,xmm0                       ; RealA copy
    movaps  xmm5,xmm1                       ; ImagA copy
    addps   xmm0,xmm2                       ; RealA = RealA + tempReal
    addps   xmm1,xmm4                       ; ImagA = ImagA + tempImag
    subps   xmm3,xmm2                       ; RealB = RealA - tempReal
    subps   xmm5,xmm4                       ; ImagB = ImagA - tempImag
    movaps  oword ptr[esi+eax*4],xmm0       ; new RealA
    movaps  oword ptr[esi+eax*4+FFTsize*4],xmm1     ; new ImagA
    movaps  oword ptr[esi+ebx*4],xmm3       ; new RealB
    movaps  oword ptr[esi+ebx*4+FFTsize*4],xmm5     ; new ImagB

    add     eax,edx                         ; Next step in FFT data
    cmp     eax,FFTsize                     ; Do we need the next 4 Sine and Cosine values ?
    jb      Decomposition_LP
    add     ebp,32                          ; Next step in SinCosTable
    add     ecx,4                           ; Next A position
    cmp     ecx,edi                         ; Is current Log2 loop done ?
    jb      Butterfly_LP
    cmp     edx,FFTsize                     ; Are all Log2 loops done ?
    jb      Log2_LP     ; Loop size = 116 bytes
    pop     ebp
    ret   

FFTSSE endp

For input, i tested with these values FFTSize = 1024:

Quote[ComplexDataNew1: F$ 0.0,0.0,1.0,0.0,2.0,0.0,3.0,0.0,4.0,0.0,5.0,0.0,6.0,0.0,7.0,0.0, ; test data
F$ 8.0,0.0,7.0,0.0,6.0,0.0,5.0,0.0,4.0,0.0,3.0,0.0,2.0,0.0,1.0,0.0,
ComplexDataGuga.Data1: F$ 0 #(FFTsize*4)]

The original version produces the correct values but the newer one is not. I believe it may be due to the changes in InitFFTSinCosTable that is producing strange values rather then the older version. Also, why the size of the table FFT_SinCosTable is FFTSize*1.5 ?

From the tests i´m doing the correct results are displayed as:
forward FFTsse2 first 16 results:

Code Select

Frequency Index: 0 R: 64.000 I: 0.000 R: 63.910 I: -3.140 R: 63.641 I: -6.268 R: 63.195 I: -9.374
Frequency Index: 1 R: -26.274 I: 0.000 R: -25.436 I: 1.250 R: -24.548 I: 2.418 R: -23.615 I: 3.503
Frequency Index: 2 R: 0.000 I: 0.000 R: 0.004 I: -0.000 R: 0.016 I: -0.002 R: 0.035 I: -0.005
Frequency Index: 3 R: -3.240 I: -0.000 R: -3.205 I: 0.157 R: -3.158 I: 0.311 R: -3.102 I: 0.460

And the wrong values (the newer version) is producing this:

Code Select


interleaved FFTsse2 first 16 results:

Frequency Index: 0 Real    (Cosine): 64.000 63.641 62.572 60.810 58.384 55.336 51.716 47.585
Frequency Index: 1 Real    (Cosine): -26.274 -24.548 -22.645 -20.617 -18.513 -16.380 -14.262 -12.198
Frequency Index: 2 Real    (Cosine): 0.000 0.016 0.061 0.129 0.214 0.308 0.404 0.493
Frequency Index: 3 Real    (Cosine): -3.240 -3.158 -3.035 -2.874 -2.680 -2.460 -2.219 -1.964

Frequency Index: 0 Imaginary (Sine): 0.000 -6.268 -12.446 -18.446 -24.183 -29.578 -34.556 -39.052
Frequency Index: 1 Imaginary (Sine): 0.000 2.418 4.504 6.254 7.668 8.755 9.529 10.011
Frequency Index: 2 Imaginary (Sine): 0.000 -0.002 -0.012 -0.039 -0.089 -0.165 -0.270 -0.404
Frequency Index: 3 Imaginary (Sine): 0.000 0.311 0.604 0.872 1.110 1.315 1.483 1.611

Did you succeeded to fix that ? Also, what are the values produces by InitFFTSinCosTable ? I know it seems to be a structure of sins/cos, but using the flag 0-1 it is alfo filling the members of the structure at Pos 04 and pos 08, instead only Pos0 and Pos16 used in FFTSSE2.

I´m trying to understand or fix this, so i can later try to make the data also export amplitude and phase rather then only hthe Real and Imaginary frequyenceis to be used on a spectogram

https://www.gaussianwaves.com/2015/11/interpreting-fft-results-obtaining-magnitude-and-phase-information/?fbclid=IwAR2HHnysT_m_dQIgX94mLA3XGvRAlsjkL5oRg-4JImr5CkcsRYjzs59l4K8

guga · February 07, 2022, 01:44:49 AM

When i´m analysing it throuugh the debugger, i see 2 differences in the ComplexData results at he start of the Butterfly calculations.

The Correct version has 0 values at the beginning, like this:

https://imgur.com/a/mg3zBrf

But the incorrect version (Th newer one), is inserting a sequence of 4 values = 8 when it should be zero

https://imgur.com/pTTm18d

guga · February 07, 2022, 02:05:23 AM

The incorrect values seems to be generated due to an error in the Sin/Cosine table

https://imgur.com/COSRSBK

On the older version of InitFFTSinCosTable the values on xmm6 are 1 0 0 1

But, on the newer version of the InitiFFTSinCosTable it is generating weird values in xmm6 ans xmm7, like:
-7.07106e-1 -2.710505e-20 -7.07106e-1 1

Btw, if i ported correctly your newer version of InitSinCosine it is:

Code Select


; size of sincostable = FFTsize+(FFTsize/2) = FFTsize*1.5 = 1024*1.5 = 1536
Proc InitFFTSinCosTable2:
    Arguments @SinCosTable, @NNType, @FFT_type
    Local @MMax
    Uses esi, edi, ecx

    finit
    fclex
    mov D@MMax 4
    mov edi D@SinCosTable

    .Do
        fldpi
        fidiv D@MMax
        fldz
        mov ecx D@MMax
        mov esi 4

        Do
            fld ST0
            fsincos
            fstp F$edi
            Test_If W@FFT_type IFFT
                fchs
            Test_End
            fstp F$edi+16
            dec esi | jne @OutNotZero
                mov esi 4
                add edi 16
@OutNotZero:
            add edi 4
            fadd ST0 ST1
            dec ecx
        Repeat_Until_Zero
        fstp ST0
        fstp ST0
        mov eax D@MMax | add eax eax | mov D@MMax eax

    .Loop_Until D@NNType = eax

EndP

The error seems to be generated on the Test_If W@FFT_type IFFT the next add edi. It seems to overlap a value ?

guga · February 07, 2022, 06:11:32 AM

I´m trying to see what´s wrong, so i can ty later calculate everything related to the frequencie of inputed samples, informatons such as amplitude, phase, pitch, sound intensity, sound pressure and the last calculate the distance between the origin of the sound and the target microphone (to make easier to we separate/isolate sounds from the background, for example. This distance we can calculate from the amplitude)

Amplitude and phase seems to be more easy to calculate:
Amplitude = sqrt ( Frequency Imaginary^2 + Frequency Real ^2 )
Phase = atan2 (Frequency Imaginary / Frequency Real)

https://www.who.int/occupational_health/publications/noise1.pdf

https://en.wikipedia.org/wiki/Sound_intensity
https://en.wikipedia.org/wiki/Sound_pressure

https://en.wikipedia.org/wiki/Sound_pressure
https://en.wikipedia.org/wiki/Sound_power
https://sound.eti.pg.gda.pl/student/eim/synteza/leszczyna/index_ang.htm

Or also create other methods of correctly resample a audio file such as: https://dsp.stackexchange.com/questions/2962/how-to-resample-audio-using-fft-or-dft

So, from a given sample (or sequence of samples) we can build a set of structures related to each frequencies on intervals of FFTSize.

Ex: FFTSize = 1024 will produce 1024 different frequencies per sample, right ? Each one of these frequencies carries information related to sound amplitude, phase, pressure, intensity, distance.

Therefore, instead building a array of structures containing only this:

FFTSize= 1024. Sample XXXX

Frequency.Real.Data1: F$ 251.5
Frequency.Imaginary.Data1: F$ -215
Frequency.Real.Data2: F$ 25
Frequency.Imaginary.Data2: F$ 5
....
Frequency.Real.Data1024: F$ 77
Frequency.Imaginary.Data1024: F$ 2

Or their interleaved format:

Frequency.Real.Data1: F$ 251.5
Frequency.Real.Data2: F$ 25
....
Frequency.Real.Data1024: F$ 77

Frequency.Imaginary.Data1: F$ -215
Frequency.Imaginary.Data2: F$ 5
Frequency.Imaginary.Data1024: F$ 2

....

We can make a structure (or even other tables) containing all of those information such as:
Frequency.Real.Data1: F$ 251.5
Frequency.Imaginary.Data1: F$ -215
Frequency.Amplitude.Data1: F$ 125
Frequency.Phase.Data1: F$ 12
Frequency.Pitch.Data1: F$ 777
Frequency.SoundIntensity.Data1: F$ 12
Frequency.Soundpressure.Data1: F$ 125
Frequency.Distance.Data1: F$ 2252

Frequency.Real.Data2: F$ 25
Frequency.Imaginary.Data2: F$ 5
Frequency.Amplitude.Data2: F$ 125
Frequency.Phase.Data2: F$ 12
Frequency.Pitch.Data2: F$ 77
Frequency.SoundIntensity.Data2: F$ 12
Frequency.Soundpressure.Data2: F$ 125
Frequency.Distance.Data2: F$ 2252
....
Frequency.Real.Data1024: F$ 77
Frequency.Imaginary.Data1024: F$ 2
Frequency.Amplitude.Data1: F$ 125
Frequency.Phase.Data1024: F$ 12
Frequency.Pitch.Data1024: F$1255
Frequency.SoundIntensity.Data1024: F$ 12
Frequency.Soundpressure.Data1024: F$ 125
Frequency.Distance.Data1024: F$ 2252

So, with a Array of Structures each one of them containing more (or all) information regarded to a certain frequency, we can later build better filters to work with.

The MASM Forum

News:

Multithreading with FFT-IFFT

guga

Siekmanski

LordAdef

daydreamer

guga

guga

guga

guga