News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Multithreading with FFT-IFFT

Started by Siekmanski, May 14, 2015, 07:58:23 PM

Previous topic - Next topic

Siekmanski

Thank you guys for the timing reports.  :t

Here is a Multithreading 1024 point FFT version, on my machine it scores an average of 68223 MegaFlops per FFT.

Would be nice to see timings  from your computers.....

Multithreading FFT routines by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.1889020 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 13552

Let the OS manage the cores:
0.5070735 seconds for 12 * 50000 * '1024 FFT' with 12 threads.
MegaFlops: 60583

Use AffinityMasks to manage the cores:
0.4502866 seconds for 12 * 50000 * '1024 FFT' with 12 threads.
MegaFlops: 68223

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.3286985 seconds for 6 * 50000 * '1024 FFT' with 6 threads. ( Even )
MegaFlops: 46730

0.3556083 seconds for 6 * 50000 * '1024 FFT' with 6 threads. ( Uneven )
MegaFlops: 43194


Creative coders use backward thinking techniques as a strategy.

jj2007

You are not respecting the speed limits, Marinus :eusa_naughty:
Processor Name      : Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Operating System    : Windows 7
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.2328926 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 10992

Let the OS manage the cores:
0.5001578 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 20474

Use AffinityMasks to manage the cores:
0.5512171 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 18577

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.3645042 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Even )
MegaFlops: 14046

0.3245189 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Uneven )
MegaFlops: 15777

Siekmanski

Creative coders use backward thinking techniques as a strategy.

Raistlin

Multithreading FFT routines by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.1614891 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 15852

Let the OS manage the cores:
0.2624986 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 39010

Use AffinityMasks to manage the cores:
0.2984818 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 34307

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.2712594 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Even )
MegaFlops: 18875

0.2641408 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Uneven )
MegaFlops: 19384


Press any key to continue...
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

guga


Multithreading FFT routines by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.2921108 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 8764

Let the OS manage the cores:
0.6074673 seconds for 8 * 50000 * '1024 FFT' with 8 threads.
MegaFlops: 33714

Use AffinityMasks to manage the cores:
0.6002589 seconds for 8 * 50000 * '1024 FFT' with 8 threads.
MegaFlops: 34119

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.4320215 seconds for 4 * 50000 * '1024 FFT' with 4 threads. ( Even )
MegaFlops: 23703

0.4249765 seconds for 4 * 50000 * '1024 FFT' with 4 threads. ( Uneven )
MegaFlops: 24095


Press any key to continue...


Old thread, but hope it helps :)

Marinus, can you post the source for FFt and IFFT functions ?

Many thanks in advance :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Hi Guga,

I will post my latest working version soon. (give me a few days, have to comment some stuff etc. )

I'm still working on a new version.
Really busy at the moment but if I have finished the code I'll post it.
Creative coders use backward thinking techniques as a strategy.

johnsa

It crashes for me unfortunately..

Multithreading FFT routines by Siekmanski 2015.

Processor Name      : AMD Ryzen Threadripper 1950X 16-Core Processor
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 32
Physical Processors : 16
ActiveProcessorMask : FFFFFFFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

I'm actually on Windows 10 not 8..

Siekmanski

Ooh, I see why.
You have to many Logical Processors.  8)
My routine handles no more than 16 parallel threads in the old source code and didn't check there could be more than 16 Logical Processors.
I'll correct that in the next source.

Incorrect windows version is because I used "GetVersionEx" and it doesn't seem to work well in a console app.
Creative coders use backward thinking techniques as a strategy.

johnsa

I like it.. you have too many cores... muwahaha..

guga

Many tks, Marinus.

I´ll give a try on both Routines (FFT and it´s reversal IFFT). I´m porting some old routines to RosAsm for a audio app i´m trying to build. The routine uses both methods described on the former FFT functions (http://www.netlib.org/fftpack). In C, the functions where ported on the rouziclib library as:

The Ported C code is:

void fftp(double *in, double *out, int n, int method, fft_plan_t *plan) // FFT with a fixed plan
{
/* method :
* 0 = DFT
* 1 = IDFT
*/

double *out2;
int32_t i;
double ratio = 1.0/sqrt((double) n);
int32_t ifac_global[64];

if (n < 2)
{
fprintf_rl(stderr, "fftp(): argument n = %d\n", n);
return ;
}

if (in==out) // if in-place FFT
out2 = calloc (n, sizeof(double)); // allocate and assign the output array
else // if out-of-place FFT
out2 = out;

if (plan->plan_init && plan->alloc_size < 2*n+15) // if alloc_size is too small
{
free_fft_plan(plan); // reallocate to fit n
*plan = alloc_fft_plan(n, -1);
plan->plan_init = 0;
}

if (plan->plan_init && plan->fft_size != n) // reinit plan if wrong size is saved
plan->plan_init = 0;

if (plan->plan_init==0)
{
rffti(n, plan->plan, ifac_global); // initialise the 'plan'
memcpy(plan->plan_backup, plan->plan, (2*n+15) * sizeof(double)); // back it up
memcpy(plan->ifacg, ifac_global, 64 * sizeof(int32_t));
plan->plan_init = 1;
plan->fft_size = n;
}
else
{
memcpy(plan->plan, plan->plan_backup, (2*n+15) * sizeof(double)); // copy from backup
memcpy(ifac_global, plan->ifacg, 64 * sizeof(int32_t));
}

if (method==0)
{
rfftf(n, in, plan->plan, ifac_global); // FFT

out2[0] = in[0]; // DC

for (i=0; i<(n+(n&1)-2)>>1; i++) // all real and imaginary bins except the Nyquist component
{ // i represents the complex bin index, i=0 being both lowest real and imaginary components
out2[i+1] = in[(i<<1)+1]; // real
out2[n-i-1] = in[(i<<1)+2]; // imaginary
}

if ((n&1)==0) // if there's a Nyquist component
out2[n>>1] = in[n-1]; // copy it over
}
if (method==1)
{
out2[0] = in[0]; // DC

for (i=0; i<(n+(n&1)-2)>>1; i++) // all real and imaginary bins except the Nyquist component
{ // i represents the complex bin index, i=0 being both lowest real and imaginary components
out2[(i<<1)+1] = in[i+1]; // real
out2[(i<<1)+2] = in[n-i-1]; // imaginary
}

if ((n&1)==0) // if there's a Nyquist component
out2[n-1] = in[n>>1]; // copy it over

rfftb(n, out2, plan->plan, ifac_global); // IFFT
}

if (in==out)
memcpy (out, out2, n * sizeof(double));

for (i=0; i<n; i++)
out[i] *= ratio;

if (out2!=out)
free(out2);
}


I´m trying to build a similar function to handle the FFT but without using the "plan" structure in order to be called simply :

[DFT 0] ;  FFT fourier routine
[IDFT 1]  ; IFFT inverted fourier

[Data1: D$ 03F6CE6B6 ; Used Dwords for my tests only, since the data is, in fact Floating Points (F$)
Data2: D$ 03EDA40DC
Data3: D$ 03F1F5FB5
Data4: D$ 0BF0995BC
Data5: D$ 0BF6B7289
Data6: D$ 0BEA87156
Data7: D$ 0BDBAAD94
Data8: D$ 0BDF30556
Data9: D$ 03EED3D23
Data10: D$ 0BF40757E]

[Output: D$ 0 #10]

call fftp Data1, Output, 10,  DFT


So, the idea is input a Pointer to a array of Samples (In FPU data), and a pointer to a output buffer, and then choose how many samples (elements of teh array) are used to perform the FFT and finally, choosing the proper methods (FFT or IFFT)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Some other algorithms that maybe interesting to implement.

STFT-FD

https://www.sciencedirect.com/science/article/pii/S2352711017300638

A matlab code can be found here:
https://github.com/ElsevierSoftwareX/SOFTX-D-17-00087

The Stft-FD provides a better way to see the pitch of a sound file (Similar to existent on the spectrogram usedd in Isotope RX), but fixes the window size.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Creative coders use backward thinking techniques as a strategy.

guga

#132
Hi Marinus

You´re welcome :)

Here is the link for the fftpack.c (The rouziclib library)

https://github.com/Photosounder/rouziclib

The other link about STFT-FD (https://www.sciencedirect.com/science/article/pii/S2352711017300638  and https://github.com/ElsevierSoftwareX/SOFTX-D-17-00087) probably is more effective to  identify picth from what i understood on the article. So perhaps, creating 2 different types of algorithms (FFT/IFFT and STFT-FD/ISTFT-FD) can be used together to properly identify or isolate/separate  voices on a audio file, for example.

Perthaps using FFT 1st and later using  STFT-FD to finish the identification of the voice pitch maybe enough,


A spectrogram using the regular STFT looks like this:



The above have this properties:
a) Uses Hann window
b) Uses Reassigment

With reassignment you can see more clearly the sound picth of the vocal, but i presume that if it used the STFT-FD the accuracy would be way better.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hi marinus, i analysed your code and there are few things i didn´t understood

Your function   InitFFTSinCosTable have a pointer to the variable FFT_SinCosTable and is called like:

invoke  InitFFTSinCosTable, addr FFT_SinCosTable, FFTsize, FFT

But a problem seems to happens with the size of the variable followed by a array of another variable with a different size (labeled as ComplexDataXXX which is also a virtual data with 2048 dword).

FFT_SinCosTable dd 1536 dup(?)
ComplexData    dd 2048 dup(?)
ComplexData1    dd 2048 dup(?)
ComplexData2    dd 2048 dup(?)
.....


But,since FFTSize = 1024, it means that InitFFTSinCosTable converts only 8160 bytes from FFT_SinCosTable (2040 dwords) which results data to be filled in the address at Complexdata as the image below:




Also, why 8160 bytes ? The total size of ComplexData is 8192 (2048 dwords/Real4). So, it is missing the analysis of 8 dwords ?

Another question...concerning the miltithread.

Why using ComplexData as virtual data and why using multiples variables for it ? Does it means that each  Threadfucntion (ThreadM1, ThreadM2 etc) will use separated data chuncks from ComplexData ? I mean each ComplexData2, ComplexData3 is a copy of COmplexdata or pointers to the total size of ComplexData divided by 16 ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Because some sin cos pairs are used several times and are only calculated once in the sin cos table.
Multiple complex data tables are used to simulate the real world or else it would be cheating for measuring the multithreaded timing.
It can be used to calculate multiple fft's and matching complex data tables simultaneous.
Creative coders use backward thinking techniques as a strategy.