News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Multithreading with FFT-IFFT

Started by Siekmanski, May 14, 2015, 07:58:23 PM

Previous topic - Next topic

Siekmanski

As far as I understand this algorithm, they first remove zero and near zero frequencies?
Further they speak about that this is a fast way to get specific frequencies from e.g. pulsars etc.
Why not using the Goertzel algorithm then?
Creative coders use backward thinking techniques as a strategy.

Raistlin

Probably because of this ???? - just guessing.... ::)

QuoteFor covering a full spectrum, the Goertzel algorithm has a higher order of complexity than Fast Fourier Transform (FFT) algorithms; but for computing a small number of selected frequency components, it is more numerically efficient.

<<edit : so not as generalizable ?? >>
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Siekmanski

That's just what I meant, if your not interested in the whole frequency spectrum range.
If you sample let's say, a 1 GHz signal and you want to examine only a few specific frequencies,  Goertzel is much faster.  :biggrin:
Creative coders use backward thinking techniques as a strategy.

rrr314159

I'm confused about this new sparse FFT ...

First, the Goertzel algo is related to a digital filter. It's good when you know what frequencies you're looking for, and there aren't too many of them. For instance recognizing the small number of frequencies phones use to encode the buttons on the keypad (do they still use this technique?).

The sparse FFT referenced seems to require, as with Goertzel, that you know the frequencies you're looking for. If that's true it's not immediately obvious when you'd use either one. Perhaps the sparse FFT is for when you've got a small range of freq's but are interested in anything within that range (unlike phone pad)? But it didn't sound that way in the brief description.

One thing that confuses me - I thought there were FFT algos optimized for sparse frequency sets, when you don't know which ones you're looking for. I don't find anything on the net about that type (didn't look too hard).

The other confusion - I'm quite sure there are FFT's for sparse data sets - i.e., in the time domain. (Again, a little googling didn't turn up anything). In this case you still want to analyze a full range of frequencies - they're not "sparse" - but the incoming data has lots of zeroes.

So there seem to be four types (at least) which could be called "sparse":

1 - few frequencies, and you know what they are
2 - limited, known frequency band, but within that you want all of them
3 - few / limited frequencies, but you don't know what they are
4 - all frequencies, but few time domain points - that is, most of the time domain is zero's

It seems that the sparse FFT - which Raistlin ref'd - is either type 1 or 2. And, it's common enough that I don't easily find stuff on the net about the others. Wouldn't be surprised if they're called something different today.

Raistlin, I believe, is interested in pulsars. This isn't a case 1. True, you can be sure they're not going faster than (not sure what the limit is - 1000 hz?) nor slower than a certain frequency (not sure of that either - a few seconds per revolution?). But within that frequency band all possibilities exist. It's not quantized, like phone keypad signals. So Goertzel, at least, doesn't seem right. Maybe that's what this "sparse FFT" is for?

But there are two things you can do with pulsars. 1) once you've found it, analyze its frequency - that ought to be easy. 2) comb through a lot of observational data covering broad segment of sky, looking for regularly repeating signals. The first is what I've been mostly thinking about but the second may be the real job in pulsar hunting. And whether the "sparse" concept applies to it at all, I don't know.

As I said I'm confused! But I hope this might help clarify some of the issues involved, for further intense googling efforts by somebody  :P
I am NaN ;)

Siekmanski

QuoteOne thing that confuses me - I thought there were FFT algos optimized for sparse frequency sets, when you don't know which ones you're looking for. I don't find anything on the net about that type.

Yes, this is what I'm confused about too.

How to find those (near) zero frequencies...
I'll have a look at the code again and see if I can find how they trace those frequencies.
Then you can just skip the calculations for those frequency bins in the FFT code.

I'm curious what Raistlin's pulsar data looks like and what he wants to analyze exactly.

My guess is:
2 - limited, known frequency band, but within that you want all of them.

If this is true, we can resample ( and band pass filter ) the frequency band to DC and do the FFT for only that frequency band.
Creative coders use backward thinking techniques as a strategy.

Siekmanski

Made some progress,

My new SSE routine is now 6.5 times faster as the old one.
13.5 Giga Flops for 1 size 1024 FFT on my machine using 1 core.

All loops run on 1 cache line (< 64 bytes) processing 16 values at once.
Except for the main butterfly loop which need 2 cache lines.

To make a library for all sizes I need one extra register to get rid of an immediate. ( or else do SMC... )

I have to learn more about code and data prefetching too.
On my machine one "prefetchtx" takes 43 cycles ????

Some optimisations:
Using all registers to keep the code size as small as possible for the loops.
Only using movs adds subs and muls in the loops. ( no sign flipping anymore )
SinCosTable size is now reduced to: real4 (FFTsize/2)*3 dup (?)
Second Log2 loop is done in 2 passes, even pass and odd pass.
25% less SinCos memory reads.

I'm curious to see some timings from other machines.

FFT IFFT routines by Siekmanski 2016.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

0.0249074 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 2056

0.0037942 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 13494


Code size Main loop: 121

Press any key to continue...

Creative coders use backward thinking techniques as a strategy.

jj2007

Hi Marinus,
Core i5 results:FFT IFFT routines by Siekmanski 2016.

Processor Name      : Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Operating System    : Windows 7
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

0.0301108 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 1700

0.0049090 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 10430


Code size Main loop: 121

sinsi

I'll fire up the AMD box later to test.


Windows 10 Home insider preview
Processor Name      : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

0.0223553 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 2290

0.0034696 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 14757


Windows 10 Pro
Processor Name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

0.0205278 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 2494

0.0029830 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 17164



sinsi

No comment  :icon_eek:

Processor Name      : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
Operating System    : Windows 7
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

0.0431336 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 1187

0.0151598 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 3377


Raistlin


FFT IFFT routines by Siekmanski 2016.

Processor Name      : Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

0.0527319 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 971

0.0031612 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 16196


Code size Main loop: 121

Press any key to continue...


LOL @ sinsi - AMD data
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Raistlin

I'am interested to see if someone (penguin type, with multi-boot capability  :P) would be willing to
compare FFTW under LINUX with Siekmanski's output under Windows - running the same hardware platform.

see: https://www.ll.mit.edu/HPEC/agendas/proc10/Day2/PB10_Pilaud_abstract.pdf  for batch script reference

I imagine the FFTW lib is available in most repositories but also here: http://www.fftw.org/download.html

We need data on single thread and multi-threaded runs.
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Siekmanski

How is the FFTW benchmarking done?

If I read the docs they do measurements on 1 core, am I right?
Do multiple runs and pick the fastest run to calculate the MegaFlops.
Can I do a multi-threaded run divided by the number of Physical Processors to compare it with the FFTW benchmarks?
Because, hyperthreading will influence the outcome but can be done on the same core.

BTW, forgot to implement the optimization for the last Log2 loop....  ::)
Creative coders use backward thinking techniques as a strategy.

Raistlin

How is the FFTW benchmarking done? 
Methodology : http://www.fftw.org/accuracy/method.html
If I read the docs they do measurements on 1 core, am I right?
Yes
Do multiple runs and pick the fastest run to calculate the MegaFlops
Actually worst run
Can I do a multi-threaded run divided by the number of Physical Processors to compare it with the FFTW benchmarks?
I wanted to see the relationship - or at-least infer such - but as we are running non-concurrent FFT - such would be a moot point
wanted to see if they have any cache sensitivity issues

Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Siekmanski

To compute the accuracy of an FFT, we only plot the "worst" results.
http://www.fftw.org/accuracy/method.html

I looked at the timing routine in benchfftw code, as far as I understand the C source-code they plot the fastest run.

void report_verbose(const struct problem *p, double *t, int st)
{
     struct stats s;
     char bmin[64], bmax[64], bavg[64], bmedian[64], btmin[64];
     char bsetup[64];

     mkstat(t, st, &s);

     sprintf_time(s.min, bmin);
     sprintf_time(s.max, bmax);
     sprintf_time(s.avg, bavg);
     sprintf_time(s.median, bmedian);
     sprintf_time(time_min, btmin);
     sprintf_time(p->setup_time, bsetup);

     ovtpvt("Problem: %s, setup: %s, time: %s, ``mflops'': %.5g\n",
    p->pstring, bsetup, bmin, mflops(p, s.min));

     if (verbose) {
  ovtpvt("Took %d measurements for at least %s each.\n", st, btmin);
  ovtpvt("Time: min %s, max %s, avg %s, median %s\n",
bmin, bmax, bavg, bmedian);
     }
}


Cache sensitivity is still a thing I have to study, at this moment I don't use prefetching data or code.
Creative coders use backward thinking techniques as a strategy.

Raistlin

Maybe they mean mflops(p, s.min)  <-- s.min to be the worst case ?
and s.max to be the best case ?
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...