Multithreading with FFT-IFFT

Raistlin · June 09, 2015, 07:29:16 PM

I've done some reading on performance characteristics
that can be vetted against research papers for FFT's. For
some strange reason they prefer MFlops as an indicator
of implementation speed. A quick tour around the internet
begot this approximation for a 1024 FFT :

FORMULA mflops = 5 N log2(N) / (time for one FFT in microseconds)
--------------------------------------------------------------------------
see: http://stackoverflow.com/questions/7957907/how-many-ffts-per-second-can-i-do-on-my-smartphone-for-performing-voice-recogn

see : http://www.fftw.org/speed/

Comparing this to fftw.org 's results on library based FFT's reveals a huge difference.

5 * 1024 * 10 / 2.5 = 20480 MFLOPS

Given rrr's FFT algorithm the MFLOP count is likely between 9000 and 20 000 MFLOPS++ on average
for modern CPU's (post 2008).

Twice to five times as fast as the best out there.

(Iv'e also seen a formula for Power of 2 FFT as = 4 n log n / don't know how relevant that is )

May we celebrate or am I missing something.

rrr314159 · June 10, 2015, 10:06:35 AM

Well, lettuce review the Mflop calculations.

The FFT does log N passes; in this case, N = 1024, that's 10 passes. Each one goes thru all 1024 numbers in pairs, that's 512 pairs. For each pair it does a so-called butterfly calculation. Call the pair a and b, first b is rotated thru an angle. The result, added to a, replaces a; and subtracted from (original) a, replaces b. Each bfly calc then requires one rotation (complex mult) plus two complex additions. That comes to 4 mults and 6 adds, 10 flops per pair, or 5 flops per number. So clearly that's 5 N log2(N) flops. If we divide by time in microseconds we get your formula:

FORMULA mflops / sec = 5 N log2(N) / (time for one FFT in microseconds)

where I've included the missing "/sec" on the left-hand side.

There are plenty of other details to consider but this is the main story ... The fascinating thing is, why this procedure is equivalent to the regular Fourier definition; but that's rather too tricky to go into. Other details include the bit-reverse swap, normalization, and other ways to get a few more micros out of it, like siekmanski's LUTs (if one reads the literature undoubtedly even more have been thought of). Anyway if you take account of some of those you might modify that factor of "5" to a "4" instead, justified in various ways.

So that all makes sense. Main thing I did was take advantage of SIMD to get 4 operations for the time of one, that's why it's 4 times faster. The ref's you're reading haven't done that for couple reasons. One, the regular way doesn't allow it, u have to think about it, which most computer scientists don't do because it gives them a splitting headache, and they get paid the same regardless. Of course people have done what I did (no big deal), but you'll have to search to find it, their papers lost among the overwhelming mass of drivel turned out by hacks. The other, your ref's are years old, and SIMD has been available only a decade or so.

Bottom line: no great cause for celebration, my algo is not real exceptional, except compared to the normal cr*p you find in the classroom. IF we can speed it up another factor of (say) 4 that would be exceptional - and, I bet, possible, tho I don't know how.

Raistlin · June 11, 2015, 01:49:21 AM

Years or Decades old = NULL

The above links are in fact contemporary - fftw 3.3.4 for example has up to AVX support and tested on XEON's etc.
In-fact recent papers still quote the open-source fftw lib as important in FFT-IFFT based or other research.
Guess what - they've never bothered writing their own <- so touche' about your com.science inference.

What's wrong rrr, ca'nt take praise when it's given? :t

rrr314159 · June 11, 2015, 04:09:08 AM

What's wrong rrr, ca'nt take praise when it's given?

- Should be pretty good at it, had a lot of practice. But can't believe this simple little trick isn't well-known to some people at least, even it it's not commonly known. I'll have to study your ref's when get around to it ...

BTW using AVX you'll get another factor of approx. 2 improvement, but then the minimum size is 32 instead of 16

And would like to look at some other tricks, pretty sure one can get at least another factor of two out of it. Problem is one has to understand that b'fly technique better, makes one feel like you're standing on your head on a merry-go-round (if u know what I mean). I wish Gauss were still alive, it would be so easy for him

rrr314159 · June 12, 2015, 02:51:39 PM

Well I reviewed a bunch of pubs on this FFT vectorizing / parralelizing / SIMDizing topic. Of course there was a bunch of work done which has nothing directly to do with my algo, but at least 10% of these papers is in that same area. My guess is that they finally figured it out in 2011, but I can't get copies of most of these papers; only the abstract. Prior to 2011 I can tell they hadn't got it yet, because of the techniques they did use: they wouldn't have done those things if they'd thought of this simple trick. Techniques include using scatter / gather, lots of shuffle ops, auto-vectorization etc. Also the speed-ups claimed are less than perfect; for instance with 8-vectors they're claiming speed-ups of 6 or so. It should be a perfect "8". Here are some of the most-cited papers (out of hundreds) that don't get it:

1) P. N. Swarztrauber, Vectorizing the FFTs, Parallel Computations

2) A High-Performance FFT Algorithm for Vector Supercomputers
David H. Bailey NASA AMES RESEARCH CENTER MOFFETT FIELD, CALIFORNIA 94035

3) Automatic SIMD Vectorization of Fast Fourier Transforms
for the Larrabee and AVX Instruction Sets
Daniel S. McFarlin
Department of Electrical and
Computer Engineering
Carnegie Mellon University
Pittsburgh, PA USA 15213
dmcfarli@ece.cmu.edu

Volodymyr Arbatov
Department of Electrical and
Computer Engineering
Carnegie Mellon University
Pittsburgh, PA USA 15213
arbatov@ece.cmu.edu

The well-known shift to parallelism in CPUs is often associated
with multicores. However another trend is equally salient: the
increasing parallelism in per-core single-instruction multiple-date
(SIMD) vector units. Intel's SSE and IBM's VMX (compatible to
AltiVec) both offer 4-way (single precision) ﬂoating point, but the
recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way
and 16-way, respectively. Compilation and optimization for vector
extensions is hard, and often the achievable speed-up by using vectorizing
compilers is small compared to hand-optimization using
intrinsic function interfaces. Unfortunately, the complexity of these
intrinsics interfaces increases considerably with the vector length,
making hand-optimization a nightmare. In this paper, we present a
peephole-based vectorization system that takes as input the vector
instruction semantics and outputs a library of basic data reorganization
blocks such as small transpositions and perfect shufﬂes that
are needed in a variety of high performance computing applications.
We evaluate the system by generating the blocks needed by
the program generator Spiral for vectorized fast Fourier transforms
(FFTs). With the generated FFTs we achieve a vectorization speedup
of 5.5–6.5 for 8-way AVX and 10–12.5 for 16-way LRB. For
the latter instruction counts are used since no timing information is
available. The combination of the proposed system and Spiral thus
automates the production of high performance FFTs for current and
future vector architectures.

4) Efficient Utilization of SIMD Extensions,10.1109/JPROC.2004.840491,Proceedings of The IEEE
Franz Franchetti, Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber

This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes: 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
Journal: Proceedings of The IEEE - PIEEE , vol. 93, no. 2, pp. 409-425, 2005

5) Ultrahigh-performance FFTs for the CRAY2 and CRAY Y-MP supercomputers,10.1007/BF00129773,The Journal of Supercomputing,David A. Carlson

In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%.
Journal: The Journal of Supercomputing - TJS , vol. 6, no. 2, pp. 107-116, 1992
DOI: 10.1007/BF00129773

- This (5) must be close, but finally in 2011 these guys apparently got it:

6) R. Crandall and J. Klivington, Supercomputer-style FFT library for Apple G4

Abstract

Many traditional algorithms for computing the fast Fourier transform (FFT) on conventional computers are unacceptable for advanced vector and parallel computers because they involve nonunit, power-of-two memory strides. This paper presents a practical technique for computing the FFT that avoids all such strides and appears to be near-optimal for a variety of current vector and parallel computers. Performance results of a program based on this technique are presented. Notable among these results is that a Fortran implementation of this algorithm on the CRAY-2 runs up to 77% faster than Cray's assembly-coded library routine.

- They claim to avoid all power-of-2 strides and get "near-optimal" performance. That's got to be the right algo. I wish I could see a copy of this paper. A general truth is: it's approximately impossible to come up with anything (in technical topics) that's actually new. There are just too many very smart people out there working on it. The reason one sometimes thinks an idea is new: there are so many not-so-good people (9 out of 10, say), and their mountains of papers make it hard to find the ones that count.

- Let me emphasize also that earlier work was on many other, more general but related problems and that partly explains why they didn't get this simple trick. For more understanding of how it could be missed, review what people say about things like disposing of nuclear waste. It's clear that about 85-90% of so-called "experts" are a bunch of idiots. Also consider the fact that an author like Wittgenstein is still considered worth reading! (really - I'm not kidding! Impossible to believe, but true!) Let's face it, a good 90% of "experts" couldn't think their way out of a paper bag

jj2007 · June 12, 2015, 10:29:58 PM

Quote from: rrr314159 on June 12, 2015, 02:51:39 PMreview what people say about things like disposing of nuclear waste. It's clear that about 85-90% of so-called "experts" are a bunch of idiots.

I am not a nuclear expert, just a humble economist. What distinguishes me from the experts who planned Three Mile Island, Chernobyl and Fukushima, or from those who wrote the Rasmussen Report aka WASH-1400: They were surprised - I was not.

Unfortunately, science is not very scientific. What is most helpful in reading scientific papers is the golden rule: "The bigger the potential profit, the bigger are the lies".

Gunther · June 13, 2015, 01:19:28 AM

Quote from: jj2007 on June 12, 2015, 10:29:58 PM
Unfortunately, science is not very scientific.

Unfortunately, this is true.

Quote from: jj2007 on June 12, 2015, 10:29:58 PM
What is most helpful in reading scientific papers is the golden rule: "The bigger the potential profit, the bigger are the lies".

We're talking about the science operations. I would slightly alter that. The more research funds are up for grabs, the greater the speculations. A good example of this are the various string theories.

Gunther

Siekmanski · June 15, 2015, 12:31:49 AM

Just to say I'm still around here.
I'm having a break at this moment. ( no time to do much programming the next weeks :( )
I'm still studying the FFT and found some special cases where I think, we can reduce calculations. ( have to proof this first... )
I'll be back if it works out, hopefully with some speed improvements.

Raistlin · August 05, 2015, 03:56:59 PM

rrr:

Quote- They claim to avoid all power-of-2 strides and get "near-optimal" performance. That's got to be the right algo. I wish I could see a copy of this paper

Here you go : http://www.davidhbailey.com/dhbpapers/fftzp.pdf
and here's the paper for your Apple G4 ref: https://www.apple.com/ca/acg/pdf/g4fft.pdf

It seems a bit old - will read and report back, if I can make sense of it.

Gunther · August 05, 2015, 07:05:32 PM

Hi Raistlin,

thank you for the links. :t

Bailey's paper is very old and treats the Cray architecture. But the G4 paper could be helpful. It includes a kind of pseudo code and a lot of hints. It's written 15 years ago, but it doesn't matter. The only difficulty is, to translate it from Altivec to XMM.

Gunther

rrr314159 · August 06, 2015, 03:28:01 AM

Thanks raistlin

Both these papers are relevant and both may discuss the technique I used, not sure yet - they're in the ballpark anyway. They discuss a lot of other things as well, probably there are ideas that would help in our case; altho a couple of them are already in siekmanski's original version, for instance unit-stride in the precomputed powers of two.

My power's been out since a tree fell on the house, so I haven't been around. Should be back on line in couple of days

Raistlin · August 07, 2015, 05:14:55 AM

After heavy reading it seems that the overall parallel algorithm, to which we should add any relevance,
is that of Pease, M - the originator of the bit-reversal and butterfly idea. Subsequently very
small but cumulative improvements have been made since the paper's release of 1986.
Confusingly that of the G4 Apple paper & Bailey might just be reiterating the point.....@#ck I wish I was better at math!

rrr314159 · August 07, 2015, 07:42:42 AM

Quote from: raistlin...I wish I was better at math!

- Who doesn't? If u have specific questions about these papers - like, an equation or symbol - I can help, but if the whole thing's baffling you're probably better off finding a good tutorial. Thanks for doing the research ...

Siekmanski · April 02, 2016, 01:11:58 AM

Found this on the WEB,

Optimized Sparse Fast Fourier Transform

http://www.spiral.net/software/sfft.html

A reference implementation of the SFFT on the SFFT website at MIT

http://groups.csail.mit.edu/netmit/sFFT/

Raistlin · April 04, 2016, 03:54:32 PM

Ta - for links Siekmanski,

A couple of notes on this:
1) Actually Sparse FFT was slower or only equal to FFTW on average - leaving out the "optimized" n volume case the author then provides as proof for 4 to 5 times speedup.
2) Note that the standard library FFTW used in mathlab etc. is still the defacto standard for scientific comparison.
3) This statement is alarming - it creates the impression of additional speed-up - which I don't believe is quite true - see: http://www.spiral.net/bench.html
and actually then in my humble opinion casts doubts as the validity or quality of research.

Code Select

However, the reference implementation is not optimized for modern hardware features such as the cache hierarchy, vector instruction sets, or multithreading

then this...

Code Select

Our flagship is the SPIRAL program generation system, which, entirely autonomously, generates platform-tuned implementations of signal processing transform such as the discrete Fourier transform, discrete cosine transform, and many others

.

PS: ...and we've proven our's (masm forum endeavor) is faster by a factor of 4 to 5 over FFTW anyway, thus we should all enroll in MIT and write a paper.

The MASM Forum

News:

Multithreading with FFT-IFFT

Raistlin

rrr314159

Raistlin

rrr314159

rrr314159

jj2007

Gunther

Siekmanski

Raistlin

Gunther

rrr314159

Raistlin

rrr314159

Siekmanski

Raistlin