Well I reviewed a bunch of pubs on this FFT vectorizing / parralelizing / SIMDizing topic. Of course there was a bunch of work done which has nothing directly to do with my algo, but at least 10% of these papers is in that same area. My guess is that they finally figured it out in 2011, but I can't get copies of most of these papers; only the abstract. Prior to 2011 I can tell they hadn't got it yet, because of the techniques they did use: they wouldn't have done those things if they'd thought of this simple trick. Techniques include using scatter / gather, lots of shuffle ops, auto-vectorization etc. Also the speed-ups claimed are less than perfect; for instance with 8-vectors they're claiming speed-ups of 6 or so. It should be a perfect "8". Here are some of the most-cited papers (out of hundreds) that don't get it:
1) P. N. Swarztrauber, Vectorizing the FFTs, Parallel Computations
2) A High-Performance FFT Algorithm for Vector Supercomputers
David H. Bailey NASA AMES RESEARCH CENTER MOFFETT FIELD, CALIFORNIA 94035
3) Automatic SIMD Vectorization of Fast Fourier Transforms
for the Larrabee and AVX Instruction Sets
Daniel S. McFarlin
Department of Electrical and
Computer Engineering
Carnegie Mellon University
Pittsburgh, PA USA 15213
dmcfarli@ece.cmu.eduVolodymyr Arbatov
Department of Electrical and
Computer Engineering
Carnegie Mellon University
Pittsburgh, PA USA 15213
arbatov@ece.cmu.eduThe well-known shift to parallelism in CPUs is often associated
with multicores. However another trend is equally salient: the
increasing parallelism in per-core single-instruction multiple-date
(SIMD) vector units. Intel’s SSE and IBM’s VMX (compatible to
AltiVec) both offer 4-way (single precision) floating point, but the
recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way
and 16-way, respectively. Compilation and optimization for vector
extensions is hard, and often the achievable speed-up by using vectorizing
compilers is small compared to hand-optimization using
intrinsic function interfaces. Unfortunately, the complexity of these
intrinsics interfaces increases considerably with the vector length,
making hand-optimization a nightmare. In this paper, we present a
peephole-based vectorization system that takes as input the vector
instruction semantics and outputs a library of basic data reorganization
blocks such as small transpositions and perfect shuffles that
are needed in a variety of high performance computing applications.
We evaluate the system by generating the blocks needed by
the program generator Spiral for vectorized fast Fourier transforms
(FFTs). With the generated FFTs we achieve a vectorization speedup
of 5.5–6.5 for 8-way AVX and 10–12.5 for 16-way LRB. For
the latter instruction counts are used since no timing information is
available. The combination of the proposed system and Spiral thus
automates the production of high performance FFTs for current and
future vector architectures.
4) Efficient Utilization of SIMD Extensions,10.1109/JPROC.2004.840491,Proceedings of The IEEE
Franz Franchetti, Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber
This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes: 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
Journal: Proceedings of The IEEE - PIEEE , vol. 93, no. 2, pp. 409-425, 2005
5) Ultrahigh-performance FFTs for the CRAY2 and CRAY Y-MP supercomputers,10.1007/BF00129773,The Journal of Supercomputing,David A. Carlson
In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%.
Journal: The Journal of Supercomputing - TJS , vol. 6, no. 2, pp. 107-116, 1992
DOI: 10.1007/BF00129773
- This (5) must be close, but finally in 2011 these guys apparently got it:
6) R. Crandall and J. Klivington, Supercomputer-style FFT library for Apple G4
Abstract
Many traditional algorithms for computing the fast Fourier transform (FFT) on conventional computers are unacceptable for advanced vector and parallel computers because they involve nonunit, power-of-two memory strides. This paper presents a practical technique for computing the FFT that avoids all such strides and appears to be near-optimal for a variety of current vector and parallel computers. Performance results of a program based on this technique are presented. Notable among these results is that a Fortran implementation of this algorithm on the CRAY-2 runs up to 77% faster than Cray's assembly-coded library routine.
- They claim to avoid all power-of-2 strides and get "near-optimal" performance. That's got to be the right algo. I wish I could see a copy of this paper. A general truth is: it's approximately impossible to come up with anything (in technical topics) that's actually new. There are just too many very smart people out there working on it. The reason one sometimes thinks an idea is new: there are so many not-so-good people (9 out of 10, say), and their mountains of papers make it hard to find the ones that count.
- Let me emphasize also that earlier work was on many other, more general but related problems and that partly explains why they didn't get this simple trick. For more understanding of how it could be missed, review what people say about things like disposing of nuclear waste. It's clear that about 85-90% of so-called "experts" are a bunch of idiots. Also consider the fact that an author like Wittgenstein is still considered worth reading! (really - I'm not kidding! Impossible to believe, but true!) Let's face it, a good 90% of "experts" couldn't think their way out of a paper bag