The MASM Forum

General => The Laboratory => Topic started by: LiaoMi on June 12, 2019, 11:36:37 PM

Title: Fast Fourier transform - AVX and FMA3
Post by: LiaoMi on June 12, 2019, 11:36:37 PM
Hi,

(https://www.nti-audio.com/portals/0/pic/news/FFT-Time-Frequency-View-540.png)

I began to translate one example for masm assembler located here - https://www.nayuki.io/page/fast-fourier-transform-in-x86-assembly (https://www.nayuki.io/page/fast-fourier-transform-in-x86-assembly)

FFT library to assess the effort and speedup of a hand-written SIMD vectorized implementation. The assembly implementation is under 150 lines of clear code; it achieves a speedup of 2× on small inputs, but only slight speedup on large inputs (memory-bound?).
The assembly code requires x86-64 AVX and FMA3, supported by Intel Haswell and later CPUs.

License: MIT (open source)
Source code notes:

The key idea behind the x86-64 AVX implementation is that all processing happens on vectors of four 64-bit floats, rather than individual scalar floats. In theory this could be 4× as fast as scalar code, but is limited by memory bandwidth, FPU execution resources, inherently serial loop control code, etc. Also, the cosine/sine tables are interleaved in blocks of 4 elements, and the table values for each FFT size are packed together so that the memory access stride is always contiguous.

As a bonus, the x86-64 AVX auxiliary C code has an implementation of a more accurate sine function by doing argument reduction to a smaller domain (1/8th of a circle instead of a half circle).


It remains to correct fft-test.asm and fft-x8664-avx-aux.c.asm, I'll finish what I started with small steps, in the meantime, maybe someone will be interested to see the implementation.
Title: Re: Fast Fourier transform - AVX and FMA3
Post by: daydreamer on June 13, 2019, 04:18:11 AM
nice Liaomi
but if you replace sincos table with generate it with AVX on the fly,the code doesnt need to read lots of values from tables it would maybe handle higher input?

Title: Re: Fast Fourier transform - AVX and FMA3
Post by: Siekmanski on June 13, 2019, 10:03:06 AM
Don't think that would be faster.
You can precalculate all sin_cos pairs for each log_loop and read 8 values at once and do 8 multiplications at once.
Title: Re: Fast Fourier transform - AVX and FMA3
Post by: daydreamer on June 13, 2019, 01:54:55 PM
Quote from: Siekmanski on June 13, 2019, 10:03:06 AM
Don't think that would be faster.
You can precalculate all sin_cos pairs for each log_loop and read 8 values at once and do 8 multiplications at once.
Been there,done that and I always a SIMD SSEalot (zealot)  :tongue:
Title: Re: Fast Fourier transform - AVX and FMA3
Post by: Raistlin on June 14, 2019, 03:54:39 PM
@LiaoMi: I do believe Siekmanski and company killed this topic (as in serious SSE skill-sets applied)
numerous times in this thread -> http://masm32.com/board/index.php?topic=4231.msg75007#msg75007

Some interesting outcomes included reduction of transforms [complexity down from 5 N log2(N) to 4 N log(N)] ,
and a genius-level butterfly bit-reversal strategy (ala Siekmanski/rrr), all within multi-threaded framework.
As to your topical interest: AVX was theorized to provide a factor of x2 improvement on calculations but might
potentially incur memory management penalties that seemingly would negate performance gains from its use.
Title: Re: Fast Fourier transform - AVX and FMA3
Post by: daydreamer on June 20, 2019, 09:25:24 PM
Quote from: Raistlin on June 14, 2019, 03:54:39 PM
@LiaoMi: I do believe Siekmanski and company killed this topic (as in serious SSE skill-sets applied)
numerous times in this thread -> http://masm32.com/board/index.php?topic=4231.msg75007#msg75007
there is also some very fast chebyshev sine somewhere on board,that is my opinion to test that code instead of the LUT's in FFT code you posted
Title: Re: Fast Fourier transform - AVX and FMA3
Post by: Raistlin on June 22, 2019, 03:06:08 AM
Very cool Daydreamer will look into what I perceived to be
a passing comment way back when : chebychev thingy