Here is my SIMD-ized FFT routine. This is all original work except for a couple places where I borrowed siekmanski's code (mainly SysInfo.asm). However it never would have happened without his example (FFTroutines.Asm, etc) which I referred to constantly to check my results. The algo is original mathematics, altho no doubt it's already been done.
On my i5 4 core 2.94 Ghz machine it averages about 2.56 microseconds per 1024 pt complex FFT, 7500 cycles. (AMD is a lot slower). Could be better, with better cache management, but hard to see how the basic algo can be improved.
This code compiles as 32-bit or 64-bit without modification. I've gotten into the habit of writing all code for both 32 and 64 for configuration management purposes. Got sick of maintaining 2 versions, 64-bit for me and 32-bit to post here. 64-bit needs just 64-bit linker and kernel32, msvcrt lib's; no includes (and JWasm for both 32 and 64). Ask if interested how it works; or u can just ignore the issue, pretend it's straight 32-bit code, no problem.
.exe's are heavy: 52k 32-bit, and 62k 64-bit. All passes are unwound (of course there are 10 passes for 1024 FFT); also there's 16k of data hard-coded in data statements. This data should, of course, be coming in via a file. So the footprint can be reduced to, I don't know, probably 15k or so for 32-bit, with a 5-10% speed hit.
The main run does 4000 FFT's in memory, consisting of the same 4 repeated. If you have 6 or more threads it will use 6000 FFTs (24 meg), and the LUTs take another 10 meg, so there are no extra cache hits timing multiple runs; very important for good timing, since cache is vital.
I've tested up to 16 million point FFT, works fine but takes minutes. 1 million points about a minute. Should go as high as your memory allows, up to a few hundred millions of points, but take hours. The problem is the init routines: bit-reverse swap table, and sin/cos table. They're negligible for 1024-pt, but at 1 million take 30 seconds. They can be sped up, get down to a second no doubt, if anybody cares.
As usual this code is written in macro-heavy style, almost no proc's. I used to intend to learn better style but am beginning to think you should learn my style; it's so much easier and runs faster. Anyway - you're not going to like it! Oh well at least the algo is worth seeing, u can rewrite it with all the #$%^&*() proc's you want.
Although there are only 1000+ lines of code, it's divided into 7 files for modularity; the source is a bit over 40k. Here are all the files in the zip:
FFT.exe, FFT64.exe: the two exes, 53 and 62 k. BTW they don't use waitkey
doFFT.bat, doFFT64.bat: make files
FFT.asm: 10k, main file, includes thread and timing exec routines
FFTmacros.asm: 12k, the actual FFT routines
SysInfo.asm: 4k, borrowed from siekmanski with mods, gets sys info, num of threads
support_macs.asm: 12k, lots of macros: printing, timing, utility
invoke_macros.asm: 6k, necessary for 64-bit; has worked, no bugs, for 5 months(!)
header3264.inc: 4k, allows code to work 32 or 64. Very simple.
DFTcorrelation.asm: 4k, NOT included in prog, but can be. For test cases and <16 pts.
Here are my timing runs:FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20
Last few Cos's and Sin's, total 2040:
0.195 0.189 0.183 0.177 0.171 0.165 0.159 0.153
0.147 0.141 0.135 0.128 0.122 0.116 0.110 0.104
0.098 0.092 0.086 0.080 0.074 0.067 0.061 0.055
0.049 0.043 0.037 0.031 0.025 0.018 0.012 0.006
Real Data FREQUENCY DOMAIN:
0.161 0.161 0.161 0.160 0.158 0.156 0.153 0.149
0.145 0.140 0.134 0.128 0.122 0.115 0.107 0.099
0.091 0.083 0.074 0.065 0.056 0.047 0.038 0.029
0.020 0.012 0.003 -0.005 -0.014 -0.021 -0.029 -0.036
Imag Data FREQUENCY DOMAIN:
0.118 0.008 -0.002 -0.012 -0.021 -0.031 -0.041 -0.050
-0.059 -0.067 -0.075 -0.083 -0.090 -0.097 -0.103 -0.109
-0.114 -0.118 -0.122 -0.125 -0.127 -0.129 -0.130 -0.130
-0.130 -0.129 -0.127 -0.125 -0.122 -0.119 -0.115 -0.111
Cycles/1000 (1 thread) 27240
MicroSeconds for one FFT 9.32
Cycles/1000 on main thread 30111450
MicroSeconds 4000 FFTs 4 threads 10301.37
MicroSeconds for one FFT (average) 2.5753
====================================
Real Data from Copied FFTs:
1.000 2.000 3.000 4.000 5.000 7.000 9.000 11.000
11.110 12.120 13.130 14.000 15.000 17.000 19.000 21.000
0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000 -0.000
0.000 -0.000 -0.000 0.000 0.000 -0.000 -0.000 -0.000
Imag Data from Copied FFTs:
1.100 1.200 1.300 1.400 1.500 1.700 1.900 1.110
1.110 1.120 1.130 1.140 1.150 1.170 1.190 1.210
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20
Last few Cos's and Sin's, total 2040:
0.195 0.189 0.183 0.177 0.171 0.165 0.159 0.153
0.147 0.141 0.135 0.128 0.122 0.116 0.110 0.104
0.098 0.092 0.086 0.080 0.074 0.067 0.061 0.055
0.049 0.043 0.037 0.031 0.025 0.018 0.012 0.006
Real Data FREQUENCY DOMAIN:
0.161 0.161 0.161 0.160 0.158 0.156 0.153 0.149
0.145 0.140 0.134 0.128 0.122 0.115 0.107 0.099
0.091 0.083 0.074 0.065 0.056 0.047 0.038 0.029
0.020 0.012 0.003 -0.005 -0.014 -0.021 -0.029 -0.036
Imag Data FREQUENCY DOMAIN:
0.118 0.008 -0.002 -0.012 -0.021 -0.031 -0.041 -0.050
-0.059 -0.067 -0.075 -0.083 -0.090 -0.097 -0.103 -0.109
-0.114 -0.118 -0.122 -0.125 -0.127 -0.129 -0.130 -0.130
-0.130 -0.129 -0.127 -0.125 -0.122 -0.119 -0.115 -0.111
Cycles/1000 (1 thread) 27354
MicroSeconds for one FFT 9.36
Cycles/1000 on main thread 29942264
MicroSeconds 4000 FFTs 4 threads 10243.49
MicroSeconds for one FFT (average) 2.5609
====================================
Real Data from Copied FFTs:
1.000 2.000 3.000 4.000 5.000 7.000 9.000 11.000
11.110 12.120 13.130 14.000 15.000 17.000 19.000 21.000
0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000 -0.000
0.000 -0.000 -0.000 0.000 0.000 -0.000 -0.000 -0.000
Imag Data from Copied FFTs:
1.100 1.200 1.300 1.400 1.500 1.700 1.900 1.110
1.110 1.120 1.130 1.140 1.150 1.170 1.190 1.210
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
=======================================================================
=======================================================================
=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
AMD A6-6310 APU with AMD Radeon R4 Graphics; Windows 8; 4 threads
Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20
Last few Cos's and Sin's, total 2040:
0.195 0.189 0.183 0.177 0.171 0.165 0.159 0.153
0.147 0.141 0.135 0.128 0.122 0.116 0.110 0.104
0.098 0.092 0.086 0.080 0.074 0.067 0.061 0.055
0.049 0.043 0.037 0.031 0.025 0.018 0.012 0.006
Real Data FREQUENCY DOMAIN:
0.161 0.161 0.161 0.160 0.158 0.156 0.153 0.149
0.145 0.140 0.134 0.128 0.122 0.115 0.107 0.099
0.091 0.083 0.074 0.065 0.056 0.047 0.038 0.029
0.020 0.012 0.003 -0.005 -0.014 -0.021 -0.029 -0.036
Imag Data FREQUENCY DOMAIN:
0.118 0.008 -0.002 -0.012 -0.021 -0.031 -0.041 -0.050
-0.059 -0.067 -0.075 -0.083 -0.090 -0.097 -0.103 -0.109
-0.114 -0.118 -0.122 -0.125 -0.127 -0.129 -0.130 -0.130
-0.130 -0.129 -0.127 -0.125 -0.122 -0.119 -0.115 -0.111
Cycles/1000 (1 thread) 52253
MicroSeconds for one FFT 29.78
Cycles/1000 on main thread 65106250
MicroSeconds 4000 FFTs 4 threads 37107.64
MicroSeconds for one FFT (average) 9.2769
====================================
Real Data from Copied FFTs:
1.000 2.000 3.000 4.000 5.000 7.000 9.000 11.000
11.110 12.120 13.130 14.000 15.000 17.000 19.000 21.000
0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000 -0.000
0.000 -0.000 -0.000 0.000 0.000 -0.000 -0.000 -0.000
Imag Data from Copied FFTs:
1.100 1.200 1.300 1.400 1.500 1.700 1.900 1.110
1.110 1.120 1.130 1.140 1.150 1.170 1.190 1.210
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
AMD A6-6310 APU with AMD Radeon R4 Graphics ; Windows 8; 4 threads
Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20
Last few Cos's and Sin's, total 2040:
0.195 0.189 0.183 0.177 0.171 0.165 0.159 0.153
0.147 0.141 0.135 0.128 0.122 0.116 0.110 0.104
0.098 0.092 0.086 0.080 0.074 0.067 0.061 0.055
0.049 0.043 0.037 0.031 0.025 0.018 0.012 0.006
Real Data FREQUENCY DOMAIN:
0.161 0.161 0.161 0.160 0.158 0.156 0.153 0.149
0.145 0.140 0.134 0.128 0.122 0.115 0.107 0.099
0.091 0.083 0.074 0.065 0.056 0.047 0.038 0.029
0.020 0.012 0.003 -0.005 -0.014 -0.021 -0.029 -0.036
Imag Data FREQUENCY DOMAIN:
0.118 0.008 -0.002 -0.012 -0.021 -0.031 -0.041 -0.050
-0.059 -0.067 -0.075 -0.083 -0.090 -0.097 -0.103 -0.109
-0.114 -0.118 -0.122 -0.125 -0.127 -0.129 -0.130 -0.130
-0.130 -0.129 -0.127 -0.125 -0.122 -0.119 -0.115 -0.111
Cycles/1000 (1 thread) 56231
MicroSeconds for one FFT 32.05
Cycles/1000 on main thread 65542234
MicroSeconds 4000 FFTs 4 threads 37356.13
MicroSeconds for one FFT (average) 9.3390
====================================
Real Data from Copied FFTs:
1.000 2.000 3.000 4.000 5.000 7.000 9.000 11.000
11.110 12.120 13.130 14.000 15.000 17.000 19.000 21.000
0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000 -0.000
0.000 -0.000 -0.000 0.000 0.000 -0.000 -0.000 -0.000
Imag Data from Copied FFTs:
1.100 1.200 1.300 1.400 1.500 1.700 1.900 1.110
1.110 1.120 1.130 1.140 1.150 1.170 1.190 1.210
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
Note: this is tested on only two computers. I hope it works on yours. Good luck!