News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Multithreading with FFT-IFFT

Started by Siekmanski, May 14, 2015, 07:58:23 PM

Previous topic - Next topic

rrr314159

Quote from: FORTRANSI know people have used FFT to deconvolve larger images.

- thanks for info! hate to ask another dumb q., but what does one do with a 2-d FFT of an image? Is it used for pattern recognition, compression, ... ?
I am NaN ;)

dedndave

images aren't the only use for FFT   :biggrin:

a while back, Marinus posted a nice little spectrum analyzer demo   :t

rrr314159

using a 2-d (not regular 1-d) FFT?
I am NaN ;)

Siekmanski

Same FFT routine as in this topic.

http://masm32.com/board/index.php?topic=3665.msg38908#msg38908
Creative coders use backward thinking techniques as a strategy.

FORTRANS

Hi,

   Turns out I did do 1024x1024 FFT's for image processing.

Quote from: rrr314159 on June 04, 2015, 01:27:54 PM
Quote from: FORTRANSI know people have used FFT to deconvolve larger images.

- thanks for info! hate to ask another dumb q., but what does one do with a 2-d FFT of an image? Is it used for pattern recognition, compression, ... ?

   The short answer is yes.  Pattern recognition using correlation
against known images.  Similar to feature extraction?  Clean-up
by masking specific frequencies.  Think of removing ~60 Hz from
a TV image.  Convolution and deconvolution, think of sharpening
and removal of complex artifacts.  Compression, sort of.  Think of
the cosine transform in JPEG images.  You can suppress frequencies
in an FFT of an image in a similar manner to reduce the data set.
Low pass, high pass, and band pass filtering.  Probably left some
stuff out.

   I have a book here "Deconvolution of Images and Spectra",
second edition, edited by Peter A. Jansson, which discusses
sharpening Hubble Space Telescope data (among other things).
It even mentions 3D FFT's to characterize a PSF (point spread
function) (image blur) and OTF (optical transfer function) (FFT
of the PSF).  However many other methods, other than FFT, are
discussed.

Regards,

Steve N.

rrr314159

Here is my SIMD-ized FFT routine. This is all original work except for a couple places where I borrowed siekmanski's code (mainly SysInfo.asm). However it never would have happened without his example (FFTroutines.Asm, etc) which I referred to constantly to check my results. The algo is original mathematics, altho no doubt it's already been done.

On my i5 4 core 2.94 Ghz machine it averages about 2.56 microseconds per 1024 pt complex FFT, 7500 cycles. (AMD is a lot slower). Could be better, with better cache management, but hard to see how the basic algo can be improved.

This code compiles as 32-bit or 64-bit without modification. I've gotten into the habit of writing all code for both 32 and 64 for configuration management purposes. Got sick of maintaining 2 versions, 64-bit for me and 32-bit to post here. 64-bit needs just 64-bit linker and kernel32, msvcrt lib's; no includes (and JWasm for both 32 and 64). Ask if interested how it works; or u can just ignore the issue, pretend it's straight 32-bit code, no problem.

.exe's are heavy: 52k 32-bit, and 62k 64-bit. All passes are unwound (of course there are 10 passes for 1024 FFT); also there's 16k of data hard-coded in data statements. This data should, of course, be coming in via a file. So the footprint can be reduced to, I don't know, probably 15k or so for 32-bit, with a 5-10% speed hit.

The main run does 4000 FFT's in memory, consisting of the same 4 repeated. If you have 6 or more threads it will use 6000 FFTs (24 meg), and the LUTs take another 10 meg, so there are no extra cache hits timing multiple runs; very important for good timing, since cache is vital.

I've tested up to 16 million point FFT, works fine but takes minutes. 1 million points about a minute. Should go as high as your memory allows, up to a few hundred millions of points, but take hours. The problem is the init routines: bit-reverse swap table, and sin/cos table. They're negligible for 1024-pt, but at 1 million take 30 seconds. They can be sped up, get down to a second no doubt, if anybody cares.

As usual this code is written in macro-heavy style, almost no proc's. I used to intend to learn better style but am beginning to think you should learn my style; it's so much easier and runs faster. Anyway - you're not going to like it! Oh well at least the algo is worth seeing, u can rewrite it with all the #$%^&*() proc's you want.

Although there are only 1000+ lines of code, it's divided into 7 files for modularity; the source is a bit over 40k. Here are all the files in the zip:

FFT.exe, FFT64.exe: the two exes, 53 and 62 k. BTW they don't use waitkey
doFFT.bat, doFFT64.bat: make files

FFT.asm: 10k, main file, includes thread and timing exec routines
FFTmacros.asm: 12k, the actual FFT routines
SysInfo.asm: 4k, borrowed from siekmanski with mods, gets sys info, num of threads
support_macs.asm: 12k, lots of macros: printing, timing, utility
invoke_macros.asm: 6k, necessary for 64-bit; has worked, no bugs, for 5 months(!)
header3264.inc: 4k, allows code to work 32 or 64. Very simple.
DFTcorrelation.asm: 4k, NOT included in prog, but can be. For test cases and <16 pts.

Here are my timing runs:FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  27240
MicroSeconds for one FFT                9.32

Cycles/1000 on main thread              30111450
MicroSeconds 4000 FFTs 4 threads        10301.37
MicroSeconds for one FFT (average)      2.5753

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100

=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  27354
MicroSeconds for one FFT                9.36

Cycles/1000 on main thread              29942264
MicroSeconds 4000 FFTs 4 threads        10243.49
MicroSeconds for one FFT (average)      2.5609

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100

=======================================================================
=======================================================================
=======================================================================

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
AMD A6-6310 APU with AMD Radeon R4 Graphics; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  52253
MicroSeconds for one FFT                29.78

Cycles/1000 on main thread              65106250
MicroSeconds 4000 FFTs 4 threads        37107.64
MicroSeconds for one FFT (average)      9.2769

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100

=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
AMD A6-6310 APU with AMD Radeon R4 Graphics    ; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  56231
MicroSeconds for one FFT                32.05

Cycles/1000 on main thread              65542234
MicroSeconds 4000 FFTs 4 threads        37356.13
MicroSeconds for one FFT (average)      9.3390

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100


Note: this is tested on only two computers. I hope it works on yours. Good luck!
I am NaN ;)

Siekmanski

Hi rrr314159,

QuoteAs usual this code is written in macro-heavy style, almost no proc's. I used to intend to learn better style but am beginning to think you should learn my style; it's so much easier and runs faster.
Anyway - you're not going to like it! Oh well at least the algo is worth seeing, u can rewrite it with all the #$%^&*() proc's you want.

:biggrin:

Just had a quick look, but i'll have to run it thru an interpreter first to understand it.
I'll study it this weekend.  :t

Here are the timings: (64 bit version crashed...)

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             27430
MicroSeconds for one FFT                1.92

Cycles/1000 on main thread              28895221
MicroSeconds 6000 FFTs 12 threads       2018.08
MicroSeconds for one FFT (average)      0.3363

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100



FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             27312
MicroSeconds for one FFT                1.91

---- here it crashed ----


Creative coders use backward thinking techniques as a strategy.

dedndave

prescott w/htt, xp sp3

i get this far, then it crashes with access violation
can't tell where - the SEH sets the EIP back to 401000

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Pentium(R) 4 CPU 3.00GHz; Windows XP; 2 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             194207
MicroSeconds for one FFT                0.06

Cycles/1000 on main thread              366340742
MicroSeconds 2000 FFTs 2 threads        122.11
MicroSeconds for one FFT (average)      0.0611

====================================

Real Data from Copied FFTs:


rrr314159

Thanks,

Not too amazed it crashes in some cases, I'm doing a lot of unusual stuff.

siekmanski,

at least it works for 32-bit. Clearly the CYCLES are correct - almost identical to mine - but the TIME computation is wrong by a factor of about 4! So mult the time by about 4, should be right answer. Dunno why, probably something simple, I could only test it with my 4 threads, u have 12. Division is wrong somewhere probably

dedndave,

this is the same problem your old machine has had with my software for months, never did find out what it was, don't really know what to do about it. If I had an old Prescott I could debug it ... maybe I'm using an SSEx instruction u don't have ... ? Happens without the 64 bit stuff also

well hopefully I'll learn more from others. Would expect 32 to work for any other machine (except perhaps nidud's old one). Don't know what to expect from 64-bit tho - Works fine for me!, but ... not 2 important. If there's a prob with it I can just take that out, post a regular 32-bit version.

I am NaN ;)

dedndave

offhand, i'd guess it's something to do with the prntFromGPR macro
but - the code is confusing   :lol:

FORTRANS

Hi,

   Here are some results from two 32-bit and one 64-bit computers.
I didn't expect it to run on the P-III, but it died with a different error
message than usual.  Also I could not redirect the program output
to a file.

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
????; Windows 2000; 1 threads

The exception unknown software exception (0xc000001e) occured in the application at
location 004028c6

================================

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Pentium(R) M processor 1.70GHz; Windows XP; 1 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             85845
MicroSeconds for one FFT                23.98

================================


FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             25930
MicroSeconds for one FFT                15.66

Cycles/1000 on main thread              50343928
MicroSeconds 4000 FFTs 4 threads        30395.03
MicroSeconds for one FFT (average)      7.5988

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100


============================================


FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             25191
MicroSeconds for one FFT                15.21

Cycles/1000 on main thread              49399295
MicroSeconds 4000 FFTs 4 threads        29824.71
MicroSeconds for one FFT (average)      7.4562

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100


HTH,

Steve

rrr314159

Thanks FORTRANS and dedndave,

Boy this is annoying, I shouldn't have computed the Time, but left it at Cycles. Apparently something's wrong with time computation but don't know what since works right on my computers. I'll be able to try it on others' machines soon, may get better idea.

FORTRANS all the runs are good on your i3, not surprising it's similar to my i5 but slower. In both cases the 64-bit is a little faster, not enough to be significant tho.

However the 1.7 Pentium (XP) is wrong, I'm reading the clock speed at 3.4 instead of 1.7. Apparently something wrong with QueryPerformanceFrequency.

Similar with siekmanski, but his clock speed is 4 times higher than it should be. At the moment have no idea why I get wrong numbers, can't be that QueryPerformanceFrequency is just "wrong" on these computers, rather I'm doing something wrong.

Like I say, wish I'd just left it alone! This could also be why siek's 64-bit run crashed. So intend to make new version without the "extras" so we can concentrate on the FFT - which after all is the important part. It gives right numbers on all the computers, not surprisingly

dedndave, prntFromGPR leads to the same print routines I've used before, which I suspected crashed your machine b4, so yes that could be it. Problem is, I need to use that technique to make it work with both 32 and 64. Could make a 32-only which would have much better chance of working on Prescott - but was trying to avoid multiple versions ...

BTW you're right those print routines have become a mess as I tack on new functions, in this case printing FFT output. Need to clean it up, but was hoping there would be no reason for anyone to look closely at it ...

Thanks for looking at it everyone. Right now I'm taking a well-deserved break from FFT'ing, will make a less ambitious version in couple of days
I am NaN ;)

Gunther

Quote from: rrr314159 on June 06, 2015, 07:14:31 PM
Right now I'm taking a well-deserved break from FFT'ing, will make a less ambitious version in couple of days

Which won't crash. That's a good plan.

Gunther
You have to know the facts before you can distort them.

rrr314159

Which won't crash

- It's guaranteed not to crash on MY computers. Apart from that, no guarantees!
I am NaN ;)

rrr314159

#89
Timing comparison -

Apparently this FFT is performing as it should, using SIMD registers. As far as I can tell, with siekmanksi's version my Intel does 4 thread's worth of FFT's in 43.2746 microseconds, which means 10.81 micros per 1024 FFT average. Mine, using SSE SIMD to do 4 computations at once, averages 2.56 microseconds - a little better than 4 times. That may be due to a couple of improvements, for instance I was able to eliminate the normalization stage by incorporating into pass 1. In fact since I'm not FFT-ing the same data again and again, but moving thru a large table, the improvement may be even better. But the basic factor of 4x is exactly as expected.

Quote from: Raistlin on June 01, 2015, 05:41:00 PMTo truely make progress we need the following:
1) A timing framework that is accurate, or at least we agree on. :bgrin:
2) A test dataset that is generic through all test cases, sufficiently large to be seen as "real world".

- good ideas!

@dedndave, I got my old 1-core computer out of the dustbin (has HTT, for 2 threads, like yours) and found the bug. In printFromCopiedFFTs I'm incorrectly assuming a minimum of 4 threads. If u want to fix it, in FFT.asm, line 193, it says

mov esi, row_ptr_table[NUMBER_ROWS*2]

change to

mov esi, row_ptr_table[500*2]

With only 2 threads there are only 2000 rows whereas NUMBER_ROWS is the max, 6000. Should have tested it on this old computer ... OTOH the problem with QueryPerformanceFrequency did NOT pop up; works perfectly on this old Intel Atom.

Or u can wait until I get around to posting a new version, which I'll do if there's any interest. The current version is good enough if anyone actually wants an improved FFT algo; there are no bugs in that.
I am NaN ;)