The MASM Forum

General => The Laboratory => Topic started by: Siekmanski on May 14, 2015, 07:58:23 PM

Title: Multithreading with FFT-IFFT
Post by: Siekmanski on May 14, 2015, 07:58:23 PM
This is an attempt to do multithreading with FFT-IFFT. (SSE2 and SSE3)
I'm not sure if the multithreading part is written well.
I see some strange timing values.....
Timing code needs to be checked as well the multithreading code.

Marinus

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 12

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


7172218 cycles for 1024 FFT & IFFT with 12 cores.

Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 14, 2015, 08:03:25 PM
prescott w/htt
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 2

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

1226085 cycles for 1024 FFT & IFFT with 2 cores.
Title: Re: Multithreading with FFT-IFFT
Post by: Biterider on May 14, 2015, 08:10:35 PM
Hi,
i7 4770K CPU

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 8

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1233527 cycles for 1024 FFT & IFFT with 8 cores.

Press any key to continue...


Biterider
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on May 14, 2015, 08:12:51 PM
HP Intel Core i5

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1025270 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...
Title: Re: Multithreading with FFT-IFFT
Post by: sinsi on May 14, 2015, 08:16:49 PM
AMD A10-7850K

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1273912 cycles for 1024 FFT & IFFT with 4 cores.

Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on May 14, 2015, 09:25:31 PM
Theoretically this would then allow for 2,5 million or so / FFT per second on a 4 core 3 Ghz CPU.
That can't be right can it ?
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on May 15, 2015, 12:36:13 AM
Hi,

   Results for i3 processor.

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


782169 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...


   Seems some zeroes are minus in some results and plus in others?
For example the first zero with Processor1 and Processor2.  Does
that matter?

Regards,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 15, 2015, 12:42:26 AM
Thank you all.

Hi Raistlin,
The fft routines are OK.
The timings are not correct as I mentioned in my first post. I got wrong time values.
Also I'm not sure about the multithreading code.
I'm not sure if the timing code is valid when running all cores together.
That we have to tackle.

Hi Steve,

rounding errors I guess. 0 is the same as -0.
Title: Re: Multithreading with FFT-IFFT
Post by: LiaoMi on May 15, 2015, 03:21:01 AM
Intel i7 4810MQ

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 8

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


2023572 cycles for 1024 FFT & IFFT with 8 cores.

Press any key to continue...
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 15, 2015, 09:04:43 AM
Hi siekmanski, (and all),

You have 6 cores, I guess it's an i7 4960? Cool ...

Minor point, but related to a major point, your printout says (for instance)

7172218 cycles for 1024 FFT & IFFT with 12 cores

But no, you have 6 cores. There's 12 threads, using hyper-threading. The rest of us peasants have 4 or 2 cores, either with hyper-threading (doubling what you mis-call the core count) or not. So for instance my i5 3330 has "4 cores" while biterider's i7 4770K has "8 cores" - actually only 4 cores, *2 for two threads per core.

That's a minor point, but the important point is: AFAIK you shouldn't use hyper-threading for this type of CPU intensive algo. No doubt you know this stuff but not everyone does ... hyper-threading allows two threads to share 1 core. Most software spends a lot of time waiting for disk access and stuff like that. So when a thread is in such a wait state the core is given over to the other ("hyper-") thread. That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever. Thus for normal software the two threads can each get almost full performance from this one core, almost doubling over-all performance for very few extra transistors.

BUT (there's always a "but") this doesn't work for CPU-intensive algos. Hyper-threading actually reduces overall performance in that case. A CPU-bound algo never waits for disk or anything, always utilizes the CPU maximally. So the two hyper-threads just fight each other for CPU time.

Therefore I think (based on naive understanding) don't use hyper-thread for this CPU-bound FFT app. Your machine shld just use 6 cores, and LiaoMi and biterider, just 4. All the other machines reported here are not hyper-thread so their performance is already better.

If you know better I won't be surprised, like I say I'm no expert.

Now, there's also a problem with timing. With so little data in each FFT it's gonna be done in no time, then the core comes back for more work. So (for instance) your timing shows how long it takes to launch 12 threads, 1000 times. Whereas dedndave's shows how long it takes to launch 2 threads, 1000 times. That's why the performance of his lesser machine apparently beats yours by about 6 times!

Basic idea, need to time how long it takes to go through a large data set, regardless of cores. So should have one large data set, maybe 1024 * 1024 * 1024, and put all cores available to work on that one set. Begin at beginning when get to end you're done. That way your 6 cores will do it about 6 times as fast as dedndave (since, altho he's got 2 cores, I'd expect your newer machine about twice as fast per core).

In other words you're doing a lot more work right now than dedndave - 6 times as much, the way it's set up. Dunno if I'm explaining this well, or even if I'm right, so will leave it at that and wait for comments.

Here are all the times in order from "slowest" (you, the fastest machine!) down to fastest, which happens to be mine. That's probably because I have nothing else running (not even hooked to internet) and no hyper-threading so my machine launches threads a bit quicker. I'm actually doing twice as much work during those cycles than dedndave, as explained above, and you're doing 3 times more than that, plus an awful lot of context switching, which explains these topsy-turvy results.

siek, 4960:   7172218 cycles for 1024 FFT & IFFT with 12 cores.
Liao, 4810MQ: 2023572 cycles for 1024 FFT & IFFT with 8 cores.
sinsi, 7850K: 1273912 cycles for 1024 FFT & IFFT with 4 cores.
bite, 4770K:  1233527 cycles for 1024 FFT & IFFT with 8 cores.
dedn, presco: 1226085 cycles for 1024 FFT & IFFT with 2 cores.
Rais, HP i5:  1025270 cycles for 1024 FFT & IFFT with 4 cores.
FORTRANS, i3:  782169 cycles for 1024 FFT & IFFT with 4 cores.
rrr, i5 3330:  608289 cycles for 1024 FFT & IFFT with 4 cores.

Finally there's a problem with the "master" routine, the one doing the timing: same core is also doing FFT calcs. U actually have 13 threads going, not just 12! And I've got 5, dedndave 3. So there's even more contention.

As for the math, Danielson-Lanczos is good, in some cases it's about the best AFAIK. But it depends a lot on the specific type of FFT and how it's going to be used. As for your implementation, looks good but perhaps can be speeded up a bit, I think (without having looked closely) some of that shuffling can be reduced.

The important thing at the moment is to clean up the multi-threading and timing, so I'll wait for comments on what I've said. If this seems right I know more or less what needs to be done, if I'm wrong that's another story!

Here is my run:

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


608289 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...


Thanks a lot (from everyone I'm sure) for getting this ball rolling, looks like a lot of fun!
Title: Re: Multithreading with FFT-IFFT
Post by: jj2007 on May 15, 2015, 10:39:55 AM
Quote from: rrr314159 on May 15, 2015, 09:04:43 AMhyper-threading allows two threads to share 1 core. Most software spends a lot of time waiting for disk access and stuff like that. So when a thread is in such a wait state the core is given over to the other ("hyper-") thread. That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever. Thus for normal software the two threads can each get almost full performance from this one core, almost doubling over-all performance for very few extra transistors.

BUT (there's always a "but") this doesn't work for CPU-intensive algos. Hyper-threading actually reduces overall performance in that case. A CPU-bound algo never waits for disk or anything, always utilizes the CPU maximally. So the two hyper-threads just fight each other for CPU time.

Thanks, that is a very nice explanation of the pros and cons of hyperthreading :t

@Marinus: afaics FFTSSE2 and FFTSSE3 both use scalar instructions. What about "parallelising" through using packed instructions?
Title: Re: Multithreading with FFT-IFFT
Post by: nidud on May 15, 2015, 10:59:02 AM
deleted
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 15, 2015, 11:07:41 AM
Hi jj2007,

thanks,

Actually two main loops in FFTSSE2 and 3 are using SIMD, loops called Istep_LP and Normalize_LP. Some other SIMD instructions also in less important loop RBI_LP and subsequent. There are lots of scalar instructions throughout but mainly initialization, prob could use SIMD but not involved with timing. I haven't studied it yet could be misunderstanding but, as I say, appears to parallelize with packed instructions in these places, movlps, mulps etc.

BTW there is a diff between parallelize and multi-thread, (altho in practice the words are pretty interchangeable), parallelizing the algo includes breaking into pieces ("jobs", "tasks", word like that) which can run independent, then assigning to different threads. Here that's been done by explicitly writing out separate data sets (ComplexData1, ComplexData2 ... ComplexData16) which certainly works. But think it needs more flexible scheme.

Another topic entirely, seems there's been a lot of homework q's asked lately. Ever since the "Visual Masm" incident I've figured you're our resident sleuth, glad to see u getting onto these errant students, keep up good work. I noticed that your handle is very similar to James Bond 007 I assume that's no coincidence :biggrin:

another post just happened, am posting this b4 looking, hope it still makes sense
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 15, 2015, 11:47:06 AM
hi nidud,

I thought they each have their own L1, regardless, you're right they're synched so that if one writes data which another one uses, it also must be changed, invalidating previous work. Anything like that is very bad, of course, if it happens often.

That's why parallelizing the algo ideally separates data completely, so each core works on its own data set. Some algos that's difficult or impossible, but FFT should be easy enough. I have multi-threaded only algos that can be thoroughly parallelized, so I never studied details of cache synchronizing - just completely avoid the issues with separated data. Mainly I've worked with pixel data, each pixel independent of the others, and it's easy to assign different screen areas (rows usually) to different cores. For FFT I believe similar can be done by partitioning the dataset.

Also didn't think about the LUT, you're raising a bunch of issues here. Perhaps the easiest approach is 1) clean up the timing and basic parallelization then 2) try it out and see how well it works, experiment to determine best way to handle LUT etc, as opposed to trying to figure it out beforehand. That's 2 much like work, for me anyway, since I don't know a lot of these details!

Another approach, tell Antariy that Lingo has figured out the fastest way to do it, nobody could ever beat him - wait a day or two - Antariy will work around the clock to show him up :biggrin:
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 15, 2015, 12:29:26 PM
you can use CPUID to determine whether or not the CPU supports HTT
and, you can control the thread affinity
the tricky part is - figuring out which affinity bit corresponds to which physical core   :lol:
it's very likely that you could use every other bit (on an HTT machine) and get seperate physical cores
no guarantees, though
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 15, 2015, 01:30:01 PM
@dedndave,

U mean u can't simply "turn off" hyper-threading? (I don't have such machines so don't know about them). U might have to trick it into putting (e.g.) 4 threads on the 4 separate cores? In my case, with almost nothing else running on the machine, I've found I don't need thread affinity. If I launch 4 threads Windows automatically makes the best allocation, they each wind up staying on the same physical core most of the time. OS doesn't move them if it doesn't have to, since obviously that's less efficient. Can't u do the same on HTT machine: just launch 4 threads with highest priority, keep everyone else away (for the most part), so OS will automatically assign each a core, and leave them there? Of course they do switch around a few times a second, maybe, but that's not enough to reduce efficiency significantly.

On another topic, I've been reading a bit about FFT implementation. There are many issues. As nidud mentions, cache is important (use "cache-oblivious" algo?), there's bit-reversal, question whether output overwrites input or not, special algos for use with SIMD ... ! The math is by far the easiest part, and only a fraction of the whole story. And these core-assignment issues are not even mentioned; fortunately that's easy (at least, w/o HTT)

I hope someone (siekmanski most likely) knows how to do this already! I'll be content to applaud ...
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on May 15, 2015, 07:03:01 PM
@Siekmanski

Ever had a look at the modified split-radix FFT by Johnson and Frigo (2007) said to reduce
standard Turnkey-Cooley (Danielson-Lanczos Lemma) by 6% fewer floating point operations ?

OOPS [later edited] :"Implementing FFTs in Practice" Johnson and Frigo (2012) : propose a cache strategy for FFT.
Cant make sense of it yet, going off to do background reading first  :icon_eek:
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 15, 2015, 07:33:54 PM
My 2 cents:

I looked at it. My overall impression is that 6% FP ops is small beer compared to other issues such as cache. All that obsession about FP comes from an earlier time when FP hardware multiplication was very slow, if it was available at all; even addition was very slow compared to integer. True, by 2007 that was no longer the case, but academia adjusts slowly; like the proverbial General, they're always fighting the last war. Today, especially with SIMD, FP mult is not much slower than add. In fact both are faster than integer when dealing with AVX (w/o AVX2). A week ago I came up with a clever way to avoid a mult (in a particular situation) using 3 int ops (xor, shift-left, and) - my "clever" way took twice as long!

FFT, u know, saves N/logN ops compared to Fourier Transform from-the-definition - that comes to many thousands of percent - that IS worth it. 6%, as I say, - perhaps worth it as a minor refinement after everything else is in place. If u have to go to even minimal trouble to achieve it, it's probably going to be slower.
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 15, 2015, 11:13:06 PM
no - can't turn off hyperthreading
i have an older machine, so if i did, i'd have only one core to play with - lol

i guess you do this....
figure out how many physical cores there are
total cores, divide by 2 if HTT bit set in CPUID (leaf 80000001:EDX[28])
create that many threads and let the OS manage cores
Title: Re: Multithreading with FFT-IFFT
Post by: nidud on May 15, 2015, 11:17:42 PM
deleted
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 16, 2015, 12:23:32 AM
Hi guys,

That's a lot of things things to take care of...
I'll try to code different schemes and see what it does.

In the meanwhile I wrote a timing routine that displays the measured time in 100 nanosecond units instead of cycle counts.

Quote from: dedndave on May 15, 2015, 11:13:06 PM
no - can't turn off hyperthreading
i have an older machine, so if i did, i'd have only one core to play with - lol

i guess you do this....
figure out how many physical cores there are
total cores, divide by 2 if HTT bit set in CPUID (leaf 80000001:EDX[28])
create that many threads and let the OS manage cores

Thanks Dave,
I'll also implement this.

Marinus
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 16, 2015, 08:15:56 PM
Next attempt, test for hyper threading and timing in 100-nanosecond units.
The speed is not what I thought it should be...  :(
It seems that starting the threads takes a lot of time.

Is there something I missed or not doing the right way ?
Maybe I should set up the threads and then trigger them with an event ?

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0307719 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.6469146 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Use AffinityMasks to manage the cores:
0.7050401 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.3317011 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Even )
0.3423527 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Uneven )

Press any key to continue...


Title: Re: Multithreading with FFT-IFFT
Post by: LiaoMi on May 16, 2015, 11:59:59 PM
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0504474 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.3644088 seconds for 8 * 1000 * '1024 FFT & IFFT' with 8 threads.

Use AffinityMasks to manage the cores:
0.4224370 seconds for 8 * 1000 * '1024 FFT & IFFT' with 8 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.2059342 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads. ( Even )
0.2245983 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads. ( Uneven )

Press any key to continue...
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 17, 2015, 04:49:33 AM
rather than trying to sync all the threads....
maybe timing them individually ?
i think the timing macros might not be "thread safe", but you could write your own version that uses local vars
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 17, 2015, 08:47:23 AM
Thanks LiaoMi.

Quote from: dedndave on May 17, 2015, 04:49:33 AM
rather than trying to sync all the threads....
maybe timing them individually ?
i think the timing macros might not be "thread safe", but you could write your own version that uses local vars

Hi Dave,

Yes you are right that the timings are not "thread safe", I noticed that also.

In some cases you want to launch lets say 2 threads and need to wait till they both are finished before you can go on in your app.
For example: calculate the FFT of a stereo sound stream every 10 ms.
In this case I want to know how fast the routine is calculating these 2 threads. (hopefully running simultaneously each on a different core.)
The timer routine is started and stopped on the same core as the app runs I assume.

In other cases where you don't have to wait for the threads to finish in your app, then you could measure the time for each thread separately.
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 17, 2015, 12:39:45 PM
Timing routines usually "thread safe" when used only one at a time, as you're doing. Don't need to use local vars in that case, global is OK. Still, if u use local vars the timer can be used in multiple threads, simultaneously. That's not true of MichaelW's (he stores beginning time in 2 dwords in memory), so u could use mine (included in my previous Lab work).

But in this case, with just one timer going at a time (no pun intended) MW's global var is fine. Other problem is, the timer must stay on the same core. If the timing thread changes cores in the middle of timing, it will use a different RDTSC, no good. U can tell because it will usually be sensible but then in one case be way off. The way you're doing it at the moment is particularly vulnerable because you're launching enough threads (5, in the case of my 4-core machine) to make it more likely that timer will, indeed, switch cores.

The right thing is to use Windows Performance Monitoring, "guaranteed" to synchronize across cores. I made a timer routine like that, posted it somewhere, I'll find it again. Probably best for this exercise. BTW Gunther also gave a reference in the Trig thread showing the really right way to do it, with Perf Mon and other precautions as well, maybe he has a routine written to that spec.

dedndave, for this exercise no good to time one thread at a time, we want to determine how effective the multi-threading is. It's not at all true that u can time for one thread, then assume that 4 threads will simply do it 4 times faster.

Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 17, 2015, 01:52:09 PM
Hi siekmanski,

Couple fixes needed to Consolestuff.asm

line 46 (this is just a warning, initializing in data? segment)

   szProcessorName             db 52 dup (0)

change to

   szProcessorName             db 52 dup (?)

line 287 (eax cpuid leaf must have bit 31 set)

   mov     eax,1

change to

   mov     eax,80000001h

Without this it incorrectly reports my machine uses HTT. The run given below has this fixed. On this point it's worth mentioning that you ignored AXIOM 1 of masm3.com, always a no-no:

**************** AXIOM 1 of masm32.com: dedndave is always right ******************

The timings seem to indicate, correctly no doubt, that

- launching 1 thread and doing one FFT takes .05 secs
- launching 4 threads, with 4 FFT's, takes a bit less than 4 times as long
- AffinityMask makes little difference

I intend to put together a better way (I think) to do the threads, so that using 4 will go close to 4 times quicker. Gonna get on it now hope to have it ready soon.

BTW I've tested the FFT routines a bit, with different inputs, everything correct so far, looks good

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
Operating System    : Windows 8
Hyperthreading      : NO
Logical Processors  : 4
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0547565 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.1736391 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Use AffinityMasks to manage the cores:
0.1761987 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Press any key to continue...
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 17, 2015, 06:33:33 PM
Thanks rrr314159,

Corrected the stupid typos. Must drink more coffee when writing code.  :biggrin:
And yes, you are right. Dedndave is always right.  :t

I'm also testing now another way to do the threads. ( using events )
Have it ready soon.
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 17, 2015, 06:58:44 PM
Here is my attempt at the modified FFT routine. As promised, with 4 threads it goes something like 4 times faster.

Unfortunately it took longer than expected, (4 hours or so) so it's not as clean as it should be. But I won't have an opportunity to fix anything for 8 hours so wanted to get this to you now.

The only real problem, I accidentally removed SSE2 support. Didn't notice until I'd buttoned it up, and can't test it anyway, so here it is. Can easily be put back in, I'll do it tomorrow (or day after :) if desired.

Also more-or-less deliberately removed ThreadAffinity stuff. Intended to put it back, but ran out of time. Don't think it's necessary anyway?

This may be 100% from your second version, or there may be some stuff crept in from the first. Should be no big deal either way.

The actual FFT routines (which I didn't touch) are in an include file "FFTstuff.asm" and my new routines are in the main asm file, now called "FFT.asm". I did this to make my approach clearer. Shifted a couple data statements into FFTstuff and Consolestuff for modularity.

I also made the data a bit more interesting by making 4 different "rows" (as I call them) of FFT data, which are not zeroed out in the last 1024*2 - 16 as the previous data was. These rows are reproduced 250 times for a total of 1000 "rows" and these are what the threads work on.

In the run included here, 1000 * 1024 reps with one thread takes 0.0471801, while 1000 * (1000 * 1024) reps (1000 reps of the 1000 rows) takes 9.2460124 seconds for 4 threads. On the face of it that's quite a bit better than 4 times as fast, since 47/4 = 10.2 seconds or so. But of course that's because it gets better the more reps you do, due to cache utilization no doubt. In fact the four threads are maybe 3.5 times better than one; which is about as good as you can expect.

The zip includes

FFT.exe      ; can be run by anyone with SSE3 to check timing
FFT.asm      ; heavily modified main routine
Consolestuff   ; unchanged except 2 data items moved here
FFTstuff.asm   ; FFT routines, unchanged, with accompanying data
doFFT.bat   ; "make file" batch file

Here are my timing results:

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.
Multi-threading parallelization changed rrr314159 2015/05/17.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

11.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse3 first 16 results:

0.073 0.000 0.073 -0.003 0.073 -0.006 0.072 -0.009
0.072 -0.012 0.071 -0.015 0.070 -0.018 0.069 -0.021
0.068 -0.024 0.066 -0.026 0.065 -0.029 0.063 -0.031
0.061 -0.034 0.059 -0.036 0.057 -0.038 0.055 -0.040

inverse FFTsse3 first 16 results:

11.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor3:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor4:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Timing starts after 2 seconds....

0.0471801 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
9.2460124 seconds for 1000 * 1000 * '1024 FFT & IFFT' with 4 threads.

Press any key to continue...


[edit] AFAIK this routine works correctly but, small correction, as siekmanski pointed out, in the routine "procarow" simply remove the two statements "push edi" and "pop edi", they're harmless but useless. Also don't forget it doesn't work for SSE2 u must have SSE3 at least. Finally a slower machine like dedndave's should take as much as 2 minutes after the statement "Let the OS manage the cores:", utilizing 100% CPU, and will appear to be hung up
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 17, 2015, 07:09:08 PM
Quote from: rrr314159 on May 17, 2015, 12:39:45 PM
dedndave, for this exercise no good to time one thread at a time, we want to determine how effective the multi-threading is. It's not at all true that u can time for one thread, then assume that 4 threads will simply do it 4 times faster.

perhaps i wasn't very clear on what i had in mind....

create N threads, timing each thread seperately
do not attempt to control thread affinity - let the OS handle it
when done, add N times together for total time

i understand that all threads must complete before the operation is done
but, we are interested in total algo time, for the sake of comparison
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on May 17, 2015, 11:04:47 PM
Hi,

Quote from: rrr314159 on May 15, 2015, 01:30:01 PM
U mean u can't simply "turn off" hyper-threading?

   That should be an option in the BIOS setup.  It was in the first
system I saw with hyper-threading, way, way back when.

Regards,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 18, 2015, 07:30:18 AM
Here is my version with event driven threads.
It sets up all the threads and leaves them in "idle" state.
They are fired up when sending an event, then does the FFT and IFFT and goes back in "idle" state when all threads are done.
It's fast and flexible to use.

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0499732 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.0619334 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Use AffinityMasks to manage the cores:
0.1338572 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.0898157 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Even )
0.1392474 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Uneven )

Press any key to continue...


Hi rrr31459,

I studied your thread routines and put a counter in "procarow" to count the FFT's and noticed that its total number is changing each time I start your app.
Also saw, you push and pop edi in "procarow". I assume it has to be esi.

Here are your results:

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.
Multi-threading parallelization changed rrr314159 2015/05/17.

Number Of Processors: 12

First 16 Complex (real, imaginary) samples for testing:

11.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse3 first 16 results:

0.073 0.000 0.073 -0.003 0.073 -0.006 0.072 -0.009
0.072 -0.012 0.071 -0.015 0.070 -0.018 0.069 -0.021
0.068 -0.024 0.066 -0.026 0.065 -0.029 0.063 -0.031
0.061 -0.034 0.059 -0.036 0.057 -0.038 0.055 -0.040

inverse FFTsse3 first 16 results:

11.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor3:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor4:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Processor5:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor7:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor8:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Processor9:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor10:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor11:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor12:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Timing starts after 2 seconds....

0.0505246 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
4.3393800 seconds for 1000 * 1000 * '1024 FFT & IFFT' with 12 threads.

Press any key to continue...


Marinus
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 18, 2015, 10:42:39 AM
Hi siekmanski,

that's right, should be esi not edi (made a last minute change). For a moment couldn't understand how it could pass testing: it's because your FFTSSE3 "uses" esi (and for that matter edi) so I don't need to push/pop either! I typically eschew procs but they sure can save your bacon now and then.

As for procarow counter different each time I'll look into it, but since results appear right (with a fair amount of testing) I don't know it's necessarily wrong ...

[edit] (thought about it) that routine is used by all threads, re-entrantly, so depending how you set up a counter, it certainly will vary depending on a number of factors ... the threads will do different numbers of rows depending on other processes active on the different cores. guess that must be the explanation

The timings u get look right, with approx. 12 times speed-up using multiple threads, similar to results with your event-using routine.

With your routine it looks like you've fixed previous timing anomalies by keeping the threads alive between FFT's (as mine does also); it was taking too long re-launching them every time. And it appears to prove that letting OS manage cores is best - twice as good, in fact, as using affinitymask. That accords with my previous experience.

However all these results indicate I was wrong about HTT, it is effective with FFT routine. Either it's not CPU-bound as I thought, or it is but manages to be effective anyway, or else these results are misleading (but I don't see how that can be). A few more runs like these will prove the point beyond doubt.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 18, 2015, 11:28:36 AM
Hi rrr314159,

QuoteWith your routine it looks like you've fixed previous timing anomalies by keeping the threads alive between FFT's (as mine does also);

And a very important advantage is, creating the threads only once at the startup of my app and then when I need the FFT, ( in my apps 60 times a second )
I only have to send an event to get the FFT data. Close the threads when my app exits. This is very efficient.
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 18, 2015, 11:40:02 AM
rrr's routine - hangs with 100% CPU - i had to terminate it
P4 prescott w/htt
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.
Multi-threading parallelization changed rrr314159 2015/05/17.

Number Of Processors: 2

First 16 Complex (real, imaginary) samples for testing:

11.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse3 first 16 results:

0.073 0.000 0.073 -0.003 0.073 -0.006 0.072 -0.009
0.072 -0.012 0.071 -0.015 0.070 -0.018 0.069 -0.021
0.068 -0.024 0.066 -0.026 0.065 -0.029 0.063 -0.031
0.061 -0.034 0.059 -0.036 0.057 -0.038 0.055 -0.040

inverse FFTsse3 first 16 results:

11.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Timing starts after 2 seconds....

0.1103082 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 18, 2015, 11:41:46 AM
Marinus' routine...
correctly detects htt
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
Operating System    : Windows XP
Hyperthreading      : YES
Logical Processors  : 2
Physical Processors : 1
ActiveProcessorMask : 0003h

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.1111278 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.2135991 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads.

Use AffinityMasks to manage the cores:
0.2115078 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.2065949 seconds for 1 * 1000 * '1024 FFT & IFFT' with 1 threads. ( Even )
0.2112603 seconds for 1 * 1000 * '1024 FFT & IFFT' with 1 threads. ( Uneven )

Press any key to continue...
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on May 18, 2015, 11:54:25 AM
Thanks Dave,

FFT multithreading doesn't seem to pay off on your machine ...
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 18, 2015, 01:12:30 PM
it's 2 logical cores, 1 physical core
not much of a test - lol
see how it does on some newer CPU's   :t
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 18, 2015, 02:11:12 PM
@dedndave,

I just made an edit to my post (the one with the FFT routine zipped) (http://masm32.com/board/index.php?topic=4231.msg45139#msg45139) says, among other things,:

Quote from: rrrFinally a slower machine like dedndave's should take as much as 2 minutes after the statement "Let the OS manage the cores:", utilizing 100% CPU, and will appear to be hung up

Quote from: dedndavei understand that all threads must complete before the operation is done but, we are interested in total algo time, for the sake of comparison

- now here's something that fascinates me (as hutch might put it), we don't really know what we're interested in! Raistlin has to tell us! He started the idea of parallelizing FFT. I'm interested in both topics (multi-threading, and FFT) so have been playing / working along; I think siekmanski has a clearer idea of his goal; but, u know, it's never really been stated.

- However IMHO it's not what u say. We (well, I) want to know the elapsed real-world time (from launching to terminating all threads) to do a particular amount of FFT'ing; NOT the total of the individual times of the separate threads. Obviously the 2 are closely related, but. That's why we don't need thread-safe timers, instead just one overall timer which needn't be thread-safe

@siekmanski,

Here's my results from your routine3, some of it excised,

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Timing starts after 2 seconds....

0.0481794 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.0482061 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Use AffinityMasks to manage the cores:
0.0826355 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.


- Without HTT each of the 4 cores is going exactly as fast as the one core, doing 4 times the work in the same time. Whereas with your machine the HTT is 20% slower, with dave's it's 100% slower. Need more timings but, as one would expect, two HTT routines on one core is not as good as 2 separate cores. In your case it's certainly worth it anyway, not in dedndave's, perhaps we need more results to be clear on the point.

- Note you still don't believe dedndave about the cpuid leaf 80000001h :), I don't have HTT. But if it's decided that HTT does work well, and therefore should be treated like real cores, it doesn't matter and u don't have to test for HTT anyway

- It seems pretty definite that AffinityMask shouldn't be used

A final point, on the net when they talk about parallelizing FFT they usually mean multiple threads working on one single FFT. Neither of us is doing that, we're doing one FFT per core, but doing a lot of FFT's. That's perfectly legit, just thot I'd point out it's different than common usage. Wouldn't be at all surprised if it's this other parallelization Raistlin was thinking of
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 18, 2015, 02:55:06 PM
my mistake on the 80000001h
it's one of those bits that shows up under both "feature bits" (EAX=1) and "extended feature bits" (EAX=80000001h)
so, using EAX=1 should also work   :P

notice that HTT is for Intel processors only (proprietary)
so, AMD may use those bits for other things
let me look it up....
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 18, 2015, 03:02:48 PM
for AMD...

EAX=1, EDX[28] = HTT (even though i don't think any AMD processors use HTT)
EAX=80000001h, EDX[28] = Reserved

for Intel...

EAX=1, EDX[28] = HTT
EAX=80000001h, EDX[28] = HTT
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 18, 2015, 03:28:18 PM
dunno where u got that from, but it's not right

On my Intel, 80000001h correctly reports no HTT, (as MSDN says it should) but 1 gives wrong answer, = yes.
As I said before.

On my AMD, ... both seem to give the wrong answer! Both say "yes", HTT
So evidently doesn't apply to AMD

U had it right the first time, for Intel!

Reminds me of a friend's favorite joke, when somebody accuses him of thinking himself perfect, "I'm not perfect. Once I thought I'd made a mistake, but I was wrong."
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on May 18, 2015, 05:25:42 PM
Quote
- now here's something that fascinates me (as hutch might put it), we don't really know what we're interested in! Raistlin has to tell us! He started the idea of parallelizing FFT. I'm interested in both topics (multi-threading, and FFT) so have been playing / working along; I think siekmanski has a clearer idea of his goal; but, u know, it's never really been stated.

@rrr314159
Astute as always, The goal (my side) is to do real-time FFT on radio telescope data streams to find binary star system pulsars (dreamy look). That's
were the meta-heuristics will come into play during differential searches that update a probability table for hits on inferred actual pulsars. Please
read http://www.ska.ac.za/researchprojects/s13.php for more information. Note I'am note affiliated with these institutions and obviously
would not be able to even physically be involved in any of such research. However that said the application of such technology would be essentially
endless (realistic look), at the very least proposing a theoretical model that could solve such problems is already a step closer. I'd need 10 petaflop system to
actually test... anyone got one of these laying about?

QuoteA final point, on the net when they talk about parallelizing FFT they usually mean multiple threads working on one single FFT. Neither of us is doing that, we're doing one FFT per core, but doing a lot of FFT's. That's perfectly legit, just thot I'd point out it's different than common usage. Wouldn't be at all surprised if it's this other parallelization Raistlin was thinking of

No, the common single FFT will not work for any real-world problems as efficiently as this and as you put it - The efforts here are much more legit! (awesome work)
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on May 18, 2015, 10:44:56 PM
Hi,

   Apparently my notebook BIOS does not affect Hyper-Threading...

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0598133 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.0910814 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Use AffinityMasks to manage the cores:
0.1557536 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.1355365 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads. ( Even )
0.1617971 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads. ( Uneven )

Press any key to continue...


HTH,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on May 18, 2015, 10:55:57 PM
rrr,
i got that information from the CPUID documents

AMD # 25481-234

http://dedndave.x10.mx/files/Amd-25481-234-CPUID.zip (http://dedndave.x10.mx/files/Amd-25481-234-CPUID.zip)

Intel # 241618-039

http://dedndave.x10.mx/files/Intel-241618-039-CPUID.zip (http://dedndave.x10.mx/files/Intel-241618-039-CPUID.zip)

of course, with newer CPU's, the documents could be out of date
i think both are pre-AVX dates
and - the bit definitions may very well have been changed to accommodate all the AVX stuff   :t


Steve, i've never seen it in the BIOS settings
and, i suppose there is a bit to control it, someplace (CR)
but, the operating system is going to set that bit, if there is   :P
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on May 19, 2015, 12:49:11 AM
Hi Dave,

Quote from: dedndave on May 18, 2015, 10:55:57 PM
Steve, i've never seen it in the BIOS settings
and, i suppose there is a bit to control it, someplace (CR)
but, the operating system is going to set that bit, if there is   :P

   Well, as I said with the first computer that I saw with HTT there
was a BIOS setting to enable it.  I think the default was off.  But
it was a long time ago (and not my computer).  As newer machines
don't have that option, one can suppose that newer operating
systems don't have problems with it any more.

Cheers,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 21, 2015, 10:51:27 AM
Hello Raistlin,

I'd like to see some of the actual pulsar data just for the interest of it and give something to shoot for re. FFT algo. I went to CSIRO Data Access Portal (PSRFITS archived data) but u need to be a registered user. Looked at PRESTO and TEMPO2 would be worthwhile to see how they do FFT but 2 much work. Scott Ransom mentions it's "very long period" pulsar data but that's a bit vague to say the least. Mentions data comes in various fixed point formats but (more recently) avail as single precision floating that would be best. maybe u could get at Maura Mc's NANOGrav pulsar search data?

If u can't get a bit of such data, do u know the data description? How big is FFT? Is it real not complex (probably), multi-dimensional? What's the data rate, how long is a typical run to be analyzed, what is the range of periods to search? Of course they can go very fast, up to 33 revolutions / sec (?), but also much slower. Usually pulsar rps can be estimated, so I wonder if they typically allow the user to input the range he'd be looking to analyze? I'll bet there are secular variations need to be taken into account, not even sure how to do such FFT's, that would be interesting. There are of course other periods associated with the (typical) binary pulsar system, mainly orbiting (a few hours up to much longer) and occlusion, who knows what else. Bottom line I'm wondering what is the typical problem you (and / or Jeandrew Brink) is trying to solve here.

Also very curious, whatever happened to GR "proof" based on PSR B1913+16? Do u know, or can u find out, about that? As we all know Hulse & Taylor "proved" GR 40 years ago with data from that system, showing the period's slowdown could be explained by contraction of orbit due to gravity waves carrying away energy. Supposedly this scheme was accurate to "14 decimal places" and won them Nobel Prize. That number "14" is very misleading, 6 (at the most) could be said to be associated with the GR gravity wave computations

At the time I was very suspicious and today even more so. Key question, 100s more systems have since been discovered, why don't any of them provide corroboration? In these 40 years has the change in period tracked their predictions correctly? Somehow I doubt it. I notice PSR B1913+16 is now a "recycled pulsar" (unheard of back then) meaning its evolution is completely different than they based their analysis on. Suspect this new explanation came about precisely because later data didn't fit prediction.

The problem I (and all laypersons) have is, they're not going to admit publicly when their Nobel Prize awards turn out to be a mistake. That's why I'm asking u, if you're in the community you could find out these things. If u tell me, I promise not to go to New York Times and create a scandal! It's just for my own curiosity.

It would be great, and not surprising, if H & T's analysis ties in with this FFT project. I'd like to do same analysis to newer systems and see if it's good to "14 decimal places" (more likely -14 I bet).

Any info appreciated c u later

[edit] made a mistake, Raistlin, your goal was "to find binary star system pulsars" and I morphed that into goal I'm more interested in, to analyze them once they're found. Re-read your post and realized this ... oh well maybe above will stimulate something anyway
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on May 22, 2015, 05:23:56 PM

http://astronomyonline.org/Stars/Pulsars.asp - for light background reading around current methods for detection (discussed method uses Complex DFT)

https://data.csiro.au/dap/landingpage?execution=e2s2 - for pulsar sample data requests
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on May 23, 2015, 02:34:43 AM
The first ref is great just what I wanted seems to support my suspicion that Hulse & Taylor didn't deserve that Nobel but I need to study more

Second ref is no good, u need to be registered user. Probably need affiliation to register
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 01, 2015, 02:42:54 PM
Quote from: jj2007 on May 15, 2015, 10:39:55 AM@Marinus: afaics FFTSSE2 and FFTSSE3 both use scalar instructions. What about "parallelising" through using packed instructions?

Quote from: rrr314159 on May 15, 2015, 11:07:41 AMActually two main loops in FFTSSE2 and 3 are using SIMD, loops called Istep_LP and Normalize_LP. ... I haven't studied it yet could be misunderstanding but, as I say, appears to parallelize with packed instructions in these places, movlps, mulps etc.

jj2007 turns out you're essentially right we're not using packed instructions with any real benefit. Using for complex multiplication (with, e.g., addsubps) is better than nothing but scratching the surface

I finally figured out how to SIMD-ize FFT (obvious once u see it, but I had to review the math). It helps that pulsar-search data is very large, can easily use 1024 points and many per second coming in (multi-threads easily also); also that individual data points are small, no need for real8. Very important, there's no prob separating real from imaginary data for pulsar problem (necessary for the algo)

When u do it the current way, with reverse-bit swapping at beginning, first 2 or 3 passes can't use SIMD (data not lined up right). If u do it the other standard way, swapping at end, then last 2 or 3 passes can't use SIMD. BUT (!!) u can swap somewhere in the middle and get full benefit for all passes! Requires pretty large FFT but that's no problem for pulsar data. Can also do 2 or 3 swappings, might be useful with AVX512 etc, but not currently necessary.

I wonder if anyone has seen this technique worked out anywhere? Somebody must already have done it? It only makes sense with SIMD instructions, because it's a PITA and there's no other reason to go to this trouble. So has anyone seen a recent FFT algo, specially for SIMD, with bit-reversal swapping in the middle? Googled it a bit but soon gave up. (sounds like a job for jj2007!) It's always easier to do it yourself than find somebody else's work, figure out what they've done, fix their mistakes etc (re-usable high-speed math software is an oxymoron). But I'd love to see an existing algo for comparison. This FFT stuff can be puzzling, sometimes feel like I'm standing on my head on a merry-go-round

Anyway I'll go ahead and put this together, will take a while, prob start a new thread (assuming it works of course) because it will be totally different. Should go quite a lot faster, especially AVX. If I get sick of it I'll at least detail how to do it, some hardier soul can tackle it
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 01, 2015, 05:07:22 PM
Hi rrr314159,

It would be cool to optimize the FFT in all it's facets.
It's difficult to understand all those different FFT examples you can read up on the internet.
I'll start all over and see what comes up. ( probably inventing the wheel again... )
I have some ideas too, have to work it out and see if they will work.
As you mentioned, this FFT stuff can be puzzling.

Are there any members on this forum who want to participate and/or give there thoughts about this subject ?
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on June 01, 2015, 05:41:00 PM
I'll be posting my modularized efforts towards
such and encompassing modern parallel FFT, as
I believe it could benefit all, no matter the application.

Please be kind.... :icon_rolleyes:

To truely make progress we need the following:
1) A timing framework that is accurate, or at least we agree on. :bgrin:
2) A test dataset that is generic through all test cases, sufficiently large to be seen as "real world".
3) An exchange of ideas for discussion - ex. bit-reversal grey code looks promising, packed instructions etc.
4a) A dynamically intelligent CPUID routine that allocates best possible thread count per hardware platform detected.
4b) A cache savvy coding effort that aligns (pun), takes heed of best practise.



Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 01, 2015, 07:01:35 PM
I don't doubt for a second I'm reinventing a wheel here, but it's a lot easier than figuring out someone else's wheel! But a problem in my great scheme has already come up: cache! Doing it the way I describe is (I think) right from a SIMD point of view, but having problem thrashing the cache (at 1024). Oh well, maybe one of you guys will figure a way out, when I post my efforts ... or else I can wait a decade until we get bigger caches :biggrin:

Meanwhile I'll be happy to help on your other points but am 2 busy at the moment trying to get it to work ... u know if we decided a fast FFT that gave wrong answers was acceptable, that would make it a lot easier

BTW what does "grey" code signify?
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on June 01, 2015, 07:15:23 PM
Developing Multithreaded Applications: A Platform Consistent Approach

Direct copy and paste text document follows relevant to our discussion.
Copyright Intel Corp. 2005 <- yes I know it's old but still is legit.
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on June 01, 2015, 07:17:24 PM
@rrr
radix 2 - grey code - see : www.katjaas.nl/bitreversal/bitreversal.html
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 01, 2015, 08:15:30 PM
Hi Raistlin,

Have you seen my bitreversal routine ?
It precalculates all bitreversals for a given FFT size, puts all needed memory swaps in a table.
For a FFT size of 1024 you end up with only 528 memory swaps.
No need to do the bitreversal each time you calculate the FFT.
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on June 01, 2015, 08:56:06 PM
@Siekmanski

Yes - pseudo-code or link please to the algo, if you have it please?
Yes - I knew it was going to be static per FFT.

see also ex. http://graphics.stanford.edu/~seander/bithacks.html#BitReverseTable  for some ideas.
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 01, 2015, 09:17:02 PM
raistlin, I see you've been working the problem! That's great.

re bit-reversal, the problem with the "gray code" and similar algos is, as siekmanski indicates, we only need one bit-reversal table per run; so no point in speeding it up.

Quote from: your bit-reversal refWhile optimizing the bitreversal, memory access becomes more and more the factor determining processing time. After all, these samples have to be swapped no matter how.

Very true! siekmanski makes an array of necesssary swaps - just once - then uses it again and again. The swaps can't be SIMD-ized, and they don't overlap in any way; there's not much u can do but, just go ahead and swap 'em. (Perhaps u could use gather/scatter, but we don't have AVX2). So I didn't really study how well the "gray code" optimizes computing bit-reversals, I suppose it must do something or they wouldn't go to all that trouble.

siekmanski, two things not quite right about your array, first, you're including unnecessary "null" swaps. In fact you only need ... well, rather less than 512 (my algo which I'll post later has the minimum, don't remember exactly). Also probably best to arrange them in ascending order to help with cache, altho might not make any difference.

The paper "advice on multi-threading" is good, more or less correct, and some of it applies to this problem.

We have (as I understand it) a whole lot of 1024-FFTs and they make perfect tasks, with excellent granularity, evenly balanced, so that's the end of that. I'm using dynamic assignment altho at first glance u'd think it not necessary, for various good reasons.

API spin-waiting or blocking, and thread pools, are extremely unnecessary here. We have about 4 processors versus thousands of jobs! And of course the OS handles other details; this paper is written by an OS writer-type person, we're not involved in that. We don't care about flush order because we're using in-place memory FFT.

I don't use Windows API synchronization, altho siekmanski does; this is assembler! Why should I go thru 1000 lines of Windows code to execute an "xchg" when I can do it directly?

This is a good write-up - excellent, even. You should read more from this source, guy knows what he's talking about. But if you want to know more about the *much* simpler job of multi-threading FFT's ...

Just noticed you posted a ref for more bit-reversing. Like that algo for reversing a byte, pretty cool, gotta admit doubt I'd think of it. Unfortunately we've got 10 bits. Reading along I see the best they've got for that appears to be the way we're already doing it. Looks like interesting work, with a lot of algos. But you know, for any one particular algo, it's an awful lot easier to figure it out myself. This is not rocket science ... except the actual FFT which is not simple at all. Would like to see a ref for that: SIMD - aware

[edit] another thing would be very useful, test cases for 1024 (or other) Complex (or real) FFTs, with the frequency domain numbers

[edit] actual pulsar data
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on June 01, 2015, 09:45:30 PM
the famous "stir-fry" bit reversal technique   :biggrin:

http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224 (http://www.masmforum.com/board/index.php?topic=12722.msg98224#msg98224)
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 01, 2015, 10:19:09 PM
Hi Raistlin,

Pseudo-code of the bit reversal routine,

allocate 2 memory buffers,

fftdata = FFT size (real,imaginary)
bitreversalindextable = FFT size * 2 ; just to be sure allocate enough memory.

; create the table

     nn = FFT size
     n = nn + nn
     totalswaps = 0
     j=1
     for (i=1; i<n; i+=2)
     {
if (j>i)
         {
             bitreversalindextable[i-1]
             bitreversalindextable + 1 [j-1]
             bitreversalindextable = bitreversalindextable + 2
             totalswaps = totalswaps + 1
         }
         k = nn
         while (k>=2 && j>k)
         {
             j -= k
             k >>= 1
         }
         j += k
      }
; end

; each FFT calculation read the swaps from the bitreversalindextable

      i=0
      for (i=0; i=totalswaps; i+=2)
      {
           j = bitreversalindextable[i]
           k = bitreversalindextable[i+1]
           swap(fftdata[j],fftdata[k]) ; swap complex pairs
      }



Hi rrr314159,

Quotesiekmanski, two things not quite right about your array, first, you're including unnecessary "null" swaps. In fact you only need ... well, rather less than 512 (my algo which I'll post later has the minimum, don't remember exactly).

OK, i'll  check my code.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 01, 2015, 10:49:53 PM
Hi rrr314159,

Thanks, Found it.  :t

rb_index_LP:
    cmp     ebx,eax
    jae      swap_skip      ; this was ja, must be jae

total swap count for FFT 1024 is now 496.

Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 02, 2015, 03:35:17 AM
@siekmanski,

that's it, needed to skip when equal also. 496, right.

I never studied your algo, since I had to tackle FFT from the ground up (40 years since I studied it) just wrote my own very simple bit-reverser table. But I was glad u posted some pseudo-code. That's cheating (should get it from the code) but figured great, this ought to make it easy! But,

             bitreversalindextable[i-1]
             bitreversalindextable + 1 [j-1]


aren't there a couple assignment operators missing here?

Quote from: rrr314159But you know, for any one particular algo, it's an awful lot easier to figure it out myself.

- Read more of these bit-reversal techniques, it's not all that easy to figure them out. This statement makes me sound like an egotistical a*****e ... bad enough being one, u don't have to sound like one! :icon_rolleyes: My excuse, bit-reversal speed just isn't important here. But admittedly there are some cool tricks available ...
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 02, 2015, 05:04:26 AM
It's not real code, i and j are both 1 from the start.

             bitreversalindextable[i-1] ; entry
             bitreversalindextable + 1 [j-1] ; next entry

Never understood the hassle about speeding up bitreversal routines.
It's faster to get it from memory then to calculate them in real time.
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 02, 2015, 07:11:02 AM
Quote from: siekmanskiNever understood the hassle about speeding up bitreversal routines.

U probably don't expect an answer! But I do have an opinion. First, consider that people do 65536 FFT's, that's a pretty big LUT. (I don't know if bit-reversal is used anywhere but FFTs, assume not). More important are these two facts:

1) In 40 years, processor speed has increased by about 1000, while storage (RAM and disk) has increased 1,000,000! Storage has kept up with "Moore's Law", doubling every 2 years, but processor speed only about 4 years. So LUTs have become more and more practical. Not that long ago, even 2000 byte LUT (for 1024 FFT) was pretty big.

2) Computer Science (CS) is always a good 10 years behind the times. Academia is conservative. Doesn't matter with History or even Math, which are pretty static. But CS is almost always obsolete. It's no coincidence that the most successful coders (Bill Gates, etc) dropped out of school. And, note that most of these ref's we've seen are already a good ten years old (e.g. "Bit Twiddling Hacks", which is an entertaining read). So that ref reflects state-of-the-art considerably more than 20 years ago! Back then storage mattered.

If you work in industry you're always pretty up-to-date, easy to forget (or maybe u didn't know) how out of it academics are
Title: Re: Multithreading with FFT-IFFT
Post by: hutch-- on June 02, 2015, 01:56:48 PM
 :biggrin:

> If you work in industry you're always pretty up-to-date, easy to forget (or maybe u didn't know) how out of it academics are   :icon14:
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on June 02, 2015, 10:04:40 PM
Hi,

Quote from: rrr314159 on June 02, 2015, 07:11:02 AM
(I don't know if bit-reversal is used anywhere but FFTs, assume not).

   Well, I thought I remembered that it was used to generate a
pseudo-random sequence for Monte Carlo programs.  Less
uniform than a uniform grid and more regular than a random
sequence.  And that it was used in an algorithm to create
something like a Bayer Matrix for ordered dithering.  If so,
Google search is being uncooperative.  I found some references
pertaining to pseudo-random sequences, but nothing nifty or
really coherent to a quick scan.  It may be a deprecated method?

   Time to run a test I guess.

Cheers,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: Gunther on June 02, 2015, 10:13:37 PM
Hi Steve,

Quote from: FORTRANS on June 02, 2015, 10:04:40 PM
It may be a deprecated method?

no it isn't. This is a very proven technology with reliable dithering results.

Gunther
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on June 02, 2015, 11:10:41 PM
Hi Gunther,

Quote from: Gunther on June 02, 2015, 10:13:37 PM
Quote from: FORTRANS on June 02, 2015, 10:04:40 PM
It may be a deprecated method?

no it isn't. This is a very proven technology with reliable dithering results.

   Thank you, that was what I thought.  I will try Google again with
differing terms later.  They seem to be declining in usefulness with
distracting search results and poor suggestions.

Regards,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on June 03, 2015, 04:48:45 AM
anyways.....
i was thinking you could look at drizz's 32-bit swap and adapt something similar using SSE for 1024 bits
that routine swaps the bits in EAX in about 10 clock cycles   :biggrin:
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 03, 2015, 08:19:25 AM
Turns out I may be interested in fast bit-reversing after all! Not just for one dword, but for the whole FFT swapping. Was playing around with FFT at 16 million points and my crude swapper gets awful slow at that level. Even 1 million points not good.  Maybe drizz has already produced best algo for one dword, but not the whole swapping job. But first I'm finishing the FFT for 1024 bit level - bit reverse algo is plenty fast enough down there

Would like to know more about other applications like dithering, how many bits would they do - 32? And how fast would it need to be? Didn't bother to google since u say it doesn't seem to be there. Maybe new thread about this topic would be sensible ...

BTW does anyone use FFT of million (1048576) points?
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 03, 2015, 04:59:38 PM
Could be a nice challenge trying to get fast bit-reversal routines.
But for FFT you still have to do the data swapping. Calculating just one FFT than you can benefit of a fast bit-reversal.
I think, if you do multiple FFTs of the same size ( for example in an audio program ) then it's faster to use a precalculated bit-reversal index-swap table.
At this moment I'm busy to speed up my FFT routine, trying to line it up for better use of SIMD instructions. ( lost a few hairs already...  :biggrin:)
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 03, 2015, 05:49:00 PM
lost a few hairs already...  :biggrin:

- Fortunately bald is "in" these days :biggrin:

- I seem to be getting about 3.4 microseconds, 10,000 cycles, per 1024 FFT, with SSE and 4 threads ... b4 2 long I'll give up trying to improve that and post it (need to clean it up a lot also that'll take a while). Apparently does 16 million FFT in something like a minute but I haven't tested that or checked it much. Suppose it could do much larger FFTs, but doubt anybody needs such?
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on June 03, 2015, 10:17:14 PM
Hi,

Quote from: rrr314159 on June 03, 2015, 08:19:25 AM
Would like to know more about other applications like dithering, how many bits would they do - 32?

   If I remember correctly (ha), bit-reversal does not scale well.
So six to twelve bits were used for pseudo-random sequences.

QuoteAnd how fast would it need to be? Didn't bother to google since u say it doesn't seem to be there. Maybe new thread about this topic would be sensible ...

   My Google skills seem to be sub par these days.  Used to get
reasonable answers fairly quickly.  Not so much now.

QuoteBTW does anyone use FFT of million (1048576) points?

   When I was using FFT for image processing, I know I did 256x256
and 512x512 images.  65,536 and 262,144 points respectively.
I may have done 1024x1024, but that would have been painfully
slow.  So probably not.  Done on a now dead computer, but I will
look at some backups just to refresh my own memory.

   I know people have used FFT to deconvolve larger images.  But
the deconvolution kernels tend to be smaller.  So they could be
processing smaller sub-images rather than the whole thing at
once.

Cheers,

Steve N.
Title: Re: Multithreading with FFT-IFFT
Post by: Gunther on June 04, 2015, 01:21:01 AM
Hi Steve,

Quote from: FORTRANS on June 02, 2015, 11:10:41 PM
   Thank you, that was what I thought.  I will try Google again with
differing terms later.  They seem to be declining in usefulness with
distracting search results and poor suggestions.

that's the wretchedness with Google. I'm using other search engines for years (different reasons) with better search results.

Gunther
Title: Re: Multithreading with FFT-IFFT
Post by: xanatose on June 04, 2015, 12:05:22 PM
Quote from: Gunther on June 04, 2015, 01:21:01 AM
Hi Steve,

Quote from: FORTRANS on June 02, 2015, 11:10:41 PM
   Thank you, that was what I thought.  I will try Google again with
differing terms later.  They seem to be declining in usefulness with
distracting search results and poor suggestions.

that's the wretchedness with Google. I'm using other search engines for years (different reasons) with better search results.

Gunther

To be fair. Google is an advertisement (and spying) business made to make money. If you are not the client, then you are the product. I suspect that any search engine without a subscription fee will eventually turn wretched.
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 04, 2015, 01:27:54 PM
Quote from: FORTRANSI know people have used FFT to deconvolve larger images.

- thanks for info! hate to ask another dumb q., but what does one do with a 2-d FFT of an image? Is it used for pattern recognition, compression, ... ?
Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on June 04, 2015, 02:05:36 PM
images aren't the only use for FFT   :biggrin:

a while back, Marinus posted a nice little spectrum analyzer demo   :t
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 04, 2015, 03:38:19 PM
using a 2-d (not regular 1-d) FFT?
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 04, 2015, 06:57:50 PM
Same FFT routine as in this topic.

http://masm32.com/board/index.php?topic=3665.msg38908#msg38908
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on June 04, 2015, 11:47:10 PM
Hi,

   Turns out I did do 1024x1024 FFT's for image processing.

Quote from: rrr314159 on June 04, 2015, 01:27:54 PM
Quote from: FORTRANSI know people have used FFT to deconvolve larger images.

- thanks for info! hate to ask another dumb q., but what does one do with a 2-d FFT of an image? Is it used for pattern recognition, compression, ... ?

   The short answer is yes.  Pattern recognition using correlation
against known images.  Similar to feature extraction?  Clean-up
by masking specific frequencies.  Think of removing ~60 Hz from
a TV image.  Convolution and deconvolution, think of sharpening
and removal of complex artifacts.  Compression, sort of.  Think of
the cosine transform in JPEG images.  You can suppress frequencies
in an FFT of an image in a similar manner to reduce the data set.
Low pass, high pass, and band pass filtering.  Probably left some
stuff out.

   I have a book here "Deconvolution of Images and Spectra",
second edition, edited by Peter A. Jansson, which discusses
sharpening Hubble Space Telescope data (among other things).
It even mentions 3D FFT's to characterize a PSF (point spread
function) (image blur) and OTF (optical transfer function) (FFT
of the PSF).  However many other methods, other than FFT, are
discussed.

Regards,

Steve N.
Title: rrr FFT SIMD-ized 32/64 bit
Post by: rrr314159 on June 05, 2015, 06:01:44 PM
Here is my SIMD-ized FFT routine. This is all original work except for a couple places where I borrowed siekmanski's code (mainly SysInfo.asm). However it never would have happened without his example (FFTroutines.Asm, etc) which I referred to constantly to check my results. The algo is original mathematics, altho no doubt it's already been done.

On my i5 4 core 2.94 Ghz machine it averages about 2.56 microseconds per 1024 pt complex FFT, 7500 cycles. (AMD is a lot slower). Could be better, with better cache management, but hard to see how the basic algo can be improved.

This code compiles as 32-bit or 64-bit without modification. I've gotten into the habit of writing all code for both 32 and 64 for configuration management purposes. Got sick of maintaining 2 versions, 64-bit for me and 32-bit to post here. 64-bit needs just 64-bit linker and kernel32, msvcrt lib's; no includes (and JWasm for both 32 and 64). Ask if interested how it works; or u can just ignore the issue, pretend it's straight 32-bit code, no problem.

.exe's are heavy: 52k 32-bit, and 62k 64-bit. All passes are unwound (of course there are 10 passes for 1024 FFT); also there's 16k of data hard-coded in data statements. This data should, of course, be coming in via a file. So the footprint can be reduced to, I don't know, probably 15k or so for 32-bit, with a 5-10% speed hit.

The main run does 4000 FFT's in memory, consisting of the same 4 repeated. If you have 6 or more threads it will use 6000 FFTs (24 meg), and the LUTs take another 10 meg, so there are no extra cache hits timing multiple runs; very important for good timing, since cache is vital.

I've tested up to 16 million point FFT, works fine but takes minutes. 1 million points about a minute. Should go as high as your memory allows, up to a few hundred millions of points, but take hours. The problem is the init routines: bit-reverse swap table, and sin/cos table. They're negligible for 1024-pt, but at 1 million take 30 seconds. They can be sped up, get down to a second no doubt, if anybody cares.

As usual this code is written in macro-heavy style, almost no proc's. I used to intend to learn better style but am beginning to think you should learn my style; it's so much easier and runs faster. Anyway - you're not going to like it! Oh well at least the algo is worth seeing, u can rewrite it with all the #$%^&*() proc's you want.

Although there are only 1000+ lines of code, it's divided into 7 files for modularity; the source is a bit over 40k. Here are all the files in the zip:

FFT.exe, FFT64.exe: the two exes, 53 and 62 k. BTW they don't use waitkey
doFFT.bat, doFFT64.bat: make files

FFT.asm: 10k, main file, includes thread and timing exec routines
FFTmacros.asm: 12k, the actual FFT routines
SysInfo.asm: 4k, borrowed from siekmanski with mods, gets sys info, num of threads
support_macs.asm: 12k, lots of macros: printing, timing, utility
invoke_macros.asm: 6k, necessary for 64-bit; has worked, no bugs, for 5 months(!)
header3264.inc: 4k, allows code to work 32 or 64. Very simple.
DFTcorrelation.asm: 4k, NOT included in prog, but can be. For test cases and <16 pts.

Here are my timing runs:FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  27240
MicroSeconds for one FFT                9.32

Cycles/1000 on main thread              30111450
MicroSeconds 4000 FFTs 4 threads        10301.37
MicroSeconds for one FFT (average)      2.5753

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100

=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  27354
MicroSeconds for one FFT                9.36

Cycles/1000 on main thread              29942264
MicroSeconds 4000 FFTs 4 threads        10243.49
MicroSeconds for one FFT (average)      2.5609

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100

=======================================================================
=======================================================================
=======================================================================

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
AMD A6-6310 APU with AMD Radeon R4 Graphics; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  52253
MicroSeconds for one FFT                29.78

Cycles/1000 on main thread              65106250
MicroSeconds 4000 FFTs 4 threads        37107.64
MicroSeconds for one FFT (average)      9.2769

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100

=======================================================================
FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
AMD A6-6310 APU with AMD Radeon R4 Graphics    ; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000 (1 thread)                  56231
MicroSeconds for one FFT                32.05

Cycles/1000 on main thread              65542234
MicroSeconds 4000 FFTs 4 threads        37356.13
MicroSeconds for one FFT (average)      9.3390

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100


Note: this is tested on only two computers. I hope it works on yours. Good luck!
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 06, 2015, 12:44:18 AM
Hi rrr314159,

QuoteAs usual this code is written in macro-heavy style, almost no proc's. I used to intend to learn better style but am beginning to think you should learn my style; it's so much easier and runs faster.
Anyway - you're not going to like it! Oh well at least the algo is worth seeing, u can rewrite it with all the #$%^&*() proc's you want.

:biggrin:

Just had a quick look, but i'll have to run it thru an interpreter first to understand it.
I'll study it this weekend.  :t

Here are the timings: (64 bit version crashed...)

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             27430
MicroSeconds for one FFT                1.92

Cycles/1000 on main thread              28895221
MicroSeconds 6000 FFTs 12 threads       2018.08
MicroSeconds for one FFT (average)      0.3363

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100



FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             27312
MicroSeconds for one FFT                1.91

---- here it crashed ----


Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on June 06, 2015, 01:27:31 AM
prescott w/htt, xp sp3

i get this far, then it crashes with access violation
can't tell where - the SEH sets the EIP back to 401000

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Pentium(R) 4 CPU 3.00GHz; Windows XP; 2 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             194207
MicroSeconds for one FFT                0.06

Cycles/1000 on main thread              366340742
MicroSeconds 2000 FFTs 2 threads        122.11
MicroSeconds for one FFT (average)      0.0611

====================================

Real Data from Copied FFTs:

Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 06, 2015, 04:17:48 AM
Thanks,

Not too amazed it crashes in some cases, I'm doing a lot of unusual stuff.

siekmanski,

at least it works for 32-bit. Clearly the CYCLES are correct - almost identical to mine - but the TIME computation is wrong by a factor of about 4! So mult the time by about 4, should be right answer. Dunno why, probably something simple, I could only test it with my 4 threads, u have 12. Division is wrong somewhere probably

dedndave,

this is the same problem your old machine has had with my software for months, never did find out what it was, don't really know what to do about it. If I had an old Prescott I could debug it ... maybe I'm using an SSEx instruction u don't have ... ? Happens without the 64 bit stuff also

well hopefully I'll learn more from others. Would expect 32 to work for any other machine (except perhaps nidud's old one). Don't know what to expect from 64-bit tho - Works fine for me!, but ... not 2 important. If there's a prob with it I can just take that out, post a regular 32-bit version.

Title: Re: Multithreading with FFT-IFFT
Post by: dedndave on June 06, 2015, 05:50:37 AM
offhand, i'd guess it's something to do with the prntFromGPR macro
but - the code is confusing   :lol:
Title: Re: Multithreading with FFT-IFFT
Post by: FORTRANS on June 06, 2015, 06:16:21 AM
Hi,

   Here are some results from two 32-bit and one 64-bit computers.
I didn't expect it to run on the P-III, but it died with a different error
message than usual.  Also I could not redirect the program output
to a file.

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
????; Windows 2000; 1 threads

The exception unknown software exception (0xc000001e) occured in the application at
location 004028c6

================================

FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Pentium(R) M processor 1.70GHz; Windows XP; 1 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             85845
MicroSeconds for one FFT                23.98

================================


FFT by rrr314159 2015/06/03. Size: 1024 points. 32 bit version
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             25930
MicroSeconds for one FFT                15.66

Cycles/1000 on main thread              50343928
MicroSeconds 4000 FFTs 4 threads        30395.03
MicroSeconds for one FFT (average)      7.5988

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100


============================================


FFT by rrr314159 2015/06/03. Size: 1024 points. 64 bit version
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz; Windows 8; 4 threads

Beginning of RB Table, 496 swaps:
1 200 2 100 3 300 4 80
5 280 6 180 7 380 8 40
9 240 a 140 b 340 c c0
d 2c0 e 1c0 f 3c0 10 20

Last few Cos's and Sin's, total 2040:
   0.195    0.189    0.183    0.177    0.171    0.165    0.159    0.153
   0.147    0.141    0.135    0.128    0.122    0.116    0.110    0.104
   0.098    0.092    0.086    0.080    0.074    0.067    0.061    0.055
   0.049    0.043    0.037    0.031    0.025    0.018    0.012    0.006

Real Data FREQUENCY DOMAIN:
   0.161    0.161    0.161    0.160    0.158    0.156    0.153    0.149
   0.145    0.140    0.134    0.128    0.122    0.115    0.107    0.099
   0.091    0.083    0.074    0.065    0.056    0.047    0.038    0.029
   0.020    0.012    0.003   -0.005   -0.014   -0.021   -0.029   -0.036

Imag Data FREQUENCY DOMAIN:
   0.118    0.008   -0.002   -0.012   -0.021   -0.031   -0.041   -0.050
  -0.059   -0.067   -0.075   -0.083   -0.090   -0.097   -0.103   -0.109
  -0.114   -0.118   -0.122   -0.125   -0.127   -0.129   -0.130   -0.130
  -0.130   -0.129   -0.127   -0.125   -0.122   -0.119   -0.115   -0.111

Cycles/1000                             25191
MicroSeconds for one FFT                15.21

Cycles/1000 on main thread              49399295
MicroSeconds 4000 FFTs 4 threads        29824.71
MicroSeconds for one FFT (average)      7.4562

====================================

Real Data from Copied FFTs:
   1.000    2.000    3.000    4.000    5.000    7.000    9.000   11.000
  11.110   12.120   13.130   14.000   15.000   17.000   19.000   21.000
   0.000   -0.000    0.000   -0.000    0.000   -0.000   -0.000   -0.000
   0.000   -0.000   -0.000    0.000    0.000   -0.000   -0.000   -0.000

Imag Data from Copied FFTs:
   1.100    1.200    1.300    1.400    1.500    1.700    1.900    1.110
   1.110    1.120    1.130    1.140    1.150    1.170    1.190    1.210
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100
   0.100    0.100    0.100    0.100    0.100    0.100    0.100    0.100


HTH,

Steve
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 06, 2015, 07:14:31 PM
Thanks FORTRANS and dedndave,

Boy this is annoying, I shouldn't have computed the Time, but left it at Cycles. Apparently something's wrong with time computation but don't know what since works right on my computers. I'll be able to try it on others' machines soon, may get better idea.

FORTRANS all the runs are good on your i3, not surprising it's similar to my i5 but slower. In both cases the 64-bit is a little faster, not enough to be significant tho.

However the 1.7 Pentium (XP) is wrong, I'm reading the clock speed at 3.4 instead of 1.7. Apparently something wrong with QueryPerformanceFrequency.

Similar with siekmanski, but his clock speed is 4 times higher than it should be. At the moment have no idea why I get wrong numbers, can't be that QueryPerformanceFrequency is just "wrong" on these computers, rather I'm doing something wrong.

Like I say, wish I'd just left it alone! This could also be why siek's 64-bit run crashed. So intend to make new version without the "extras" so we can concentrate on the FFT - which after all is the important part. It gives right numbers on all the computers, not surprisingly

dedndave, prntFromGPR leads to the same print routines I've used before, which I suspected crashed your machine b4, so yes that could be it. Problem is, I need to use that technique to make it work with both 32 and 64. Could make a 32-only which would have much better chance of working on Prescott - but was trying to avoid multiple versions ...

BTW you're right those print routines have become a mess as I tack on new functions, in this case printing FFT output. Need to clean it up, but was hoping there would be no reason for anyone to look closely at it ...

Thanks for looking at it everyone. Right now I'm taking a well-deserved break from FFT'ing, will make a less ambitious version in couple of days
Title: Re: Multithreading with FFT-IFFT
Post by: Gunther on June 07, 2015, 02:05:54 AM
Quote from: rrr314159 on June 06, 2015, 07:14:31 PM
Right now I'm taking a well-deserved break from FFT'ing, will make a less ambitious version in couple of days

Which won't crash. That's a good plan.

Gunther
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 07, 2015, 03:26:50 AM
Which won't crash

- It's guaranteed not to crash on MY computers. Apart from that, no guarantees!
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 07, 2015, 08:44:58 AM
Timing comparison -

Apparently this FFT is performing as it should, using SIMD registers. As far as I can tell, with siekmanksi's version my Intel does 4 thread's worth of FFT's in 43.2746 microseconds, which means 10.81 micros per 1024 FFT average. Mine, using SSE SIMD to do 4 computations at once, averages 2.56 microseconds - a little better than 4 times. That may be due to a couple of improvements, for instance I was able to eliminate the normalization stage by incorporating into pass 1. In fact since I'm not FFT-ing the same data again and again, but moving thru a large table, the improvement may be even better. But the basic factor of 4x is exactly as expected.

Quote from: Raistlin on June 01, 2015, 05:41:00 PMTo truely make progress we need the following:
1) A timing framework that is accurate, or at least we agree on. :bgrin:
2) A test dataset that is generic through all test cases, sufficiently large to be seen as "real world".

- good ideas!

@dedndave, I got my old 1-core computer out of the dustbin (has HTT, for 2 threads, like yours) and found the bug. In printFromCopiedFFTs I'm incorrectly assuming a minimum of 4 threads. If u want to fix it, in FFT.asm, line 193, it says

mov esi, row_ptr_table[NUMBER_ROWS*2]

change to

mov esi, row_ptr_table[500*2]

With only 2 threads there are only 2000 rows whereas NUMBER_ROWS is the max, 6000. Should have tested it on this old computer ... OTOH the problem with QueryPerformanceFrequency did NOT pop up; works perfectly on this old Intel Atom.

Or u can wait until I get around to posting a new version, which I'll do if there's any interest. The current version is good enough if anyone actually wants an improved FFT algo; there are no bugs in that.
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on June 09, 2015, 07:29:16 PM
I've done some reading on performance characteristics
that can be vetted against research papers for FFT's. For
some strange reason they prefer MFlops as an indicator
of implementation speed. A quick tour around the internet
begot this approximation for a 1024 FFT :

FORMULA mflops = 5 N log2(N) / (time for one FFT in microseconds)
--------------------------------------------------------------------------
see: http://stackoverflow.com/questions/7957907/how-many-ffts-per-second-can-i-do-on-my-smartphone-for-performing-voice-recogn

see : http://www.fftw.org/speed/

Comparing this to fftw.org 's results on library based FFT's reveals a huge difference.

5 * 1024 * 10 / 2.5 = 20480 MFLOPS

Given rrr's FFT algorithm the MFLOP count is likely between 9000 and 20 000 MFLOPS++ on average
for modern CPU's (post 2008).

Twice to five times as fast as the best out there.

(Iv'e also seen a formula for Power of 2 FFT as = 4 n log n  / don't know how relevant that is )

May we celebrate or am I missing something.  :greenclp:
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 10, 2015, 10:06:35 AM
Well, lettuce review the Mflop calculations.

The FFT does log N passes; in this case, N = 1024, that's 10 passes. Each one goes thru all 1024 numbers in pairs, that's 512 pairs. For each pair it does a so-called butterfly calculation. Call the pair a and b, first b is rotated thru an angle. The result, added to a, replaces a; and subtracted from (original) a, replaces b. Each bfly calc then requires one rotation (complex mult) plus two complex additions. That comes to 4 mults and 6 adds, 10 flops per pair, or 5 flops per number. So clearly that's 5 N log2(N) flops. If we divide by time in microseconds we get your formula:

FORMULA mflops / sec = 5 N log2(N) / (time for one FFT in microseconds)

where I've included the missing "/sec" on the left-hand side.

There are plenty of other details to consider but this is the main story ... The fascinating thing is, why this procedure is equivalent to the regular Fourier definition; but that's rather too tricky to go into. Other details include the bit-reverse swap, normalization, and other ways to get a few more micros out of it, like siekmanski's LUTs (if one reads the literature undoubtedly even more have been thought of). Anyway if you take account of some of those you might modify that factor of "5" to a "4" instead, justified in various ways.

So that all makes sense. Main thing I did was take advantage of SIMD to get 4 operations for the time of one, that's why it's 4 times faster. The ref's you're reading haven't done that for couple reasons. One, the regular way doesn't allow it, u have to think about it, which most computer scientists don't do because it gives them a splitting headache, and they get paid the same regardless. Of course people have done what I did (no big deal), but you'll have to search to find it, their papers lost among the overwhelming mass of drivel turned out by hacks. The other, your ref's are years old, and SIMD has been available only a decade or so.

Bottom line: no great cause for celebration, my algo is not real exceptional, except compared to the normal cr*p you find in the classroom. IF we can speed it up another factor of (say) 4 that would be exceptional - and, I bet, possible, tho I don't know how.
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on June 11, 2015, 01:49:21 AM
Years or Decades old = NULL

The above links are in fact contemporary - fftw 3.3.4 for example has up to AVX support and tested on XEON's etc.
In-fact recent papers still quote the open-source fftw lib as important in  FFT-IFFT based or other research.
Guess what - they've never bothered writing their own <- so touche'  about your com.science inference.

What's wrong rrr, ca'nt take praise when it's given? :t
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 11, 2015, 04:09:08 AM
What's wrong rrr, ca'nt take praise when it's given?

- Should be pretty good at it, had a lot of practice. But can't believe this simple little trick isn't well-known to some people at least, even it it's not commonly known. I'll have to study your ref's when get around to it ...

BTW using AVX you'll get another factor of approx. 2 improvement, but then the minimum size is 32 instead of 16

And would like to look at some other tricks, pretty sure one can get at least another factor of two out of it. Problem is one has to understand that b'fly technique better, makes one feel like you're standing on your head on a merry-go-round (if u know what I mean). I wish Gauss were still alive, it would be so easy for him
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on June 12, 2015, 02:51:39 PM
Well I reviewed a bunch of pubs on this FFT vectorizing / parralelizing / SIMDizing topic. Of course there was a bunch of work done which has nothing directly to do with my algo, but at least 10% of these papers is in that same area. My guess is that they finally figured it out in 2011, but I can't get copies of most of these papers; only the abstract. Prior to 2011 I can tell they hadn't got it yet, because of the techniques they did use: they wouldn't have done those things if they'd thought of this simple trick. Techniques include using scatter / gather, lots of shuffle ops, auto-vectorization etc. Also the speed-ups claimed are less than perfect; for instance with 8-vectors they're claiming speed-ups of 6 or so. It should be a perfect "8". Here are some of the most-cited papers (out of hundreds) that don't get it:

1) P. N. Swarztrauber, Vectorizing the FFTs, Parallel Computations

2) A High-Performance FFT Algorithm for Vector Supercomputers
David H. Bailey NASA AMES RESEARCH CENTER MOFFETT FIELD, CALIFORNIA 94035 

3) Automatic SIMD Vectorization of Fast Fourier Transforms
for the Larrabee and AVX Instruction Sets
Daniel S. McFarlin
Department of Electrical and
Computer Engineering
Carnegie Mellon University
Pittsburgh, PA USA 15213
dmcfarli@ece.cmu.edu

Volodymyr Arbatov
Department of Electrical and
Computer Engineering
Carnegie Mellon University
Pittsburgh, PA USA 15213
arbatov@ece.cmu.edu

The well-known shift to parallelism in CPUs is often associated
with multicores. However another trend is equally salient: the
increasing parallelism in per-core single-instruction multiple-date
(SIMD) vector units. Intel's SSE and IBM's VMX (compatible to
AltiVec) both offer 4-way (single precision) floating point, but the
recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way
and 16-way, respectively. Compilation and optimization for vector
extensions is hard, and often the achievable speed-up by using vectorizing
compilers is small compared to hand-optimization using
intrinsic function interfaces. Unfortunately, the complexity of these
intrinsics interfaces increases considerably with the vector length,
making hand-optimization a nightmare. In this paper, we present a
peephole-based vectorization system that takes as input the vector
instruction semantics and outputs a library of basic data reorganization
blocks such as small transpositions and perfect shuffles that
are needed in a variety of high performance computing applications.
We evaluate the system by generating the blocks needed by
the program generator Spiral for vectorized fast Fourier transforms
(FFTs). With the generated FFTs we achieve a vectorization speedup
of 5.5–6.5 for 8-way AVX and 10–12.5 for 16-way LRB. For
the latter instruction counts are used since no timing information is
available. The combination of the proposed system and Spiral thus
automates the production of high performance FFTs for current and
future vector architectures.

4) Efficient Utilization of SIMD Extensions,10.1109/JPROC.2004.840491,Proceedings of The IEEE
Franz Franchetti, Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber

This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes: 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
Journal: Proceedings of The IEEE - PIEEE , vol. 93, no. 2, pp. 409-425, 2005

5) Ultrahigh-performance FFTs for the CRAY2 and CRAY Y-MP supercomputers,10.1007/BF00129773,The Journal of Supercomputing,David A. Carlson 

In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%.
Journal: The Journal of Supercomputing - TJS , vol. 6, no. 2, pp. 107-116, 1992
DOI: 10.1007/BF00129773

- This (5) must be close, but finally in 2011 these guys apparently got it:

6) R. Crandall and J. Klivington, Supercomputer-style FFT library for Apple G4

Abstract

Many traditional algorithms for computing the fast Fourier transform (FFT) on conventional computers are unacceptable for advanced vector and parallel computers because they involve nonunit, power-of-two memory strides. This paper presents a practical technique for computing the FFT that avoids all such strides and appears to be near-optimal for a variety of current vector and parallel computers. Performance results of a program based on this technique are presented. Notable among these results is that a Fortran implementation of this algorithm on the CRAY-2 runs up to 77% faster than Cray's assembly-coded library routine.

- They claim to avoid all power-of-2 strides and get "near-optimal" performance. That's got to be the right algo. I wish I could see a copy of this paper. A general truth is: it's approximately impossible to come up with anything (in technical topics) that's actually new. There are just too many very smart people out there working on it. The reason one sometimes thinks an idea is new: there are so many not-so-good people (9 out of 10, say), and their mountains of papers make it hard to find the ones that count.

- Let me emphasize also that earlier work was on many other, more general but related problems and that partly explains why they didn't get this simple trick. For more understanding of how it could be missed, review what people say about things like disposing of nuclear waste. It's clear that about 85-90% of so-called "experts" are a bunch of idiots. Also consider the fact that an author like Wittgenstein is still considered worth reading! (really - I'm not kidding! Impossible to believe, but true!) Let's face it, a good 90% of "experts" couldn't think their way out of a paper bag
Title: Re: Multithreading with FFT-IFFT
Post by: jj2007 on June 12, 2015, 10:29:58 PM
Quote from: rrr314159 on June 12, 2015, 02:51:39 PMreview what people say about things like disposing of nuclear waste. It's clear that about 85-90% of so-called "experts" are a bunch of idiots.

I am not a nuclear expert, just a humble economist. What distinguishes me from the experts who planned Three Mile Island, Chernobyl and Fukushima, or from those who wrote the Rasmussen Report aka WASH-1400: They were surprised - I was not.

Unfortunately, science is not very scientific. What is most helpful in reading scientific papers is the golden rule: "The bigger the potential profit, the bigger are the lies".
Title: Re: Multithreading with FFT-IFFT
Post by: Gunther on June 13, 2015, 01:19:28 AM
Quote from: jj2007 on June 12, 2015, 10:29:58 PM
Unfortunately, science is not very scientific.

Unfortunately, this is true.

Quote from: jj2007 on June 12, 2015, 10:29:58 PM
What is most helpful in reading scientific papers is the golden rule: "The bigger the potential profit, the bigger are the lies".

We're talking about the science operations. I would slightly alter that. The more research funds are up for grabs, the greater the speculations. A good example of this are the various string theories.

Gunther
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on June 15, 2015, 12:31:49 AM
Just to say I'm still around here.
I'm having a break at this moment. ( no time to do much programming the next weeks  :( )
I'm still studying the FFT and found some special cases where I think, we can reduce calculations. ( have to proof this first... )
I'll be back if it works out, hopefully with some speed improvements.
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on August 05, 2015, 03:56:59 PM
rrr:
Quote- They claim to avoid all power-of-2 strides and get "near-optimal" performance. That's got to be the right algo. I wish I could see a copy of this paper

Here you go :  http://www.davidhbailey.com/dhbpapers/fftzp.pdf
and here's the paper for your Apple G4 ref: https://www.apple.com/ca/acg/pdf/g4fft.pdf

It seems a bit old - will read and report back, if I can make sense of it.
Title: Re: Multithreading with FFT-IFFT
Post by: Gunther on August 05, 2015, 07:05:32 PM
Hi Raistlin,

thank you for the links.  :t

Bailey's paper is very old and treats the Cray architecture. But the G4 paper could be helpful. It includes a kind of pseudo code and a lot of hints. It's written 15 years ago, but it doesn't matter. The only difficulty is, to translate it from Altivec to XMM.

Gunther
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on August 06, 2015, 03:28:01 AM
Thanks raistlin

Both these papers are relevant and both may discuss the technique I used, not sure yet - they're in the ballpark anyway. They discuss a lot of other things as well, probably there are ideas that would help in our case; altho a couple of them are already in siekmanski's original version, for instance unit-stride in the precomputed powers of two.

My power's been out since a tree fell on the house, so I haven't been around. Should be back on line in couple of days
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on August 07, 2015, 05:14:55 AM
After heavy reading it seems that the overall parallel algorithm, to which we should add any relevance,
is that of Pease, M - the originator of the bit-reversal and butterfly idea. Subsequently very
small but cumulative improvements have been made since the paper's release of 1986.
Confusingly that of the G4 Apple paper & Bailey might just be reiterating the point.....@#ck I wish I was better at math!
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on August 07, 2015, 07:42:42 AM
Quote from: raistlin...I wish I was better at math!

- Who doesn't? If u have specific questions about these papers - like, an equation or symbol - I can help, but  if the whole thing's baffling you're probably better off finding a good tutorial. Thanks for doing the research ...
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 02, 2016, 01:11:58 AM
Found this on the WEB,

Optimized Sparse Fast Fourier Transform

http://www.spiral.net/software/sfft.html

A reference implementation of the SFFT on the SFFT website at MIT

http://groups.csail.mit.edu/netmit/sFFT/
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 04, 2016, 03:54:32 PM
Ta - for links Siekmanski,

A couple of notes on this:
1) Actually Sparse FFT was slower or only equal to FFTW on average - leaving out the "optimized" n volume case the author then provides as proof for 4 to 5 times speedup.
2) Note that the standard library FFTW used in mathlab etc. is still the defacto standard for scientific comparison.
3) This statement is alarming - it creates the impression of additional speed-up - which I don't believe is  quite true - see: http://www.spiral.net/bench.html
     and actually then in my humble opinion casts doubts as the validity or quality of research.
However, the reference implementation is not optimized for modern hardware features such as the cache hierarchy, vector instruction sets, or multithreading
then this...
Our flagship is the SPIRAL program generation system, which, entirely autonomously, generates platform-tuned implementations of signal processing transform such as the discrete Fourier transform, discrete cosine transform, and many others.

PS: ...and we've proven our's (masm forum endeavor) is faster by a factor of 4 to 5 over FFTW anyway, thus we should all enroll in MIT and write a paper.  :badgrin:

Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 04, 2016, 07:11:16 PM
As far as I understand this algorithm, they first remove zero and near zero frequencies?
Further they speak about that this is a fast way to get specific frequencies from e.g. pulsars etc.
Why not using the Goertzel algorithm then?
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 04, 2016, 08:10:01 PM
Probably because of this ???? - just guessing.... ::)

QuoteFor covering a full spectrum, the Goertzel algorithm has a higher order of complexity than Fast Fourier Transform (FFT) algorithms; but for computing a small number of selected frequency components, it is more numerically efficient.

<<edit : so not as generalizable ?? >>
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 04, 2016, 08:29:05 PM
That's just what I meant, if your not interested in the whole frequency spectrum range.
If you sample let's say, a 1 GHz signal and you want to examine only a few specific frequencies,  Goertzel is much faster.  :biggrin:
Title: Re: Multithreading with FFT-IFFT
Post by: rrr314159 on April 05, 2016, 02:22:08 AM
I'm confused about this new sparse FFT ...

First, the Goertzel algo is related to a digital filter. It's good when you know what frequencies you're looking for, and there aren't too many of them. For instance recognizing the small number of frequencies phones use to encode the buttons on the keypad (do they still use this technique?).

The sparse FFT referenced seems to require, as with Goertzel, that you know the frequencies you're looking for. If that's true it's not immediately obvious when you'd use either one. Perhaps the sparse FFT is for when you've got a small range of freq's but are interested in anything within that range (unlike phone pad)? But it didn't sound that way in the brief description.

One thing that confuses me - I thought there were FFT algos optimized for sparse frequency sets, when you don't know which ones you're looking for. I don't find anything on the net about that type (didn't look too hard).

The other confusion - I'm quite sure there are FFT's for sparse data sets - i.e., in the time domain. (Again, a little googling didn't turn up anything). In this case you still want to analyze a full range of frequencies - they're not "sparse" - but the incoming data has lots of zeroes.

So there seem to be four types (at least) which could be called "sparse":

1 - few frequencies, and you know what they are
2 - limited, known frequency band, but within that you want all of them
3 - few / limited frequencies, but you don't know what they are
4 - all frequencies, but few time domain points - that is, most of the time domain is zero's

It seems that the sparse FFT - which Raistlin ref'd - is either type 1 or 2. And, it's common enough that I don't easily find stuff on the net about the others. Wouldn't be surprised if they're called something different today.

Raistlin, I believe, is interested in pulsars. This isn't a case 1. True, you can be sure they're not going faster than (not sure what the limit is - 1000 hz?) nor slower than a certain frequency (not sure of that either - a few seconds per revolution?). But within that frequency band all possibilities exist. It's not quantized, like phone keypad signals. So Goertzel, at least, doesn't seem right. Maybe that's what this "sparse FFT" is for?

But there are two things you can do with pulsars. 1) once you've found it, analyze its frequency - that ought to be easy. 2) comb through a lot of observational data covering broad segment of sky, looking for regularly repeating signals. The first is what I've been mostly thinking about but the second may be the real job in pulsar hunting. And whether the "sparse" concept applies to it at all, I don't know.

As I said I'm confused! But I hope this might help clarify some of the issues involved, for further intense googling efforts by somebody  :P
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 05, 2016, 05:19:43 AM
QuoteOne thing that confuses me - I thought there were FFT algos optimized for sparse frequency sets, when you don't know which ones you're looking for. I don't find anything on the net about that type.

Yes, this is what I'm confused about too.

How to find those (near) zero frequencies...
I'll have a look at the code again and see if I can find how they trace those frequencies.
Then you can just skip the calculations for those frequency bins in the FFT code.

I'm curious what Raistlin's pulsar data looks like and what he wants to analyze exactly.

My guess is:
2 - limited, known frequency band, but within that you want all of them.

If this is true, we can resample ( and band pass filter ) the frequency band to DC and do the FFT for only that frequency band.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 13, 2016, 08:19:26 AM
Made some progress,

My new SSE routine is now 6.5 times faster as the old one.
13.5 Giga Flops for 1 size 1024 FFT on my machine using 1 core.

All loops run on 1 cache line (< 64 bytes) processing 16 values at once.
Except for the main butterfly loop which need 2 cache lines.

To make a library for all sizes I need one extra register to get rid of an immediate. ( or else do SMC... )

I have to learn more about code and data prefetching too.
On my machine one "prefetchtx" takes 43 cycles ????

Some optimisations:
Using all registers to keep the code size as small as possible for the loops.
Only using movs adds subs and muls in the loops. ( no sign flipping anymore )
SinCosTable size is now reduced to: real4 (FFTsize/2)*3 dup (?)
Second Log2 loop is done in 2 passes, even pass and odd pass.
25% less SinCos memory reads.

I'm curious to see some timings from other machines.

FFT IFFT routines by Siekmanski 2016.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

0.0249074 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 2056

0.0037942 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 13494


Code size Main loop: 121

Press any key to continue...

Title: Re: Multithreading with FFT-IFFT
Post by: jj2007 on April 13, 2016, 09:01:23 AM
Hi Marinus,
Core i5 results:FFT IFFT routines by Siekmanski 2016.

Processor Name      : Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Operating System    : Windows 7
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

0.0301108 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 1700

0.0049090 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 10430


Code size Main loop: 121
Title: Re: Multithreading with FFT-IFFT
Post by: sinsi on April 13, 2016, 09:49:20 AM
I'll fire up the AMD box later to test.


Windows 10 Home insider preview
Processor Name      : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

0.0223553 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 2290

0.0034696 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 14757


Windows 10 Pro
Processor Name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

0.0205278 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 2494

0.0029830 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 17164


Title: Re: Multithreading with FFT-IFFT
Post by: sinsi on April 13, 2016, 09:58:25 AM
No comment  :icon_eek:

Processor Name      : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
Operating System    : Windows 7
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

0.0431336 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 1187

0.0151598 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 3377

Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 13, 2016, 05:18:57 PM

FFT IFFT routines by Siekmanski 2016.

Processor Name      : Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

0.0527319 seconds for 1000 * '1024 FFT Old' single thread.
MegaFlops: 971

0.0031612 seconds for 1000 * '1024 FFT New' single thread.
MegaFlops: 16196


Code size Main loop: 121

Press any key to continue...


LOL @ sinsi - AMD data
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 13, 2016, 07:16:52 PM
I'am interested to see if someone (penguin type, with multi-boot capability  :P) would be willing to
compare FFTW under LINUX with Siekmanski's output under Windows - running the same hardware platform.

see: https://www.ll.mit.edu/HPEC/agendas/proc10/Day2/PB10_Pilaud_abstract.pdf  for batch script reference

I imagine the FFTW lib is available in most repositories but also here: http://www.fftw.org/download.html

We need data on single thread and multi-threaded runs.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 13, 2016, 08:22:57 PM
How is the FFTW benchmarking done?

If I read the docs they do measurements on 1 core, am I right?
Do multiple runs and pick the fastest run to calculate the MegaFlops.
Can I do a multi-threaded run divided by the number of Physical Processors to compare it with the FFTW benchmarks?
Because, hyperthreading will influence the outcome but can be done on the same core.

BTW, forgot to implement the optimization for the last Log2 loop....  ::)
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 13, 2016, 09:42:36 PM
How is the FFTW benchmarking done? 
Methodology : http://www.fftw.org/accuracy/method.html
If I read the docs they do measurements on 1 core, am I right?
Yes
Do multiple runs and pick the fastest run to calculate the MegaFlops
Actually worst run
Can I do a multi-threaded run divided by the number of Physical Processors to compare it with the FFTW benchmarks?
I wanted to see the relationship - or at-least infer such - but as we are running non-concurrent FFT - such would be a moot point
wanted to see if they have any cache sensitivity issues

Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 13, 2016, 10:19:21 PM
To compute the accuracy of an FFT, we only plot the "worst" results.
http://www.fftw.org/accuracy/method.html

I looked at the timing routine in benchfftw code, as far as I understand the C source-code they plot the fastest run.

void report_verbose(const struct problem *p, double *t, int st)
{
     struct stats s;
     char bmin[64], bmax[64], bavg[64], bmedian[64], btmin[64];
     char bsetup[64];

     mkstat(t, st, &s);

     sprintf_time(s.min, bmin);
     sprintf_time(s.max, bmax);
     sprintf_time(s.avg, bavg);
     sprintf_time(s.median, bmedian);
     sprintf_time(time_min, btmin);
     sprintf_time(p->setup_time, bsetup);

     ovtpvt("Problem: %s, setup: %s, time: %s, ``mflops'': %.5g\n",
    p->pstring, bsetup, bmin, mflops(p, s.min));

     if (verbose) {
  ovtpvt("Took %d measurements for at least %s each.\n", st, btmin);
  ovtpvt("Time: min %s, max %s, avg %s, median %s\n",
bmin, bmax, bavg, bmedian);
     }
}


Cache sensitivity is still a thing I have to study, at this moment I don't use prefetching data or code.
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 13, 2016, 11:46:18 PM
Maybe they mean mflops(p, s.min)  <-- s.min to be the worst case ?
and s.max to be the best case ?
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 14, 2016, 10:12:12 AM
Thank you guys for the timing reports.  :t

Here is a Multithreading 1024 point FFT version, on my machine it scores an average of 68223 MegaFlops per FFT.

Would be nice to see timings  from your computers.....

Multithreading FFT routines by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.1889020 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 13552

Let the OS manage the cores:
0.5070735 seconds for 12 * 50000 * '1024 FFT' with 12 threads.
MegaFlops: 60583

Use AffinityMasks to manage the cores:
0.4502866 seconds for 12 * 50000 * '1024 FFT' with 12 threads.
MegaFlops: 68223

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.3286985 seconds for 6 * 50000 * '1024 FFT' with 6 threads. ( Even )
MegaFlops: 46730

0.3556083 seconds for 6 * 50000 * '1024 FFT' with 6 threads. ( Uneven )
MegaFlops: 43194


Title: Re: Multithreading with FFT-IFFT
Post by: jj2007 on April 14, 2016, 10:14:38 AM
You are not respecting the speed limits, Marinus :eusa_naughty:
Processor Name      : Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Operating System    : Windows 7
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.2328926 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 10992

Let the OS manage the cores:
0.5001578 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 20474

Use AffinityMasks to manage the cores:
0.5512171 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 18577

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.3645042 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Even )
MegaFlops: 14046

0.3245189 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Uneven )
MegaFlops: 15777
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on April 14, 2016, 10:16:49 AM
Thanks Jochen.  :eusa_dance:
Title: Re: Multithreading with FFT-IFFT
Post by: Raistlin on April 14, 2016, 04:00:14 PM
Multithreading FFT routines by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.1614891 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 15852

Let the OS manage the cores:
0.2624986 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 39010

Use AffinityMasks to manage the cores:
0.2984818 seconds for 4 * 50000 * '1024 FFT' with 4 threads.
MegaFlops: 34307

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.2712594 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Even )
MegaFlops: 18875

0.2641408 seconds for 2 * 50000 * '1024 FFT' with 2 threads. ( Uneven )
MegaFlops: 19384


Press any key to continue...
Title: Re: Multithreading with FFT-IFFT
Post by: guga on March 22, 2018, 11:57:43 AM

Multithreading FFT routines by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

Timing starts after waking up cores....

0.2921108 seconds for 50000 * '1024 FFT' single thread.
MegaFlops: 8764

Let the OS manage the cores:
0.6074673 seconds for 8 * 50000 * '1024 FFT' with 8 threads.
MegaFlops: 33714

Use AffinityMasks to manage the cores:
0.6002589 seconds for 8 * 50000 * '1024 FFT' with 8 threads.
MegaFlops: 34119

Hyperthreading on board, let's test even and uneven AffinityMask bits....

0.4320215 seconds for 4 * 50000 * '1024 FFT' with 4 threads. ( Even )
MegaFlops: 23703

0.4249765 seconds for 4 * 50000 * '1024 FFT' with 4 threads. ( Uneven )
MegaFlops: 24095


Press any key to continue...


Old thread, but hope it helps :)

Marinus, can you post the source for FFt and IFFT functions ?

Many thanks in advance :)
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on March 22, 2018, 07:00:16 PM
Hi Guga,

I will post my latest working version soon. (give me a few days, have to comment some stuff etc. )

I'm still working on a new version.
Really busy at the moment but if I have finished the code I'll post it.
Title: Re: Multithreading with FFT-IFFT
Post by: johnsa on March 22, 2018, 08:07:26 PM
It crashes for me unfortunately..

Multithreading FFT routines by Siekmanski 2015.

Processor Name      : AMD Ryzen Threadripper 1950X 16-Core Processor
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 32
Physical Processors : 16
ActiveProcessorMask : FFFFFFFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

I'm actually on Windows 10 not 8..
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on March 22, 2018, 08:46:14 PM
Ooh, I see why.
You have to many Logical Processors.  8)
My routine handles no more than 16 parallel threads in the old source code and didn't check there could be more than 16 Logical Processors.
I'll correct that in the next source.

Incorrect windows version is because I used "GetVersionEx" and it doesn't seem to work well in a console app.
Title: Re: Multithreading with FFT-IFFT
Post by: johnsa on March 23, 2018, 12:19:17 AM
I like it.. you have too many cores... muwahaha..
Title: Re: Multithreading with FFT-IFFT
Post by: guga on March 23, 2018, 11:36:16 AM
Many tks, Marinus.

I´ll give a try on both Routines (FFT and it´s reversal IFFT). I´m porting some old routines to RosAsm for a audio app i´m trying to build. The routine uses both methods described on the former FFT functions (http://www.netlib.org/fftpack). In C, the functions where ported on the rouziclib library as:

The Ported C code is:

void fftp(double *in, double *out, int n, int method, fft_plan_t *plan) // FFT with a fixed plan
{
/* method :
* 0 = DFT
* 1 = IDFT
*/

double *out2;
int32_t i;
double ratio = 1.0/sqrt((double) n);
int32_t ifac_global[64];

if (n < 2)
{
fprintf_rl(stderr, "fftp(): argument n = %d\n", n);
return ;
}

if (in==out) // if in-place FFT
out2 = calloc (n, sizeof(double)); // allocate and assign the output array
else // if out-of-place FFT
out2 = out;

if (plan->plan_init && plan->alloc_size < 2*n+15) // if alloc_size is too small
{
free_fft_plan(plan); // reallocate to fit n
*plan = alloc_fft_plan(n, -1);
plan->plan_init = 0;
}

if (plan->plan_init && plan->fft_size != n) // reinit plan if wrong size is saved
plan->plan_init = 0;

if (plan->plan_init==0)
{
rffti(n, plan->plan, ifac_global); // initialise the 'plan'
memcpy(plan->plan_backup, plan->plan, (2*n+15) * sizeof(double)); // back it up
memcpy(plan->ifacg, ifac_global, 64 * sizeof(int32_t));
plan->plan_init = 1;
plan->fft_size = n;
}
else
{
memcpy(plan->plan, plan->plan_backup, (2*n+15) * sizeof(double)); // copy from backup
memcpy(ifac_global, plan->ifacg, 64 * sizeof(int32_t));
}

if (method==0)
{
rfftf(n, in, plan->plan, ifac_global); // FFT

out2[0] = in[0]; // DC

for (i=0; i<(n+(n&1)-2)>>1; i++) // all real and imaginary bins except the Nyquist component
{ // i represents the complex bin index, i=0 being both lowest real and imaginary components
out2[i+1] = in[(i<<1)+1]; // real
out2[n-i-1] = in[(i<<1)+2]; // imaginary
}

if ((n&1)==0) // if there's a Nyquist component
out2[n>>1] = in[n-1]; // copy it over
}
if (method==1)
{
out2[0] = in[0]; // DC

for (i=0; i<(n+(n&1)-2)>>1; i++) // all real and imaginary bins except the Nyquist component
{ // i represents the complex bin index, i=0 being both lowest real and imaginary components
out2[(i<<1)+1] = in[i+1]; // real
out2[(i<<1)+2] = in[n-i-1]; // imaginary
}

if ((n&1)==0) // if there's a Nyquist component
out2[n-1] = in[n>>1]; // copy it over

rfftb(n, out2, plan->plan, ifac_global); // IFFT
}

if (in==out)
memcpy (out, out2, n * sizeof(double));

for (i=0; i<n; i++)
out[i] *= ratio;

if (out2!=out)
free(out2);
}


I´m trying to build a similar function to handle the FFT but without using the "plan" structure in order to be called simply :

[DFT 0] ;  FFT fourier routine
[IDFT 1]  ; IFFT inverted fourier

[Data1: D$ 03F6CE6B6 ; Used Dwords for my tests only, since the data is, in fact Floating Points (F$)
Data2: D$ 03EDA40DC
Data3: D$ 03F1F5FB5
Data4: D$ 0BF0995BC
Data5: D$ 0BF6B7289
Data6: D$ 0BEA87156
Data7: D$ 0BDBAAD94
Data8: D$ 0BDF30556
Data9: D$ 03EED3D23
Data10: D$ 0BF40757E]

[Output: D$ 0 #10]

call fftp Data1, Output, 10,  DFT


So, the idea is input a Pointer to a array of Samples (In FPU data), and a pointer to a output buffer, and then choose how many samples (elements of teh array) are used to perform the FFT and finally, choosing the proper methods (FFT or IFFT)
Title: Re: Multithreading with FFT-IFFT
Post by: guga on March 23, 2018, 11:27:04 PM
Some other algorithms that maybe interesting to implement.

STFT-FD

https://www.sciencedirect.com/science/article/pii/S2352711017300638

A matlab code can be found here:
https://github.com/ElsevierSoftwareX/SOFTX-D-17-00087

The Stft-FD provides a better way to see the pitch of a sound file (Similar to existent on the spectrogram usedd in Isotope RX), but fixes the window size.
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on March 24, 2018, 01:43:20 AM
Thanks for the links.  :t
Title: Re: Multithreading with FFT-IFFT
Post by: guga on March 25, 2018, 05:59:00 AM
Hi Marinus

You´re welcome :)

Here is the link for the fftpack.c (The rouziclib library)

https://github.com/Photosounder/rouziclib

The other link about STFT-FD (https://www.sciencedirect.com/science/article/pii/S2352711017300638  and https://github.com/ElsevierSoftwareX/SOFTX-D-17-00087) probably is more effective to  identify picth from what i understood on the article. So perhaps, creating 2 different types of algorithms (FFT/IFFT and STFT-FD/ISTFT-FD) can be used together to properly identify or isolate/separate  voices on a audio file, for example.

Perthaps using FFT 1st and later using  STFT-FD to finish the identification of the voice pitch maybe enough,


A spectrogram using the regular STFT looks like this:

(https://preview.ibb.co/gXKc77/1Image1.jpg) (https://ibb.co/ipTFun)

The above have this properties:
a) Uses Hann window
b) Uses Reassigment

With reassignment you can see more clearly the sound picth of the vocal, but i presume that if it used the STFT-FD the accuracy would be way better.
Title: Re: Multithreading with FFT-IFFT
Post by: guga on March 26, 2018, 10:51:09 AM
Hi marinus, i analysed your code and there are few things i didn´t understood

Your function   InitFFTSinCosTable have a pointer to the variable FFT_SinCosTable and is called like:

invoke  InitFFTSinCosTable, addr FFT_SinCosTable, FFTsize, FFT

But a problem seems to happens with the size of the variable followed by a array of another variable with a different size (labeled as ComplexDataXXX which is also a virtual data with 2048 dword).

FFT_SinCosTable dd 1536 dup(?)
ComplexData    dd 2048 dup(?)
ComplexData1    dd 2048 dup(?)
ComplexData2    dd 2048 dup(?)
.....


But,since FFTSize = 1024, it means that InitFFTSinCosTable converts only 8160 bytes from FFT_SinCosTable (2040 dwords) which results data to be filled in the address at Complexdata as the image below:

(http://preview.ibb.co/m9rcs7/1Image1.jpg) (http://ibb.co/hQ5gKn)


Also, why 8160 bytes ? The total size of ComplexData is 8192 (2048 dwords/Real4). So, it is missing the analysis of 8 dwords ?

Another question...concerning the miltithread.

Why using ComplexData as virtual data and why using multiples variables for it ? Does it means that each  Threadfucntion (ThreadM1, ThreadM2 etc) will use separated data chuncks from ComplexData ? I mean each ComplexData2, ComplexData3 is a copy of COmplexdata or pointers to the total size of ComplexData divided by 16 ?
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on March 26, 2018, 11:22:13 AM
Because some sin cos pairs are used several times and are only calculated once in the sin cos table.
Multiple complex data tables are used to simulate the real world or else it would be cheating for measuring the multithreaded timing.
It can be used to calculate multiple fft's and matching complex data tables simultaneous.
Title: Re: Multithreading with FFT-IFFT
Post by: guga on March 26, 2018, 05:59:22 PM
I´ll wait for the code to study this. I´m a bit confused with the results, though.

If i use, for example, only 16 samples the result i´ve got is diferent from some places i computed online. Ex:
On input
[<16 ComplexData: F$ 0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0]

but, on output i´ve got  32 samples and their values are diferent from the ones i calculated here:

http://www.themobilestudio.net/the-fourier-transform-part-15
or from here
http://scistatcalc.blogspot.com.br/2013/12/fft-calculator.html
Title: Re: Multithreading with FFT-IFFT
Post by: Siekmanski on March 26, 2018, 09:52:04 PM
Hi Guga,

FFT ( Fast Fourier Transform ) comes in all kind of flavours, complex to complex, real to complex, real to real, interleaved or not interleaved, 1D, 2D etc....

My FFT routine posted here is "complex to complex", interleaved input and interleaved output. ( real,imag,real,imag, etc. )

If you insert these 32 ( = FFTsize 16 ) values to "Mark's FFT Calculator"
http://www.themobilestudio.net/the-fourier-transform-part-15

"Input Samples" set to "Complex Wave"

Copy these values ( = Complex input) as complex wave,
0.0
0.0
1.0
0.0
2.0
0.0
3.0
0.0
4.0
0.0
5.0
0.0
6.0
0.0
7.0
0.0
8.0
0.0
9.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0


You get this as FFT results:

REAL: 45.000 -25.452 10.364 -9.064 4.000 -1.279 -2.364 3.795
REAL: -5.000 3.795 -2.364 -1.279 4.000 -9.064 10.364 -25.452
IMAG: 0.000 -16.665 3.293 2.328 -5.000 5.642 -4.707 2.648
IMAG: 0.000 -2.649 4.707 -5.642 5.000 -2.328 -3.293 16.665

Why he outputs the results twice I don't know, 0-15 are the actual results, 16-31 is just a copy of 0-15.

So, if you insert the same values in my FFT routine ( FFTsize equ 16, and do not normalize! )
Complex to Complex, Interleaved input:

ComplexData real4 0.0,0.0,1.0,0.0,2.0,0.0,3.0,0.0,4.0,0.0,5.0,0.0,6.0,0.0,7.0,0.0,8.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ; test data


You get this as output,

Interleaved ouput: ( real, Imaginary, not normalized)

45.000 0.000 -25.452 -16.665 10.364 3.293 -9.064 2.328
4.000 -5.000 -1.279 5.642 -2.364 -4.707 3.795 2.648
-5.000 0.000 3.795 -2.649 -2.364 4.707 -1.279 -5.642
4.000 5.000 -9.064 -2.328 10.364 -3.293 -25.452 16.665

And is exactly the same as calculated in "Mark's FFT Calculator"

I promised to post my latest FFT code.
But there is an error in the calculations which I need to correct first. This can take some time since I'm really busy at the moment. Sorry...
Title: Re: Multithreading with FFT-IFFT
Post by: LordAdef on March 27, 2018, 02:43:18 PM
Otimo post!

Nice thread, guys. Replying to follow. Cheers

Title: Re: Multithreading with FFT-IFFT
Post by: daydreamer on March 28, 2018, 06:31:15 PM
next step is to have 32threads x the number of PC's in a small home LAN,once I had 7 PC's
it is most effective in memoryintensive 3d anmations,first an initial several gb data transfer and after that if you used 5 computers, it rendered 5 times the speed,computer1did frame1,computer2 did frame2etc
Title: Re: Multithreading with FFT-IFFT
Post by: guga on February 06, 2022, 02:39:56 PM
Old thread, but i´m testing the newer Routines from http://masm32.com/board/index.php?topic=6111.msg66562#msg66562

I ported it to RosAsm and i´m analyzing the results, but it seems that the newer version (code below) is producing incorrect values. For some reason it is jumping the 2nd data on each Frequency Index (http://www.themobilestudio.net/the-fourier-transform-part-15)

align 16
FFTSSE proc uses ebx esi edi SinCosTable:DWORD,Complex_Data:DWORD

    push    ebp
                                            ; To keep the code size small for the Log2 loops ( within 1 or 2 cache lines ) for all FFTsizes,
                                            ; we have to put the addresses of the AB stride positions in registers !
    mov     esi,Complex_Data                ; RealA = Complex_Data
    mov     edi,FFTsize
    mov     eax,edi
    lea     ebx,[esi+edi*4]                 ; ImagA = Complex_Data + FFTsize*4
    lea     ecx,[esi+edi*2]                 ; RealB = Complex_Data + FFTsize*2
    lea     edx,[esi+edi*4]
    lea     edx,[edx+edi*2]                 ; ImagB = Complex_Data + FFTsize*6

                                            ; Because we are dealing with 0 degree and 90 degree rotations we can skip trigonometric calculations
                                            ; Before time domain decomposition:
                                            ; The stride of the first log2 loop = FFTsize/2
                                            ; The stride of the second log2 loop = FFTsize/4                                           
                                            ; We can take advantage of that fact and process the first 2 log2 loops with 16 values at once
    add     eax,eax                         ; Start position = FFTsize*2
align 16
first_log2_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     first_log2_LP   ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$ - first_log2_LP


    sub     ecx,edi                         ; RealB = Complex_Data + FFTsize
    sub     edx,edi                         ; ImagB = Complex_Data + FFTsize*5
align 16
second_log2_Even_LP:
    movaps  xmm0,oword ptr[esi+edi-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+edi-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+edi-16]      ; RealB
    movaps  xmm3,oword ptr[edx+edi-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm2                       ; RealA = RealA + RealB
    addps   xmm1,xmm3                       ; ImagA = ImagA + ImagB
    subps   xmm4,xmm2                       ; RealB = tempRealA - RealB
    subps   xmm5,xmm3                       ; ImagB = tempImagA - ImagB

    sub     edi,16

    movaps  oword ptr[esi+edi],xmm0         ; new RealA
    movaps  oword ptr[ebx+edi],xmm1         ; new ImagA
    movaps  oword ptr[ecx+edi],xmm4         ; new RealB
    movaps  oword ptr[edx+edi],xmm5         ; new ImagB
    jnz     second_log2_Even_LP ; Loop size = 56 bytes ( within 1 cache line )
;   mov     CodeSize,$-second_log2_Even_LP

    mov     eax,FFTsize
    add     esi,FFTsize*2                   ; RealA = Complex_Data + FFTsize*2
    add     ebx,FFTsize*2                   ; ImagA = Complex_Data + FFTsize*6
    add     ecx,FFTsize*2                   ; RealB = Complex_Data + FFTsize*3
    add     edx,FFTsize*2                   ; ImagB = Complex_Data + FFTsize*7
align 16
second_log2_Odd_LP:
    movaps  xmm0,oword ptr[esi+eax-16]      ; RealA
    movaps  xmm1,oword ptr[ebx+eax-16]      ; ImagA
    movaps  xmm2,oword ptr[ecx+eax-16]      ; RealB
    movaps  xmm3,oword ptr[edx+eax-16]      ; ImagB

    movaps  xmm4,xmm0                       ; tempRealA
    movaps  xmm5,xmm1                       ; tempImagA
    addps   xmm0,xmm3                       ; RealA = RealA + ImagB
    addps   xmm5,xmm2                       ; ImagB = tempImagA + RealB
    subps   xmm1,xmm2                       ; ImagA = ImagA - RealB
    subps   xmm4,xmm3                       ; RealB = tempRealA - ImagB

    sub     eax,16

    movaps  oword ptr[esi+eax],xmm0         ; new RealA
    movaps  oword ptr[ebx+eax],xmm1         ; new ImagA
    movaps  oword ptr[ecx+eax],xmm4         ; new RealB
    movaps  oword ptr[edx+eax],xmm5         ; new ImagB
    jnz     second_log2_Odd_LP  ; Loop size = 56 bytes ( within 1 cache line )


    mov     esi,Complex_Data
    mov     eax,FFTsize*2                   ; Complex_Data + FFTsize*2
    lea     edi,ReverseBinaryTable          ; reverse-binary reindexing table ( time domain decomposition )
    lea     edx,[esi+eax*2]                 ; imaginary data ptr = Complex_Data + FFTsize*4
    mov     ecx,dword ptr[edi]              ; swap count
align 16
RBI_LP:
    mov     eax,dword ptr[edi+4]            ; get the swap positions
    mov     ebx,dword ptr[edi+8]
    movss   xmm0,real4 ptr[esi+eax*2]       ; get real data
    movss   xmm1,real4 ptr[esi+ebx*2]       ; get real data
    movss   xmm2,real4 ptr[edx+eax*2]       ; get imaginary data
    movss   xmm3,real4 ptr[edx+ebx*2]       ; get imaginary data
    movss   real4 ptr[esi+eax*2],xmm1       ; swap real data
    movss   real4 ptr[esi+ebx*2],xmm0       ; swap real data
    movss   real4 ptr[edx+eax*2],xmm3       ; swap imaginary data
    movss   real4 ptr[edx+ebx*2],xmm2       ; swap imaginary data
    add     edi,8
    dec     ecx
    jnz     RBI_LP
;   mov     CodeSize,$-RBI_LP

    mov     ebp,SinCosTable
    mov     edx,4                           ; AB distance = 4
align 16
Log2_LP:                                    ; Third Log2 loop starts here, loop (log2(FFTsize)-2) times.
    mov     edi,edx                         ; AB distance
    add     edx,edx                         ; Step == AB distance * 2
    xor     ecx,ecx                         ; Next Log2 loop, first A position = 0
Butterfly_LP:                               ; For strides 4, 8, 16 etc.
    movaps  xmm6,oword ptr[ebp]             ; Load Cos
    movaps  xmm7,oword ptr[ebp+16]          ; Load Sin
    mov     eax,ecx                         ; Next A position
Decomposition_LP:
    mov     ebx,eax                         ; A position
    add     ebx,edi                         ; B position = A + AB distance             

    movaps  xmm0,oword ptr[esi+eax*4]       ; RealA
    movaps  xmm1,oword ptr[esi+eax*4+FFTsize*4]     ; ImagA
    movaps  xmm2,oword ptr[esi+ebx*4]       ; RealB
    movaps  xmm3,oword ptr[esi+ebx*4+FFTsize*4]     ; ImagB
    movaps  xmm4,xmm2                       ; RealB copy
    movaps  xmm5,xmm3                       ; ImagB copy
    mulps   xmm2,xmm6                       ; RealB * cos
    mulps   xmm3,xmm7                       ; ImagB * sin
    mulps   xmm4,xmm7                       ; RealB * sin
    mulps   xmm5,xmm6                       ; ImagB * cos
    subps   xmm2,xmm3                       ; tempReal = RealB * cos - ImagB * sin
    addps   xmm4,xmm5                       ; tempImag = RealB * sin + ImagB * cos
    movaps  xmm3,xmm0                       ; RealA copy
    movaps  xmm5,xmm1                       ; ImagA copy
    addps   xmm0,xmm2                       ; RealA = RealA + tempReal
    addps   xmm1,xmm4                       ; ImagA = ImagA + tempImag
    subps   xmm3,xmm2                       ; RealB = RealA - tempReal
    subps   xmm5,xmm4                       ; ImagB = ImagA - tempImag
    movaps  oword ptr[esi+eax*4],xmm0       ; new RealA
    movaps  oword ptr[esi+eax*4+FFTsize*4],xmm1     ; new ImagA
    movaps  oword ptr[esi+ebx*4],xmm3       ; new RealB
    movaps  oword ptr[esi+ebx*4+FFTsize*4],xmm5     ; new ImagB

    add     eax,edx                         ; Next step in FFT data
    cmp     eax,FFTsize                     ; Do we need the next 4 Sine and Cosine values ?
    jb      Decomposition_LP
    add     ebp,32                          ; Next step in SinCosTable
    add     ecx,4                           ; Next A position
    cmp     ecx,edi                         ; Is current Log2 loop done ?
    jb      Butterfly_LP
    cmp     edx,FFTsize                     ; Are all Log2 loops done ?
    jb      Log2_LP     ; Loop size = 116 bytes
    pop     ebp
    ret   

FFTSSE endp


For input, i tested with these values FFTSize = 1024:
Quote[ComplexDataNew1: F$ 0.0,0.0,1.0,0.0,2.0,0.0,3.0,0.0,4.0,0.0,5.0,0.0,6.0,0.0,7.0,0.0, ; test data
              F$ 8.0,0.0,7.0,0.0,6.0,0.0,5.0,0.0,4.0,0.0,3.0,0.0,2.0,0.0,1.0,0.0,
ComplexDataGuga.Data1: F$ 0 #(FFTsize*4)]


The original version produces the correct values but the newer one is not. I believe it may be due to the changes in InitFFTSinCosTable that is producing strange values rather then the older version. Also, why the size of the table FFT_SinCosTable is FFTSize*1.5 ?

From the tests i´m doing the correct results are displayed as:
forward FFTsse2 first 16 results:

Frequency Index: 0 R: 64.000 I: 0.000 R: 63.910 I: -3.140 R: 63.641 I: -6.268 R: 63.195 I: -9.374
Frequency Index: 1 R: -26.274 I: 0.000 R: -25.436 I: 1.250 R: -24.548 I: 2.418 R: -23.615 I: 3.503
Frequency Index: 2 R: 0.000 I: 0.000 R: 0.004 I: -0.000 R: 0.016 I: -0.002 R: 0.035 I: -0.005
Frequency Index: 3 R: -3.240 I: -0.000 R: -3.205 I: 0.157 R: -3.158 I: 0.311 R: -3.102 I: 0.460


And the wrong values (the newer version) is producing this:

interleaved FFTsse2 first 16 results:

Frequency Index: 0 Real    (Cosine): 64.000 63.641 62.572 60.810 58.384 55.336 51.716 47.585
Frequency Index: 1 Real    (Cosine): -26.274 -24.548 -22.645 -20.617 -18.513 -16.380 -14.262 -12.198
Frequency Index: 2 Real    (Cosine): 0.000 0.016 0.061 0.129 0.214 0.308 0.404 0.493
Frequency Index: 3 Real    (Cosine): -3.240 -3.158 -3.035 -2.874 -2.680 -2.460 -2.219 -1.964

Frequency Index: 0 Imaginary (Sine): 0.000 -6.268 -12.446 -18.446 -24.183 -29.578 -34.556 -39.052
Frequency Index: 1 Imaginary (Sine): 0.000 2.418 4.504 6.254 7.668 8.755 9.529 10.011
Frequency Index: 2 Imaginary (Sine): 0.000 -0.002 -0.012 -0.039 -0.089 -0.165 -0.270 -0.404
Frequency Index: 3 Imaginary (Sine): 0.000 0.311 0.604 0.872 1.110 1.315 1.483 1.611



Did you succeeded to fix that ? Also, what are the values produces by InitFFTSinCosTable ? I know it seems to be a structure of sins/cos, but using the flag 0-1 it is alfo filling the members of the structure at Pos 04 and pos 08, instead only Pos0 and Pos16 used in FFTSSE2.

I´m trying to understand or fix this, so i can later try to make the data also export amplitude and phase rather then only hthe Real and Imaginary frequyenceis to be used on a spectogram

https://www.gaussianwaves.com/2015/11/interpreting-fft-results-obtaining-magnitude-and-phase-information/?fbclid=IwAR2HHnysT_m_dQIgX94mLA3XGvRAlsjkL5oRg-4JImr5CkcsRYjzs59l4K8
Title: Re: Multithreading with FFT-IFFT
Post by: guga on February 07, 2022, 01:44:49 AM
When i´m analysing it throuugh the debugger, i see 2 differences in the ComplexData results at he start of the Butterfly calculations.

The Correct version has 0 values at the beginning, like this:

https://imgur.com/a/mg3zBrf


But the incorrect version (Th newer one), is inserting a sequence of 4 values = 8 when it should be zero

https://imgur.com/pTTm18d
Title: Re: Multithreading with FFT-IFFT
Post by: guga on February 07, 2022, 02:05:23 AM
The incorrect values seems to be generated due to an error in the Sin/Cosine table

https://imgur.com/COSRSBK

On the older version of InitFFTSinCosTable the values on xmm6 are 1 0 0 1

But, on the newer version of the InitiFFTSinCosTable it is generating weird values in xmm6 ans xmm7, like:
-7.07106e-1 -2.710505e-20 -7.07106e-1 1

Btw, if i ported correctly your newer version of InitSinCosine it is:


; size of sincostable = FFTsize+(FFTsize/2) = FFTsize*1.5 = 1024*1.5 = 1536
Proc InitFFTSinCosTable2:
    Arguments @SinCosTable, @NNType, @FFT_type
    Local @MMax
    Uses esi, edi, ecx

    finit
    fclex
    mov D@MMax 4
    mov edi D@SinCosTable

    .Do
        fldpi
        fidiv D@MMax
        fldz
        mov ecx D@MMax
        mov esi 4

        Do
            fld ST0
            fsincos
            fstp F$edi
            Test_If W@FFT_type IFFT
                fchs
            Test_End
            fstp F$edi+16
            dec esi | jne @OutNotZero
                mov esi 4
                add edi 16
@OutNotZero:
            add edi 4
            fadd ST0 ST1
            dec ecx
        Repeat_Until_Zero
        fstp ST0
        fstp ST0
        mov eax D@MMax | add eax eax | mov D@MMax eax

    .Loop_Until D@NNType = eax

EndP



The error seems to be generated on the Test_If W@FFT_type IFFT the next add edi. It seems to overlap a value ?
Title: Re: Multithreading with FFT-IFFT
Post by: guga on February 07, 2022, 06:11:32 AM
I´m trying to see what´s wrong, so i can ty later calculate everything related to the frequencie of inputed samples, informatons such as amplitude, phase, pitch, sound intensity, sound pressure and the last calculate the distance between the origin of the sound and the target microphone (to make easier to we separate/isolate sounds from the background, for example. This distance we can calculate from the amplitude)

Amplitude and phase seems to be more easy to calculate:
Amplitude = sqrt ( Frequency Imaginary^2 + Frequency Real ^2 )
Phase = atan2 (Frequency Imaginary / Frequency Real)

https://www.who.int/occupational_health/publications/noise1.pdf

https://en.wikipedia.org/wiki/Sound_intensity
https://en.wikipedia.org/wiki/Sound_pressure

https://en.wikipedia.org/wiki/Sound_pressure
https://en.wikipedia.org/wiki/Sound_power
https://sound.eti.pg.gda.pl/student/eim/synteza/leszczyna/index_ang.htm

Or also create other methods of correctly resample a audio file such as: https://dsp.stackexchange.com/questions/2962/how-to-resample-audio-using-fft-or-dft

So, from a given sample (or sequence of samples) we can build a set of structures related to each frequencies on intervals of FFTSize.

Ex: FFTSize = 1024 will produce 1024 different frequencies per sample, right ? Each one of these frequencies carries information related to sound amplitude, phase, pressure, intensity, distance.

Therefore, instead building a array of structures containing only this:

FFTSize= 1024. Sample XXXX

Frequency.Real.Data1: F$ 251.5
Frequency.Imaginary.Data1: F$ -215
Frequency.Real.Data2: F$ 25
Frequency.Imaginary.Data2: F$ 5
....
Frequency.Real.Data1024: F$ 77
Frequency.Imaginary.Data1024: F$ 2

Or their interleaved format:

Frequency.Real.Data1: F$ 251.5
Frequency.Real.Data2: F$ 25
....
Frequency.Real.Data1024: F$ 77

Frequency.Imaginary.Data1: F$ -215
Frequency.Imaginary.Data2: F$ 5
Frequency.Imaginary.Data1024: F$ 2

....

We can make a structure (or even other tables) containing all of those information such as:
Frequency.Real.Data1: F$ 251.5
Frequency.Imaginary.Data1: F$ -215
Frequency.Amplitude.Data1: F$ 125
Frequency.Phase.Data1: F$ 12
Frequency.Pitch.Data1: F$ 777
Frequency.SoundIntensity.Data1: F$ 12
Frequency.Soundpressure.Data1: F$ 125
Frequency.Distance.Data1: F$ 2252

Frequency.Real.Data2: F$ 25
Frequency.Imaginary.Data2: F$ 5
Frequency.Amplitude.Data2: F$ 125
Frequency.Phase.Data2: F$ 12
Frequency.Pitch.Data2: F$ 77
Frequency.SoundIntensity.Data2: F$ 12
Frequency.Soundpressure.Data2: F$ 125
Frequency.Distance.Data2: F$ 2252
....
Frequency.Real.Data1024: F$ 77
Frequency.Imaginary.Data1024: F$ 2
Frequency.Amplitude.Data1: F$ 125
Frequency.Phase.Data1024: F$ 12
Frequency.Pitch.Data1024: F$1255
Frequency.SoundIntensity.Data1024: F$ 12
Frequency.Soundpressure.Data1024: F$ 125
Frequency.Distance.Data1024: F$ 2252


So, with a Array of Structures each one of them containing more (or all) information regarded to a certain frequency, we can later build better filters to work with.