Multithreading with FFT-IFFT

Siekmanski · May 14, 2015, 07:58:23 PM

This is an attempt to do multithreading with FFT-IFFT. (SSE2 and SSE3)
I'm not sure if the multithreading part is written well.
I see some strange timing values.....
Timing code needs to be checked as well the multithreading code.

Marinus

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 12

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


7172218 cycles for 1024 FFT & IFFT with 12 cores.

dedndave · May 14, 2015, 08:03:25 PM

prescott w/htt

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 2

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

1226085 cycles for 1024 FFT & IFFT with 2 cores.

Biterider · May 14, 2015, 08:10:35 PM

Hi,
i7 4770K CPU

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 8

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

1233527 cycles for 1024 FFT & IFFT with 8 cores.

Press any key to continue...

Biterider

Raistlin · May 14, 2015, 08:12:51 PM

HP Intel Core i5

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

1025270 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...

sinsi · May 14, 2015, 08:16:49 PM

AMD A10-7850K

Code Select


FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1273912 cycles for 1024 FFT & IFFT with 4 cores.

Raistlin · May 14, 2015, 09:25:31 PM

Theoretically this would then allow for 2,5 million or so / FFT per second on a 4 core 3 Ghz CPU.
That can't be right can it ?

FORTRANS · May 15, 2015, 12:36:13 AM

Hi,

Results for i3 processor.

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


782169 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...

Seems some zeroes are minus in some results and plus in others?
For example the first zero with Processor1 and Processor2. Does
that matter?

Regards,

Steve N.

Siekmanski · May 15, 2015, 12:42:26 AM

Thank you all.

Hi Raistlin,
The fft routines are OK.
The timings are not correct as I mentioned in my first post. I got wrong time values.
Also I'm not sure about the multithreading code.
I'm not sure if the timing code is valid when running all cores together.
That we have to tackle.

Hi Steve,

rounding errors I guess. 0 is the same as -0.

LiaoMi · May 15, 2015, 03:21:01 AM

Intel i7 4810MQ

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 8

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


2023572 cycles for 1024 FFT & IFFT with 8 cores.

Press any key to continue...

rrr314159 · May 15, 2015, 09:04:43 AM

Hi siekmanski, (and all),

You have 6 cores, I guess it's an i7 4960? Cool ...

Minor point, but related to a major point, your printout says (for instance)

7172218 cycles for 1024 FFT & IFFT with 12 cores

But no, you have 6 cores. There's 12 threads, using hyper-threading. The rest of us peasants have 4 or 2 cores, either with hyper-threading (doubling what you mis-call the core count) or not. So for instance my i5 3330 has "4 cores" while biterider's i7 4770K has "8 cores" - actually only 4 cores, *2 for two threads per core.

That's a minor point, but the important point is: AFAIK you shouldn't use hyper-threading for this type of CPU intensive algo. No doubt you know this stuff but not everyone does ... hyper-threading allows two threads to share 1 core. Most software spends a lot of time waiting for disk access and stuff like that. So when a thread is in such a wait state the core is given over to the other ("hyper-") thread. That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever. Thus for normal software the two threads can each get almost full performance from this one core, almost doubling over-all performance for very few extra transistors.

BUT (there's always a "but") this doesn't work for CPU-intensive algos. Hyper-threading actually reduces overall performance in that case. A CPU-bound algo never waits for disk or anything, always utilizes the CPU maximally. So the two hyper-threads just fight each other for CPU time.

Therefore I think (based on naive understanding) don't use hyper-thread for this CPU-bound FFT app. Your machine shld just use 6 cores, and LiaoMi and biterider, just 4. All the other machines reported here are not hyper-thread so their performance is already better.

If you know better I won't be surprised, like I say I'm no expert.

Now, there's also a problem with timing. With so little data in each FFT it's gonna be done in no time, then the core comes back for more work. So (for instance) your timing shows how long it takes to launch 12 threads, 1000 times. Whereas dedndave's shows how long it takes to launch 2 threads, 1000 times. That's why the performance of his lesser machine apparently beats yours by about 6 times!

Basic idea, need to time how long it takes to go through a large data set, regardless of cores. So should have one large data set, maybe 1024 * 1024 * 1024, and put all cores available to work on that one set. Begin at beginning when get to end you're done. That way your 6 cores will do it about 6 times as fast as dedndave (since, altho he's got 2 cores, I'd expect your newer machine about twice as fast per core).

In other words you're doing a lot more work right now than dedndave - 6 times as much, the way it's set up. Dunno if I'm explaining this well, or even if I'm right, so will leave it at that and wait for comments.

Here are all the times in order from "slowest" (you, the fastest machine!) down to fastest, which happens to be mine. That's probably because I have nothing else running (not even hooked to internet) and no hyper-threading so my machine launches threads a bit quicker. I'm actually doing twice as much work during those cycles than dedndave, as explained above, and you're doing 3 times more than that, plus an awful lot of context switching, which explains these topsy-turvy results.

siek, 4960: 7172218 cycles for 1024 FFT & IFFT with 12 cores.
Liao, 4810MQ: 2023572 cycles for 1024 FFT & IFFT with 8 cores.
sinsi, 7850K: 1273912 cycles for 1024 FFT & IFFT with 4 cores.
bite, 4770K: 1233527 cycles for 1024 FFT & IFFT with 8 cores.
dedn, presco: 1226085 cycles for 1024 FFT & IFFT with 2 cores.
Rais, HP i5: 1025270 cycles for 1024 FFT & IFFT with 4 cores.
FORTRANS, i3: 782169 cycles for 1024 FFT & IFFT with 4 cores.
rrr, i5 3330: 608289 cycles for 1024 FFT & IFFT with 4 cores.

Finally there's a problem with the "master" routine, the one doing the timing: same core is also doing FFT calcs. U actually have 13 threads going, not just 12! And I've got 5, dedndave 3. So there's even more contention.

As for the math, Danielson-Lanczos is good, in some cases it's about the best AFAIK. But it depends a lot on the specific type of FFT and how it's going to be used. As for your implementation, looks good but perhaps can be speeded up a bit, I think (without having looked closely) some of that shuffling can be reduced.

The important thing at the moment is to clean up the multi-threading and timing, so I'll wait for comments on what I've said. If this seems right I know more or less what needs to be done, if I'm wrong that's another story!

Here is my run:

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


608289 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...

Thanks a lot (from everyone I'm sure) for getting this ball rolling, looks like a lot of fun!

jj2007 · May 15, 2015, 10:39:55 AM

Quote from: rrr314159 on May 15, 2015, 09:04:43 AMhyper-threading allows two threads to share 1 core. Most software spends a lot of time waiting for disk access and stuff like that. So when a thread is in such a wait state the core is given over to the other ("hyper-") thread. That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever. Thus for normal software the two threads can each get almost full performance from this one core, almost doubling over-all performance for very few extra transistors.

BUT (there's always a "but") this doesn't work for CPU-intensive algos. Hyper-threading actually reduces overall performance in that case. A CPU-bound algo never waits for disk or anything, always utilizes the CPU maximally. So the two hyper-threads just fight each other for CPU time.

Thanks, that is a very nice explanation of the pros and cons of hyperthreading :t

@Marinus: afaics FFTSSE2 and FFTSSE3 both use scalar instructions. What about "parallelising" through using packed instructions?

nidud · May 15, 2015, 10:59:02 AM

deleted

rrr314159 · May 15, 2015, 11:07:41 AM

Hi jj2007,

thanks,

Actually two main loops in FFTSSE2 and 3 are using SIMD, loops called Istep_LP and Normalize_LP. Some other SIMD instructions also in less important loop RBI_LP and subsequent. There are lots of scalar instructions throughout but mainly initialization, prob could use SIMD but not involved with timing. I haven't studied it yet could be misunderstanding but, as I say, appears to parallelize with packed instructions in these places, movlps, mulps etc.

BTW there is a diff between parallelize and multi-thread, (altho in practice the words are pretty interchangeable), parallelizing the algo includes breaking into pieces ("jobs", "tasks", word like that) which can run independent, then assigning to different threads. Here that's been done by explicitly writing out separate data sets (ComplexData1, ComplexData2 ... ComplexData16) which certainly works. But think it needs more flexible scheme.

Another topic entirely, seems there's been a lot of homework q's asked lately. Ever since the "Visual Masm" incident I've figured you're our resident sleuth, glad to see u getting onto these errant students, keep up good work. I noticed that your handle is very similar to James Bond 007 I assume that's no coincidence

another post just happened, am posting this b4 looking, hope it still makes sense

rrr314159 · May 15, 2015, 11:47:06 AM

hi nidud,

I thought they each have their own L1, regardless, you're right they're synched so that if one writes data which another one uses, it also must be changed, invalidating previous work. Anything like that is very bad, of course, if it happens often.

That's why parallelizing the algo ideally separates data completely, so each core works on its own data set. Some algos that's difficult or impossible, but FFT should be easy enough. I have multi-threaded only algos that can be thoroughly parallelized, so I never studied details of cache synchronizing - just completely avoid the issues with separated data. Mainly I've worked with pixel data, each pixel independent of the others, and it's easy to assign different screen areas (rows usually) to different cores. For FFT I believe similar can be done by partitioning the dataset.

Also didn't think about the LUT, you're raising a bunch of issues here. Perhaps the easiest approach is 1) clean up the timing and basic parallelization then 2) try it out and see how well it works, experiment to determine best way to handle LUT etc, as opposed to trying to figure it out beforehand. That's 2 much like work, for me anyway, since I don't know a lot of these details!

Another approach, tell Antariy that Lingo has figured out the fastest way to do it, nobody could ever beat him - wait a day or two - Antariy will work around the clock to show him up

dedndave · May 15, 2015, 12:29:26 PM

you can use CPUID to determine whether or not the CPU supports HTT
and, you can control the thread affinity
the tricky part is - figuring out which affinity bit corresponds to which physical core :lol:
it's very likely that you could use every other bit (on an HTT machine) and get seperate physical cores
no guarantees, though

The MASM Forum

News:

Multithreading with FFT-IFFT

Siekmanski

dedndave

Biterider

Raistlin

sinsi

Raistlin

FORTRANS

Siekmanski

LiaoMi

rrr314159

jj2007

nidud

rrr314159

rrr314159

dedndave