Author Topic: Multithreading with FFT-IFFT  (Read 20688 times)

Siekmanski

  • Member
  • *****
  • Posts: 1139
Multithreading with FFT-IFFT
« on: May 14, 2015, 07:58:23 PM »
This is an attempt to do multithreading with FFT-IFFT. (SSE2 and SSE3)
I'm not sure if the multithreading part is written well.
I see some strange timing values.....
Timing code needs to be checked as well the multithreading code.

Marinus

Code: [Select]
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 12

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


7172218 cycles for 1024 FFT & IFFT with 12 cores.

dedndave

  • Member
  • *****
  • Posts: 8748
  • Still using Abacus 2.0
    • DednDave
Re: Multithreading with FFT-IFFT
« Reply #1 on: May 14, 2015, 08:03:25 PM »
prescott w/htt
Code: [Select]
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 2

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

1226085 cycles for 1024 FFT & IFFT with 2 cores.

Biterider

  • Member
  • **
  • Posts: 126
  • ObjAsm32
    • ObjAsm32
Re: Multithreading with FFT-IFFT
« Reply #2 on: May 14, 2015, 08:10:35 PM »
Hi,
i7 4770K CPU

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 8

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1233527 cycles for 1024 FFT & IFFT with 8 cores.

Press any key to continue...


Biterider

Raistlin

  • Member
  • ***
  • Posts: 250
Re: Multithreading with FFT-IFFT
« Reply #3 on: May 14, 2015, 08:12:51 PM »
HP Intel Core i5

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1025270 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...

sinsi

  • Member
  • *****
  • Posts: 1003
Re: Multithreading with FFT-IFFT
« Reply #4 on: May 14, 2015, 08:16:49 PM »
AMD A10-7850K
Code: [Select]
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


1273912 cycles for 1024 FFT & IFFT with 4 cores.
I can walk on water but stagger on beer.

Raistlin

  • Member
  • ***
  • Posts: 250
Re: Multithreading with FFT-IFFT
« Reply #5 on: May 14, 2015, 09:25:31 PM »
Theoretically this would then allow for 2,5 million or so / FFT per second on a 4 core 3 Ghz CPU.
That can't be right can it ?

FORTRANS

  • Member
  • ****
  • Posts: 945
Re: Multithreading with FFT-IFFT
« Reply #6 on: May 15, 2015, 12:36:13 AM »
Hi,

   Results for i3 processor.

Code: [Select]
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


782169 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...

   Seems some zeroes are minus in some results and plus in others?
For example the first zero with Processor1 and Processor2.  Does
that matter?

Regards,

Steve N.

Siekmanski

  • Member
  • *****
  • Posts: 1139
Re: Multithreading with FFT-IFFT
« Reply #7 on: May 15, 2015, 12:42:26 AM »
Thank you all.

Hi Raistlin,
The fft routines are OK.
The timings are not correct as I mentioned in my first post. I got wrong time values.
Also I'm not sure about the multithreading code.
I'm not sure if the timing code is valid when running all cores together.
That we have to tackle.

Hi Steve,

rounding errors I guess. 0 is the same as -0.

LiaoMi

  • Member
  • **
  • Posts: 162
Re: Multithreading with FFT-IFFT
« Reply #8 on: May 15, 2015, 03:21:01 AM »
Intel i7 4810MQ

Code: [Select]
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 8

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


2023572 cycles for 1024 FFT & IFFT with 8 cores.

Press any key to continue...

rrr314159

  • Member
  • *****
  • Posts: 1383
Re: Multithreading with FFT-IFFT
« Reply #9 on: May 15, 2015, 09:04:43 AM »
Hi siekmanski, (and all),

You have 6 cores, I guess it's an i7 4960? Cool ...

Minor point, but related to a major point, your printout says (for instance)

7172218 cycles for 1024 FFT & IFFT with 12 cores

But no, you have 6 cores. There's 12 threads, using hyper-threading. The rest of us peasants have 4 or 2 cores, either with hyper-threading (doubling what you mis-call the core count) or not. So for instance my i5 3330 has "4 cores" while biterider's i7 4770K has "8 cores" - actually only 4 cores, *2 for two threads per core.

That's a minor point, but the important point is: AFAIK you shouldn't use hyper-threading for this type of CPU intensive algo. No doubt you know this stuff but not everyone does ... hyper-threading allows two threads to share 1 core. Most software spends a lot of time waiting for disk access and stuff like that. So when a thread is in such a wait state the core is given over to the other ("hyper-") thread. That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever. Thus for normal software the two threads can each get almost full performance from this one core, almost doubling over-all performance for very few extra transistors.

BUT (there's always a "but") this doesn't work for CPU-intensive algos. Hyper-threading actually reduces overall performance in that case. A CPU-bound algo never waits for disk or anything, always utilizes the CPU maximally. So the two hyper-threads just fight each other for CPU time.

Therefore I think (based on naive understanding) don't use hyper-thread for this CPU-bound FFT app. Your machine shld just use 6 cores, and LiaoMi and biterider, just 4. All the other machines reported here are not hyper-thread so their performance is already better.

If you know better I won't be surprised, like I say I'm no expert.

Now, there's also a problem with timing. With so little data in each FFT it's gonna be done in no time, then the core comes back for more work. So (for instance) your timing shows how long it takes to launch 12 threads, 1000 times. Whereas dedndave's shows how long it takes to launch 2 threads, 1000 times. That's why the performance of his lesser machine apparently beats yours by about 6 times!

Basic idea, need to time how long it takes to go through a large data set, regardless of cores. So should have one large data set, maybe 1024 * 1024 * 1024, and put all cores available to work on that one set. Begin at beginning when get to end you're done. That way your 6 cores will do it about 6 times as fast as dedndave (since, altho he's got 2 cores, I'd expect your newer machine about twice as fast per core).

In other words you're doing a lot more work right now than dedndave - 6 times as much, the way it's set up. Dunno if I'm explaining this well, or even if I'm right, so will leave it at that and wait for comments.

Here are all the times in order from "slowest" (you, the fastest machine!) down to fastest, which happens to be mine. That's probably because I have nothing else running (not even hooked to internet) and no hyper-threading so my machine launches threads a bit quicker. I'm actually doing twice as much work during those cycles than dedndave, as explained above, and you're doing 3 times more than that, plus an awful lot of context switching, which explains these topsy-turvy results.

siek, 4960:   7172218 cycles for 1024 FFT & IFFT with 12 cores.
Liao, 4810MQ: 2023572 cycles for 1024 FFT & IFFT with 8 cores.
sinsi, 7850K: 1273912 cycles for 1024 FFT & IFFT with 4 cores.
bite, 4770K:  1233527 cycles for 1024 FFT & IFFT with 8 cores.
dedn, presco: 1226085 cycles for 1024 FFT & IFFT with 2 cores.
Rais, HP i5:  1025270 cycles for 1024 FFT & IFFT with 4 cores.
FORTRANS, i3:  782169 cycles for 1024 FFT & IFFT with 4 cores.
rrr, i5 3330:  608289 cycles for 1024 FFT & IFFT with 4 cores.

Finally there's a problem with the "master" routine, the one doing the timing: same core is also doing FFT calcs. U actually have 13 threads going, not just 12! And I've got 5, dedndave 3. So there's even more contention.

As for the math, Danielson-Lanczos is good, in some cases it's about the best AFAIK. But it depends a lot on the specific type of FFT and how it's going to be used. As for your implementation, looks good but perhaps can be speeded up a bit, I think (without having looked closely) some of that shuffling can be reduced.

The important thing at the moment is to clean up the multi-threading and timing, so I'll wait for comments on what I've said. If this seems right I know more or less what needs to be done, if I'm wrong that's another story!

Here is my run:

Code: [Select]
FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000


608289 cycles for 1024 FFT & IFFT with 4 cores.

Press any key to continue...

Thanks a lot (from everyone I'm sure) for getting this ball rolling, looks like a lot of fun!
« Last Edit: May 16, 2015, 12:46:21 PM by rrr314159 »
I am NaN ;)

jj2007

  • Member
  • *****
  • Posts: 7728
  • Assembler is fun ;-)
    • MasmBasic
Re: Multithreading with FFT-IFFT
« Reply #10 on: May 15, 2015, 10:39:55 AM »
hyper-threading allows two threads to share 1 core. Most software spends a lot of time waiting for disk access and stuff like that. So when a thread is in such a wait state the core is given over to the other ("hyper-") thread. That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever. Thus for normal software the two threads can each get almost full performance from this one core, almost doubling over-all performance for very few extra transistors.

BUT (there's always a "but") this doesn't work for CPU-intensive algos. Hyper-threading actually reduces overall performance in that case. A CPU-bound algo never waits for disk or anything, always utilizes the CPU maximally. So the two hyper-threads just fight each other for CPU time.

Thanks, that is a very nice explanation of the pros and cons of hyperthreading :t

@Marinus: afaics FFTSSE2 and FFTSSE3 both use scalar instructions. What about "parallelising" through using packed instructions?

nidud

  • Member
  • *****
  • Posts: 1408
    • https://github.com/nidud/asmc
Re: Multithreading with FFT-IFFT
« Reply #11 on: May 15, 2015, 10:59:02 AM »
That thread, though idle, has all its data preserved in L1-type memory (high-speed, right next to the core - not L1 itself but similar) so it can hop into the core very quickly. As opposed to the normal situation, where the new thread needs to be restored from a far-off slow location which takes forever.

The hardware closest to the CPU (the L0 cache if you will) is the fastest and most expensive part and shares more or less the same privileges as the registers in the CPU. This is not a byte buffer as in software but a cache-line buffer. In newer hardware this will be 128 bytes in size and older 64 or 32 bytes. This means that all memory moves is done in chunk size.

Another aspect of this is that it’s capable of moving 6M at the same speed as 64 byte (not actually true). When you move one byte from memory (mov al,[esi]) a chunk-size is copied to this fast lane. However, before this happens the CPU needs to save whatever is there to the cache. This is done by "pushing" the cache one line down. The last line in the L3 cache is then overwritten in this process and deleted.

As a result of this shuffle the cache will be chronological in time where the most resent data accessed will always be closest to the CPU.

This means that all memory past through the CPU is pushed into the L1 cache and then to L2. Each CPU (or core if you will) has a separate cache but share the same L1 cache (not sure this is correct so rrr may expand on this, but they all (at least) have access to L1 but not to the L2 and L3 cache).

The problem is that multiple CPU’s may access the same memory region. If this is overwritten by one of the treads all of them needs to invalidate this data in their own cache, which may be located in L2 or L3. This is why all data must be past through the cache system this way to synchronize the content of each CPU.

If you then look at the LUT, the size of it may not be that important since only the address you actually use will end up in the cache. In this case the LUT is created (in one CPU) and this will move the whole table into the cache on creation, so a static table may be better in some cases, at least for large ones.

rrr314159

  • Member
  • *****
  • Posts: 1383
Re: Multithreading with FFT-IFFT
« Reply #12 on: May 15, 2015, 11:07:41 AM »
Hi jj2007,

thanks,

Actually two main loops in FFTSSE2 and 3 are using SIMD, loops called Istep_LP and Normalize_LP. Some other SIMD instructions also in less important loop RBI_LP and subsequent. There are lots of scalar instructions throughout but mainly initialization, prob could use SIMD but not involved with timing. I haven't studied it yet could be misunderstanding but, as I say, appears to parallelize with packed instructions in these places, movlps, mulps etc.

BTW there is a diff between parallelize and multi-thread, (altho in practice the words are pretty interchangeable), parallelizing the algo includes breaking into pieces ("jobs", "tasks", word like that) which can run independent, then assigning to different threads. Here that's been done by explicitly writing out separate data sets (ComplexData1, ComplexData2 ... ComplexData16) which certainly works. But think it needs more flexible scheme.

Another topic entirely, seems there's been a lot of homework q's asked lately. Ever since the "Visual Masm" incident I've figured you're our resident sleuth, glad to see u getting onto these errant students, keep up good work. I noticed that your handle is very similar to James Bond 007 I assume that's no coincidence :biggrin:

another post just happened, am posting this b4 looking, hope it still makes sense
I am NaN ;)

rrr314159

  • Member
  • *****
  • Posts: 1383
Re: Multithreading with FFT-IFFT
« Reply #13 on: May 15, 2015, 11:47:06 AM »
hi nidud,

I thought they each have their own L1, regardless, you're right they're synched so that if one writes data which another one uses, it also must be changed, invalidating previous work. Anything like that is very bad, of course, if it happens often.

That's why parallelizing the algo ideally separates data completely, so each core works on its own data set. Some algos that's difficult or impossible, but FFT should be easy enough. I have multi-threaded only algos that can be thoroughly parallelized, so I never studied details of cache synchronizing - just completely avoid the issues with separated data. Mainly I've worked with pixel data, each pixel independent of the others, and it's easy to assign different screen areas (rows usually) to different cores. For FFT I believe similar can be done by partitioning the dataset.

Also didn't think about the LUT, you're raising a bunch of issues here. Perhaps the easiest approach is 1) clean up the timing and basic parallelization then 2) try it out and see how well it works, experiment to determine best way to handle LUT etc, as opposed to trying to figure it out beforehand. That's 2 much like work, for me anyway, since I don't know a lot of these details!

Another approach, tell Antariy that Lingo has figured out the fastest way to do it, nobody could ever beat him - wait a day or two - Antariy will work around the clock to show him up :biggrin:
I am NaN ;)

dedndave

  • Member
  • *****
  • Posts: 8748
  • Still using Abacus 2.0
    • DednDave
Re: Multithreading with FFT-IFFT
« Reply #14 on: May 15, 2015, 12:29:26 PM »
you can use CPUID to determine whether or not the CPU supports HTT
and, you can control the thread affinity
the tricky part is - figuring out which affinity bit corresponds to which physical core   :lol:
it's very likely that you could use every other bit (on an HTT machine) and get seperate physical cores
no guarantees, though