Multithreading with FFT-IFFT

FORTRANS · May 17, 2015, 11:04:47 PM

Hi,

Quote from: rrr314159 on May 15, 2015, 01:30:01 PM
U mean u can't simply "turn off" hyper-threading?

That should be an option in the BIOS setup. It was in the first
system I saw with hyper-threading, way, way back when.

Regards,

Steve N.

Siekmanski · May 18, 2015, 07:30:18 AM

Here is my version with event driven threads.
It sets up all the threads and leaves them in "idle" state.
They are fired up when sending an event, then does the FFT and IFFT and goes back in "idle" state when all threads are done.
It's fast and flexible to use.

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

 0.0499732 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
 0.0619334 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Use AffinityMasks to manage the cores:
 0.1338572 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

 0.0898157 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Even )
 0.1392474 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Uneven )

Press any key to continue...

Hi rrr31459,

I studied your thread routines and put a counter in "procarow" to count the FFT's and noticed that its total number is changing each time I start your app.
Also saw, you push and pop edi in "procarow". I assume it has to be esi.

Here are your results:

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.
Multi-threading parallelization changed rrr314159 2015/05/17.

Number Of Processors: 12

First 16 Complex (real, imaginary) samples for testing:

11.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse3 first 16 results:

0.073 0.000 0.073 -0.003 0.073 -0.006 0.072 -0.009
0.072 -0.012 0.071 -0.015 0.070 -0.018 0.069 -0.021
0.068 -0.024 0.066 -0.026 0.065 -0.029 0.063 -0.031
0.061 -0.034 0.059 -0.036 0.057 -0.038 0.055 -0.040

inverse FFTsse3 first 16 results:

11.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor3:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor4:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Processor5:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor6:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor7:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor8:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Processor9:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor10:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor11:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor12:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Timing starts after 2 seconds....

 0.0505246 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
 4.3393800 seconds for 1000 * 1000 * '1024 FFT & IFFT' with 12 threads.

Press any key to continue...

Marinus

rrr314159 · May 18, 2015, 10:42:39 AM

Hi siekmanski,

that's right, should be esi not edi (made a last minute change). For a moment couldn't understand how it could pass testing: it's because your FFTSSE3 "uses" esi (and for that matter edi) so I don't need to push/pop either! I typically eschew procs but they sure can save your bacon now and then.

As for procarow counter different each time I'll look into it, but since results appear right (with a fair amount of testing) I don't know it's necessarily wrong ...

[edit] (thought about it) that routine is used by all threads, re-entrantly, so depending how you set up a counter, it certainly will vary depending on a number of factors ... the threads will do different numbers of rows depending on other processes active on the different cores. guess that must be the explanation

The timings u get look right, with approx. 12 times speed-up using multiple threads, similar to results with your event-using routine.

With your routine it looks like you've fixed previous timing anomalies by keeping the threads alive between FFT's (as mine does also); it was taking too long re-launching them every time. And it appears to prove that letting OS manage cores is best - twice as good, in fact, as using affinitymask. That accords with my previous experience.

However all these results indicate I was wrong about HTT, it is effective with FFT routine. Either it's not CPU-bound as I thought, or it is but manages to be effective anyway, or else these results are misleading (but I don't see how that can be). A few more runs like these will prove the point beyond doubt.

Siekmanski · May 18, 2015, 11:28:36 AM

Hi rrr314159,

QuoteWith your routine it looks like you've fixed previous timing anomalies by keeping the threads alive between FFT's (as mine does also);

And a very important advantage is, creating the threads only once at the startup of my app and then when I need the FFT, ( in my apps 60 times a second )
I only have to send an event to get the FFT data. Close the threads when my app exits. This is very efficient.

dedndave · May 18, 2015, 11:40:02 AM

rrr's routine - hangs with 100% CPU - i had to terminate it
P4 prescott w/htt

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.
Multi-threading parallelization changed rrr314159 2015/05/17.

Number Of Processors: 2

First 16 Complex (real, imaginary) samples for testing:

11.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse3 first 16 results:

0.073 0.000 0.073 -0.003 0.073 -0.006 0.072 -0.009
0.072 -0.012 0.071 -0.015 0.070 -0.018 0.069 -0.021
0.068 -0.024 0.066 -0.026 0.065 -0.029 0.063 -0.031
0.061 -0.034 0.059 -0.036 0.057 -0.038 0.055 -0.040

inverse FFTsse3 first 16 results:

11.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Timing starts after 2 seconds....

 0.1103082 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:

dedndave · May 18, 2015, 11:41:46 AM

Marinus' routine...
correctly detects htt

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
Operating System    : Windows XP
Hyperthreading      : YES
Logical Processors  : 2
Physical Processors : 1
ActiveProcessorMask : 0003h

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

 0.1111278 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
 0.2135991 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads.

Use AffinityMasks to manage the cores:
 0.2115078 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

 0.2065949 seconds for 1 * 1000 * '1024 FFT & IFFT' with 1 threads. ( Even )
 0.2112603 seconds for 1 * 1000 * '1024 FFT & IFFT' with 1 threads. ( Uneven )

Press any key to continue...

Siekmanski · May 18, 2015, 11:54:25 AM

Thanks Dave,

FFT multithreading doesn't seem to pay off on your machine ...

dedndave · May 18, 2015, 01:12:30 PM

it's 2 logical cores, 1 physical core
not much of a test - lol
see how it does on some newer CPU's :t

rrr314159 · May 18, 2015, 02:11:12 PM

@dedndave,

I just made an edit to my post (the one with the FFT routine zipped) says, among other things,:

Quote from: rrrFinally a slower machine like dedndave's should take as much as 2 minutes after the statement "Let the OS manage the cores:", utilizing 100% CPU, and will appear to be hung up

Quote from: dedndavei understand that all threads must complete before the operation is done but, we are interested in total algo time, for the sake of comparison

- now here's something that fascinates me (as hutch might put it), we don't really know what we're interested in! Raistlin has to tell us! He started the idea of parallelizing FFT. I'm interested in both topics (multi-threading, and FFT) so have been playing / working along; I think siekmanski has a clearer idea of his goal; but, u know, it's never really been stated.

- However IMHO it's not what u say. We (well, I) want to know the elapsed real-world time (from launching to terminating all threads) to do a particular amount of FFT'ing; NOT the total of the individual times of the separate threads. Obviously the 2 are closely related, but. That's why we don't need thread-safe timers, instead just one overall timer which needn't be thread-safe

@siekmanski,

Here's my results from your routine3, some of it excised,

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Timing starts after 2 seconds....

 0.0481794 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
 0.0482061 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Use AffinityMasks to manage the cores:
 0.0826355 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

- Without HTT each of the 4 cores is going exactly as fast as the one core, doing 4 times the work in the same time. Whereas with your machine the HTT is 20% slower, with dave's it's 100% slower. Need more timings but, as one would expect, two HTT routines on one core is not as good as 2 separate cores. In your case it's certainly worth it anyway, not in dedndave's, perhaps we need more results to be clear on the point.

- Note you still don't believe dedndave about the cpuid leaf 80000001h :), I don't have HTT. But if it's decided that HTT does work well, and therefore should be treated like real cores, it doesn't matter and u don't have to test for HTT anyway

- It seems pretty definite that AffinityMask shouldn't be used

A final point, on the net when they talk about parallelizing FFT they usually mean multiple threads working on one single FFT. Neither of us is doing that, we're doing one FFT per core, but doing a lot of FFT's. That's perfectly legit, just thot I'd point out it's different than common usage. Wouldn't be at all surprised if it's this other parallelization Raistlin was thinking of

dedndave · May 18, 2015, 02:55:06 PM

my mistake on the 80000001h
it's one of those bits that shows up under both "feature bits" (EAX=1) and "extended feature bits" (EAX=80000001h)
so, using EAX=1 should also work :P

notice that HTT is for Intel processors only (proprietary)
so, AMD may use those bits for other things
let me look it up....

dedndave · May 18, 2015, 03:02:48 PM

for AMD...

EAX=1, EDX[28] = HTT (even though i don't think any AMD processors use HTT)
EAX=80000001h, EDX[28] = Reserved

for Intel...

EAX=1, EDX[28] = HTT
EAX=80000001h, EDX[28] = HTT

rrr314159 · May 18, 2015, 03:28:18 PM

dunno where u got that from, but it's not right

On my Intel, 80000001h correctly reports no HTT, (as MSDN says it should) but 1 gives wrong answer, = yes.
As I said before.

On my AMD, ... both seem to give the wrong answer! Both say "yes", HTT
So evidently doesn't apply to AMD

U had it right the first time, for Intel!

Reminds me of a friend's favorite joke, when somebody accuses him of thinking himself perfect, "I'm not perfect. Once I thought I'd made a mistake, but I was wrong."

Raistlin · May 18, 2015, 05:25:42 PM

Quote
- now here's something that fascinates me (as hutch might put it), we don't really know what we're interested in! Raistlin has to tell us! He started the idea of parallelizing FFT. I'm interested in both topics (multi-threading, and FFT) so have been playing / working along; I think siekmanski has a clearer idea of his goal; but, u know, it's never really been stated.

@rrr314159
Astute as always, The goal (my side) is to do real-time FFT on radio telescope data streams to find binary star system pulsars (dreamy look). That's
were the meta-heuristics will come into play during differential searches that update a probability table for hits on inferred actual pulsars. Please
read http://www.ska.ac.za/researchprojects/s13.php for more information. Note I'am note affiliated with these institutions and obviously
would not be able to even physically be involved in any of such research. However that said the application of such technology would be essentially
endless (realistic look), at the very least proposing a theoretical model that could solve such problems is already a step closer. I'd need 10 petaflop system to
actually test... anyone got one of these laying about?

QuoteA final point, on the net when they talk about parallelizing FFT they usually mean multiple threads working on one single FFT. Neither of us is doing that, we're doing one FFT per core, but doing a lot of FFT's. That's perfectly legit, just thot I'd point out it's different than common usage. Wouldn't be at all surprised if it's this other parallelization Raistlin was thinking of

No, the common single FFT will not work for any real-world problems as efficiently as this and as you put it - The efforts here are much more legit! (awesome work)

FORTRANS · May 18, 2015, 10:44:56 PM

Hi,

Apparently my notebook BIOS does not affect Hyper-Threading...

Code Select

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 4
Physical Processors : 2
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

 0.0598133 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
 0.0910814 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Use AffinityMasks to manage the cores:
 0.1557536 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

 0.1355365 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads. ( Even )
 0.1617971 seconds for 2 * 1000 * '1024 FFT & IFFT' with 2 threads. ( Uneven )

Press any key to continue...

HTH,

Steve N.

dedndave · May 18, 2015, 10:55:57 PM

rrr,
i got that information from the CPUID documents

AMD # 25481-234

http://dedndave.x10.mx/files/Amd-25481-234-CPUID.zip

Intel # 241618-039

http://dedndave.x10.mx/files/Intel-241618-039-CPUID.zip

of course, with newer CPU's, the documents could be out of date
i think both are pre-AVX dates
and - the bit definitions may very well have been changed to accommodate all the AVX stuff :t

Steve, i've never seen it in the BIOS settings
and, i suppose there is a bit to control it, someplace (CR)
but, the operating system is going to set that bit, if there is :P

The MASM Forum

News:

Multithreading with FFT-IFFT

FORTRANS

Siekmanski

rrr314159

Siekmanski

dedndave

dedndave

Siekmanski

dedndave

rrr314159

dedndave

dedndave

rrr314159

Raistlin

FORTRANS

dedndave