News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Multithreading with FFT-IFFT

Started by Siekmanski, May 14, 2015, 07:58:23 PM

Previous topic - Next topic

rrr314159

@dedndave,

U mean u can't simply "turn off" hyper-threading? (I don't have such machines so don't know about them). U might have to trick it into putting (e.g.) 4 threads on the 4 separate cores? In my case, with almost nothing else running on the machine, I've found I don't need thread affinity. If I launch 4 threads Windows automatically makes the best allocation, they each wind up staying on the same physical core most of the time. OS doesn't move them if it doesn't have to, since obviously that's less efficient. Can't u do the same on HTT machine: just launch 4 threads with highest priority, keep everyone else away (for the most part), so OS will automatically assign each a core, and leave them there? Of course they do switch around a few times a second, maybe, but that's not enough to reduce efficiency significantly.

On another topic, I've been reading a bit about FFT implementation. There are many issues. As nidud mentions, cache is important (use "cache-oblivious" algo?), there's bit-reversal, question whether output overwrites input or not, special algos for use with SIMD ... ! The math is by far the easiest part, and only a fraction of the whole story. And these core-assignment issues are not even mentioned; fortunately that's easy (at least, w/o HTT)

I hope someone (siekmanski most likely) knows how to do this already! I'll be content to applaud ...
I am NaN ;)

Raistlin

@Siekmanski

Ever had a look at the modified split-radix FFT by Johnson and Frigo (2007) said to reduce
standard Turnkey-Cooley (Danielson-Lanczos Lemma) by 6% fewer floating point operations ?

OOPS [later edited] :"Implementing FFTs in Practice" Johnson and Frigo (2012) : propose a cache strategy for FFT.
Cant make sense of it yet, going off to do background reading first  :icon_eek:
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

rrr314159

My 2 cents:

I looked at it. My overall impression is that 6% FP ops is small beer compared to other issues such as cache. All that obsession about FP comes from an earlier time when FP hardware multiplication was very slow, if it was available at all; even addition was very slow compared to integer. True, by 2007 that was no longer the case, but academia adjusts slowly; like the proverbial General, they're always fighting the last war. Today, especially with SIMD, FP mult is not much slower than add. In fact both are faster than integer when dealing with AVX (w/o AVX2). A week ago I came up with a clever way to avoid a mult (in a particular situation) using 3 int ops (xor, shift-left, and) - my "clever" way took twice as long!

FFT, u know, saves N/logN ops compared to Fourier Transform from-the-definition - that comes to many thousands of percent - that IS worth it. 6%, as I say, - perhaps worth it as a minor refinement after everything else is in place. If u have to go to even minimal trouble to achieve it, it's probably going to be slower.
I am NaN ;)

dedndave

no - can't turn off hyperthreading
i have an older machine, so if i did, i'd have only one core to play with - lol

i guess you do this....
figure out how many physical cores there are
total cores, divide by 2 if HTT bit set in CPUID (leaf 80000001:EDX[28])
create that many threads and let the OS manage cores

nidud

#19
deleted

Siekmanski

Hi guys,

That's a lot of things things to take care of...
I'll try to code different schemes and see what it does.

In the meanwhile I wrote a timing routine that displays the measured time in 100 nanosecond units instead of cycle counts.

Quote from: dedndave on May 15, 2015, 11:13:06 PM
no - can't turn off hyperthreading
i have an older machine, so if i did, i'd have only one core to play with - lol

i guess you do this....
figure out how many physical cores there are
total cores, divide by 2 if HTT bit set in CPUID (leaf 80000001:EDX[28])
create that many threads and let the OS manage cores

Thanks Dave,
I'll also implement this.

Marinus
Creative coders use backward thinking techniques as a strategy.

Siekmanski

#21
Next attempt, test for hyper threading and timing in 100-nanosecond units.
The speed is not what I thought it should be...  :(
It seems that starting the threads takes a lot of time.

Is there something I missed or not doing the right way ?
Maybe I should set up the threads and then trigger them with an event ?

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 12
Physical Processors : 6
ActiveProcessorMask : 0FFFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread9:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread10:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread11:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread12:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0307719 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.6469146 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Use AffinityMasks to manage the cores:
0.7050401 seconds for 12 * 1000 * '1024 FFT & IFFT' with 12 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.3317011 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Even )
0.3423527 seconds for 6 * 1000 * '1024 FFT & IFFT' with 6 threads. ( Uneven )

Press any key to continue...


Creative coders use backward thinking techniques as a strategy.

LiaoMi

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Operating System    : Windows 8
Hyperthreading      : YES
Logical Processors  : 8
Physical Processors : 4
ActiveProcessorMask : 00FFh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread5:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread6:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread7:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread8:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0504474 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.3644088 seconds for 8 * 1000 * '1024 FFT & IFFT' with 8 threads.

Use AffinityMasks to manage the cores:
0.4224370 seconds for 8 * 1000 * '1024 FFT & IFFT' with 8 threads.

Hyperthreading on board, lets test even and uneven AffinityMask bits....

0.2059342 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads. ( Even )
0.2245983 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads. ( Uneven )

Press any key to continue...

dedndave

rather than trying to sync all the threads....
maybe timing them individually ?
i think the timing macros might not be "thread safe", but you could write your own version that uses local vars

Siekmanski

Thanks LiaoMi.

Quote from: dedndave on May 17, 2015, 04:49:33 AM
rather than trying to sync all the threads....
maybe timing them individually ?
i think the timing macros might not be "thread safe", but you could write your own version that uses local vars

Hi Dave,

Yes you are right that the timings are not "thread safe", I noticed that also.

In some cases you want to launch lets say 2 threads and need to wait till they both are finished before you can go on in your app.
For example: calculate the FFT of a stereo sound stream every 10 ms.
In this case I want to know how fast the routine is calculating these 2 threads. (hopefully running simultaneously each on a different core.)
The timer routine is started and stopped on the same core as the app runs I assume.

In other cases where you don't have to wait for the threads to finish in your app, then you could measure the time for each thread separately.
Creative coders use backward thinking techniques as a strategy.

rrr314159

Timing routines usually "thread safe" when used only one at a time, as you're doing. Don't need to use local vars in that case, global is OK. Still, if u use local vars the timer can be used in multiple threads, simultaneously. That's not true of MichaelW's (he stores beginning time in 2 dwords in memory), so u could use mine (included in my previous Lab work).

But in this case, with just one timer going at a time (no pun intended) MW's global var is fine. Other problem is, the timer must stay on the same core. If the timing thread changes cores in the middle of timing, it will use a different RDTSC, no good. U can tell because it will usually be sensible but then in one case be way off. The way you're doing it at the moment is particularly vulnerable because you're launching enough threads (5, in the case of my 4-core machine) to make it more likely that timer will, indeed, switch cores.

The right thing is to use Windows Performance Monitoring, "guaranteed" to synchronize across cores. I made a timer routine like that, posted it somewhere, I'll find it again. Probably best for this exercise. BTW Gunther also gave a reference in the Trig thread showing the really right way to do it, with Perf Mon and other precautions as well, maybe he has a routine written to that spec.

dedndave, for this exercise no good to time one thread at a time, we want to determine how effective the multi-threading is. It's not at all true that u can time for one thread, then assume that 4 threads will simply do it 4 times faster.

I am NaN ;)

rrr314159

Hi siekmanski,

Couple fixes needed to Consolestuff.asm

line 46 (this is just a warning, initializing in data? segment)

   szProcessorName             db 52 dup (0)

change to

   szProcessorName             db 52 dup (?)

line 287 (eax cpuid leaf must have bit 31 set)

   mov     eax,1

change to

   mov     eax,80000001h

Without this it incorrectly reports my machine uses HTT. The run given below has this fixed. On this point it's worth mentioning that you ignored AXIOM 1 of masm3.com, always a no-no:

**************** AXIOM 1 of masm32.com: dedndave is always right ******************

The timings seem to indicate, correctly no doubt, that

- launching 1 thread and doing one FFT takes .05 secs
- launching 4 threads, with 4 FFT's, takes a bit less than 4 times as long
- AffinityMask makes little difference

I intend to put together a better way (I think) to do the threads, so that using 4 will go close to 4 times quicker. Gonna get on it now hope to have it ready soon.

BTW I've tested the FFT routines a bit, with different inputs, everything correct so far, looks good

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.

Processor Name      : Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
Operating System    : Windows 8
Hyperthreading      : NO
Logical Processors  : 4
ActiveProcessorMask : 000Fh

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

First 16 Complex (real, imaginary) samples for testing:

0.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

forward FFTsse2 first 16 results:

0.063 0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse2 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

forward FFTsse3 first 16 results:

0.063 -0.000 0.062 -0.003 0.062 -0.006 0.062 -0.009
0.061 -0.012 0.060 -0.015 0.059 -0.018 0.058 -0.021
0.057 -0.024 0.056 -0.026 0.054 -0.029 0.052 -0.031
0.051 -0.034 0.049 -0.036 0.046 -0.038 0.044 -0.040

inverse FFTsse3 first 16 results:

0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Thread1:
-0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread2:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread3:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Thread4:
0.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Timing starts after 2 seconds....

0.0547565 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
0.1736391 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Use AffinityMasks to manage the cores:
0.1761987 seconds for 4 * 1000 * '1024 FFT & IFFT' with 4 threads.

Press any key to continue...
I am NaN ;)

Siekmanski

Thanks rrr314159,

Corrected the stupid typos. Must drink more coffee when writing code.  :biggrin:
And yes, you are right. Dedndave is always right.  :t

I'm also testing now another way to do the threads. ( using events )
Have it ready soon.
Creative coders use backward thinking techniques as a strategy.

rrr314159

#28
Here is my attempt at the modified FFT routine. As promised, with 4 threads it goes something like 4 times faster.

Unfortunately it took longer than expected, (4 hours or so) so it's not as clean as it should be. But I won't have an opportunity to fix anything for 8 hours so wanted to get this to you now.

The only real problem, I accidentally removed SSE2 support. Didn't notice until I'd buttoned it up, and can't test it anyway, so here it is. Can easily be put back in, I'll do it tomorrow (or day after :) if desired.

Also more-or-less deliberately removed ThreadAffinity stuff. Intended to put it back, but ran out of time. Don't think it's necessary anyway?

This may be 100% from your second version, or there may be some stuff crept in from the first. Should be no big deal either way.

The actual FFT routines (which I didn't touch) are in an include file "FFTstuff.asm" and my new routines are in the main asm file, now called "FFT.asm". I did this to make my approach clearer. Shifted a couple data statements into FFTstuff and Consolestuff for modularity.

I also made the data a bit more interesting by making 4 different "rows" (as I call them) of FFT data, which are not zeroed out in the last 1024*2 - 16 as the previous data was. These rows are reproduced 250 times for a total of 1000 "rows" and these are what the threads work on.

In the run included here, 1000 * 1024 reps with one thread takes 0.0471801, while 1000 * (1000 * 1024) reps (1000 reps of the 1000 rows) takes 9.2460124 seconds for 4 threads. On the face of it that's quite a bit better than 4 times as fast, since 47/4 = 10.2 seconds or so. But of course that's because it gets better the more reps you do, due to cache utilization no doubt. In fact the four threads are maybe 3.5 times better than one; which is about as good as you can expect.

The zip includes

FFT.exe      ; can be run by anyone with SSE3 to check timing
FFT.asm      ; heavily modified main routine
Consolestuff   ; unchanged except 2 data items moved here
FFTstuff.asm   ; FFT routines, unchanged, with accompanying data
doFFT.bat   ; "make file" batch file

Here are my timing results:

FFT IFFT routines using Danielson-Lanczos Lemma by Siekmanski 2015.
Multi-threading parallelization changed rrr314159 2015/05/17.

Number Of Processors: 4

First 16 Complex (real, imaginary) samples for testing:

11.000 0.000 1.000 0.000 2.000 0.000 3.000 0.000
4.000 0.000 5.000 0.000 6.000 0.000 7.000 0.000
8.000 0.000 7.000 0.000 6.000 0.000 5.000 0.000
4.000 0.000 3.000 0.000 2.000 0.000 1.000 0.000

Initializing Reverse-Binary Table, FFT and IFFT Sine-Cosine Tables.......

forward FFTsse3 first 16 results:

0.073 0.000 0.073 -0.003 0.073 -0.006 0.072 -0.009
0.072 -0.012 0.071 -0.015 0.070 -0.018 0.069 -0.021
0.068 -0.024 0.066 -0.026 0.065 -0.029 0.063 -0.031
0.061 -0.034 0.059 -0.036 0.057 -0.038 0.055 -0.040

inverse FFTsse3 first 16 results:

11.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Print multithreaded results:

Processor1:
1.000 -0.000 1.000 -0.000 2.000 -0.000 3.000 -0.000
4.000 0.000 5.000 -0.000 6.000 -0.000 7.000 -0.000
8.000 -0.000 7.000 -0.000 6.000 -0.000 5.000 -0.000
4.000 -0.000 3.000 -0.000 2.000 -0.000 1.000 -0.000

Processor2:
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000
11.000 -11.000 9.000 -9.000 7.000 -7.000 6.000 -6.000
4.000 -4.000 3.000 -3.000 2.000 -2.000 1.000 -1.000

Processor3:
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
20.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000
0.000 0.000 1.000 1.000 0.000 0.000 -1.000 -1.000

Processor4:
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000
5.000 -3.000 22.000 -22.000 -1.000 0.000 1.000 1.000
5.000 -3.000 22.000 -22.000 -0.000 -0.000 6.000 2.000

Timing starts after 2 seconds....

0.0471801 seconds for 1000 * '1024 FFT & IFFT' without threads.

Let the OS manage the cores:
9.2460124 seconds for 1000 * 1000 * '1024 FFT & IFFT' with 4 threads.

Press any key to continue...


[edit] AFAIK this routine works correctly but, small correction, as siekmanski pointed out, in the routine "procarow" simply remove the two statements "push edi" and "pop edi", they're harmless but useless. Also don't forget it doesn't work for SSE2 u must have SSE3 at least. Finally a slower machine like dedndave's should take as much as 2 minutes after the statement "Let the OS manage the cores:", utilizing 100% CPU, and will appear to be hung up
I am NaN ;)

dedndave

Quote from: rrr314159 on May 17, 2015, 12:39:45 PM
dedndave, for this exercise no good to time one thread at a time, we want to determine how effective the multi-threading is. It's not at all true that u can time for one thread, then assume that 4 threads will simply do it 4 times faster.

perhaps i wasn't very clear on what i had in mind....

create N threads, timing each thread seperately
do not attempt to control thread affinity - let the OS handle it
when done, add N times together for total time

i understand that all threads must complete before the operation is done
but, we are interested in total algo time, for the sake of comparison