raistlin, I see you've been working the problem! That's great.
re bit-reversal, the problem with the "gray code" and similar algos is, as siekmanski indicates, we only need one bit-reversal table per run; so no point in speeding it up.
While optimizing the bitreversal, memory access becomes more and more the factor determining processing time. After all, these samples have to be swapped no matter how.
Very true! siekmanski makes an array of necesssary swaps - just once - then uses it again and again. The swaps can't be SIMD-ized, and they don't overlap in any way; there's not much u can do but, just go ahead and swap 'em. (Perhaps u could use gather/scatter, but we don't have AVX2). So I didn't really study how well the "gray code" optimizes computing bit-reversals, I suppose it must do something or they wouldn't go to all that trouble.
siekmanski, two things not quite right about your array, first, you're including unnecessary "null" swaps. In fact you only need ... well, rather less than 512 (my algo which I'll post later has the minimum, don't remember exactly). Also probably best to arrange them in ascending order to help with cache, altho might not make any difference.
The paper "advice on multi-threading" is good, more or less correct, and some of it applies to this problem.
We have (as I understand it) a whole lot of 1024-FFTs and they make perfect tasks, with excellent granularity, evenly balanced, so that's the end of that. I'm using dynamic assignment altho at first glance u'd think it not necessary, for various good reasons.
API spin-waiting or blocking, and thread pools, are extremely unnecessary here. We have about 4 processors versus thousands of jobs! And of course the OS handles other details; this paper is written by an OS writer-type person, we're not involved in that. We don't care about flush order because we're using in-place memory FFT.
I don't use Windows API synchronization, altho siekmanski does; this is assembler! Why should I go thru 1000 lines of Windows code to execute an "xchg" when I can do it directly?
This is a good write-up - excellent, even. You should read more from this source, guy knows what he's talking about. But if you want to know more about the *much* simpler job of multi-threading FFT's ...
Just noticed you posted a ref for more bit-reversing. Like that algo for reversing a byte, pretty cool, gotta admit doubt I'd think of it. Unfortunately we've got 10 bits. Reading along I see the best they've got for that appears to be the way we're already doing it. Looks like interesting work, with a lot of algos. But you know, for any one particular algo, it's an awful lot easier to figure it out myself. This is not rocket science ... except the actual FFT which is not simple at all. Would like to see a ref for that: SIMD - aware
[edit] another thing would be very useful, test cases for 1024 (or other) Complex (or real) FFTs, with the frequency domain numbers
[edit] actual pulsar data