News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Multithreading with FFT-IFFT

Started by Siekmanski, May 14, 2015, 07:58:23 PM

Previous topic - Next topic

FORTRANS

Hi Dave,

Quote from: dedndave on May 18, 2015, 10:55:57 PM
Steve, i've never seen it in the BIOS settings
and, i suppose there is a bit to control it, someplace (CR)
but, the operating system is going to set that bit, if there is   :P

   Well, as I said with the first computer that I saw with HTT there
was a BIOS setting to enable it.  I think the default was off.  But
it was a long time ago (and not my computer).  As newer machines
don't have that option, one can suppose that newer operating
systems don't have problems with it any more.

Cheers,

Steve N.

rrr314159

Hello Raistlin,

I'd like to see some of the actual pulsar data just for the interest of it and give something to shoot for re. FFT algo. I went to CSIRO Data Access Portal (PSRFITS archived data) but u need to be a registered user. Looked at PRESTO and TEMPO2 would be worthwhile to see how they do FFT but 2 much work. Scott Ransom mentions it's "very long period" pulsar data but that's a bit vague to say the least. Mentions data comes in various fixed point formats but (more recently) avail as single precision floating that would be best. maybe u could get at Maura Mc's NANOGrav pulsar search data?

If u can't get a bit of such data, do u know the data description? How big is FFT? Is it real not complex (probably), multi-dimensional? What's the data rate, how long is a typical run to be analyzed, what is the range of periods to search? Of course they can go very fast, up to 33 revolutions / sec (?), but also much slower. Usually pulsar rps can be estimated, so I wonder if they typically allow the user to input the range he'd be looking to analyze? I'll bet there are secular variations need to be taken into account, not even sure how to do such FFT's, that would be interesting. There are of course other periods associated with the (typical) binary pulsar system, mainly orbiting (a few hours up to much longer) and occlusion, who knows what else. Bottom line I'm wondering what is the typical problem you (and / or Jeandrew Brink) is trying to solve here.

Also very curious, whatever happened to GR "proof" based on PSR B1913+16? Do u know, or can u find out, about that? As we all know Hulse & Taylor "proved" GR 40 years ago with data from that system, showing the period's slowdown could be explained by contraction of orbit due to gravity waves carrying away energy. Supposedly this scheme was accurate to "14 decimal places" and won them Nobel Prize. That number "14" is very misleading, 6 (at the most) could be said to be associated with the GR gravity wave computations

At the time I was very suspicious and today even more so. Key question, 100s more systems have since been discovered, why don't any of them provide corroboration? In these 40 years has the change in period tracked their predictions correctly? Somehow I doubt it. I notice PSR B1913+16 is now a "recycled pulsar" (unheard of back then) meaning its evolution is completely different than they based their analysis on. Suspect this new explanation came about precisely because later data didn't fit prediction.

The problem I (and all laypersons) have is, they're not going to admit publicly when their Nobel Prize awards turn out to be a mistake. That's why I'm asking u, if you're in the community you could find out these things. If u tell me, I promise not to go to New York Times and create a scandal! It's just for my own curiosity.

It would be great, and not surprising, if H & T's analysis ties in with this FFT project. I'd like to do same analysis to newer systems and see if it's good to "14 decimal places" (more likely -14 I bet).

Any info appreciated c u later

[edit] made a mistake, Raistlin, your goal was "to find binary star system pulsars" and I morphed that into goal I'm more interested in, to analyze them once they're found. Re-read your post and realized this ... oh well maybe above will stimulate something anyway
I am NaN ;)

Raistlin


http://astronomyonline.org/Stars/Pulsars.asp - for light background reading around current methods for detection (discussed method uses Complex DFT)

https://data.csiro.au/dap/landingpage?execution=e2s2 - for pulsar sample data requests
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

rrr314159

The first ref is great just what I wanted seems to support my suspicion that Hulse & Taylor didn't deserve that Nobel but I need to study more

Second ref is no good, u need to be registered user. Probably need affiliation to register
I am NaN ;)

rrr314159

Quote from: jj2007 on May 15, 2015, 10:39:55 AM@Marinus: afaics FFTSSE2 and FFTSSE3 both use scalar instructions. What about "parallelising" through using packed instructions?

Quote from: rrr314159 on May 15, 2015, 11:07:41 AMActually two main loops in FFTSSE2 and 3 are using SIMD, loops called Istep_LP and Normalize_LP. ... I haven't studied it yet could be misunderstanding but, as I say, appears to parallelize with packed instructions in these places, movlps, mulps etc.

jj2007 turns out you're essentially right we're not using packed instructions with any real benefit. Using for complex multiplication (with, e.g., addsubps) is better than nothing but scratching the surface

I finally figured out how to SIMD-ize FFT (obvious once u see it, but I had to review the math). It helps that pulsar-search data is very large, can easily use 1024 points and many per second coming in (multi-threads easily also); also that individual data points are small, no need for real8. Very important, there's no prob separating real from imaginary data for pulsar problem (necessary for the algo)

When u do it the current way, with reverse-bit swapping at beginning, first 2 or 3 passes can't use SIMD (data not lined up right). If u do it the other standard way, swapping at end, then last 2 or 3 passes can't use SIMD. BUT (!!) u can swap somewhere in the middle and get full benefit for all passes! Requires pretty large FFT but that's no problem for pulsar data. Can also do 2 or 3 swappings, might be useful with AVX512 etc, but not currently necessary.

I wonder if anyone has seen this technique worked out anywhere? Somebody must already have done it? It only makes sense with SIMD instructions, because it's a PITA and there's no other reason to go to this trouble. So has anyone seen a recent FFT algo, specially for SIMD, with bit-reversal swapping in the middle? Googled it a bit but soon gave up. (sounds like a job for jj2007!) It's always easier to do it yourself than find somebody else's work, figure out what they've done, fix their mistakes etc (re-usable high-speed math software is an oxymoron). But I'd love to see an existing algo for comparison. This FFT stuff can be puzzling, sometimes feel like I'm standing on my head on a merry-go-round

Anyway I'll go ahead and put this together, will take a while, prob start a new thread (assuming it works of course) because it will be totally different. Should go quite a lot faster, especially AVX. If I get sick of it I'll at least detail how to do it, some hardier soul can tackle it
I am NaN ;)

Siekmanski

Hi rrr314159,

It would be cool to optimize the FFT in all it's facets.
It's difficult to understand all those different FFT examples you can read up on the internet.
I'll start all over and see what comes up. ( probably inventing the wheel again... )
I have some ideas too, have to work it out and see if they will work.
As you mentioned, this FFT stuff can be puzzling.

Are there any members on this forum who want to participate and/or give there thoughts about this subject ?
Creative coders use backward thinking techniques as a strategy.

Raistlin

I'll be posting my modularized efforts towards
such and encompassing modern parallel FFT, as
I believe it could benefit all, no matter the application.

Please be kind.... :icon_rolleyes:

To truely make progress we need the following:
1) A timing framework that is accurate, or at least we agree on. :bgrin:
2) A test dataset that is generic through all test cases, sufficiently large to be seen as "real world".
3) An exchange of ideas for discussion - ex. bit-reversal grey code looks promising, packed instructions etc.
4a) A dynamically intelligent CPUID routine that allocates best possible thread count per hardware platform detected.
4b) A cache savvy coding effort that aligns (pun), takes heed of best practise.



Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

rrr314159

I don't doubt for a second I'm reinventing a wheel here, but it's a lot easier than figuring out someone else's wheel! But a problem in my great scheme has already come up: cache! Doing it the way I describe is (I think) right from a SIMD point of view, but having problem thrashing the cache (at 1024). Oh well, maybe one of you guys will figure a way out, when I post my efforts ... or else I can wait a decade until we get bigger caches :biggrin:

Meanwhile I'll be happy to help on your other points but am 2 busy at the moment trying to get it to work ... u know if we decided a fast FFT that gave wrong answers was acceptable, that would make it a lot easier

BTW what does "grey" code signify?
I am NaN ;)

Raistlin

Developing Multithreaded Applications: A Platform Consistent Approach

Direct copy and paste text document follows relevant to our discussion.
Copyright Intel Corp. 2005 <- yes I know it's old but still is legit.
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Raistlin

@rrr
radix 2 - grey code - see : www.katjaas.nl/bitreversal/bitreversal.html
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

Siekmanski

Hi Raistlin,

Have you seen my bitreversal routine ?
It precalculates all bitreversals for a given FFT size, puts all needed memory swaps in a table.
For a FFT size of 1024 you end up with only 528 memory swaps.
No need to do the bitreversal each time you calculate the FFT.
Creative coders use backward thinking techniques as a strategy.

Raistlin

@Siekmanski

Yes - pseudo-code or link please to the algo, if you have it please?
Yes - I knew it was going to be static per FFT.

see also ex. http://graphics.stanford.edu/~seander/bithacks.html#BitReverseTable  for some ideas.
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

rrr314159

raistlin, I see you've been working the problem! That's great.

re bit-reversal, the problem with the "gray code" and similar algos is, as siekmanski indicates, we only need one bit-reversal table per run; so no point in speeding it up.

Quote from: your bit-reversal refWhile optimizing the bitreversal, memory access becomes more and more the factor determining processing time. After all, these samples have to be swapped no matter how.

Very true! siekmanski makes an array of necesssary swaps - just once - then uses it again and again. The swaps can't be SIMD-ized, and they don't overlap in any way; there's not much u can do but, just go ahead and swap 'em. (Perhaps u could use gather/scatter, but we don't have AVX2). So I didn't really study how well the "gray code" optimizes computing bit-reversals, I suppose it must do something or they wouldn't go to all that trouble.

siekmanski, two things not quite right about your array, first, you're including unnecessary "null" swaps. In fact you only need ... well, rather less than 512 (my algo which I'll post later has the minimum, don't remember exactly). Also probably best to arrange them in ascending order to help with cache, altho might not make any difference.

The paper "advice on multi-threading" is good, more or less correct, and some of it applies to this problem.

We have (as I understand it) a whole lot of 1024-FFTs and they make perfect tasks, with excellent granularity, evenly balanced, so that's the end of that. I'm using dynamic assignment altho at first glance u'd think it not necessary, for various good reasons.

API spin-waiting or blocking, and thread pools, are extremely unnecessary here. We have about 4 processors versus thousands of jobs! And of course the OS handles other details; this paper is written by an OS writer-type person, we're not involved in that. We don't care about flush order because we're using in-place memory FFT.

I don't use Windows API synchronization, altho siekmanski does; this is assembler! Why should I go thru 1000 lines of Windows code to execute an "xchg" when I can do it directly?

This is a good write-up - excellent, even. You should read more from this source, guy knows what he's talking about. But if you want to know more about the *much* simpler job of multi-threading FFT's ...

Just noticed you posted a ref for more bit-reversing. Like that algo for reversing a byte, pretty cool, gotta admit doubt I'd think of it. Unfortunately we've got 10 bits. Reading along I see the best they've got for that appears to be the way we're already doing it. Looks like interesting work, with a lot of algos. But you know, for any one particular algo, it's an awful lot easier to figure it out myself. This is not rocket science ... except the actual FFT which is not simple at all. Would like to see a ref for that: SIMD - aware

[edit] another thing would be very useful, test cases for 1024 (or other) Complex (or real) FFTs, with the frequency domain numbers

[edit] actual pulsar data
I am NaN ;)

dedndave


Siekmanski

Hi Raistlin,

Pseudo-code of the bit reversal routine,

allocate 2 memory buffers,

fftdata = FFT size (real,imaginary)
bitreversalindextable = FFT size * 2 ; just to be sure allocate enough memory.

; create the table

     nn = FFT size
     n = nn + nn
     totalswaps = 0
     j=1
     for (i=1; i<n; i+=2)
     {
if (j>i)
         {
             bitreversalindextable[i-1]
             bitreversalindextable + 1 [j-1]
             bitreversalindextable = bitreversalindextable + 2
             totalswaps = totalswaps + 1
         }
         k = nn
         while (k>=2 && j>k)
         {
             j -= k
             k >>= 1
         }
         j += k
      }
; end

; each FFT calculation read the swaps from the bitreversalindextable

      i=0
      for (i=0; i=totalswaps; i+=2)
      {
           j = bitreversalindextable[i]
           k = bitreversalindextable[i+1]
           swap(fftdata[j],fftdata[k]) ; swap complex pairs
      }



Hi rrr314159,

Quotesiekmanski, two things not quite right about your array, first, you're including unnecessary "null" swaps. In fact you only need ... well, rather less than 512 (my algo which I'll post later has the minimum, don't remember exactly).

OK, i'll  check my code.
Creative coders use backward thinking techniques as a strategy.