Multithreading and the FPU

Biterider · November 09, 2024, 06:22:42 AM

Hi
I've been wondering about a technology question.
Most current CPUs have a dedicated FPU per physical core, which means we can theoretically run each FPU from different threads in parallel and speed up floating point calculations significantly.
There are still some limits, like memory access, caching, bus bandwidth, etc., but there should still be a noticeable effect.

I've never seen any publication, article or post doing something like that. Does anyone know anything about it or have looked into it?
I can do my own research, but I thought I'd check it first, just to make sure I'm on the right track.

Biterider

NoCforMe · November 09, 2024, 07:14:33 AM

Hmm; I can't say for certain, but wouldn't the burden of proof here be on showing that one wouldn't be able to use a FPU in a different thread? If there are multiple FPUs, when it seems reasonable to assume that each one should be able to be used in a separate thread. Why would you be forced to use all of them in a single thread?

I could be wrong about that ...

HSE · November 10, 2024, 12:16:41 AM

Hi Biterider,

Quote from: Biterider on November 09, 2024, 06:22:42 AMThere are still some limits, like memory access, caching, bus bandwidth, etc., but there should still be a noticeable effect.

Yes, but you need enough cores, because there are overheads.

I play a time ago in 4 cores machine from UEFI, but happen 2 cores were really available, then not benefit because one core control process and the other run the threads

Any way you must transform the problem, and some problems could not be well suited for that.

Quote from: Biterider on November 09, 2024, 06:22:42 AMI've never seen any publication, article or post

Historically distribution was first, because threads were going to run in differents machines. Then probably no specific multicore FPU single processor work is of much interest for big problems.

You can ask Gunther, who work in CERN computer farm with 30000 cores, that is interesting

HSE

Biterider · November 10, 2024, 02:56:49 AM

Hi HSE

Quote from: HSE on November 10, 2024, 12:16:41 AM... in CERN computer farm with 30000 cores, that is interesting

My goals are much more modest than that.
I thought I would give it a run with a 3D animation application that moves some objects around sequentially using exclusively the CPU and FPU, like in the old days.
Calculating the movements, the scene and the camera is really FPU intensive.
I created 2 threads, one for each movement (2 objects moving), but the result was disappointing.

The new version was slightly slower than the sequential one.

My guess is that the creation, destruction and synchronisation outweighs the speed gain.
I can see a ThreadPool being the way to go, but I need to code it first.

Biterider

HSE · November 10, 2024, 04:48:30 AM

Quote from: Biterider on November 10, 2024, 02:56:49 AMMy goals are much more modest than that.

How many cores you have?

Quote from: Biterider on November 10, 2024, 02:56:49 AMMy guess is that the creation, destruction and synchronisation outweighs the speed gain.

I think, also can happen that OS is running threads in same core, then you end with same code + context switching.

Not so easy to know what OS is doing

Quote from: Biterider on November 10, 2024, 02:56:49 AMI can see a ThreadPool being the way to go, but I need to code it first.

NoCforMe · November 10, 2024, 05:28:08 AM

Found this on Raymond Chen's blog. Dunno if it's of any help here. There may be more there, but this was all I could find in an initial search.

Biterider · November 10, 2024, 08:37:30 PM

@HSE

Quote from: HSE on November 10, 2024, 04:48:30 AMHow many cores you have?

At home, where I do my programming stuff, I have a 10 year old i7-4770K (Haswell).
It has 4 physical cores and 8 logical cores (https://www.intel.com/.../specifications.html)

Quote from: HSE on November 10, 2024, 04:48:30 AMNot so easy to know what OS is doing

Absolutely.

@NoCforMe

Quote from: NoCforMe on November 10, 2024, 05:28:08 AMFound this on Raymond Chen's blog.

Not directly related, but still interesting reading

Biterider

daydreamer · November 10, 2024, 09:02:25 PM

I tried some SIMT, but 3d transform first came SSE instructions set for speed up,after that graphics card hardware accelerated 3d transform, but I went back to SIMD
Got Simplest SIMT applied on huge fibonnaci numbers worked for me:
Main thread take care of print fibonnaci numbers
Worker thread calculate fibonnaci numbers and sometimes copy number to other memory area, which main thread prints from
It takes out slowest print in addition loop that takes Milliseconds and moves it to main thread, while adding takes clock cycles

Biterider · November 17, 2024, 06:27:43 AM

Hi all
I have spent some time researching this topic and have come up with some results that I would like to share.

My first step was to check what the current state of the art is regarding multithreading and synchronisation.
I saw that some new things have been implemented recently, especially in the multimedia area for Win10 and Win11 in the kernel and userland. The idea was to try them out and compare it to the classic techniques.

First I tried the ThreadPool object (Win7, reworked version for Win10) and started to translate an MS example from the official help page and found it a "bit" complicated to set up the environment and all required callbacks.
In the end, I discovered that it did not quite do what I needed when it came to synchronising the threads when they were finished doing the calculations. This required an additional event object to be signalled to allow the rest of the application to continue.

At this stage of my experimentation I was using a timer (WM_TIMER) as the heartbeat of the application.
Interestingly, I was not able to get above ~30 fps no matter what I did. That was when I ditched the timer and let the application run with synchronisation, but triggering one frame after the other. Depending on the test machine, I achieved frame rates of up to 300 fps!
Switching back to serial computation was when I started to see the difference to the parallel execution. In this case the maximum frame rate reached was around 250 fps. Not what I expected, but still a noticeable difference.
It is known that the regular timer has a minimum period of ~30 ms, which roughly matches the initial performance of 30 fps.

The next logical step was to see if Windows had something better. I decided to use the WaitableTimer. This synch-object is much easier to handle and on all the machines I tested (all Win10) it reached a min and stable period of ~15 ms. The framerate I achieved was ~64 fps, which is in the order of what my monitor is capable of.

I'm fully aware that there are better gaming monitors on the market capable of much higher framerates, but for the moment I'm happy to reach this limit, knowing that I have plenty of reserve for more intensive fpu calculations.

Regards, Biterider

sinsi · November 17, 2024, 08:10:54 AM

Have you looked at timeBeginPeriod to up the timer resolution?

daydreamer · November 17, 2024, 07:33:36 PM

First all start threads cost milliseconds,for example 4 threadsSIMT vs 4 packed SIMD which starts immediately
All synchronisation api calls cost milliseconds
Stop threads api calls cost

Put directly code in a timer doing all calculations 1/fps ,i find take least system resources, high performance timer if WM_TIMER is too slow
I think you should use simt strength vs SIMD ,is several threads ,each taking care of different things

Biterider · November 17, 2024, 09:03:00 PM

Hi Sinsi
That was a good hint. timeBeginPeriod increases the timer to ~5 ms, almost 200 fps.

I did not try it because of this (multimedia-timer-functions)

QuoteMultimedia Timer Functions
Article
06/20/2023
3 contributors
[The feature associated with this page, Multimedia Timers, is a legacy feature. It has been superseded by Multimedia Class Scheduler Service. Multimedia Class Scheduler Service has been optimized for Windows 10 and Windows 11. Microsoft strongly recommends that new code use Multimedia Class Scheduler Service instead of Multimedia Timers, when possible. Microsoft suggests that existing code that uses the legacy APIs be rewritten to use the new APIs if possible.]

Trying to use the "Multimedia Class Scheduler Service" was not so easy, so I put it aside.
Maybe I have to try harder

Biterider

Biterider · November 17, 2024, 09:28:49 PM

Hi DayDreamer
I'm not aware of any SIMT implementation on an x86 platform, only on GPU basis, which is not exactly what I want to achieve with this.
It is also not clear to me how this could work here, since it must inevitably come to a synchronisation when the worker threads have done their work to then calculate the subsequent scene and camera.

SIMD is another story and can be used to accelerate the workers.

Regards, Biterider

daydreamer · November 17, 2024, 11:45:03 PM

Biterider check masm32 multithread example
There are something kinda invoke waitformultipleevent ,that waits for all workerthreads had stopped
Might be what you are looking for

Biterider · November 18, 2024, 02:51:15 AM

Thanks DayDreamer

You're probably referring to "\MASM32\examples\exampl10\threads\mprocasm\mprocasm.asm".
It uses the WaitForMultipleObjects API, which is one of the preferred ways of synchronising different events.

Biterider

The MASM Forum

News: