News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Hyperthreading - Revisited

Started by aw27, September 22, 2018, 12:51:21 AM

Previous topic - Next topic

aw27

This is a follow up of this message where I mentioned that the detection method I was using failed with AMD Ryzen processors and that I would try another approach.

This new approach is API based so it should work for both Intel and AMD because the OS will handle the nitty-gritty details on our behalf.

The results do not explicitly mention Hyperthreading because this is an Intel specific term and there are other possibilities. However, when Logical processors is more than Processor Cores you may in general consider the reason is Hyperthreading.


Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Vendor: GenuineIntel
Family: 6
Number of NUMA nodes: 1
Number of Physical Processor Packages: 1
Number of Processor Cores: 6
Number of Logical Processors: 12

daydreamer

When hyperthreading was new,I read advice to turn off hypertreading for best performance,it would be nice to make a test for this,to test if its true or false or in some code its true and some it is false
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

aw27

Quote from: daydreamer on September 22, 2018, 01:48:04 AM
When hyperthreading was new,I read advice to turn off hypertreading for best performance,it would be nice to make a test for this,to test if its true or false or in some code its true and some it is false
I believe most games and applications scale well with a thread pool up to the number of logical cores. Personally, I have doubts that switching off hyperthreading will improve, but some people may know a lot more than I do about this.

hutch--

From memory it used to be a limitation at an operating system level where if you had a multi-threading processor back then the OS could not properly use the capacity. With QE I use a similar approach to Jose's output taken directly from the operating system as its easy enough to handle and reliable.

Funny enough I am writing some CPUID code at the moment, mainly so it can find the instruction set family that can be used with each processor so you can dial in code to a processor version without it crashing on older hardware. It handles up to AVX ok but I have to decipher Intel's current manual for AVX2.

Siekmanski

Quote from: AW on September 22, 2018, 03:04:36 AM
Quote from: daydreamer on September 22, 2018, 01:48:04 AM
When hyperthreading was new,I read advice to turn off hypertreading for best performance,it would be nice to make a test for this,to test if its true or false or in some code its true and some it is false
I believe most games and applications scale well with a thread pool up to the number of logical cores. Personally, I have doubts that switching off hyperthreading will improve, but some people may know a lot more than I do about this.

You can try "semi" turning off hyperthreading by using SetThreadAffinityMask and put your threads only on even or uneven Logical Processors.
Creative coders use backward thinking techniques as a strategy.

johnsa


From my testing I found that the improvement gain from hyper-threading really depends on A) your workload and B) how well optimised your code is.

The basic idea is that execution units are shared / although they're exposed is 2 separate logical processors.

If you are under normal user conditions and running several different applications simultaneously, the HT'ed core can yield up to 60% of a real core in optimal conditions.
If you're writing your own multi-thread code, for example breaking up a job in N threads where all threads are mapped to separate logical cores and are basically executing the same code it comes down to
instruction level pairing and how many execution units you can max out. If for example your code is purely simd, or purely gp and you've reached a point where all the execution units are constantly saturated, then HT can actually degrade performance. One way to help the HT if you're keeping it on during intense multi-threaded loops is to try and balance the code between different execution units if possible and try to re-arrange it so that any stalls are grouped, that way core A can focus on part 1 then stall while core B can execute the other independant portion.

In 90% of cases I've not found having HT on to cause me any problems, and I usually get about 30-40% extra from having them included.
For example, I get N from using 4 real cores, then another (0.3*N) from the remaining 4 HT cores.

Raistlin

Agreed, many criteria are at play.
However, the real concern is that
the performance of an application
should be predictable. The near
certain way of achieving this is via
application profiling using the
major bottleneck attributes. But
these actualized optimizations
are never a guarantee of performance
extents. Thus runtime profiling is
needed.
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

hutch--

Over time I have found that if the fundamental design of multi-core processing is that you get about an 80% improve each core as the task switching is a component of the overhead. What can mess this up rather badly is the ratio of task switching to task length, if the task length is in milliseconds, the task switching will kill it where if a thread is started and allowed to finish without interference and the task is of reasonable length, the speed gain can be very good.

With Intel hyperthreading, the gain is not all that big, its better than nothing but your best performance comes with matching the core count to the thread count in intensive processing. With less intensive processing with something like multiple internet connections a higher thread count at times really improves the throughput but itis only because the individual thread throughput is low intensity.

Raistlin

Good take on general frame of reference hutch.
Surely again, the application feature to wanted
performance is key and usually pre-testable. But
what about that other eventuality = usually the
customer/user system that doesn't conform to
the optimistic expectations.  And then finding
out the user platform is the norm, and not your
testing platform exception.  :(
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

hutch--

 :biggrin:

If there was an easy test platform, someone would sell it. What you are always stuck with is how the basic multi-threading system works and how it uses multiple cores on different hardware. If you start with an early Core2 Duo, 2 cores and no hyper threading you limit the threads to 2, a Core2 Quad you use 4. Once you get to the "i" series of hardware, you have the core count then the matching hyper threading count (4 cores = 8 threads).

Depending on the task, you run 4 threads for highly intensive work or 8 threads for lower intensity tasks. If its a very low processing intensity task(s) you can use a higher thread count as much of the absolute processing power is being wasted per thread. Long ago I downloaded a download speedup app that let you select the number of threads it would use. Because internet data is being dribbled in most instances at a rate far lower than the core will process, much of that thread's processing power is being wasted. Crude though it may be, you just keep upping the thread count until it does not get any faster.

johnsa


For my implementation of multi-core support I basically took the following approach:

1) Create a thread-pool of size N (N == logical core count).
2) The threads themselves run a custom user-space scheduler so that when I need a thread I don't have the overhead of creating one, I just assign a task and it wakes up and does it.
    This has had huge performance gains for me especially when you want to quickly fire off a bunch of parallel tasks at a high rate (IE: not long running threads).
3) the threads are assigned to every even processor first, then every odd, so that if you have 4 real cores + 4 HT cores, and you request 4 parallel tasks they will be allocated to real cores, only if the number of tasks   
    > physical cores do the HT ones come into play.
4) The locking and synchronization setup between these threads and the custom scheduler is implemented primarily with a user-mode spinlock and atomic operations, as I've found these to be considerably faster than using the kernel ones.

One way to use this setup dynamically for example during a rendering-loop is run some tasks with say 4 threads, then every 5 iterations up the core count until the frametime in ms is no longer dropping.
That way you know you've reached the limit for the particular machine given ram, cores, latency and whatever else may be at play. It's basically like run-time profiling to determine the optimal thread-count.


hutch--

Quote
4) The locking and synchronization setup between these threads and the custom scheduler is implemented primarily with a user-mode spinlock and atomic operations, as I've found these to be considerably faster than using the kernel ones.
I agree with this approach, the Intel instruction is also very slow where a simple spinlock may cost locking a core for some pico seconds but its close enough to instant when it unlocks.

daydreamer

wouldnt it be best to try with and without HT in a cg application that makes use of either 4 physical cores or 2 physical cores+2 HT?
but if you load it with gigabytes of photorealistic textures+very hipoly 3dmodels,its memoryspeed and cache performance and maybe luck if all threads are working with same texturepart at the same time or 4 different tiles with different textures
maybe knowledge about how lowlevel hardware could help me pick the perfect tile/bucket size in 3d renderer
Hutch the percentage of performance of an extra core drops when using memoryintensive 3dscene,but probably if you only use procedural for most thing it probably works fast

if you want to code your own procedural,would it maybe be best to perform randomizing in one thread and have perform cosine in Another thread,or when you need many registers without need to turn to masm64
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding