News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

TSX/HLE

Started by aw27, May 15, 2018, 12:50:29 AM

Previous topic - Next topic

aw27

This is a TSX/HLE demo. It can be run in CPUs without HLE support because HLE falls back to legacy mode when there no CPU hardware support. Of course, it can be run in CPUs with HLE hardware support or in the Intel Emulator.
The theory behind this is not easy at all, I will not even attempt to explain it. Yes, I have lots to learn too.
This is also the first ASM full blown TSX demo I have seen so far.
Have a look at all the comments in the source code.

Results with the Intel Emulator

HLE supported
RTM supported
Thread: 0 amount: 100
Thread: 3 amount: 250
Thread: 1 amount: 150
Thread: 2 amount: 200
Thread: 4 amount: 300
Thread: 5 amount: 350
Thread: 6 amount: 400
Thread: 8 amount: 500
Thread: 9 amount: 550
Thread: 27 amount: 1450
Thread: 11 amount: 650

......

Thread: 2786 amount: 139400
Thread: 2032 amount: 101700
Thread: 2793 amount: 139750
Thread: 2802 amount: 140200
Thread: 2981 amount: 149150
Thread: 2983 amount: 149250
Thread: 2650 amount: 132600
Thread: 2995 amount: 149850
Thread: 2870 amount: 143600
Thread: 2886 amount: 144400
All 3000 threads terminated. Transactionals: 2886 Sums: 225225000

Siekmanski

Creative coders use backward thinking techniques as a strategy.

aw27

Quote from: Siekmanski on May 15, 2018, 05:27:41 AM
Cool example.
I will review it tomorrow because I forgot that printf is not thread safe and results appear garbled sometimes.  :redface:

zedd151

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   1.60 GHz

HLE not supported
RTM not supported

:(  I feel left out


HLE not supported
RTM not supported
Thread: 0 amount: 100
Thread: 1 amount: 150
Thread: 2 amount: 200
Thread: 3 amount: 250
Thread: 4 amount: 300
Thread: 5 amount: 350
Thread: 6 amount: 400
Thread: 7 amount: 450
Thread: 8 amount: 500
Thread: 9 amount: 550
Thread: 10 amount: 600
Thread: 11 amount: 650
Thread: 12 amount: 700
Thread: 13 amount: 750
Thread: 14 amount: 800
Thread: 15 amount: 850
Thread: 16 amount: 900
Thread: 17 amount: 950
Thread: 18 amount: 1000
Thread: 19 amount: 1050
Thread: 20 amount: 1100
Thread: 21 amount: 1150
Thread: 22 amount: 1200
Thread: 23 amount: 1250
Thread: 24 amount: 1300
Thread: 25 amount: 1350
Thread: 26 amount: 1400
Thread: 28 amount: 1500
Thread: 29 amount: 1550
Thread: 27 amount: 1450
Thread: 30 amount: 1600
Thread: 31 amount: 1650
Thread: 915 amount: 45850     <-------------- noticed some strange anomalies
Thread: 33 amount: 1750
Thread: 917 amount: 45950     <-------------- noticed some strange anomalies
Thread: 35 amount: 1850
Thread: 36 amount: 1900




Noticed some strange anomalies indicated above, there were numerous others but the full text was too large to post

after I posted I noticed

Quote from: aw27 on May 15, 2018, 08:12:25 AM

..... results appear garbled sometimes.  :redface:

probably explains a little bit

P.S., the very first code I (re)assembled with ml64    :bgrin:
I have to learn the new (quirks) about assembling in 64..  :shock:

hutch--

This is the result on this Haswell I am using.

HLE not supported
RTM not supported
Thread: 1 amount: 150
Thread: 8 amount: 500
Thread: 13 amount: 750
Thread: 2 amount: 200
... Many more threads ...

Raistlin

#5
@hutch: How strange - a Haswell is suppose to have HLE at least (2014 release)   :icon_eek:

@aw27 : This is not how expected HLE to work - maybe I'am looking at it wrong (don't see the opcodes) - seeing as you do state it's TSX  :P

DESCRIPTION:
https://www.felixcloutier.com/x86/XACQUIRE:XRELEASE.html
http://mcg.cs.tau.ac.il/papers/amir-levy-msc.pdf

FROM: https://brooker.co.za/blog/2013/12/14/intel-hle.html
QuoteHLE is based on two new instruction prefixes (rather than new instructions): XACQUIRE (F2) and XRELEASE (F3).
You basically put the XACQUIRE prefix on the instruction that starts your critical section, and XRELEASE on the instruction that ends it.

FROM: https://software.intel.com/en-us/blogs/2013/07/25/fun-with-intel-transactional-synchronization-extensions
#define PREFIX_XACQUIRE ".byte 0xF2; "

    #define PREFIX_XRELEASE ".byte 0xF3; "


 

    class mutex_elided {

      uint8_t flag;

      inline bool try_lock_elided() {

        uint8_t value = 1;

        __asm__ volatile (PREFIX_XACQUIRE "lock; xchgl %0, %1"

                : "=r"(value),"=m"(flag):"0"(value),"m"(flag):"memory" );

        return uint8_t(value^1);

      }




Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

hutch--

Bios date is 2016 but they may have crippled it.

aw27

Quote
Bios date is 2016 but they may have crippled it.
My current understanding is that you have to look in the ark.intel.com page for the CPU and look under Advanced Technologies if IntelĀ® TSX-NI is supported. For example:
https://ark.intel.com/products/126686/Intel-Core-i7-8700-Processor-12M-Cache-up-to-4_60-GHz

Quote
This is not how expected HLE to work - maybe I'am looking at it wrong (don't see the opcodes) - seeing as you do state it's TSX
But the XACQUIRE and XRELEASE are there! I swear!  :biggrin:
(hint: Look in the Include file.)
All is according to IntelĀ® 64 and IA-32 Architectures Software Developer's Manual Volume 1, Chapter 16.2.1
Quote
:(  I feel left out
AMD does not support TSX (yet?), most Intel CPUs don't as well and also within those that support TSX the processor may abort transactional execution for a few reasons. Hence the need for fallback paths according to Intel in their documentation.


Raistlin

@aw27: Kewl - found it !
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

aw27

#9
Actually, there is no problem with the printf in my demo.

The problem is with WaitForMultipleObjects which can't wait for more than MAXIMUM_WAIT_OBJECTS (64 objects) and I am creating 3000 threads. The side effect is that the program may exit before all threads have completed their job.
A fast fix, and in order not to complicate things further, all we need to do is replace WaitForMultipleObjects with a Sleep of say 2 or 3 seconds. For 3000 threads, the correct reported sum in the end shall be 225225000.
I will not change the program now, I am going to work on the RTM part and eventually will come up with an integrated solution.

BTW, it is normal to have threads out of order, there is no FIFO guarantee.

aw27

This is the reviewed TSX\HLE (it runs as well without HLE) and includes a different metaphor. Actually an enlargement of the initial metaphor.

Consider a very high-traffic bank account, with deposits and withdrawals occurring all the time.
When the account balance would become below zero after an withdrawal the bank would refuse to pay it. However, since for this special client deposits are occurring all the time, our bank will retry the attempt of withdrawal as many times as necessary since it is expected enough funds to be available later on.

Since there is no FIFO order, 3000 operations will become tens or hundreds of thousand or even millions of operations. This really stresses the algorithm quite a bit. The expected final result should be 3345 of positive bank balance. The total number of operations will be the sum of Transactionals + Fallbacks. If testing without HLE there will be no Transactionals.