This is a spin-off of this thread (http://masm32.com/board/index.php?topic=7509.0) and I reused the structure of its code (thank you Alex, Marinus and the Masm32 Library).
According to Intel:
"When tuning, note that all Intel 64 and IA-32 processors usually have very high branch
prediction rates. Consistently mispredicted branches are generally rare."
Given that, I am testing to what point eliminating mispredicted branches produce a performance boost. For unpredicability I am using the rdrand function to keep the code small, you can use your favorite random algo in its place :t. Branch elimination can be done with SETCC or CMOV instructions, I am testing both. There appears to be a slight improvement by eliminating branches in my setup, but results may be different on other systems or different setups. I am showing esi values but they will only be statistically close, not the same.
Unpredicatable Branching Performance Test begins:
Test A (Branching): 5310.804483 esi: 25001015
Test B (No Branching/Using SETCC): 5154.628838 esi: 25000277
Test C (No Branching/Using CMOV): 5102.413384 esi: 24996680
Press any key to continue ...
QuoteUnpredicatable Branching Performance Test begins:
Test A (Branching): 6147.469874 esi: 25006397
Test B (No Branching/Using SETCC): 5842.821538 esi: 24998669
Test C (No Branching/Using CMOV): 5737.716296 esi: 25005668
Press any key to continue ...
Unpredicatable Branching Performance Test begins:
Test A (Branching): 17905.004104 esi: 24999384
Test B (No Branching/Using SETCC): 17769.385169 esi: 25000978
Test C (No Branching/Using CMOV): 17706.415355 esi: 24995715
Press any key to continue ...
cpu: AMD Threadripper 1950X
I wonder why these values are so large compared to yours ?
Win 8.1 i7-4930K
Unpredicatable Branching Performance Test begins:
Test A (Branching): 4847.646489 esi: 24995192
Test B (No Branching/Using SETCC): 4715.992466 esi: 24998433
Test C (No Branching/Using CMOV): 4792.383529 esi: 25007685
Doesn't look like there is enough difference to worry about.
Unpredicatable Branching Performance Test begins:
Test A (Branching): 4714.086700 esi: 24999738
Test B (No Branching/Using SETCC): 4347.706700 esi: 24992511
Test C (No Branching/Using CMOV): 4286.387600 esi: 25000474
Press any key to continue ...
Quote from: johnsa on November 09, 2018, 09:14:49 PM
cpu: AMD Threadripper 1950X
I wonder why these values are so large compared to yours ?
The reason could be the rdrand instruction. It is a slow instruction and eventually much slower on the AMD. You can infer how slow it is by commenting out:
;rdrand ax
;jnc @b
I also believe that rdrand is slower on my system than in Siekmarski's and Hutch's because my system is generally faster.
Since rdrand is so slow, it explains most of the results which was not what we were seeking. So I decided to replace it with xorshift32 and will use 4 times more iterations.
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1079.229572 esi: 99999910
Test B (No Branching/Using SETCC): 560.419929 esi: 99992311
Test C (No Branching/Using CMOV): 508.874674 esi: 100000643
Press any key to continue ...
Now we see a clear difference. :t
QuoteUnpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1707.725816 esi: 100003249
Test B (No Branching/Using SETCC): 814.433475 esi: 99999228
Test C (No Branching/Using CMOV): 815.539814 esi: 100012874
Press any key to continue ...
Haswell E/EP
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1390.336600 esi: 99996111
Test B (No Branching/Using SETCC): 753.261600 esi: 100000008
Test C (No Branching/Using CMOV): 717.176900 esi: 100000849
Press any key to continue ...
Try more than 1 run with a SleepEx,100,0 followed by a CPUID instruction. Isolates one test from the other. May help with a change in priority class.
Core i5, a factor for branching vs CMOV:
Test A (Branching): 1839.689614 esi: 99994305
Test B (No Branching/Using SETCC): 1014.606289 esi: 100011883
Test C (No Branching/Using CMOV): 916.680932 esi: 100005125
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1474.770048 esi: 99998649
Test B (No Branching/Using SETCC): 776.358238 esi: 99996812
Test C (No Branching/Using CMOV): 740.065497 esi: 100008227
Press any key to continue ...
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 7489.981548 esi: 99994224
Test B (No Branching/Using SETCC): 6022.567114 esi: 100000788
Test C (No Branching/Using CMOV): 6892.027059 esi: 100007146
Press any key to continue ...
Quote from: hutch-- on November 10, 2018, 12:46:43 AM
Try more than 1 run with a SleepEx,100,0 followed by a CPUID instruction. Isolates one test from the other. May help with a change in priority class.
I usually follow the RDTSC route, today I followed the method inherited from the other thread which has also some advantages, in my opinion.
i7-4810mq
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1348.180566 esi: 100011227
Test B (No Branching/Using SETCC): 718.488105 esi: 100001925
Test C (No Branching/Using CMOV): 706.041224 esi: 100002620
Press any key to continue ...
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1346.779943 esi: 100012082
Test B (No Branching/Using SETCC): 681.084780 esi: 99999300
Test C (No Branching/Using CMOV): 683.320791 esi: 100016696
Press any key to continue ...
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1354.122122 esi: 100000595
Test B (No Branching/Using SETCC): 682.919775 esi: 99994066
Test C (No Branching/Using CMOV): 686.241900 esi: 99991064
Press any key to continue ...
Quote from: AW on November 10, 2018, 03:36:23 AM
I usually follow the RDTSC route, today I followed the method inherited from the other thread which has also some advantages, in my opinion.
:t
The timers are not to calculate routine execution times.
These timers are meant to run simultaneous in realtime to control multimedia events in games or demos.
for example:
- timer 1 controls the time to switch to the next scene.
- timer 2 controls when a flock of birds fly over.
- timer 3 controls the duration of a bullet salvo.
- timer 4 controls when an UFO enters the earth orbit.
etc.
Hi jose, Thanks for carrying the tourch!
Unpredictable Branching Performance Test using xorshift32 begins:
Test A (Branching): 1452.756573 esi: 100007857
Test B (No Branching/Using SETCC): 679.610210 esi: 99997780
Test C (No Branching/Using CMOV): 588.437494 esi: 100011413
Press any key to continue ...
Quote from: Siekmanski on November 10, 2018, 05:13:53 AM
The timers are not to calculate routine execution times.
These timers are meant to run simultaneous in realtime to control multimedia events in games or demos.
for example:
- timer 1 controls the time to switch to the next scene.
- timer 2 controls when a flock of birds fly over.
- timer 3 controls the duration of a bullet salvo.
- timer 4 controls when an UFO enters the earth orbit.
etc.
I look forward to test timers for synchronize several threads working together
LOCK prefix maybe needs to be used?
rdrand vs simplest randomgenerator used in perlin noise would be interesting to test against each other
I think we should add D nobranch test:SIMD comparisions so we know how much worth the time spending on getting nonbranch code right and how much gain/loss?