News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

rdtsc

Started by jj2007, January 23, 2023, 07:30:56 PM

Previous topic - Next topic

jj2007

For our timings fans. We have been fighting with this problem for a while. Bulat Ziganshin knows his stuff, he is the author of FreeArc, the best archiver ever:

QuoteIt measures execution times using RDTSC instruction, which is very accurate, measuring exact number of CPU cycles between two RDTSC commands. Unfortunately, it has severe drawbacks.

When RDTSC was first implemented, CPU frequencies were fixed, and RDTSC really measured number of CPU cycles. Later, TurboBoost was implemented, increasing frequency under load, but RDTSC continued to measure actual number of CPU cycles executed, i.e. its frequency was changed over time. On modern Intel CPUs, however, RDTSC measures amount of base frequency cycles, i.e. it returned back to measuring with fixed frequency.

F.e. on my i7-4770, base freq is 3.4 GHz, and RDTSC returns time*3.4e9, irrespective of real CPU frequency at the time of measurement. And since real frequency may go up to 3.9 GHz, this means that RDTSC-based SMHasher measurements, converted to bytes/cycle, may be up to 15% more optimistic than real value. For example, XXH32 can't be faster than 2 bytes/cycle due to use of 2 MUL operations per 4 input bytes, but SMHasher measures bulk speed up to ~2.2 bytes/cycle.

Another problem is that RDTSC instruction is executed out-of-order, like most other instructions. This isn't problem for measuring larger times, but single call to hash like XXH32, with 1-32 bytes of input data, executes only a few dozen instructions, while the CPU instruction reorder buffer holds more than 100 instructions. This means that attempt to measure time of single small-input hash operation with RDTSC may give absolutely meaningless results - just compare results produced by those 2 successive runs

There is a standard solution to this problem - serializing RDTSC operations with CPUID, but for our needs it's much better to just measure time of 1000 hashing calls. I made this change to SMHasher, and now "Small key speed test" results are reproducible with accuracy of 5%. Of course, they are still 10-15% more optimistic than real speeds due to abovementioned RDTSC feature.

Finally, we can measure 2 small-input speeds: latency (time to get a result) and throughput (how much hashes can be computed in 1 second). The latency is more interesting for use-cases when hash value should be used immediately, for example to address a hash table or find value in a tree. Since further operation depends on concrete hash value, we should wait until it will be computed.

OTOH, when all we need is to store hash value somewhere in memory, further operation doesn't depend on concrete value of the hash, and may be overlapped with hash computation - for such scenarios, bulk throughput is more representative. So, i measured both values, but optimized for latency, believing that it's more popular usecase of small-key-speed requests.

mineiro

Quote
It depends a lot on the operating system and its tuning. On Linux, the cpufreq subsystem takes into account the latency to switch between frequencies to decide when to switch, and I'm pretty sure Windows proceeds similarly, that's important to preserve laptop batteries. A PLL takes time to stabilize so it's never very fast, and 1ms seems quite short to me. The latency I'm seeing is generally around 20-50ms on Linux, which is why 100ms are enough in my tests.
Quote
Probably 1 ms is a time required to warm up the AVX engine. So, combining both suggestions togther, CPU should be warmed for 100 ms, including at least 1 ms of actual hash code (to enable AVX if required)

Quote
When I was using Windows, 18 years ago, motherboards were shipped with a CD to manipulate a lot of stuff without having to reboot via the BIOS. Some of them started to even provide some overclocking utilities which allowed you to precisely control the upper and lower frequencies. I'm sure these programs still exist. It may be enough for you to simply start the utility and reload your benchmark profile.
Quote
My Asrock Z87 mobo has the overclockimg software and I just tried to set up 34x multiplication coefficient - equal to RDTSC frequency. This immediately provided me the results I expected:
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

hutch--

 :biggrin:

The pursuit of low MS timing is like looking for the end of the rainbow, any reasonable grasp of hardware over the last 20 years tells you that the range of variables and priorities make the old cycle count technology useless. With ring 0 priority for the OS, anything of lower priority gets disrupted on a regular basis and that turns cycle base timings to garbage.

The only method I know that can deliver reasonable speed comparisons is real time duration with a measure of work done in the code being tested. With durations over 1 second, you start to get highly reliable results, well under 5% and a longer duration of some extra seconds reduces the tolerance even further.

daydreamer

Hutch when I measured in milliseconds result varied with random add 15ms= probably OS go in and take control over cpu
Better than clock cycles?
Windows Gui program and put clock cycles testing of in Workerthread better?

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

mineiro

#4
Maybe I don't get yours point of view, to me, I suppose it's more personal preference.
Some members of this forum prefer units of time, others prefer clock cycles.

Alternating electrical energy in Brazil comes from the concessionaire at 60 hertz, in Europe I suppose it is 50 hertz. The period is the inverse (reciprocal) of the frequency, and in 1 second 60 turns are necessary (or 50, or N). An incandescent lamp blinks 60 times per second in Brazil, but for our eyes it is constant, we do not see it blink 60 times(clock cycles) in 1 second, or if you prefer, 1/60 = 0,016666667 seconds to each blink.

We can never have an accuracy when measuring something, just an approximation. Be it the weight of a person, be it using a ruler with units of centimeters but not millimeters, be it the bee going back and forth over the steps of two people walking to meet each other.

A clock pulse comes from the vibration of the quartz crystal which is multiplied by a given operating factor.
The processor is always doing something, always.
By interference in the results I mean internal reordering of the instructions to be executed, failures when reading data that is not present in the cache, instructions executed out of order, instructions in the queue to be executed, serializations, sincronization, instructions per cycle, ... .

If we have the clock cycles and divide it by the operating frequency of the cpu we will be able to measure it in time.

So, whats wrong with this instruction (rdtsc)? Where does the time measured in computers come from?
Why to some users here time measure is prefered instead of clock cycles?


lscpu
CPU MHz: 800,404
Max CPU MHz: 4000,0000 (at one second it proceds 4000000 cycles)
Min CPU MHz: 800.0000

The CPU operating at low operating frequency: 800 MHz
time: 0.000000175 seconds
130 cycles 146 cycles 140 cycles 128 cycles 144 cycles

The cpu operating at 4ghz frequency.
cpufreq-set -c 0 -g performance
cpufreq-set -c 0 -d 4000000

time: 0.000000007 seconds
28 cycles 28 cycles 26 cycles 28 cycles 28 cycles



;mfence     ;lfence ;sfence
;lea rax,@F
;prefetcht0 [rax]
;prefetcht1 [rax]
;prefetcht2 [rax]
;prefetchnta [rax]
;prefetchw [rax]
;prefetchwt1 [rax]

;xor eax,eax
;cpuid
;rdtsc

rdtscp
mov r12d,eax
mov r13d,edx
rdtscp
sub eax,r12d
sbb edx,r13d

shl rdx,32
or rdx,rax
invoke printf,CStr("%llu ciclos",10),rdx

edited - changed code
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

daydreamer

To make asm SIMD version and compare to hll result in seconds together with split seconds
But scalar compared to optimize with SIMD(packed) or SIMT might need to measure performance different than clock cycles? Gigaflops?(measurement for floating point ops /second?)
Scalar x flops, SIMD 4*x flops(sse), SIMT numberofcores*x flops ?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

hutch--

Hi mineiro,

You have two different mechanisms here, modern computers have a real time clock and various methods of measuring frequencies, both are useful but they do different things. With very old fixed frequency hardware, you could calculate the time from the frequency count but with multicore CPUs, turbo boost and temperature based speed reduction, that style of calculation produces garbage.

A useful toy to have on a computer is "Core Temp" as it shows you the load distribution across all of the CPU cores and that is yet another variable as to what you are trying to time as different cores perform differently in terms of speed.

Real time has its own problems, poor granularity and the only way to improve it is to use far longer timing intervals and measure the speed of an algorithm by the amount of work it does within a specified time. Design a speed test based on what the algorithm is required to do, then run it for a second or two and measure its output in terms of work done.

The granularity is about 1 time slice (15ms) so you need to run it for at least 500ms, 1000 is better and by going higher, you reduce the error percentage to below 1% which is usually good enough in most instances.

NoCforMe

So just to throw another cat in here amongst the canaries, how would y'all rate QueryPerformanceCounter() as a way of tallying time? They say it's good down to a μsec. (Not accurate to that unit, obviously.)

That's the only time measurement I've ever used in the Win32 world. (Outside of using milliseconds for SetTimer().)
Assembly language programming should be fun. That's why I do it.

hutch--

 :biggrin:

If you can call it from ring 3, then it gets whacked by the OS in ring 0.  :tongue:

NoCforMe

OK, so it gets whacked, but how badly? Order of magnitude?
Assembly language programming should be fun. That's why I do it.

hutch--

 :biggrin:

Feel free to calibrate this yourself and see how consistent the results are. I can only go from 30 years of practice.  :tongue:

NoCforMe

Don't want to do that. That's why I'm asking you, since you have 30 years of experience.
Assembly language programming should be fun. That's why I do it.

hutch--

I also have a code mountain in front of me bigger than Mount Everest.

jj2007

QueryPerformanceCounter is what Microsoft recommends. Is that a good or a bad thing?  :rolleyes:

hutch--

 :biggrin:

Test it and find out, consistency is the criteria, if you get consistent results, it may do the job for you. I will stick to real time.