News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

QueryPerformanceCounter: Sleep slows down my code

Started by jj2007, November 30, 2017, 10:48:25 PM

Previous topic - Next topic

jj2007

A simple loop using NanoTimer aka QueryPerformanceCounter:  For_ ecx=0 To loops-1
push ecx
  invoke Sleep, 0    ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  pop ecx
NanoTimer() ; start the QueryPerformanceCounter timer
void GetHash("Just a little test of the hash function")
add total, NanoTimer(us) ; add time for calculating the average, end QueryPerformanceCounter
mov ms(ecx), eax ; mov time into array
  Next


One would expect that most timings are near the physical limit, but a few are much higher because of interrupts. Here is the distribution:


What I don't understand is the result for invoke Sleep, 1:


Suddenly, timings increase drastically, with a few exceptions to the left. We always used Sleep to improve the accuracy of timings, but here it seems to have the opposite effect...

Executables and source attached.

Siekmanski

Timers in windows are really a pain in the ass..
Especially when you need to poll a message at a fixed interval with a timer interrupt within 1 millisecond.  :(


Sleep,0


Sleep,1
Creative coders use backward thinking techniques as a strategy.

aw27

Probably you should not consider the first measurements, they will take longer because there is nothing in the cache.

jj2007

Quote from: aw27 on December 01, 2017, 05:55:48 AM
Probably you should not consider the first measurements, they will take longer because there is nothing in the cache.

Interesting idea: ; ArraySort ms() ; no sorting, show timings in their original sequence

Doesn't solve the Sleep(0) vs Sleep(1) mystery, but you are absolutely right about the first timings being much higher :t

aw27

Quote
Doesn't solve the Sleep(0) vs Sleep(1) mystery,
I think it solves. The longer the thread sleeps the more chances has the same CPU thread of being reused elsewhere, either in the same process or in another process.  This means that the cache will be invalidated shortly after that.

jj2007


hutch--

You know roughly what SleepEx does, it suspends the thread, waits a duration then returns in the next time slice. It will garrantee slow down your code which is close to what its designed to do.

jj2007

Quote from: hutch-- on December 01, 2017, 11:52:30 AM
You know roughly what SleepEx does, it suspends the thread, waits a duration then returns in the next time slice. It will garrantee slow down your code which is close to what its designed to do.

The timing starts after the return from Sleep. Check José's argument, it's more convincing.

hutch--

What Jose said is fine, what I gather is you are running timings and in that context I would imagine the you would want to avoid cache effects. Make the SleepEx duration longer so that you DO lose any cache effects and the accuracy of the timing should improve. You are still at the mercy of ring0 interference and there is likely to be some degree of wandering in the next time slice available after SleepEx gives control back to the timing thread. This is why I avoid short intense timing methods, there are too many variables that are either difficult or impossible to control.

aw27

You can also increase the Process Priority and the Thread Priority (The normal Thread Priority is the Process Priority). No thread is interrupted by a lower priority thread and a higher priority interrupts lower priority. This may have side effects in the Sleep behaviour, though.  :icon_rolleyes:

jj2007

Quote from: hutch-- on December 01, 2017, 11:33:42 PMMake the SleepEx duration longer so that you DO lose any cache effects and the accuracy of the timing should improve.

That is what I would expect, too. But the evidence tells a different story - this is Sleep(100) before the timing starts. Much worse than before, and a number of "negative" outliers. And the median is about 5x higher than it should be.

Compare that to the lower graph: Sleep(0), realtimepriority, unsorted = sequential. It's the cache and nothing else. Which opens the question whether we want to time algos with or without cache ::)

Siekmanski

Quote from: aw27 on December 01, 2017, 11:48:50 PM
You can also increase the Process Priority and the Thread Priority (The normal Thread Priority is the Process Priority). No thread is interrupted by a lower priority thread and a higher priority interrupts lower priority. This may have side effects in the Sleep behaviour, though.  :icon_rolleyes:

It works very well.

I have used "Sleep,1" in a multimedia interrupt timer thread with THREAD_PRIORITY_TIME_CRITICAL.
The timer is running 200 nanoseconds slower as the interrupt time I want.
When it slowly drifts below the interrupt time, I pause it with "Sleep,1" and reset the timer.
It stays "stable" within 1 millisecond and it keeps the processor load very small.

I needed this because I had to synchronize with the per 10 millisecond write/play cursor message from DirectSound which is jittery as hell and would not spend another 10 milliseconds waiting for the next message and lose track of the buffer positions. Now I have the lowest latency possible with DirectSound streaming buffers.
Creative coders use backward thinking techniques as a strategy.

aw27

Quote from: Siekmanski on December 02, 2017, 02:15:55 AM
I have used "Sleep,1" in a multimedia interrupt timer thread with THREAD_PRIORITY_TIME_CRITICAL.
For very short runs it may work very well as you confirmed.  :t

hutch--

Something else that helps to stabilise the timings is to use a serialised instruction before it, CPUID is one that will do that, clear the cache before the timing. As you would be aware, modern x86/64 hardware is a lot messier when it comes to timing code. This Haswell E/EP I am using sleeps in "noddy mode" at about 1.2 gig until you load it where it comes up close to the 3.3 gig it tops out at.

Power saving, variable clock rates, out of order instruction sequencing etc etc etc .... are all designed to make your life more miserable so I personally test with long duration test pieces to try and get some idea if an algo is faster than another. To test attack (how quick and algo is started and run) you make a short test and run it many times so you get a duration of a second or so. For long linear tests I use the massive amount of memory available in win64 to test linear speed so if you have to process massive log files or large videos you can clock how fast a long single read/write takes.

Just a note with SleepEx, the granularity is about 20ms so SleepEx,1,0 is no faster than SleepEx,20,0. From memory it is a hardware based interrupt.

jj2007

Quote from: hutch-- on December 02, 2017, 08:59:45 AMthe granularity is about 20ms so SleepEx,1,0 is no faster than SleepEx,20,0

Right. Under the hood, it calls NtDelayExecution, exactly like Sleep(delay), which saves two bytes compared to the identical SleepEx, delay, 0:SleepEx         Ú$  8BFF              mov edi, edi                            ; STATUS_X kernel32.SleepEx(Time,Alertable)
75811217        ³.  55                push ebp
75811218        ³.  8BEC              mov ebp, esp
7581121A        ³.  5D                pop ebp
7581121B        À. EB ED             jmp short <jmp.&API-MS-Win-Core-Synch-L ; Jump to KERNELBASE.SleepEx
... ->NtDelayExecution


Guess what are the values of edi, ebp and esp after the first four lines have been executed - C compilers at work :P

Below the distributions (serial, not sorted) for:
- Sleep(100), no CPUID
- Sleep(100), with CPUID
- Sleep(0), no CPUID
- Sleep(100), with CPUID

Note the average of 780 ms for the upper 2 graphs: Remember the Sleep(100) comes before the timings starts.
In contrast, the lower 2 Sleep(0) runs (with 500 elements) show spikes at regular intervals; but even those, at about 200 microsecs, are far below the 780ms average of the Sleep(100) runs.