A simple loop using NanoTimer (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1171) aka QueryPerformanceCounter: For_ ecx=0 To loops-1
push ecx
invoke Sleep, 0 ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
pop ecx
NanoTimer() ; start the QueryPerformanceCounter timer
void GetHash("Just a little test of the hash function")
add total, NanoTimer(us) ; add time for calculating the average, end QueryPerformanceCounter
mov ms(ecx), eax ; mov time into array
Next
One would expect that most timings are near the physical limit, but a few are much higher because of interrupts. Here is the distribution:
(http://www.webalice.it/jj2006/PlotHashSleep0.png)
What I don't understand is the result for invoke Sleep, 1:
(http://www.webalice.it/jj2006/PlotHashSleep1.png)
Suddenly, timings increase drastically, with a few exceptions to the left. We always used Sleep to improve the accuracy of timings, but here it seems to have the opposite effect...
Executables and source attached.
Timers in windows are really a pain in the ass..
Especially when you need to poll a message at a fixed interval with a timer interrupt within 1 millisecond. :(
Sleep,0
(http://members.home.nl/siekmanski/Sleep0.png)
Sleep,1
(http://members.home.nl/siekmanski/Sleep1.png)
Probably you should not consider the first measurements, they will take longer because there is nothing in the cache.
Quote from: aw27 on December 01, 2017, 05:55:48 AM
Probably you should not consider the first measurements, they will take longer because there is nothing in the cache.
Interesting idea:
; ArraySort ms() ; no sorting, show timings in their original sequence
Doesn't solve the Sleep(0) vs Sleep(1) mystery, but you are absolutely right about the first timings being much higher :t
Quote
Doesn't solve the Sleep(0) vs Sleep(1) mystery,
I think it solves. The longer the thread sleeps the more chances has the same
CPU thread of being reused elsewhere, either in the same process or in another process. This means that the cache will be invalidated shortly after that.
Good point, José :t
You know roughly what SleepEx does, it suspends the thread, waits a duration then returns in the next time slice. It will garrantee slow down your code which is close to what its designed to do.
Quote from: hutch-- on December 01, 2017, 11:52:30 AM
You know roughly what SleepEx does, it suspends the thread, waits a duration then returns in the next time slice. It will garrantee slow down your code which is close to what its designed to do.
The timing starts
after the return from Sleep. Check José's argument, it's more convincing.
What Jose said is fine, what I gather is you are running timings and in that context I would imagine the you would want to avoid cache effects. Make the SleepEx duration longer so that you DO lose any cache effects and the accuracy of the timing should improve. You are still at the mercy of ring0 interference and there is likely to be some degree of wandering in the next time slice available after SleepEx gives control back to the timing thread. This is why I avoid short intense timing methods, there are too many variables that are either difficult or impossible to control.
You can also increase the Process Priority and the Thread Priority (The normal Thread Priority is the Process Priority). No thread is interrupted by a lower priority thread and a higher priority interrupts lower priority. This may have side effects in the Sleep behaviour, though. :icon_rolleyes:
Quote from: hutch-- on December 01, 2017, 11:33:42 PMMake the SleepEx duration longer so that you DO lose any cache effects and the accuracy of the timing should improve.
That is what I would expect, too. But the evidence tells a different story - this is Sleep(100) before the timing starts. Much worse than before, and a number of "negative" outliers. And the median is about 5x higher than it should be.
Compare that to the lower graph: Sleep(0), realtimepriority, unsorted = sequential. It's the cache and nothing else. Which opens the question whether we want to time algos with or without cache ::)
Quote from: aw27 on December 01, 2017, 11:48:50 PM
You can also increase the Process Priority and the Thread Priority (The normal Thread Priority is the Process Priority). No thread is interrupted by a lower priority thread and a higher priority interrupts lower priority. This may have side effects in the Sleep behaviour, though. :icon_rolleyes:
It works very well.
I have used "Sleep,1" in a multimedia interrupt timer thread with THREAD_PRIORITY_TIME_CRITICAL.
The timer is running 200 nanoseconds slower as the interrupt time I want.
When it slowly drifts below the interrupt time, I pause it with "Sleep,1" and reset the timer.
It stays "stable" within 1 millisecond and it keeps the processor load very small.
I needed this because I had to synchronize with the per 10 millisecond write/play cursor message from DirectSound which is jittery as hell and would not spend another 10 milliseconds waiting for the next message and lose track of the buffer positions. Now I have the lowest latency possible with DirectSound streaming buffers.
Quote from: Siekmanski on December 02, 2017, 02:15:55 AM
I have used "Sleep,1" in a multimedia interrupt timer thread with THREAD_PRIORITY_TIME_CRITICAL.
For very short runs it may work very well as you confirmed. :t
Something else that helps to stabilise the timings is to use a serialised instruction before it, CPUID is one that will do that, clear the cache before the timing. As you would be aware, modern x86/64 hardware is a lot messier when it comes to timing code. This Haswell E/EP I am using sleeps in "noddy mode" at about 1.2 gig until you load it where it comes up close to the 3.3 gig it tops out at.
Power saving, variable clock rates, out of order instruction sequencing etc etc etc .... are all designed to make your life more miserable so I personally test with long duration test pieces to try and get some idea if an algo is faster than another. To test attack (how quick and algo is started and run) you make a short test and run it many times so you get a duration of a second or so. For long linear tests I use the massive amount of memory available in win64 to test linear speed so if you have to process massive log files or large videos you can clock how fast a long single read/write takes.
Just a note with SleepEx, the granularity is about 20ms so SleepEx,1,0 is no faster than SleepEx,20,0. From memory it is a hardware based interrupt.
Quote from: hutch-- on December 02, 2017, 08:59:45 AMthe granularity is about 20ms so SleepEx,1,0 is no faster than SleepEx,20,0
Right. Under the hood, it calls NtDelayExecution, exactly like Sleep(delay), which saves two bytes compared to the identical SleepEx, delay,
0:
SleepEx Ú$ 8BFF mov edi, edi ; STATUS_X kernel32.SleepEx(Time,Alertable)
75811217 ³. 55 push ebp
75811218 ³. 8BEC mov ebp, esp
7581121A ³. 5D pop ebp
7581121B À. EB ED jmp short <jmp.&API-MS-Win-Core-Synch-L ; Jump to KERNELBASE.SleepEx
... ->NtDelayExecution
Guess what are the values of edi, ebp and esp after the first four lines have been executed - C compilers at work :P
Below the distributions (serial, not sorted) for:
- Sleep(100), no CPUID
- Sleep(100), with CPUID
- Sleep(0), no CPUID
- Sleep(100), with CPUID
Note the average of 780 ms for the upper 2 graphs: Remember the Sleep(100) comes
before the timings starts.
In contrast, the lower 2 Sleep(0) runs (with 500 elements) show spikes at regular intervals; but even those, at about 200 microsecs, are far below the 780ms average of the Sleep(100) runs.
deleted
Finally, a graph showing the results for the optimal timing strategy:
- discard first x timings to load the cache
- sort the results
- eliminate the highest 20%, i.e. the spikes
Note that the average is calculated before the top 20% got eliminated. The real average is 101.5 microseconds.
@nidud: yes, of course, mov edi, edi is the hot-patch instruction that you find at the entry of most API functions. But the rest is funny ;)
From the ntdll.dll these functions are interesting,
ZwQueryTimerResolution
ZwSetTimerResolution
ZwDelayExecution
Maybe we can create our own Sleep function with a 100 nanoseconds granularity and get below 1 millisecond?
more info: http://www.google.nl/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=0ahUKEwj0ioejiuvXAhUSpKQKHXRPBaUQFgg-MAM&url=http%3A%2F%2Fread.pudn.com%2Fdownloads137%2Fdoc%2F584096%2F%25E6%259C%25AA%25E6%2596%2587%25E6%25A1%25A3%25E5%258C%2596%25E5%2587%25BD%25E6%2595%25B0%2F10%2520Time.pdf&usg=AOvVaw2OY09pfo_iU8iIPP8gE0vM
Quote from: Siekmanski on December 02, 2017, 09:29:29 PM
From the ntdll.dll these functions are interesting,
ZwQueryTimerResolution
ZwSetTimerResolution
ZwDelayExecution
These are for drivers; the equivalent Ntxx functions do the same.
QuoteMaybe we can create our own Sleep function with a 100 nanoseconds granularity and get below 1 millisecond?
In theory, yes, but it would increase power consumption.
I have another little project: a Delay2AbsoluteTime macro. Here are first results:
13:29:42.066 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:44.123 2057 ms passed
13:29:44.273 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:46.123 1850 ms passed
13:29:46.146 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:48.123 1977 ms passed
13:29:48.168 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:50.123 1955 ms passed
13:29:50.411 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:52.123 1712 ms passed
13:29:52.341 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:54.123 1782 ms passed
13:29:54.396 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:56.123 1727 ms passed
13:29:56.196 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:58.123 1927 ms passed
13:29:58.385 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:58.123 1 ms passed
13:29:58.393 - end should be two seconds later, at hh:mm:??.123 millisecs
13:29:58.123 1 ms passed
Interestingly enough, the granularity is not the usual 16 or so millisecs (1/64) of Sleep & friends but rather 1 ms.
P.S. - results from
powercfg -energy duration 5 (google translated), it seems that Pale Moon is also greedy:
Platform timer resolution: Platform timer resolution
The default platform timer resolution is 15.6 ms (15625000 ns) and must be used whenever the system is idle. If the timer resolution increases, the processor's energy-saving technologies may not be effective. The timer resolution may increase in the case of multimedia playback or graphic animations.
Current timer resolution (100 ns unit) 10000
Max timer interval (100 ns unit) 156001
Platform timer resolution: Pending timer request
A program or service has requested a resolution of the timer that is less than the maximum resolution of the platform timer.
Requested range 10000
Request process ID 6028
Request process location \ Device \ HarddiskVolume3 \ Program Files (x86) \ Pale Moon \ palemoon.exe
So my system is running at 1 ms resolution if the browser is open :(
Quote from: jj2007 on December 02, 2017, 11:33:47 PM
Quote from: Siekmanski on December 02, 2017, 09:29:29 PM
From the ntdll.dll these functions are interesting,
ZwQueryTimerResolution
ZwSetTimerResolution
ZwDelayExecution
These are for drivers; the equivalent Ntxx functions do the same.
Forgot to mention that...
Did a quick test with NtQueryTimerResolution:
CurrentResolution = 10003
MaximumResolution = 5000
MinimumResolution = 156250
Seems we can have a "Sleep" function of 0.5 milliseconds.
I'll have to check first if it is useful or not in a timer interrupt, and how it behaves on different computers. ( cpu load etc. )
QuoteI have another little project: a Delay2AbsoluteTime macro.
Timing in windows is a real challenge.
> NtQueryTimerResolution
Nice - I used powercfg -energy duration 5 which is much clumsier :t
Further analysis shows that my browser Pale Moon is indeed guilty of setting the resolution to 1ms, thus causing increased CPU load and power consumption, BUT: it doesn't so automatically. If I open the browser with a static page, granularity is as usual 1/64; only if I visit Google's baby called "YouTube", the resolution suddenly jumps up, and granularity goes to 1ms. Good to know :bgrin:
P.S.: Can I get some info, please?This is Windows version 6.1, build 7601
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
initial res: 100000
current res
esi 0
edi 5000
current res
esi 10000
edi 10000
current res
esi 12500
edi 12500
current res
esi 25000
edi 25000
current res
esi 50000
edi 50000
current res
esi 100000
edi 100000
This is Windows version 5.1, build 2600
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
initial res: 100144
current res
esi 0
edi 10032
Weird results for Win10, it seems to be stuck at 10000:This is Windows version 10.0, build 15063
Intel(R) Celeron(R) CPU N2840 @ 2.16GHz
initial res: 10014
current res
esi 0
edi 5007
lowest, highes, current res
eax 5000
edx 5007
ecx 156250
current res
esi 10000
edi 10014
lowest, highes, current res
eax 5000
edx 10014
ecx 156250
lowest, highes, current res
eax 5000
edx 10014
ecx 156250
I've implemented the 0.5 ms "Sleep" routine in my timer interrupt.
It works perfect.
Now I can keep track with the write/play position message with more accuracy and precision. ( within 4 samples at a samplerate of 44100Hz )
Without losing a message or producing clicks in the sound, I'm happy ! :biggrin:
And the cpu load is very small 0.0 - 0.2 % on my computer.
Always thought that it was impossible without a big cpu load.
Jochen, thanks for the inspiring thread.
(http://members.home.nl/siekmanski/Dsound%20timer.png)
Quote from: Siekmanski on December 03, 2017, 07:37:14 AMAnd the cpu load is very small 0.0 - 0.2 % on my computer.
For the program, or the whole computer? The setting is global...
QuoteJochen, thanks for the inspiring thread.
Still struggling to understand everything. For my Win7-64 machine, it is pretty clear, but the XP VM and the Win10 notebook show an entirely different behaviour for NtQueryTimerResolution :(
For the program only 0.1 % ( see task manager "AudioDSwav.exe" in the top right of the posted image )
I have to test it on other computers, and see how it behaves.