News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

SwitchToThread vs Sleep(0)

Started by jj2007, May 11, 2013, 08:17:48 AM

Previous topic - Next topic

jj2007

Still playing...
The assumption is that timings should be measured inside one timeslice. To detect if a timeslice has been left, it is further assumed that the cycle count for the test loop would jump over a certain value, say: 10000 cycles. Using the attached TestMe macro, it turns out that the slice is typically about 15ms or 35Mio cycles/2.3GHz (LiTs=Loops inside timeslice, CiTs=Cycles inside timeslice):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
cycles  LiTs    CiTs
Loopct 0/0
23      6250034 143810186
23      1515100 34848007
23      4656881 107434202
23      1518678 34930235
23      3099625 71535622
23      1515225 34854130
23      1519461 34948261
23      1535141 35320238
23      6285264 145841188

Loopct 10/0
100     718797  71884004
100     348853  34885990
100     23890   2390032
100     323494  32349986
100     349989  34999560
100     352675  35268098
100     350274  35028107
100     351695  35200207
100     34103   3417695

Loopct 100/0
645     55709   35934108
645     54053   34865016
645     54667   35260579
645     54418   35100028
645     54795   35351193
653     1532    1000792
646     9954    6430330
651     378     246137
647     778     503492

Loopct 1000/0
8948    225     2013349
9030    3611    32609769
9030    3899    35211161
8842    225     1989386
9028    3497    31572751
9030    3886    35092325
8988    196     1761638
9029    3568    32218893
9030    3910    35310563

Loopct 1000/1
5836    6146    35872323
5836    180     1050405
5836    5650    32974841
5836    6060    35367503
5836    5997    35000922
5836    6053    35327492
5943    164     974610
5836    5463    31884485
5837    6078    35479713


One odd observation regarding the test loop:
                push loopct
                .Repeat        ; fake activity here
                        fild dword ptr [esp]
                        if usefldpi
                                fldpi        ; include these two...
                                fmul        ; ... and the code gets faster (!), see "1000/1" above.
                        endif
                        movd xmm0, dword ptr [esp]        ; ML 6.14 encodes this as movd mm0, ...
                        fstp st
                        dec dword ptr [esp]
                .Until Sign?
                pop eax

dedndave

prescott w/htt, xp mce2005 sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
cycles  LiTs    CiTs
Loopct 0/0
200     28947   5808113
200     12169   2436278
201     12265   2473065
200     12687   2539852
201     85225   17205435
202     12547   2537310
201     41483   8374245
202     12561   2542358
200     12623   2527148


Loopct 10/0
232     1       105
225     1       98
232     1       105
232     1       105
232     1       105
232     1       105
232     1       105
232     1       105
232     1       105


Loopct 100/0
1399    2052    2872170
1667    1432    2388367
1631    1565    2552565
1417    1568    2222145
1403    1809    2538772
1417    1668    2364413
1667    1533    2556015
1499    1703    2553690
1641    1552    2546775


Loopct 1000/0
232     1       105
5092    2       10057
5043    2       9960
5043    2       9960
5043    2       9960
5024    2       9922
5025    2       9923
5021    2       9915
5024    2       9922


Loopct 1000/1
240     1       113
7209    330     2378948
7210    353     2545200
7210    354     2552317
7208    354     2551800
7211    351     2530950
7208    354     2551830
7209    354     2552212
7208    351     2530140


probably not what you expected to see   :P

jj2007

Quote from: dedndave on May 13, 2013, 09:50:28 PM
prescott w/htt, xp mce2005 sp3
..
probably not what you expected to see   :P

The P4 apparently needs higher MaxCycles values to stabilise.

dedndave

well - not sure it's just the P4
the HAL is probably a little different for each OS/CPU combination

one thing about my P4 is that it uses hyper-threading for the second logical core
that surely changes how time slices are alotted

i don't think context switches between logical cores is handled the same as physical cores
i.e. - the OS can't change switching between hyper-threaded contexts - it's "hard-wired"

and - the fact that i am using mce2005 probably makes my machine unique against other members - lol

dedndave

i found this in the registry
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex]
"MaxQueryTimeslice"=dword:00000032

i think that's a timeslice quantity, though (i.e. 50 time slices)

a possibility that we may have overlooked is that the length of a time-slice could be dynamic
so - the length changes with conditions   :P
that would certainly explain why there is so little official documentation on the subject

hutch--

This is a bit crude but its an attempt to get some idea of the duration of task switching intervals using SleepEx().



IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL du    :DWORD
    LOCAL tc    :DWORD
    LOCAL cn    :DWORD

    push esi

    mov du, 0                       ; start duration at 0

    fn QueryPerformanceCounter,ADDR cn
    mov eax, cn
    mov esi, eax

  @@:
    fn SleepEx,du,0
    add du, 1
    fn QueryPerformanceCounter,ADDR cn
    mov eax, cn
    mov tc, eax

    sub esi, tc
    neg esi

    print ustr$(du)," = "
    print ustr$(esi),13,10

    mov esi, tc

    cmp du, 64
    jbe @B

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

Tedd

There's so many variables that you won't get any consistent results.

Time slices last up to a selected number of milliseconds (they may be shorter due to various interruptions, or yielding while waiting for I/O operations.)
The number of milliseconds any thread gets depends on: its current priority, whether the program window currently 'has focus', what other processes are running and their priorities, whether scheduling is set to favour programs or services (user modifiable), ...
And then the number of clock cycles you actually get within that time depends on a number of things, namely CPU clock frequency and throttling.



More on how 'simple' it is: http://support.microsoft.com/kb/259025
Potato2

jj2007

Quote from: Tedd on May 14, 2013, 02:19:00 AM
More on how 'simple' it is: http://support.microsoft.com/kb/259025

Good link, Tedd, thanks. For our main purpose, timing, much of this complexity is irrelevant: We have a high priority foreground process with no I/O.

The code I posted above shows that some cycle counts occur more often. For my Celeron, it's 1.445 Mio cycles - that is 0.9 milliseconds. For timing purposes, context switches should be avoided because they distort the results. So the goal is to get the highest possible cycle count within one time slice. Strange that Windows does not have a GetLenOfQuantum API...

dedndave

i would think you could derive such a value from the registry, if you knew how to manipulate which values   :P
learn all you can about the HAL

jj2007

Quote from: dedndave on May 14, 2013, 04:15:42 AM
i would think you could derive such a value from the registry

Win32PrioritySeparation ... not very useful  :(

qWord

There is a description in the book "Windows Internals", which shows how determine the clock cycles per quantum - It can be derived from the current timer resolution:
include \masm32\include\masm32rt.inc
include \masm32\include\advapi32.inc
includelib \masm32\lib\advapi32.lib

NTQUERYTIMERRESOLUTION typedef proto MinimumResolution:PULONG,MaximumResolution:PULONG,ActualResolution:PULONG
PNTQUERYTIMERRESOLUTION typedef ptr NTQUERYTIMERRESOLUTION

UULONG union
    qw QWORD ?
    ul ULONG ?
UULONG ends

.const
    c0 REAL8 1.0E6
    c1 REAL8 100.0E-9
    c3 REAL8 0.33333333333333333
.data?
    NtQueryTimerResolution PNTQUERYTIMERRESOLUTION ?
.code
main proc uses esi
LOCAL hKey:HKEY
LOCAL ulMHz:UULONG,_size:DWORD
LOCAL ulMinimumResolution:UULONG,ulMaximumResolution:UULONG,ulActualResolution:UULONG
LOCAL MinimumResolution:REAL8,MaximumResolution:REAL8,ActualResolution:REAL8,MHz:REAL8
LOCAL r8:REAL8
    .repeat
    ; get undocumented function NtQueryTimerResolution :-|
        .if !rvx(NtQueryTimerResolution = GetProcAddress,rv(GetModuleHandle,"Ntdll.dll"),"NtQueryTimerResolution")
            print "error: can't locate NtQueryTimerResolution",13,10
            .break
        .endif
       
        mov _size,4
        .if rvx(esi = RegOpenKeyEx,HKEY_LOCAL_MACHINE,"HARDWARE\DESCRIPTION\System\CentralProcessor\0",0,KEY_READ,&hKey) != ERROR_SUCCESS || \
            rvx(RegQueryValueEx,hKey,"~MHz",0,0,&ulMHz, &_size) != ERROR_SUCCESS
            print "error: can't read CPU frequency from registry",13,10
            .break         
        .endif
       
        mov ulMHz.ul[4],0
        fild ulMHz.qw
        fmul c0
        fstp MHz
       
        fnc crt_printf,"CPU frequency: %u [MHz]\n",ulMHz.ul
       
        fn NtQueryTimerResolution,&ulMinimumResolution,&ulMaximumResolution,&ulActualResolution
       
        FOR var,<MinimumResolution,MaximumResolution,ActualResolution>
            mov ul&var&.ul[4],0
            fild ul&var&.qw
            fmul c1
            fst var     ; -> seconds
           
            fmul MHz
            fmul c3     ; var*(1/3) -> Windows Internal: "each quantum unit is one-third of a clock interval"
            fstp r8     ; -> clocks per quantum
            fnc crt_printf,"Timer: &var : %.6G [s] -\r clocks/quantum: %G\n",var,r8
        ENDM

    .until 1
   
    inkey
    exit
main endp
end main

They also mention that there is a variable that can be read out in kernel mode: KiCyclesPerClockQuantum.

for my i7QM:
CPU frequency: 2294 [MHz]
Timer: MinimumResolution : 0.0156001 [s] -> clocks/quantum: 1.19289E+007
Timer: MaximumResolution : 0.0005 [s] -> clocks/quantum: 382333
Timer: ActualResolution : 0.001 [s] -> clocks/quantum: 764667
Press any key to continue ...

AFAICS this result only applies when the CPU is not throttled.
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

prescott w/htt, xp mce2005 sp3
CPU frequency: 3000 [MHz]
Timer: MinimumResolution : 0.015625 [s] -> clocks/quantum: 1.5625E+007
Timer: MaximumResolution : 0.001 [s] -> clocks/quantum: 1E+006
Timer: ActualResolution : 0.0009766 [s] -> clocks/quantum: 976600

Antariy

I have some doubts that the timer resolution has any relation to the scheduler work - there are too much things which have influence to the timeslice (just like Tedd said). I suspect, these values we get from an native api - timer resolution - are for the multimedia timers, not for timeslice.

There is also the function:

NTSYSAPI
NTSTATUS
NTAPI
NtSetTimerResolution(

  IN ULONG                DesiredResolution,
  IN BOOLEAN              SetResolution,
  OUT PULONG              CurrentResolution );


But obviously that doesn't relate to the timeslice as well.
BTW, we can probably try to change timer resolution via MM APIs and then query it with native API - just to check.

Anyway, results for the code:

CPU frequency: 2133 [MHz]
Timer: MinimumResolution : 0.015625 [s] -> clocks/quantum: 1.11094E+007
Timer: MaximumResolution : 0.001 [s] -> clocks/quantum: 711000
Timer: ActualResolution : 0.0009766 [s] -> clocks/quantum: 694363
Press any key to continue ...


BTW Minimal resolution here (15,625 milliseconds) is the default resolution of Sleep(Ex), GetTickCount, timeGetTime under NTs.

Antariy

After a short playing with these functions (as well as MM APIs...) they all did not seem to change anything on my system at all :greensml:


Min res: 156250, Max res: 10000, Actual res: 9766

Sequental calls to GetTickCount/Sleep,20/GetTickCount difference: 16

NtSTR NTSTATUS: 00000000, actual: 9766

Min res: 156250, Max res: 10000, Actual res: 9766

Sequental calls to GetTickCount/Sleep,20/GetTickCount difference: 15

Press any key to continue ...



include \masm32\include\masm32rt.inc
include \masm32\include\ntdll.inc
includelib \masm32\lib\ntdll.lib
include \masm32\include\winmm.inc
includelib \masm32\lib\winmm.lib

;NTQUERYTIMERRESOLUTION typedef proto MinimumResolution:PULONG,MaximumResolution:PULONG,ActualResolution:PULONG
;PNTQUERYTIMERRESOLUTION typedef ptr NTQUERYTIMERRESOLUTION

UULONG union
    qw QWORD ?
    ul ULONG ?
UULONG ends

.const
    c0 REAL8 1.0E6
    c1 REAL8 100.0E-9
    c3 REAL8 0.33333333333333333
.data?
;    NtQueryTimerResolution PNTQUERYTIMERRESOLUTION ?
.code
main proc uses esi
LOCAL hKey:HKEY
LOCAL ulMHz:UULONG,_size:DWORD
LOCAL ulMinimumResolution:ULONG,ulMaximumResolution:ULONG,ulActualResolution:ULONG
LOCAL MinimumResolution:REAL8,MaximumResolution:REAL8,ActualResolution:REAL8,MHz:REAL8
LOCAL r8:REAL8


;invoke timeBeginPeriod,20
;invoke Sleep,0



    .repeat

        fn NtQueryTimerResolution,addr ulMinimumResolution,addr ulMaximumResolution,addr ulActualResolution

invoke crt_printf,CTXT("Min res: %lu, Max res: %lu, Actual res: %lu",10,10),ulMinimumResolution,ulMaximumResolution,ulActualResolution

invoke GetTickCount
mov ebx,eax
invoke Sleep,20
invoke GetTickCount
sub eax,ebx
invoke crt_printf,CTXT("Sequental calls to GetTickCount/Sleep,20/GetTickCount difference: %d",10,10),eax

     mov ulMinimumResolution,50000
     invoke NtSetTimerResolution,addr ulMinimumResolution,1,addr ulActualResolution
     invoke crt_printf,CTXT("NtSTR NTSTATUS: %p, actual: %lu",10,10),eax,ulActualResolution
     invoke Sleep,0

        fn NtQueryTimerResolution,addr ulMinimumResolution,addr ulMaximumResolution,addr ulActualResolution

invoke crt_printf,CTXT("Min res: %lu, Max res: %lu, Actual res: %lu",10,10),ulMinimumResolution,ulMaximumResolution,ulActualResolution

invoke GetTickCount
mov ebx,eax
invoke Sleep,20
invoke GetTickCount
sub eax,ebx
invoke crt_printf,CTXT("Sequental calls to GetTickCount/Sleep,20/GetTickCount difference: %d",10,10),eax

    .until 1

;invoke timeEndPeriod,1

    inkey
    exit
main endp
end main





Obviously the Michael's experimental code is the way to check the timeslice length in real time.

Having changed it a bit, I get, for instance, these results:

Commiting the memory loops... 3 2 1
Starting threads...
Flagging threads to start...
2       -2146197512
1       -2128130376
2       -2061490392
1       -1994868136
2       -1928309152
1       -1861226840
2       -1794693936
1       -1728372904
2       -1661278560
1       -1594765592
2       -1528019176
1       -1461520648
2       -1461783008
1       -1395112944
2       -1328442808
1       -1261772736
2       -1125095944
1       -1061480520
2       -1061762488
1       -927819416
2       -928423880
1       -795053216
2       -727995480
1       -728413624
2       -661743592
1       -595073432
2       -528403400
1       -395034176
2       -328250360
1       -261685496
2       -194617560
1       -128355360
2       -53439128
-1      -1
1       4988000
2       72073936   \
1       138326792  / timeslice ~31 ms
2       138294296
1       204964432
2       338715304
1       405007400
2       404974600
1       538372152
2       605409656  \ timeslice 199 microseconds
1       604984904  /\
2       738475176   / timeslice ~62 ms
1       738323552
2       872103120
1       938366456
2       938333784
1       1071705408
2       1143388808
1       1205660120
2       1205014048
1       1338386880
2       1405433584
1       1471723520
2       1538516608
1       1605088400
2       1672120992
1       1671703120
2       1738373224
1       1871743168
2       1871711800
1       2005082920
2       2079923832
1       2138425592
2       2138392064
1       -2023205624
2       -1956412152
1       -1889864432
2       -1822804288
1       -1756522312
2       -1756556240
1       -1689886128
2       -1556117160
1       -1489845864
2       -1423056504
1       -1423205784
2       -1289432992
1       -1223167360
2       -1223197168
1       -1145165320
Press any key to continue ...


And much results are fluctuate - obviously the timeslice is not "hardcoded".

Slightly modified code:

;==============================================================================
include \masm32\include\masm32rt.inc
.686
;==============================================================================
.data
    hThread1 dd 0
    hThread2 dd 0
    pBuff    dd 0
    count    dd 0
    lok      dd 1
.code
;==============================================================================
ThreadProc1 proc uses ebx lpParameter:LPVOID

.while lok
.endw

    .WHILE count < 10000000
        xor eax, eax
        cpuid
        rdtsc
        mov ecx, count
        mov ebx, pBuff
        mov DWORD PTR [ebx+ecx*8], 1
        mov DWORD PTR [ebx+ecx*8+4], eax
        inc count
    .ENDW
    ret
ThreadProc1 endp
;==============================================================================
ThreadProc2 proc uses ebx lpParameter:LPVOID

.while lok
.endw

    .WHILE count < 10000000
        xor eax, eax
        cpuid
        rdtsc
        mov ecx, count
        mov ebx, pBuff
        mov DWORD PTR [ebx+ecx*8], 2
        mov DWORD PTR [ebx+ecx*8+4], eax
        inc count
    .ENDW
    ret
ThreadProc2 endp
;==============================================================================
start:
;==============================================================================

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    ;--------------------------------------------------------------------------
    ; Minimize interruptions (THESE PRIORITY LEVELS NOT SAFE FOR SINGLE CORE).
    ;--------------------------------------------------------------------------

    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
    invoke GetCurrentThread
    invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

    ;-----------------------------------------------------------
    ; Buffer entries will be a dword thread identifier (1 or 2)
    ; followed by the low-order dword of the current TSC.
    ;-----------------------------------------------------------

    mov ebx,10000000*8+1000

    mov pBuff, alloc(ebx)

    invoke crt_printf,CTXT("Commiting the memory loops...")

    mov esi,3
    @@:
    invoke crt_printf,CTXT(" %lu"),esi
    mov edi,pBuff
    mov ecx,ebx
    shr ecx,2
    or eax,-1
    rep stosd
    dec esi
    jnz @B

    invoke crt_printf,CTXT(10,"Starting threads...",10)
   

    invoke CreateThread, NULL, 0, ThreadProc1, 0, 0, NULL
    mov hThread1, eax
    invoke CreateThread, NULL, 0, ThreadProc2, 0, 0, NULL
    mov hThread2, eax

    invoke crt_printf,CTXT("Flagging threads to start...",10)
   
    and lok,0

    invoke WaitForMultipleObjects, 2, ADDR hThread1, TRUE, INFINITE

    ;--------------------------------------------------------------
    ; Display only buffer entries where thread identifier changed.
    ;--------------------------------------------------------------

    mov ebx, -1
    xor edi, edi
    mov esi, pBuff
  @@:
    add ebx, 1
    cmp ebx, 10000000
    ja  @F
    mov eax, [esi+ebx*8]
    cmp eax, edi
    je  @B
    mov edi, eax
    invoke crt_printf,CTXT("%d",9), eax
    mov eax, [esi+ebx*8+4]
    invoke crt_printf,CTXT("%d",10), eax
    jmp @B
  @@:
    inkey
    exit
;==============================================================================
end start

jj2007

Celeron M, CPU frequency: 1600 [MHz]
Timer: MinimumResolution : 0.015625 s -> clocks/quantum: 8.33333E+006
Timer: MaximumResolution : 0.001 s -> clocks/quantum: 533333
Timer: ActualResolution : 0.015625 s -> clocks/quantum: 8.33333E+006


AMD Athlon(tm) Dual Core Processor 4450B, CPU frequency: 2304 [MHz]
Timer: MinimumResolution : 0.0156001 s -> clocks/quantum: 1.19809E+007
Timer: MaximumResolution : 0.0005 s -> clocks/quantum: 384000
Timer: ActualResolution : 0.0156 s -> clocks/quantum: 1.19808E+007


With my method I get 1.44E+6 clocks/slice for the Celeron and 3.5E+7 for the AMD...