News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

SwitchToThread vs Sleep(0)

Started by jj2007, May 11, 2013, 08:17:48 AM

Previous topic - Next topic

Antariy

Here is the further development of your idea, Michael :t

It obviously shows that for two concurrent CPU-greedy threads the timeslice is ~31 ms. For 6 threads I get inconsistent timeslice - obviously CPU time is not enough and the OS trying to provide bigger timeslice to some threads randomly.
Latest two...six "microseconds" values meant nothing because for first thread it is smaller due to its early termination (so, it took not full timeslice), for second thread there is no next value to substract from, so latest the microseconds value is a garbage.
If you set THREADS_COUNT to a higher value and sometimes get negative microseconds value - that's just because of TSC overflow - the code uses DWORD for TSC difference saving. That's OK.

To get timeslice in clocks - just substract the required_thread+1 and required_thread TSC values.

You may play with THREADS_COUNT equate, for example set it to 8, comment process affinity setup and run prog on 4 core machine :biggrin:

2 threads

Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
ThrId: 1, Tsc: 1641942104 (31111.889036 microseconds)
ThrId: 2, Tsc: 1708313408 (31307.562426 microseconds)
ThrId: 1, Tsc: 1775102144 (31197.697475 microseconds)
ThrId: 2, Tsc: 1841656504 (31278.417108 microseconds)
ThrId: 1, Tsc: 1908383064 (31223.175253 microseconds)
ThrId: 2, Tsc: 1974991776 (32716.561587 microseconds)
ThrId: 1, Tsc: 2044786344 (29788.938317 microseconds)
ThrId: 2, Tsc: 2108335384 (33655.721859 microseconds)
ThrId: 1, Tsc: 2180133472 (28849.669293 microseconds)
ThrId: 2, Tsc: 2241678760 (31300.099845 microseconds)
ThrId: 1, Tsc: 2308451576 (31212.060132 microseconds)
ThrId: 2, Tsc: 2375036576 (31276.298335 microseconds)
ThrId: 1, Tsc: 2441758616 (31215.536420 microseconds)
ThrId: 2, Tsc: 2508351032 (31267.174485 microseconds)
ThrId: 1, Tsc: 2575053608 (31237.901665 microseconds)
ThrId: 2, Tsc: 2641693736 (31255.853111 microseconds)
ThrId: 1, Tsc: 2708372160 (31245.889252 microseconds)
ThrId: 2, Tsc: 2775029328 (31276.189583 microseconds)
ThrId: 1, Tsc: 2841751136 (18778.881644 microseconds)
ThrId: 2, Tsc: 2881812312 (-1350864.296328 microseconds)
Press any key to continue ...


3 threads

Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
Thread #3 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
ThrId: 1, Tsc: 1901983592 (31258.440639 microseconds)
ThrId: 2, Tsc: 1968667536 (31243.706728 microseconds)
ThrId: 3, Tsc: 2035320048 (31251.514313 microseconds)
ThrId: 1, Tsc: 2101989216 (31254.746849 microseconds)
ThrId: 2, Tsc: 2168665280 (31248.443030 microseconds)
ThrId: 3, Tsc: 2235327896 (31251.956818 microseconds)
ThrId: 1, Tsc: 2301998008 (31254.038091 microseconds)
ThrId: 2, Tsc: 2368672560 (31251.694315 microseconds)
ThrId: 3, Tsc: 2435342112 (31251.371812 microseconds)
ThrId: 1, Tsc: 2502010976 (31252.346823 microseconds)
ThrId: 2, Tsc: 2568681920 (31250.186799 microseconds)
ThrId: 3, Tsc: 2635348256 (31252.196821 microseconds)
ThrId: 1, Tsc: 2702018880 (31251.971818 microseconds)
ThrId: 2, Tsc: 2768689024 (31250.891807 microseconds)
ThrId: 3, Tsc: 2835356864 (31252.076820 microseconds)
ThrId: 1, Tsc: 2902027232 (31253.385584 microseconds)
ThrId: 2, Tsc: 2968700392 (32028.237809 microseconds)
ThrId: 3, Tsc: 3037026552 (30476.185833 microseconds)
ThrId: 1, Tsc: 3102041704 (31300.549850 microseconds)
ThrId: 2, Tsc: 3168815480 (31437.782601 microseconds)
ThrId: 3, Tsc: 3235882016 (31016.742996 microseconds)
ThrId: 1, Tsc: 3302050344 (31252.950579 microseconds)
ThrId: 2, Tsc: 3368722576 (31482.340588 microseconds)
ThrId: 3, Tsc: 3435884168 (31020.080532 microseconds)
ThrId: 1, Tsc: 3502059616 (31252.913079 microseconds)
ThrId: 2, Tsc: 3568731768 (31303.572383 microseconds)
ThrId: 3, Tsc: 3635511992 (31198.237481 microseconds)
ThrId: 1, Tsc: 3702067504 (90.942245 microseconds)
ThrId: 2, Tsc: 3702261512 (231.943787 microseconds)
ThrId: 3, Tsc: 3702756320 (-1735686.009066 microseconds)
Press any key to continue ...


6 threads

Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
Thread #3 started and waiting for synchronization flag...
Thread #4 started and waiting for synchronization flag...
Thread #5 started and waiting for synchronization flag...
Thread #6 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
ThrId: 1, Tsc: 3286510888 (156269.421697 microseconds)
ThrId: 2, Tsc: 3619882008 (124958.236731 microseconds)
ThrId: 3, Tsc: 3886456664 (156258.749080 microseconds)
ThrId: 4, Tsc: 4219805016 (-1950782.957939 microseconds)
ThrId: 5, Tsc: 58180224 (93757.269220 microseconds)
ThrId: 6, Tsc: 258193544 (156256.551556 microseconds)
ThrId: 1, Tsc: 591537208 (156425.453403 microseconds)
ThrId: 2, Tsc: 925241192 (124841.666706 microseconds)
ThrId: 3, Tsc: 1191567168 (156258.910332 microseconds)
ThrId: 4, Tsc: 1524915864 (62505.642405 microseconds)
ThrId: 5, Tsc: 1658259776 (93755.356699 microseconds)
ThrId: 6, Tsc: 1858269016 (156258.332826 microseconds)
ThrId: 1, Tsc: 2191616480 (156412.215759 microseconds)
ThrId: 2, Tsc: 2525292224 (124854.885600 microseconds)
ThrId: 3, Tsc: 2791646400 (156258.899082 microseconds)
ThrId: 4, Tsc: 3124995072 (62570.289363 microseconds)
ThrId: 5, Tsc: 3258476896 (93810.744805 microseconds)
ThrId: 6, Tsc: 3458604296 (156137.795257 microseconds)
ThrId: 1, Tsc: 3791694616 (156416.573306 microseconds)
ThrId: 2, Tsc: 4125379656 (-1888438.237293 microseconds)
ThrId: 3, Tsc: 96755480 (156259.544089 microseconds)
ThrId: 4, Tsc: 430105528 (62504.772396 microseconds)
ThrId: 5, Tsc: 563447584 (93863.631633 microseconds)
ThrId: 6, Tsc: 763687808 (156213.358584 microseconds)
ThrId: 1, Tsc: 1096939328 (156356.036394 microseconds)
ThrId: 2, Tsc: 1430495224 (124848.090526 microseconds)
ThrId: 3, Tsc: 1696834904 (156259.094084 microseconds)
ThrId: 4, Tsc: 2030183992 (62504.907397 microseconds)
ThrId: 5, Tsc: 2163526336 (93755.184197 microseconds)
ThrId: 6, Tsc: 2363535208 (156259.484088 microseconds)
ThrId: 1, Tsc: 2696885128 (156425.854658 microseconds)
ThrId: 2, Tsc: 3030589968 (124841.846708 microseconds)
ThrId: 3, Tsc: 3296916328 (156430.245956 microseconds)
ThrId: 4, Tsc: 3630630536 (62331.666753 microseconds)
ThrId: 5, Tsc: 3763603304 (93756.605463 microseconds)
ThrId: 6, Tsc: 3963615208 (-1857030.312519 microseconds)
ThrId: 1, Tsc: 1993872 (156520.648195 microseconds)
ThrId: 2, Tsc: 335900936 (124745.639405 microseconds)
ThrId: 3, Tsc: 602022056 (156259.457838 microseconds)
ThrId: 4, Tsc: 935371920 (62588.117058 microseconds)
ThrId: 5, Tsc: 1068891776 (93672.229540 microseconds)
ThrId: 6, Tsc: 1268723680 (156258.235324 microseconds)
ThrId: 1, Tsc: 1602070936 (156261.816614 microseconds)
ThrId: 2, Tsc: 1935425832 (125005.273495 microseconds)
ThrId: 3, Tsc: 2202100832 (156258.955332 microseconds)
ThrId: 4, Tsc: 2535449624 (66547.751616 microseconds)
ThrId: 5, Tsc: 2677416608 (89727.381393 microseconds)
ThrId: 6, Tsc: 2868832928 (156328.481093 microseconds)
ThrId: 1, Tsc: 3202330040 (156175.813173 microseconds)
ThrId: 2, Tsc: 3535501464 (125005.993503 microseconds)
ThrId: 3, Tsc: 3802178000 (156260.931604 microseconds)
ThrId: 4, Tsc: 4135531008 (62531.975193 microseconds)
ThrId: 5, Tsc: 4268931096 (-1919560.476443 microseconds)
ThrId: 6, Tsc: 173913536 (156507.905555 microseconds)
ThrId: 1, Tsc: 507793416 (124759.199554 microseconds)
ThrId: 2, Tsc: 773943464 (62501.903615 microseconds)
ThrId: 3, Tsc: 907279400 (62590.052079 microseconds)
ThrId: 4, Tsc: 1040803384 (2384.928585 microseconds)
ThrId: 5, Tsc: 1045891176 (87.525957 microseconds)
ThrId: 6, Tsc: 1046077896 (-490354.377001 microseconds)
Press any key to continue ...



TIMES_TO_SWITCH EQU 10
THREADS_COUNT EQU 2

;==============================================================================
include \masm32\include\masm32rt.inc
.686
;==============================================================================
.data
    hThread1 dd 0
    hThread2 dd 0
    pBuff    dd 0
    nextThr  dd 0
    lok      dd 1
.code


ThrCtl struct DWORD
thrId       dd  ?
switchCnt   dd  ?
ThrCtl ends


;==============================================================================
ThreadProc1 proc uses ebx esi edi lpParameter:ptr ThrCtl

mov esi,lpParameter

mov ecx,[esi].ThrCtl.thrId
mov ebx,pBuff
lea edi,[ecx*8-8]

push ecx
invoke crt_printf,CTXT("Thread #%d started and waiting for synchronization flag...",10),ecx
pop ecx

lock inc nextThr

.while lok
.endw


    .WHILE [esi].ThrCtl.switchCnt

        @loop1:
            cmp ecx,nextThr
        jnz @loop1
       
        xor eax, eax
        push ebx
        push ecx
        cpuid
        pop ecx
        pop ebx
        rdtsc
        mov DWORD PTR [ebx+edi], ecx
        mov DWORD PTR [ebx+edi+4], eax
        lea edi,[edi+THREADS_COUNT*8]

        lea eax,[ecx+1]
        cmp eax,THREADS_COUNT
        jbe @F
        mov eax,1
        @@:

        dec [esi].ThrCtl.switchCnt
        mov nextThr,eax
       
    .ENDW
    ret
ThreadProc1 endp
;==============================================================================

start proc
;==============================================================================
LOCAL hthrs[THREADS_COUNT]:DWORD
LOCAL tdd:DWORD

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    ;--------------------------------------------------------------------------
    ; Minimize interruptions (THESE PRIORITY LEVELS NOT SAFE FOR SINGLE CORE).
    ;--------------------------------------------------------------------------

    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
    invoke GetCurrentThread
    invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

    ;-----------------------------------------------------------
    ; Buffer entries will be a dword thread identifier (1 or 2)
    ; followed by the low-order dword of the current TSC.
    ;-----------------------------------------------------------

    mov ebx,TIMES_TO_SWITCH*THREADS_COUNT*8+8

    invoke GlobalAlloc,GMEM_ZEROINIT,ebx
    mov pBuff,eax

    invoke crt_printf,CTXT("Commiting the memory loops...")

    mov esi,3
    @@:
    invoke crt_printf,CTXT(" %lu"),esi
    mov edi,pBuff
    mov ecx,ebx
    shr ecx,2
    xor eax,eax
    rep stosd
    dec esi
    jnz @B

    invoke crt_printf,CTXT(10,"Starting threads...",10)
   
    ;mov esi,THREADS_COUNT
    xor esi,esi
    @@:
    push TIMES_TO_SWITCH
    lea ecx,[esi+1]
    push ecx
    mov edx,esp
    invoke CreateThread, NULL, 0, ThreadProc1, edx, 0, addr tdd
    mov hthrs[esi*4],eax
    inc esi
    cmp esi,THREADS_COUNT
    jb @B

    @@:
    invoke Sleep,10
    cmp nextThr,THREADS_COUNT
    jb @B
    invoke crt_printf,CTXT("All threads started and ready to work, flagging them...",10)

    mov nextThr,1
   
    and lok,0

    invoke WaitForMultipleObjects, THREADS_COUNT, ADDR hthrs, TRUE, INFINITE

    ;--------------------------------------------------------------
    ; Display only buffer entries where thread identifier changed.
    ;--------------------------------------------------------------

    push eax
    push eax
    invoke QueryPerformanceFrequency,esp
    fild qword ptr [esp]
    mov dword ptr [esp],1000000
    fild dword ptr [esp]
    fdivp st(1),st(0)
       
   
    mov ebx, -1
    xor edi, edi
    mov esi, pBuff
  @@:
    and dword ptr [esp+4],0
    mov eax,[esi+edi+8+4]
    mov [esp],eax
    fild qword ptr [esp]
    mov eax,[esi+edi+4]
    mov [esp],eax
    fild qword ptr [esp]
    fsubp st(1),st(0)
    fdiv st(0),st(1)
    fstp qword ptr [esp]       
    invoke crt_printf,CTXT("ThrId: %d, Tsc: %lu (%lf microseconds)",10),dword ptr [esi+edi],dword ptr [esi+edi+4]
    add edi,8
    cmp edi,TIMES_TO_SWITCH*THREADS_COUNT*8
    jb @B
   
  @@:
    pop eax
    pop eax
   
    inkey
    exit
;==============================================================================
start endp
end start



Michael, your idea is very good :t

qWord

Quote from: Antariy on May 14, 2013, 01:55:00 PM
I have some doubts that the timer resolution has any relation to the scheduler work
your absolute right, it sounds more than unrealistic...

BTW: http://technet.microsoft.com/en-us/sysinternals/bb963901, Chapter 5 is available as book preview: http://download.microsoft.com/download/1/4/0/14045A9E-C978-47D1-954B-92B9FD877995/97807356648739_SampleChapters.pdf
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: Antariy on May 14, 2013, 09:20:13 PM
Michael, your idea is very good :t

I can echo that one :t

My "catch the outlier" approach works but it's not very practical. What I can observe, though, is that inside a timeslice you encounter about 1-3 outliers, i.e. cycle counts that are higher than expected but not caused by a context switch (which has a much higher cycle count, e.g. 100*). These outliers might be hardware interrupts, but I am not sure about that.

From qWord's link:
QuoteA thread might not get to complete its quantum, however, because Windows implements a preemptive scheduler: if another thread with a higher priority becomes ready to run, the currently running thread might be preempted before finishing its time slice. In fact, a thread can be selected to run next and be preempted before even beginning its quantum!

That may explain occasional shorter time slices but not the outliers mentioned above.

Antariy

Update.

To see if the Sleep(0) works, one can define DROP_TIMESLICE equation, and then we see that for every thread the timeslice is obviously dropped.


Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
Thread #3 started and waiting for synchronization flag...
Thread #4 started and waiting for synchronization flag...
Thread #5 started and waiting for synchronization flag...
Thread #6 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
ThrId: 1, Tsc: 3973525984 (15160 clocks, 7.106328 microseconds)
ThrId: 2, Tsc: 3973541144 (14608 clocks, 6.847575 microseconds)
ThrId: 3, Tsc: 3973555752 (15240 clocks, 7.143828 microseconds)
ThrId: 4, Tsc: 3973570992 (14528 clocks, 6.810074 microseconds)
ThrId: 5, Tsc: 3973585520 (14568 clocks, 6.828825 microseconds)
ThrId: 6, Tsc: 3973600088 (14576 clocks, 6.832575 microseconds)
ThrId: 1, Tsc: 3973614664 (14600 clocks, 6.843825 microseconds)
ThrId: 2, Tsc: 3973629264 (14456 clocks, 6.776324 microseconds)
ThrId: 3, Tsc: 3973643720 (14584 clocks, 6.836325 microseconds)
ThrId: 4, Tsc: 3973658304 (14592 clocks, 6.840075 microseconds)
ThrId: 5, Tsc: 3973672896 (14568 clocks, 6.828825 microseconds)
ThrId: 6, Tsc: 3973687464 (14544 clocks, 6.817575 microseconds)
ThrId: 1, Tsc: 3973702008 (14520 clocks, 6.806324 microseconds)
ThrId: 2, Tsc: 3973716528 (14456 clocks, 6.776324 microseconds)
ThrId: 3, Tsc: 3973730984 (14576 clocks, 6.832575 microseconds)
ThrId: 4, Tsc: 3973745560 (14632 clocks, 6.858825 microseconds)
ThrId: 5, Tsc: 3973760192 (14576 clocks, 6.832575 microseconds)
ThrId: 6, Tsc: 3973774768 (14560 clocks, 6.825075 microseconds)
ThrId: 1, Tsc: 3973789328 (14528 clocks, 6.810074 microseconds)
ThrId: 2, Tsc: 3973803856 (14456 clocks, 6.776324 microseconds)
ThrId: 3, Tsc: 3973818312 (14576 clocks, 6.832575 microseconds)
ThrId: 4, Tsc: 3973832888 (14632 clocks, 6.858825 microseconds)
ThrId: 5, Tsc: 3973847520 (14576 clocks, 6.832575 microseconds)
ThrId: 6, Tsc: 3973862096 (14560 clocks, 6.825075 microseconds)
ThrId: 1, Tsc: 3973876656 (14568 clocks, 6.828825 microseconds)
ThrId: 2, Tsc: 3973891224 (14440 clocks, 6.768824 microseconds)
ThrId: 3, Tsc: 3973905664 (14552 clocks, 6.821325 microseconds)
ThrId: 4, Tsc: 3973920216 (14632 clocks, 6.858825 microseconds)
ThrId: 5, Tsc: 3973934848 (14576 clocks, 6.832575 microseconds)
ThrId: 6, Tsc: 3973949424 (14560 clocks, 6.825075 microseconds)
ThrId: 1, Tsc: 3973963984 (14528 clocks, 6.810074 microseconds)
ThrId: 2, Tsc: 3973978512 (14456 clocks, 6.776324 microseconds)
ThrId: 3, Tsc: 3973992968 (14576 clocks, 6.832575 microseconds)
ThrId: 4, Tsc: 3974007544 (14632 clocks, 6.858825 microseconds)
ThrId: 5, Tsc: 3974022176 (14576 clocks, 6.832575 microseconds)
ThrId: 6, Tsc: 3974036752 (14560 clocks, 6.825075 microseconds)
ThrId: 1, Tsc: 3974051312 (14528 clocks, 6.810074 microseconds)
ThrId: 2, Tsc: 3974065840 (14456 clocks, 6.776324 microseconds)
ThrId: 3, Tsc: 3974080296 (14576 clocks, 6.832575 microseconds)
ThrId: 4, Tsc: 3974094872 (14632 clocks, 6.858825 microseconds)
ThrId: 5, Tsc: 3974109504 (14592 clocks, 6.840075 microseconds)
ThrId: 6, Tsc: 3974124096 (14568 clocks, 6.828825 microseconds)
ThrId: 1, Tsc: 3974138664 (14536 clocks, 6.813825 microseconds)
ThrId: 2, Tsc: 3974153200 (14480 clocks, 6.787574 microseconds)
ThrId: 3, Tsc: 3974167680 (14608 clocks, 6.847575 microseconds)
ThrId: 4, Tsc: 3974182288 (14544 clocks, 6.817575 microseconds)
ThrId: 5, Tsc: 3974196832 (14568 clocks, 6.828825 microseconds)
ThrId: 6, Tsc: 3974211400 (14576 clocks, 6.832575 microseconds)
ThrId: 1, Tsc: 3974225976 (14520 clocks, 6.806324 microseconds)
ThrId: 2, Tsc: 3974240496 (14472 clocks, 6.783824 microseconds)
ThrId: 3, Tsc: 3974254968 (14576 clocks, 6.832575 microseconds)
ThrId: 4, Tsc: 3974269544 (14632 clocks, 6.858825 microseconds)
ThrId: 5, Tsc: 3974284176 (14560 clocks, 6.825075 microseconds)
ThrId: 6, Tsc: 3974298736 (14552 clocks, 6.821325 microseconds)
ThrId: 1, Tsc: 3974313288 (188328 clocks, 88.279716 microseconds)
ThrId: 2, Tsc: 3974501616 (81648 clocks, 38.272919 microseconds)
ThrId: 3, Tsc: 3974583264 (108864 clocks, 51.030558 microseconds)
ThrId: 4, Tsc: 3974692128 (66280 clocks, 31.069090 microseconds)
ThrId: 5, Tsc: 3974758408 (123112 clocks, 57.709381 microseconds)
ThrId: 6, Tsc: 3974881520 (2147483648 clocks, -1863246.091754 microseconds)
Press any key to continue ...



In the archive the "default" settings compiled program - threads count is 2, do not drop timeslice. Interesting how will it perform on x64 OSes.

Results for it is

Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
ThrId: 1, Tsc: 3190639352 (66646168 clocks, 31240.732946 microseconds)
ThrId: 2, Tsc: 3257285520 (66680104 clocks, 31256.640620 microseconds)
ThrId: 1, Tsc: 3323965624 (66655280 clocks, 31245.004242 microseconds)
ThrId: 2, Tsc: 3390620904 (66685200 clocks, 31259.029396 microseconds)
ThrId: 1, Tsc: 3457306104 (66683048 clocks, 31258.020635 microseconds)
ThrId: 2, Tsc: 3523989152 (66653304 clocks, 31244.077982 microseconds)
ThrId: 1, Tsc: 3590642456 (66661424 clocks, 31247.884274 microseconds)
ThrId: 2, Tsc: 3657303880 (66683016 clocks, 31258.005634 microseconds)
ThrId: 1, Tsc: 3723986896 (66689880 clocks, 31261.223170 microseconds)
ThrId: 2, Tsc: 3790676776 (66651008 clocks, 31243.001720 microseconds)
ThrId: 1, Tsc: 3857327784 (66654632 clocks, 31244.700489 microseconds)
ThrId: 2, Tsc: 3923982416 (66700200 clocks, 31266.060723 microseconds)
ThrId: 1, Tsc: 3990682616 (66639888 clocks, 31237.789163 microseconds)
ThrId: 2, Tsc: 4057322504 (66684024 clocks, 31258.478140 microseconds)
ThrId: 1, Tsc: 4124006528 (66654648 clocks, 31244.707989 microseconds)
ThrId: 2, Tsc: 4190661176 (83116016 clocks, 38961.058637 microseconds)
ThrId: 1, Tsc: 4273777192 (2147483648 clocks, -1989743.356568 microseconds)
ThrId: 2, Tsc: 29037792 (66760952 clocks, 31294.538534 microseconds)
ThrId: 1, Tsc: 95798744 (189608 clocks, 88.879722 microseconds)
ThrId: 2, Tsc: 95988352 (4198978944 clocks, -44995.032133 microseconds)
Press any key to continue ...


Now TSC (still DWORD sized), clocks and milliseconds for a thread timeslice are shown.

Request for a test, please :biggrin:



Quote from: qWord on May 14, 2013, 10:47:52 PM
Quote from: Antariy on May 14, 2013, 01:55:00 PM
I have some doubts that the timer resolution has any relation to the scheduler work
your absolute right, it sounds more than unrealistic...

BTW: http://technet.microsoft.com/en-us/sysinternals/bb963901, Chapter 5 is available as book preview: http://download.microsoft.com/download/1/4/0/14045A9E-C978-47D1-954B-92B9FD877995/97807356648739_SampleChapters.pdf

Thank you for the information and the file, qWord! :t (I did not have a chance to check them right now, but will check later.)


Quote from: jj2007 on May 14, 2013, 11:17:17 PM
Quote from: Antariy on May 14, 2013, 09:20:13 PM
Michael, your idea is very good :t

I can echo that one :t

My "catch the outlier" approach works but it's not very practical. What I can observe, though, is that inside a timeslice you encounter about 1-3 outliers, i.e. cycle counts that are higher than expected but not caused by a context switch (which has a much higher cycle count, e.g. 100*). These outliers might be hardware interrupts, but I am not sure about that.

I think you're right, Jochen :biggrin: Probably it has something to do with an OS "games" :biggrin:

Antariy

Added an archive to a previous post. Had forgotten to do it when posted ::)

jj2007

Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
ThrId: 1, Tsc: 40662020 (73754145 clocks, 2950165.800000 microseconds)
ThrId: 2, Tsc: 114416165 (73754112 clocks, 2950164.480000 microseconds)
ThrId: 1, Tsc: 188170277 (73752661 clocks, 2950106.440000 microseconds)
ThrId: 2, Tsc: 261922938 (73797654 clocks, 2951906.160000 microseconds)
ThrId: 1, Tsc: 335720592 (73710676 clocks, 2948427.040000 microseconds)
ThrId: 2, Tsc: 409431268 (73756159 clocks, 2950246.360000 microseconds)
ThrId: 1, Tsc: 483187427 (73751858 clocks, 2950074.320000 microseconds)
ThrId: 2, Tsc: 556939285 (73755018 clocks, 2950200.720000 microseconds)
ThrId: 1, Tsc: 630694303 (73752243 clocks, 2950089.720000 microseconds)
ThrId: 2, Tsc: 704446546 (73753786 clocks, 2950151.440000 microseconds)
ThrId: 1, Tsc: 778200332 (73754262 clocks, 2950170.480000 microseconds)
ThrId: 2, Tsc: 851954594 (73752758 clocks, 2950110.320000 microseconds)
ThrId: 1, Tsc: 925707352 (73767315 clocks, 2950692.600000 microseconds)
ThrId: 2, Tsc: 999474667 (73741593 clocks, 2949663.720000 microseconds)
ThrId: 1, Tsc: 1073216260 (73752790 clocks, 2950111.600000 microseconds)
ThrId: 2, Tsc: 1146969050 (73754554 clocks, 2950182.160000 microseconds)
ThrId: 1, Tsc: 1220723604 (73753815 clocks, 2950152.600000 microseconds)
ThrId: 2, Tsc: 1294477419 (73753211 clocks, 2950128.440000 microseconds)
ThrId: 1, Tsc: 1368230630 (90926 clocks, 3637.040000 microseconds)
ThrId: 2, Tsc: 1368321556 (2926645740 clocks, -54732862.240000 microseconds)

FORTRANS

CPU frequency: 801 [MHz]
Timer: MinimumResolution : 0.0100144 -> clocks/quantum: 2.67384E+006
Timer: MaximumResolution : 0.0010032 -> clocks/quantum: 267854
Timer: ActualResolution : 0.0100144 -> clocks/quantum: 2.67384E+006
Press any key to continue ...

qWord

I'm really curions about some reults here (or I'm not able to intpret them?):
Quote from: jj2007 on May 14, 2013, 11:41:21 PMThrId: 1, Tsc: 40662020 (73754145 clocks, 2950165.800000 microseconds)
--> 2.95 seconds per time slice / quantum? The same strange results for my i7: 32 seconds per quantum  :dazzled:

The following quote is from the above linked book "Windows Internals, Sixth Edition, Microsoft Press, Chapter 5: Processes, Threads, and Jobs" (this chapter is available as book preview):
QuoteQuantum
As mentioned earlier in the chapter, a quantum is the amount of time a thread gets to run before
Windows checks to see whether another thread at the same priority is waiting to run. If a thread
completes its quantum and there are no other threads at its priority, Windows permits the thread to
run for another quantum.
On client versions of Windows, threads run by default for 2 clock intervals; on server systems, by
default, a thread runs for 12 clock intervals. (We'll explain how you can change these values later.) The
rationale for the longer default value on server systems is to minimize context switching. By having
a longer quantum, server applications that wake up as the result of a client request have a better
chance of completing the request and going back into a wait state before their quantum ends.
The length of the clock interval varies according to the hardware platform. The frequency of the
clock interrupts is up to the HAL, not the kernel. For example, the clock interval for most x86 uniprocessors
is about 10 milliseconds (note that these machines are no longer supported by Windows and
are only used here for example purposes), and for most x86 and x64 multiprocessors it is about 15
milliseconds.
This clock interval is stored in the kernel variable KeMaximumIncrement as hundreds of
nanoseconds.
Because thread run-time accounting is based on processor cycles, although threads still run in
units of clock intervals, the system does not use the count of clock ticks as the deciding factor for
how long a thread has run and whether its quantum has expired. Instead, when the system starts up,
a calculation
is made whose result is the number of clock cycles that each quantum is equivalent to
.
(This value is stored in the kernel variable KiCyclesPerClockQuantum.) This calculation is made by multiplying
the processor speed in Hz (CPU clock cycles per second) with the number of seconds it takes
for one clock tick to fire (based on the KeMaximumIncrement value described earlier).
The result of this accounting method is that threads do not actually run for a quantum number
based on clock ticks; they instead run for a quantum target, which represents an estimate of what the
number of CPU clock cycles the thread has consumed should be when its turn would be given up.
This target should be equal to an equivalent number of clock interval timer ticks because, as you just
saw, the calculation of clock cycles per quantum is based on the clock interval timer frequency, which
you can check using the following experiment. On the other hand, because interrupt cycles are not
charged to the thread, the actual clock time might be longer.
[...]
Internally, a quantum unit is represented as one third of a clock tick. (So one clock tick equals three
quantums.) This means that on client Windows systems, threads, by default, have a quantum reset
value
of 6 (2 * 3), and that server systems have a quantum reset value of 36 (12 * 3). For this reason,
the KiCyclesPerClockQuantum value is divided by three at the end of the calculation previously described,
because the original value describes only CPU clock cycles per clock interval timer tick.

qWord
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: qWord on May 15, 2013, 12:45:29 AM
I'm really curions about some reults here (or I'm not able to intpret them?):
Quote from: jj2007 on May 14, 2013, 11:41:21 PMThrId: 1, Tsc: 40662020 (73754145 clocks, 2950165.800000 microseconds)
--> 2.95 seconds per time slice / quantum? The same strange results for my i7: 32 seconds

73,754,145 clocks is about 30ms. Probably a comma error somewhere.

qWord

Quote from: jj2007 on May 15, 2013, 01:49:23 AM73,754,145 clocks is about 30ms. Probably a comma error somewhere.
I think the the problem is that the code uses QueryPerformanceFrequency to convert the TSC value.
MREAL macros - when you need floating point arithmetic while assembling!

qWord

I've also found an other interesting fact:
Quote from: msdn: timeBeginPeriod functionThis function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process. Setting a higher resolution can improve the accuracy of time-out intervals in wait functions. However, it can also reduce overall system performance, because the thread scheduler switches tasks more often. High resolutions can also prevent the CPU power management system from entering power-saving modes

After closing all running applications, the timer resolution on my machine decrease by the factor 10 (from 1ms to 10ms), which means that the quantum length increase by factor 10:
QuoteCPU frequency: 2294 [MHz]
Timer: MinimumResolution : 0.0156001 s -> clocks/quantum: 1.19289E+007
Timer: MaximumResolution : 0.0005 s -> clocks/quantum: 382333
Timer: ActualResolution : 0.01 s -> clocks/quantum: 7.64667E+006
Press any key to continue ...
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

i think the real answer is....
the length of a timeslice varies, with a number of contributing factors
not the least of which is the OS version
it seems the scheduler underwent major changes with each version of windows

that's not to say that it can't be discovered and utilized
i would say that, if you can run a test in less than the length of a quantum, you might get more stable results

the variations in results seem to be due to using CPUID to serialize instructions
at least, that's my take on it
now - it may be that the very nature of serializing instructions accounts for the variations in CPUID - lol
sounds like a chicken and egg thing

i have some ideas on serialization that don't involve CPUID
maybe i'll write some test code and post   :biggrin:
give me a few days to finish up this current project

MichaelW

By running at the highest possible priority you can at least minimize the effects of context switches. And although I have no hardware to test this on, I suspect that on a system with multiple physical cores, and with your test confined to a single core, you can practically eliminate the effects of context switches.

Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Quote from: MichaelW on May 15, 2013, 08:13:40 AM... , you can practically eliminate the effects of context switches.

The best option would be to eliminate the context switches themselves. Your two-thread code provides a way to calculate how many test loops are needed to get timings just before the next context switch (with many extra cycles) happens.

Antariy

Quote from: qWord on May 15, 2013, 12:45:29 AM
I'm really curions about some reults here (or I'm not able to intpret them?):
Quote from: jj2007 on May 14, 2013, 11:41:21 PMThrId: 1, Tsc: 40662020 (73754145 clocks, 2950165.800000 microseconds)
--> 2.95 seconds per time slice / quantum? The same strange results for my i7: 32 seconds per quantum  :dazzled:

The following quote is from the above linked book "Windows Internals, Sixth Edition, Microsoft Press, Chapter 5: Processes, Threads, and Jobs" (this chapter is available as book preview):


Quote from: jj2007 on May 15, 2013, 01:49:23 AM
Quote from: qWord on May 15, 2013, 12:45:29 AM
I'm really curions about some reults here (or I'm not able to intpret them?):
Quote from: jj2007 on May 14, 2013, 11:41:21 PMThrId: 1, Tsc: 40662020 (73754145 clocks, 2950165.800000 microseconds)
--> 2.95 seconds per time slice / quantum? The same strange results for my i7: 32 seconds

73,754,145 clocks is about 30ms. Probably a comma error somewhere.

Thank you for tests and notes! :biggrin:


Quote from: qWord on May 15, 2013, 02:01:59 AM
Quote from: jj2007 on May 15, 2013, 01:49:23 AM73,754,145 clocks is about 30ms. Probably a comma error somewhere.
I think the the problem is that the code uses QueryPerformanceFrequency to convert the TSC value.

Hmm... Interesting. Maybe some systems return not CPU freq but some scale value from this, thank you, qWord! :t Changed code now to calculate CPU freq at runtime, probably it will work better that relying on an API ::)


Quote from: qWord on May 15, 2013, 04:02:39 AM
I've also found an other interesting fact:
Quote from: msdn: timeBeginPeriod functionThis function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process. Setting a higher resolution can improve the accuracy of time-out intervals in wait functions. However, it can also reduce overall system performance, because the thread scheduler switches tasks more often. High resolutions can also prevent the CPU power management system from entering power-saving modes

After closing all running applications, the timer resolution on my machine decrease by the factor 10 (from 1ms to 10ms), which means that the quantum length increase by factor 10:
QuoteCPU frequency: 2294 [MHz]
Timer: MinimumResolution : 0.0156001 s -> clocks/quantum: 1.19289E+007
Timer: MaximumResolution : 0.0005 s -> clocks/quantum: 382333
Timer: ActualResolution : 0.01 s -> clocks/quantum: 7.64667E+006
Press any key to continue ...

Strange, I tried to play with these functions (timeBeginPeriod and NtSetTimerResolution) yesterday (previous page), but with no any changement. Probably you're right and this needs to close all apps and maybe to run the prog under Admin privilege (did you run it under admin?).


Quote from: dedndave on May 15, 2013, 04:45:26 AM
the variations in results seem to be due to using CPUID to serialize instructions
at least, that's my take on it
now - it may be that the very nature of serializing instructions accounts for the variations in CPUID - lol
sounds like a chicken and egg thing

i have some ideas on serialization that don't involve CPUID
maybe i'll write some test code and post   :biggrin:
give me a few days to finish up this current project

Yes, Dave, it sounds so :biggrin: In the "Test of timing code" thread you saw that almost all problems of the timing frame were because of instability in CPUID timings - on some CPUs, like Jochen's is, it is very stable, but on most of others it has very high bias (relative to the code which runs just tents of clocks).

Interesting to see your idea :t




As for code attached. Well, I rewritten the timeslice "calculation" code to a stand-alone code, it now consists a two procs, and has no global vars to rely on, i.e. it now is reenterant (and may be runned, for example, in different threads which setup each to its own core) and reusable (copy-paste). Also added a conditional compilation flags to not output an info like "Thread #x started" etc etc - if anyone will want to use it silently in the progs, also it has a bool flag which, if the conditional compilation flag allows info output, may change the mode of info output to on/off at runtime.

Usage:


LOCAL pBuff:DWORD
...
    invoke calc_timeslice,TIMES_TO_SWITCH,THREADS_COUNT,addr pBuff,1
...
mov eax,pBuff
; EAX now points to a buffer with a data format of which is described below


First param - TIMES_TO_SWITCH param - sets how much times needs to switch the threads, do not set it to too high values, since if timeslice is ~31ms and there is 2 threads, for instance, then for 10 switches it will take at least ~620 ms to calculate the timeslices.
Second param - THREADS_COUNT - is a count of CPU-greedy concurrent threads to run in test. Best value is 2 - other may produce unstability due to high lack of a CPU time.
Third param is a pointer to a variable which after a call will hold an address of a buffer with a data having this format - array of structure elements:
ThreadIdentified  dd ?
TimeStampCounter dq ?

Yes, now the code uses full precision for TSC saving, so there should not be any problems with overflows.
TSC is a value that was at the time the thread got switched to, so if you want to get a timeslice in clocks - just substract TSC of the required_thread_id array entry from required_thread_id+1 array entry.
As we actually need not length of a timeslice in a seconds but rather in clocks - the data array consists from a clocks data, it doesn't convert it to a milliseconds - this done by a testing and displayment code in a "main" proc.
Fourth param is a BOOL value - if it's FALSE, then no info will be displayed at the time code runs.

The code doesn't set affinity and priority levels - that's for user part and maybe played a lot.


Also updated the displayment code - to work with a QWORDs for a TSC and TSC difference (it's excessive but OK).

Also changed CPU frequency getting code to a "handmade", in a displayment part of the prog.

Maybe also some changes, but most relevant are described.

Thoughts are welcome :t



Commiting the memory loops... 3 2 1
Starting threads...
Thread #1 started and waiting for synchronization flag...
Thread #2 started and waiting for synchronization flag...
All threads started and ready to work, flagging them...
Calculating CPU speed... (Sleep(2000) takes: 2000): ~2115181524 Hz
ThrId: 1, Tsc: DC58E182110 (66240824 clocks, 31316.851 microseconds)
ThrId: 2, Tsc: DC5920AE248 (67112224 clocks, 31728.825 microseconds)
ThrId: 1, Tsc: DC5960AEF68 (66228776 clocks, 31311.155 microseconds)
ThrId: 2, Tsc: DC599FD8190 (66749072 clocks, 31557.136 microseconds)
ThrId: 1, Tsc: DC59DF80420 (66594280 clocks, 31483.955 microseconds)
ThrId: 2, Tsc: DC5A1F02A08 (67835888 clocks, 32070.953 microseconds)
ThrId: 1, Tsc: DC5A5FB41F8 (65500152 clocks, 30966.681 microseconds)
ThrId: 2, Tsc: DC5A9E2B5F0 (67209024 clocks, 31774.589 microseconds)
ThrId: 1, Tsc: DC5ADE43D30 (66131200 clocks, 31265.023 microseconds)
ThrId: 2, Tsc: DC5B1D55230 (67680192 clocks, 31997.345 microseconds)
ThrId: 1, Tsc: DC5B5DE09F0 (65662680 clocks, 31043.520 microseconds)
ThrId: 2, Tsc: DC5B9C7F8C8 (67017584 clocks, 31684.082 microseconds)
ThrId: 1, Tsc: DC5BDC69438 (66318928 clocks, 31353.776 microseconds)
ThrId: 2, Tsc: DC5C1BA8688 (82939136 clocks, 39211.356 microseconds)
ThrId: 1, Tsc: DC5C6AC1388 (50481152 clocks, 23866.109 microseconds)
ThrId: 2, Tsc: DC5C9AE5B88 (67595328 clocks, 31957.223 microseconds)
ThrId: 1, Tsc: DC5CDB5C7C8 (65664232 clocks, 31044.254 microseconds)
ThrId: 2, Tsc: DC5D19FBCB0 (67142456 clocks, 31743.118 microseconds)
Press any key to continue ...