CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)

guga · January 05, 2016, 09:36:57 AM

:icon_mrgreen: No problem :) These comments are good for development and help getting ideas on what to do.

The parameter is a good idea, but, this is to be done by the user if he wants to create his profiler. I mean, i´m focusing more on the library itself, rather then a dialog or app to use it. This is because it seems that the library is hard to calibrate internally.

For example, i´m writing a example that uses dialogs and displays the result on the screen etc (As i did before, but this time, testing the new functon CreateTimeProfileEx. So far, the api was doing an excellent job in terms of accuracy (I´m using Jochen´s algos to test, because they are better for me to interpret the results. Since tehy are extremelly fast, it is better for me to enhance the accuracy), but...I added one single line of code on the dialog calling sprintf (msvcrt.dll) to it display one of the parameters the profile was using, and it was enough tpo interfer the results considerably.

I have absolutely no idea why a single line with

Code Select

C_call 'msvcrt.sprintf' SzPass, {B$ "Algo method %.d is running", 0}, D$edi+CT_Nfo.AlgoIDDis

on the main executable can simply ruin the acuracy of the profiler dll.

When adding this line to the executable, Jochen´s result jump from 3.76034787419823 ns to 5.66218814474938 ns !!!!!!

The weird is that, the executable do have sprintf Api displaying other parameters as well. And if i remove this single line, JJ´s value back to normal (3,74 ns) !!!

This single line of code is ruining the performance and i have no idea why. I didn´t even touched the dll !

Why the hell this...

Code Select


   .If D@dwStatusCode = CT_CALIBRATION_START

        C_call 'msvcrt.sprintf' SzPass, {B$ "Algo method %.d is running", 0}, D$edi+CT_Nfo.AlgoIDDis
        call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_ALGORUNNING, SzPass

        call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_WARNING, {B$ "Calibration Started.", 0}
        mov D$CalibrationAlreadyRunning &FALSE
        ;mov D$PreviousGoodSample 0
        call 'user32.SetDlgItemInt' D$ThisWindow, IDC_OVERHEAD, 0, &TRUE

    .Else_If D@dwStatusCode = CT_CALIBRATION_RUNNING

Is almost duplicating the results ? The code above without this lines brings the profiler back to his normal behaviour with JJ value at 3,74 ns

Code Select



   .If D@dwStatusCode = CT_CALIBRATION_START

       ; C_call 'msvcrt.sprintf' SzPass, {B$ "Algo method %.d is running", 0}, D$edi+CT_Nfo.AlgoIDDis <--- this line ruins the performance
        ;call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_ALGORUNNING, SzPass <--- this line donp´t cause any change

        call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_WARNING, {B$ "Calibration Started.", 0}
        mov D$CalibrationAlreadyRunning &FALSE
        ;mov D$PreviousGoodSample 0
        call 'user32.SetDlgItemInt' D$ThisWindow, IDC_OVERHEAD, 0, &TRUE

    .Else_If D@dwStatusCode = CT_CALIBRATION_RUNNING

Of course, the duplication happens only in Jochen´s algo, since it is extremely fast. On slower functions to be tested, this difference is smaller..But, the main problem is...it have a difference. Something is interfering the results when using functions that adjusts the stack (like msvcrt ones that uses _cdecl)

I need to understand what exactly is happening, because since it is a library, it must be the stabler as possible without being influenced by the code that surrounds it, or other threads etc.

One thing that i also noticed is that, if i use a VisualStyle theme on the main executable, it also interferes the result ! And that´s not wishable. I need to understand what is going on to try to prevent the library to be influenced by those sort of things.

So far, i´m clueless

Grincheux · January 05, 2016, 09:41:29 AM

We will give you a medal

:t

fearless · January 05, 2016, 10:12:53 AM

It does make sense that calling other functions whilst in the callback routine would increase the overall time of the operation. I also would imagine the visual styles thing has to go through other layers of code and pass stuff to the dwm (desktop windows management? i forget exactly what its called), for rendering newer visual control styles, which probably has an impact on overall performance of the app / and controls. Not much you can do about that, the library seems fine as it is, its down to how the user uses it to convey the information, that might add time to the overall process. I mean there is no way you can prevent someone putting a call to their own function after CT_CALIBARATION_START that takes ages to run, initializes some memory, copies some strings, sets some controls etc etc. So the call to sprintf is probably adding to the time i think.

Only thing i can think of off the top of my head is to start the timer stuff after the first callback to CT_CALIBARATION_START, but i think that might be inpractical. Or stop timer at each callback call and resume it after it returns from it, which again might be more headache to code and account for - lol

As for the dialog thing, i would leave that to the user to handle, but sure you could provide a couple examples of simple projects with basic dialogs that uses the library to show how it can be used.

guga · January 05, 2016, 10:14:08 AM

Instead a medal, what about a beer of a bottle of wine

Understanding this is giving me headaches. It have absolutelly no reason for this difference. nevertheless, it do exists :(

Ok that the Api is doing a better job (in terms of accuracy) then simply using cpuid+rdtsc to measure functions, but, even with this level of accuracy i really need to understand what a hell 'winblows' is doing to change the results on this manner.

Stability of the results is a must to get a better interpretation of the results. Afterall, if your processor is built to execute a function in let´s say 3 nanoseconds (10 clock cycles), there is no reason why this value goes up. The only reason is that something is interfering in the measuring itself. Perhaps, some of those Winblows Apis functions decreases or increases the speed of the processor ? Or it varies it speed considerably what may be the cause for this differences.

For example, lets say that when 3000 iterations is being measured, the clock speed was at 2,93 but, on another stage of the measures the clock was at 2,73 etc... This variances may interfere the results but...find exactly what and how it varies is what i´m trying to figure it out. (presuming that is some problem with frequency variances of the processor speed, of course).

Damn...I need a beer or two

guga · January 05, 2016, 10:30:59 AM

Quotestart the timer stuff after the first callback to CT_CALIBARATION_START,

makes sense, but the timers are already after CT_CALIBRATION_START (inside the dll, i mean). I succeeded to isolate the main function "StartBenchmark" that is where it do the main stages i explained on the guide.

I got that results mainly forcing the processor to have enough time to do the measures. Dave´s strategy of using a serialization on the way he did (a series of push/pop) was great to make the results more stable. I enhanced it adding them inside their own functions and in pairs and create a macro i called BARRIER that is a loop to itself 10 times. This "hyperserialization" technique ensures that the timings are measured properly.

For instance on the main dll you will see this:

Code Select



Proc StartTime2:
    Uses eax, ebx, ecx, edx

    SERIALIZE3

    cpuid
    rdtsc                     ; read clock counter
    mov D$User.StartTime eax      ; save count
    mov D$User.StartTime+4 edx

    SERIALIZE4

EndP

and the correspondent macros

Code Select



[BARRIER | mov ecx 10 | Do | Loop_Until_Zero ecx]

[SERIALIZE3 | call SerializeTest1 | call SerializeTest1 | call SerializeTest1 | BARRIER] ; Hyperserialize
[SERIALIZE4 | call SerializeTest2 | call SerializeTest2 | call SerializeTest2 | BARRIER] ; Hyperserialize

[SERIALIZE |
        push    eax
        push    ecx
        push    edx
        push    ebx
        push    esp
        push    ebp
        push    esi
        push    edi]

[SERIALIZE2 |
        pop     eax
        pop     ecx
        pop     edx
        pop     ebx
        pop     eax
        pop     ecx
        pop     esi
        pop     edi]

____________________________________________________________________________________

Proc SerializeTest1:
    Uses eax, ecx, ebx, edx, esi, edi

    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2

EndP
____________________________________________________________________________________

Proc SerializeTest2:
    Uses eax, ecx, ebx, edx, esi, edi

    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2

EndP

The macros SERIALIZE3 and SERIALIZE4, used on the start and end of the measuring functions are excellent to grant accuracy of the results. They fixed some adjustment problems on cpuid opcode for example.

QuoteI mean there is no way you can prevent someone putting a call to their own function after CT_CALIBARATION_START that takes ages to run, initializes some memory, copies some strings, sets some controls etc etc. So the call to sprintf is probably adding to the time i think.

Indeed. probably you are correct, but, there should be a way to make those external functions not interfere the result. I could simply add a error rate of 2 nanoseconds (or a percentage of of the results, let´s say. 1% or a error rate), but, then we wouldn´t retrieving the correct values, but only a raw estimation.

fearless · January 05, 2016, 11:07:29 AM

Quote from: guga on January 05, 2016, 10:30:59 AMthere should be a way to make those external functions not interfere the result.

Might have to collect time spent so far, save that, do callback, resume.

so just before callback, get time with rdstc, save to stoptime, get difference from starttime, save it to OverallTime or something, after callback, get time with rdstc, save it to starttime and continue on. So you would be only adding those times that the library is doing its thing, and not measuring times whilst callback occurs. I think ;-)

guga · January 05, 2016, 01:30:47 PM

QuoteMight have to collect time spent so far, save that, do callback, resume.

so just before callback, get time with rdstc, save to stoptime, get difference from starttime, save it to OverallTime or something, after callback, get time with rdstc, save it to starttime and continue on. So you would be only adding those times that the library is doing its thing, and not measuring times whilst callback occurs. I think ;-)

Agree, but, that if i was using the callback while the function is retrieving the timings. I mean, the main functions that uses starttime/endtime have the less interference as possible of external functions or callbacks or windows Apis. (In fact it has none interference. The only external data is the user´s target function that is under analysis).

The callback is used only after (way after, btw) the time was collected and analyzed. I made this to make sure that nothing was changing the results.

The callback is stored on a function named "PlaceHolder" that is called after the analysis and collecting of time is done. The main function that do the timmings is "Do_Benchmark" from where will make the calls to starttime/endtime. The main routine is like:

Code Select


    mov D@GoodPassCnt 0

    .Do

        L1:

            ; call DoProcessSleep D@hProcess, &REALTIME_PRIORITY_CLASS, D@hThread, &THREAD_PRIORITY_TIME_CRITICAL ;<-- Useless

            call Do_Benchmark D@AlgoMethod, D@IterationsCount, D@lpStartAddress ; Main routine. Start collecting time and computing differences
            call GetSTDFromSquaredMean TmpBKMean, TmpBKSquaredMean, D@IterationsCount, StandardDeviation ; Do STD from the results
            call GoodPassAnalysis D@GoodPassCnt ; analyze the results from the Standard Deviation above. Pay attention to timings above the equivalent of 1 tick.
; Do not consider all of them !!! Everything above 1 tick (converted to his equivalent in nanosec) generates more accumulation of errors.
; Take care on negative results. Analyze relation between variance, STD and Mean.
; Set maximum innacuracy rate to 1% for above or below the mean and the mean found on the previous sample 
;(Only increase this value if previous mean is bigger then the current and increase only at a rate of 0.01% to maintain accuracy intact as possible .
; Eliminate zero values. Leave everything ready to stage 3.

            If D$OverheadsCount > OVERHEAD_LIMIT
                xor eax eax ; Too much overheads, stallings etc. Innacurated data was collected, exit.
                ExitP
            End_If

            ;call DoProcessSleep D@hProcess, D@dwPrcsClass, D@hThread, D@dwThrdLevel; <-- Useless

            test eax eax | jz L1<

        ; now we start collecting from xxx (MAX_GOOD_PASSES) "good samples" the one that have the smallest mean
        fld R$TmpBKMean | fstp R$ebx | add ebx 8
        fld R$TmpBKSquaredMean | fstp R$ebx | add ebx 8
        inc D@GoodPassCnt ; The real amount of "good samples" to be analyzed on the 3rd stage. Used here as a simple variable


        If D@lpStartAddress = DoCalibration
            call PlaceHolder D@lpCallBack, CT_Nfo, CT_CALIBRATION_RUNNING ; <--- save to callback
        Else
            mov eax D@GoodPassCnt | mov D$CT_Nfo.GoodSamples eax
            call PlaceHolder D@lpCallBack, CT_Nfo, CT_BENCHMARK_RUNNING; <--- save to callback
        End_If

        ; call DoProcessSleep D@hProcess, D@dwPrcsClass, D@hThread, D@dwThrdLevel ; <--- Totally useless.
;  Using "sleeps" routines or apis just causes the timings to be inaccurate since it
; potentiate the accumulation of errors while they are being collected.
; Never use Api Sleep or SleepEx.
; Instead using SetPriorityClass/SetThreadPriority and loops to slowdown are more effective.
; But, yet, it generates errors and increases the overall time to finish :(

    .Loop_Until_Zero D@GoodPasses

This variances of time must be resultant somewhere else. I´ll give a try and see if the clock frequency is being changed too much if those sort of functions are being used externally. I doubt it is the case, but...who knows ?

Damn..i really needed a beer right now

Grincheux · January 05, 2016, 04:12:04 PM

QuoteStability of the results is a must to get a better interpretation of the results. Afterall, if your processor is built to execute a function in let´s say 3 nanoseconds (10 clock cycles), there is no reason why this value goes up. The only reason is that something is interfering in the measuring itself. Perhaps, some of those Winblows Apis functions decreases or increases the speed of the processor ? Or it varies it speed considerably what may be the cause for this differences.

Sometimes antivirus can replace original api function with their own one. Perhaps it is not as well written as the ap's one?

guga · January 05, 2016, 04:55:57 PM

I don´t know if the problem is in the AV that is influencing the processing speed somehow. One thing that i noticed is that it was not only with sprintf. I replaced that Api with a itoa function and the problem persists when analyzing JJ´s algo. Whenever i touch the main executable this variance shows up. I´m totally clueless what maybe the cause of that.

This is happenning only on JJ´s algo and i have no idea why. I tested with other functions that does not uses SSE instructions and everything went ok (If it makes any difference/sense think that the cause maybe something with the target´s SSE instructions)

I´ll reboot and take a Knapp for a while. Too tired to think of this problem today. All i can say is that it is weird, because the new function (CreateTimeProfileEx) is running well, fast and with stability, despite this damn difference. :icon_mrgreen:

guga · January 05, 2016, 10:59:30 PM

After a night of sleep and rebooting the PC :icon_mrgreen: the app is running fine :t

Some app was running while i was working causing the clock frequency to vary it´s speed. Probably was firefox or PaintShopPro that were opened "eating" memory. Or as phillipe said, the AV also eating memory of the PC running on the background. The odd thing is that it was reacting like that only when i used masmbasicstrlen function to test. Weirdness of "winblows"... as usual :icon_mrgreen:

I´ll finish the executable example on the new function and try to prepare one small example for masm.

guga · January 06, 2016, 01:50:15 AM

I´m currently trying to make a example for Masm user´s

In the meanwhile, can someone please test this example that uses the new Function ?

The problem of yesterday was fixed but, still, i was able to reproduce the same error today. It seems a problem with testing fast functions and the usage of a thread. I solved it putting the callback function before the Start of the application. The line before "Main: ("Start" token in masm).

I have no idea why this difference is happening if i insert the callback function on other line of code (At the very beginning before start as explained), but, nevertheless it works here as expected with the changes on the executable.

But..here a shot of the good results after the changes on the main executable

Btw..what are the BBCode on the forum for me resize whenever i post a image ? I tried [img height = "100 width = "100"].... and nothing :(

jj2007 · January 06, 2016, 02:30:57 AM

Works fine:

Grincheux · January 06, 2016, 02:41:34 AM

guga · January 06, 2016, 02:54:52 AM

Tks for the feedback, guys.

Jochen, 204 seconds on a I5... Seems a bit slow, i guess. Did you measured on your I5 some of your benchmark with 90 million operations ?

I mean, if you use your benchmark tools with masmbasicstrlen forcing it to pass 8 times (or looping 90 million times), do you have an idea of the time it will take to finish ?

I would like to compare the time it is taking to work on this huge amount of loops to see if i can be able to optimize it.

Phillip, Those results seems wrong to me. Try again and let me know the results, please.

QueryPerformance Counter is by far, the worst timming measure technique. To it have displayed that value i t means that the others were zeroed.

I guess i forgot to put an error message on such cases. If i remember well from other posts, your processor don´t have all instructions enabled. method1, 2 and 8 are available as far i can remember, but not the others. Maybe, i´m forgetting to include a error message when the algo that is being tested is not compatible with the processor.

fearless · January 06, 2016, 03:09:03 AM

The MASM Forum

News:

CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)

guga

Grincheux

fearless

guga

guga

fearless

guga

Grincheux

guga

guga

guga

jj2007

Grincheux

guga

fearless