The MASM Forum

General => The Laboratory => Topic started by: guga on December 30, 2015, 08:33:13 AM

Title: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on December 30, 2015, 08:33:13 AM
(http://i64.tinypic.com/11v1sht.jpg)

LIBRARY UPDATED !!!

Updated of my benchmark library called "CodeTune". The library was made to help programmers find an easier and better approach while bechmarking their fucntions.

The API usage is simple and converted for masm usage as well. (Originally made for RosAsm i ported it to masm so you guys can use it too)

Working examples in Masm and RosAsm on the attached zip file (CodeTuneV1Light.zip).
Due to size limitation on the Forum, the complete updated API Guide is stored on the link at the end of this post

New version include:

TODO:

Current version Api include:

QuoteI - Functions

1 - Main functionality:

•   CreateTimeProfile
•   CreateTimeProfileEx

2 - Complementary functions:

•   RunTimeDataProc
•   SetupTimeProfiler
•   UserTargetProc

3 - Extras:

•   CpuSettIngs
•   GetCpuFrequencyEx

II – Structures

•   CPUData
•   CT_STANDARD_DEVIATION
•   CT_STDEx
•   CT_Nfo

III – Equates

•   CPU_CPUID_AVALIABLE
•   CPU_RDTSCP_AVALIABLE
•   CPU_RDTSC_AVALIABLE
•   CT_ALGO1
•   CT_ALGO2
•   CT_ALGO3
•   CT_ALGO4
•   CT_ALGO5
•   CT_ALGO6
•   CT_ALGO7
•   CT_ALGO8
•   CT_ALGO_METHOD_ERROR
•   CT_ANALYSIS_ERROR1
•   CT_ANALYSIS_ERROR2
•   CT_ANALYSIS_ERROR3
•   CT_ANALYSIS_ERROR4
•   CT_ANALYSIS_START
•   CT_ANALYSIS_SUCESS
•   CT_BENCHMARK_FINISHED
•   CT_BENCHMARK_RUNNING
•   CT_BENCHMARK_START
•   CT_CALIBRATION_FINISHED
•   CT_CALIBRATION_RUNNING
•   CT_CALIBRATION_START
•   CT_ERROR_BENCHMARK_OVERHEAD
•   CT_ERROR_CALIBRATION_OVERHEAD
•   CT_ERROR_INPUT_VALUE
•   CT_INCONCLUSIVE
•   CT_INSUFFICIENT_FEATURES
•   CT_STATUSCODE_ERROR
•   MAX_ITERATIONS
•   MAX_SAMPLES
•   OVERHEAD_LIMIT

IV – Type Definitions

•   LP_RUNTIMEDATA_CALLBACK_ROUTINE
•   LP_USERTARGET_CALLBACK_ROUTINE



A small tip: To enhance the accuracy of the library it is good if perform an alignment on your code. I don´t know how to make a proper alignment in Masm, but the better approach is align to 16 bytes boundary on the start of each of your functions inside your app (I´m not saying to align the PE sections, but align  your functions to a 16 byte boundary). This will make your app works better and also enhance the accuracy of the library itself.
Although the library Api´s are already aligned, if you align your own apps on that way, you probably will get a better performance (At least on x86 32 bits, dunno how is it on 64 bits)


Complete documentation (guide and library) is stored on the link below (Couldn´t attach here due to size limits)

CodeTune Api Reference (http://www.4shared.com/zip/gpEC4Hweba/CodeTuneV1.html)

Mirror

CodeTune Api Reference (http://docdro.id/onPXIMc)

A Light version containing only the library+examples is available on the link  below attached to this post !
Title: Re: CodeTune Timming Analyzer - Beta (functional for marm users)
Post by: guga on January 03, 2016, 03:37:22 AM
new version of CodeTune library ported to masm. Zip file includes the dll, a inc and a lb file already ported to masm.

I hope i suceeded to make the correct translation to masm. I´m currently updating the library and bulding other functions as well. Like a estimation on how much time the algo will take to finish, a full analysis including all algorithms methods at once, converters for nanosecs to other units.

I´m currently looking a way to make a faster version of qsort. I posted earlier looking for one more specifgic, but, couldn´t find yet.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 03, 2016, 11:25:06 AM
Good news. The next version i´ll implement CreateTimeProfileEx which will be able to analyse the data more deeper. It will take onto account all the available Algo methods and make a similar interpretation as designed in CreateTimeProfile function.

On my preliminary tests i had a gain of accuracy of something around 15% to 20% !!! Today after releasing the library i improved it a little more and gained a bit more of accuracy, but, right now, after finishing the new function CreateTimeProfileEx, the accuracy increased considerably :)

The side-effect of that is time spent to compute. The new function takes 8 to 10 times more to run then the CreateTimeProfile function using all 8 algorithms that are available on my CPU. The new function takes around 8 to 10 seconds to finish analyzing  masmbasic strlen (Which reached the amazing mark of only 3,74917033.... nanoseconds : something around 10,998..... clock cycles), while the other one (CreateTimeProfile) takes only 1 or 1,2 seconds to complete).

The time spent X computations (iterations) on new function are as follow:
a) 140 millions of iterations (loops) in 83,6710 seconds. Reached the mark of 3,37 nanoseconds (10,82 clock cycles) for masm basic strlen
b) 14 millions of iterations (loops) in 10 seconds. Reached the mark of 3,75 nanoseconds (11,01 clock cycles) for masm basic strlen.

So between both there is still a difference of something around 1,75% of clock cycles for the extra 126 millions loops. So, the extra 126 millions loops resulted on a gain of 1,75% of accuracy.. This level of accuracy is good enough because if the results maintain stable i can be able to make a error estimation. So, instead people feed the algo with 3000 samples and 3000 iterations, they can keep using 300 samples x 3000 iterations (or even less) in order to the function works faster and yet, keep the accuracy intact.

I´m currently working on the new function to include on the library. Not sure if i can reduce the time spent, but, i´ll do my best to keep the high accuracy in short time spent.

10 seconds may seems not that much, but, this is the time on a I7. On older processors it may take more time to finish. That´s why i need to optimize what i can without ruining the rest of the functions inside the library


Btw: For those who wants to take a look at the source code but don´t have RosAsm installed, here it is. (Sorry, Rosasm syntax, but it is close to Nasm. So, the syntax shouldn´t be a problem, i hope)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 03, 2016, 09:20:32 PM
How to use it?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 03, 2016, 10:57:19 PM
It is on the API guide on the topic. The link for the guide is:

http://www.4shared.com/zip/Kr64SbPbba/CodeTuneGuide.html

Basically all you have to do is like this:

Example:

   call 'CodeTune.CreateTimeProfile' 300, 3000, CT_ALGO6, MyPointer, Algoritm1, StatusCode


or


   call 'CodeTune.CreateTimeProfile' 300, 3000, CT_ALGO2, 0, Algoritm1, StatusCode



In masm syntax it should be something like (Not sure about the "ADDR" token. I don´t remember the correct masm syntax, since those tokens are unnecessary in RosAsm) :


   invoke CreateTimeProfile 300, 3000, CT_ALGO6, ADDR MyPointer, ADDR Algoritm1, offset StatusCode


or


   invoke CreateTimeProfile 300, 3000, CT_ALGO2, 0, ADDR Algoritm1, offset StatusCode


The equates values  (CT_ALGO2, CT_ALGO6....) are displayed on the guide.

MyPointer = A pointer to a callback function from where you can show some information while the function is running. Useful if you plan to use CreateTimeProfile inside a Thread.
Algorithm1 = A pointer to a function from where you store your code to be tested. Similar functionality as in some parameters of the Apis CreateThread or qsort except that it have no parameters. It is used as a place holder for the function you are testing.
StatusCode a pointer to a variable that will store the actual status of the api (if it is running, if it finished, error messages and so on)

Read the guide and if you still are having problems, let me know and i´ll try to make a small example in masm.

Btw,  CreateTimeProfile calibrates itself internally so you don´t have to do anything like that. All you have to do is call this function choosing the desired Method (CT_ALGOXXX equates) and filling the parameters (1st = sample, 2nd = iterations....)

If i remember correctly the masm syntax, Algorithm1 (The placeholder to the function to be benchmarked) can be written as:

Algorithm1 proc, ; This function is a placeholder. Void (no parameters)

    szText db "Hello World !",0
    invoke strlen, ADDR szText ; <--- function to be tested

Algorithm1 endp
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 04, 2016, 12:03:30 AM
Thaks, I only found source code. That does not interest me. Profiling yes. I will try.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 04, 2016, 12:20:28 AM
Only the source ? that´s Weird  :dazzled:. I uploaded the binary (dll) and lb/inc files on the 1st post as an attachment. Can´t you download it from there ? If not, i´ll post somewhere else for you
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 04, 2016, 09:02:13 AM
The source does not interest me, only the tool (dll).
I don't need the source I will no be able to understand :(
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 04, 2016, 09:23:28 AM
Hmm...you are not being able to dl the dll ? What browser are you using ? The file (dll) is attached on the 1st post.

I re-uploaded it here for you in case you are not succeeding to download from the board

http://www.4shared.com/zip/pfkwttEBce/CodeTuneLibrary.html
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 04, 2016, 09:30:18 AM
I have the dll; that is ok.
I downloaded the html file from the link and I got VIRUSES.
Could you send another link ? or post by mail?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 04, 2016, 09:45:20 AM
Virus ??? from 4shared ? Are you sure it is not a false alarm ? I just downloaded it and no viruses here.

I´ll send to you by email.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 02:20:43 AM
Ok guys, i finished the new function CreateTimeProfileEx. This function uses all the Algorithms available one the user´s CPU and provide 2 types of data set to be interpreted.

One containing the best Mean value. So, the smallest mean value found using all available algorithms.

And other containing the Fastest value found on all of the algorithm methods. So, it locates among the Best mean (described above) what are the smallest (fastest) values found.

In general, the values of the 1st data set and the 2nd one do matches, meaning that the Fastest Value was found with the same Algorithm that was found the Fastest mean....But, sometimes it can varies.

Sometimes the fasted value is not necessarily found on the same algorithm method as the one that found the Best mean.  What matters to consider the most accurate value is the smallest (fastest) one always, despite what value was found on the "Best" Mean.

This is because the algorithms are collecting the proper timings of the tested function. So, after all internal interpretation is done, all that left is the values that most represents the actual time. So, the smallest one is the one that is the nearest to the time your function is using.

The problem is, i´m not being able to write that in English. I mean, what is the proper terms to name those data sets ??? Or a proper name for the structure itself ? (STDEx2 can be renamed as.... ?)

The representation is like this:

(Note: STDEx2 is what the function returns if suceeded.)


[STDEx2:
STDEx2.BestTimming: R$ 0 ; smallest value
STDEx2.IDFound: D$ 0 ; the smalest value was found with this Method

; Best Mean values
STDEx2.Data1.AlgoMethod: D$ 0 ; the method used to collect the best mean
STDEx2.Data1.Mean: R$ 0.0 ; <---- Below is a simple Standard deviation structure "CT_STANDARD_DEVIATION"
STDEx2.Data1.PopulationStd.Max: R$ 0.0
STDEx2.Data1.PopulationStd.Min: R$ 0.0
STDEx2.Data1.PopulationStd.Variance: R$ 0.0
STDEx2.Data1.PopulationStd.StandardDeviation: R$ 0.0
STDEx2.Data1.SampleStd.Max: R$ 0.0
STDEx2.Data1.SampleStd.Min: R$ 0.0
STDEx2.Data1.Sample.Variance: R$ 0.0
STDEx2.Data1.Sample.StandardDeviation: R$ 0.0

; Fastest value was found on this dataset
STDEx2.Data2.AlgoMethod: D$ 0 ; the fastest value was found with this method
STDEx2.Data2.Mean: R$ 0.0 ; ----> same as before. Simple Standard deviation containing of the Fastest Value found
STDEx2.Data2.PopulationStd.Max: R$ 0.0
STDEx2.Data2.PopulationStd.Min: R$ 0.0
STDEx2.Data2.PopulationStd.Variance: R$ 0.0
STDEx2.Data2.PopulationStd.StandardDeviation: R$ 0.0
STDEx2.Data2.SampleStd.Max: R$ 0.0
STDEx2.Data2.SampleStd.Min: R$ 0.0  ;<------------ In general, the fastest value is always found on this member !
STDEx2.Data2.Sample.Variance: R$ 0.0
STDEx2.Data2.Sample.StandardDeviation: R$ 0.0]


How to properly name Data1 and data2 ??

Btw: I´m deeply amazed with the speed found on Jonchen´s MasmBasicStrLen. The final time i found  for his algo is only 3.74033346275425 ns (Something around 10,99..... clock cycles). The margin of error of the new function seems to be only 0,7 to 0,8 %. Fortunately, i succeeded to decrease this rate to less then 1% of error :) . So, in JJ´s case, it is a error of less then 0,03 clock cycles. :) :)

I only need to decide what is the proper term to use for that in english before releasing it. What do you think is the most appropriated term for data1, data2 and STDEx2 ?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 05, 2016, 04:01:21 AM
Just a suggestion :
Could it be possible to display your dialog box, passing it the offset of the function to test. Like this we could test more algorithms.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 04:28:45 AM
Ok, i´ll post the examples for testing.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 05, 2016, 08:08:03 AM
I am afraid when I write an answer to your posts. I say "I will think I never am happy" Grumpy!
Having such a dialog would be a great interface in our pgm during  the test phase and better than running a profiler.
I suggest a parameter which indicate to profile the function or to ignore. Like this in real time under certain circumstances we could test again even if the pgm was finished many months ago. I imagine a parameter in a ini file that told the pgm that it must profile.
A little bit like debuggers do.

(http://www.phrio.biz/imgphr/Grincheux.jpg)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 09:36:57 AM
 :icon_mrgreen: No problem :) These comments are good for development and help getting ideas on what to do.

The parameter is a good idea, but, this is to be done by the user if he wants to create his profiler. I mean, i´m focusing more on the library itself, rather then a dialog or app to use it. This is because it seems that the library is hard to calibrate internally.

For example, i´m writing a example that uses dialogs and displays the result on the screen etc (As i did before, but this time, testing the new functon CreateTimeProfileEx. So far, the api was doing an excellent job in terms of accuracy (I´m using Jochen´s algos to test, because they are  better for me to interpret the results. Since tehy are extremelly fast, it is better for me to enhance the accuracy), but...I added one single line of code on the dialog calling sprintf (msvcrt.dll) to it display one of the parameters the profile was using, and it was enough tpo interfer the results considerably.

I have absolutely no idea why a single line with
C_call 'msvcrt.sprintf' SzPass, {B$ "Algo method %.d is running", 0}, D$edi+CT_Nfo.AlgoIDDis

on the main executable can simply ruin the acuracy of the profiler dll.

When adding this line to the executable, Jochen´s result jump from 3.76034787419823 ns to 5.66218814474938 ns !!!!!!

The weird is that, the executable do have sprintf Api displaying other parameters as well. And if i remove this single line, JJ´s value back to normal (3,74 ns) !!!

This single line of code is ruining the performance and i have no idea why. I didn´t even touched the dll !

Why the hell this...

   .If D@dwStatusCode = CT_CALIBRATION_START

        C_call 'msvcrt.sprintf' SzPass, {B$ "Algo method %.d is running", 0}, D$edi+CT_Nfo.AlgoIDDis
        call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_ALGORUNNING, SzPass

        call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_WARNING, {B$ "Calibration Started.", 0}
        mov D$CalibrationAlreadyRunning &FALSE
        ;mov D$PreviousGoodSample 0
        call 'user32.SetDlgItemInt' D$ThisWindow, IDC_OVERHEAD, 0, &TRUE

    .Else_If D@dwStatusCode = CT_CALIBRATION_RUNNING



Is almost duplicating the results ? The code above without this lines brings the profiler back to his normal behaviour with JJ value at 3,74 ns



   .If D@dwStatusCode = CT_CALIBRATION_START

       ; C_call 'msvcrt.sprintf' SzPass, {B$ "Algo method %.d is running", 0}, D$edi+CT_Nfo.AlgoIDDis <--- this line ruins the performance
        ;call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_ALGORUNNING, SzPass <--- this line donp´t cause any change

        call 'USER32.SetDlgItemTextA' D$ThisWindow, IDC_WARNING, {B$ "Calibration Started.", 0}
        mov D$CalibrationAlreadyRunning &FALSE
        ;mov D$PreviousGoodSample 0
        call 'user32.SetDlgItemInt' D$ThisWindow, IDC_OVERHEAD, 0, &TRUE

    .Else_If D@dwStatusCode = CT_CALIBRATION_RUNNING



Of course, the duplication happens only in Jochen´s algo, since it is extremely fast. On slower functions to be tested, this difference is smaller..But, the main problem is...it have a difference. Something is interfering the results when using functions that adjusts the stack (like msvcrt ones that uses _cdecl)

I need to understand what exactly is happening, because since it is a library, it must be the stabler as possible without being influenced by the code that surrounds it, or other threads etc.

One thing that i also noticed is that, if i use a VisualStyle theme on the main executable, it also interferes the result ! And that´s not wishable. I need to understand what is going on to try to prevent the library to be influenced by those sort of things.

So far, i´m clueless  :dazzled:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 05, 2016, 09:41:29 AM
We will give you a medal

:badgrin: :t
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: fearless on January 05, 2016, 10:12:53 AM
It does make sense that calling other functions whilst in the callback routine would increase the overall time of the operation. I also would imagine the visual styles thing has to go through other layers of code and pass stuff to the dwm (desktop windows management? i forget exactly what its called), for rendering newer visual control styles, which probably has an impact on overall performance of the app / and controls. Not much you can do about that, the library seems fine as it is, its down to how the user uses it to convey the information, that might add time to the overall process. I mean there is no way you can prevent someone putting a call to their own function after CT_CALIBARATION_START that takes ages to run, initializes some memory, copies some strings, sets some controls etc etc. So the call to sprintf is probably adding to the time i think.

Only thing i can think of off the top of my head is to start the timer stuff after the first callback to CT_CALIBARATION_START, but i think that might be inpractical. Or stop timer at each callback call and resume it after it returns from it, which again might be more headache to code and account for - lol

As for the dialog thing, i would leave that to the user to handle, but sure you could provide a couple examples of simple projects with basic dialogs that uses the library to show how it can be used.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 10:14:08 AM
Instead a medal, what about a beer of a bottle of wine :greensml:

Understanding this is giving me headaches. It have absolutelly no reason for this difference. nevertheless, it do exists :(

Ok that the Api is doing a better job (in terms of accuracy) then simply using cpuid+rdtsc to measure functions, but, even with this level of accuracy i really need to understand what a hell 'winblows' is doing to change the results on this manner.

Stability of the results is a must to get a better interpretation of the results. Afterall, if your processor is built to execute a function in let´s say 3 nanoseconds (10 clock cycles), there is no reason why this value goes up. The only reason is that something is interfering in the measuring itself. Perhaps, some of those Winblows Apis functions decreases or increases the speed of the processor ? Or it varies it speed considerably what may be the cause for this differences.

For example, lets say that when 3000 iterations is being measured, the clock speed was at 2,93 but, on another stage of the measures the clock was at 2,73 etc... This variances may interfere the results but...find exactly what and how it varies is what i´m trying to figure it out. (presuming that is some problem with frequency variances of the processor speed, of course).

Damn...I need a beer or two  :greensml: :greensml: :greensml:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 10:30:59 AM
Quotestart the timer stuff after the first callback to CT_CALIBARATION_START,

makes sense, but the timers are already after CT_CALIBRATION_START (inside the dll, i mean). I succeeded to isolate the main function "StartBenchmark" that is where it do the main stages i explained on the guide.

I got that results mainly forcing the processor to have enough time to do the measures. Dave´s strategy of using a serialization on the way he did (a series of push/pop) was great to make the results more stable. I enhanced it adding them inside their own functions and in pairs and create a macro i called BARRIER that is a loop to itself 10 times. This "hyperserialization" technique ensures that the timings are measured properly.

For instance on the main dll you will see this:


Proc StartTime2:
    Uses eax, ebx, ecx, edx

    SERIALIZE3

    cpuid
    rdtsc                     ; read clock counter
    mov D$User.StartTime eax      ; save count
    mov D$User.StartTime+4 edx

    SERIALIZE4

EndP


and the correspondent macros


[BARRIER | mov ecx 10 | Do | Loop_Until_Zero ecx]

[SERIALIZE3 | call SerializeTest1 | call SerializeTest1 | call SerializeTest1 | BARRIER] ; Hyperserialize
[SERIALIZE4 | call SerializeTest2 | call SerializeTest2 | call SerializeTest2 | BARRIER] ; Hyperserialize

[SERIALIZE |
        push    eax
        push    ecx
        push    edx
        push    ebx
        push    esp
        push    ebp
        push    esi
        push    edi]

[SERIALIZE2 |
        pop     eax
        pop     ecx
        pop     edx
        pop     ebx
        pop     eax
        pop     ecx
        pop     esi
        pop     edi]

____________________________________________________________________________________

Proc SerializeTest1:
    Uses eax, ecx, ebx, edx, esi, edi

    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2

EndP
____________________________________________________________________________________

Proc SerializeTest2:
    Uses eax, ecx, ebx, edx, esi, edi

    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2
    SERIALIZE
    SERIALIZE2

EndP



The macros SERIALIZE3 and SERIALIZE4, used on the start and end of the measuring functions are excellent to grant accuracy of the results. They fixed some adjustment problems on cpuid opcode for example.

QuoteI mean there is no way you can prevent someone putting a call to their own function after CT_CALIBARATION_START that takes ages to run, initializes some memory, copies some strings, sets some controls etc etc. So the call to sprintf is probably adding to the time i think.

Indeed. probably you are correct, but, there should be a way to make those external functions not interfere the result. I could simply add a error rate of 2 nanoseconds (or a percentage of of the results, let´s say. 1% or a error rate), but, then we wouldn´t retrieving the correct values, but only a raw estimation.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: fearless on January 05, 2016, 11:07:29 AM
Quote from: guga on January 05, 2016, 10:30:59 AMthere should be a way to make those external functions not interfere the result.

Might have to collect time spent so far, save that, do callback, resume.

so just before callback, get time with rdstc, save to stoptime, get difference from starttime, save it to OverallTime or something, after callback, get time with rdstc, save it to starttime and continue on. So you would be only adding those times that the library is doing its thing, and not measuring times whilst callback occurs. I think ;-)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 01:30:47 PM
QuoteMight have to collect time spent so far, save that, do callback, resume.

so just before callback, get time with rdstc, save to stoptime, get difference from starttime, save it to OverallTime or something, after callback, get time with rdstc, save it to starttime and continue on. So you would be only adding those times that the library is doing its thing, and not measuring times whilst callback occurs. I think ;-)

Agree, but, that if i was using the callback while the function is retrieving the timings. I mean, the main functions that uses starttime/endtime have the less interference as possible of external functions or callbacks or windows Apis. (In fact it has none interference. The only external data is the user´s target function that is under analysis).

The callback is used only after (way after, btw) the time was collected and analyzed. I made this to make sure that nothing was changing the results.

The callback is stored on a function named "PlaceHolder" that is called after the analysis and collecting of time is done. The main function that do the timmings is "Do_Benchmark" from where will make the calls to starttime/endtime. The main routine is like:


    mov D@GoodPassCnt 0

    .Do

        L1:

            ; call DoProcessSleep D@hProcess, &REALTIME_PRIORITY_CLASS, D@hThread, &THREAD_PRIORITY_TIME_CRITICAL ;<-- Useless

            call Do_Benchmark D@AlgoMethod, D@IterationsCount, D@lpStartAddress ; Main routine. Start collecting time and computing differences
            call GetSTDFromSquaredMean TmpBKMean, TmpBKSquaredMean, D@IterationsCount, StandardDeviation ; Do STD from the results
            call GoodPassAnalysis D@GoodPassCnt ; analyze the results from the Standard Deviation above. Pay attention to timings above the equivalent of 1 tick.
; Do not consider all of them !!! Everything above 1 tick (converted to his equivalent in nanosec) generates more accumulation of errors.
; Take care on negative results. Analyze relation between variance, STD and Mean.
; Set maximum innacuracy rate to 1% for above or below the mean and the mean found on the previous sample
;(Only increase this value if previous mean is bigger then the current and increase only at a rate of 0.01% to maintain accuracy intact as possible .
; Eliminate zero values. Leave everything ready to stage 3.

            If D$OverheadsCount > OVERHEAD_LIMIT
                xor eax eax ; Too much overheads, stallings etc. Innacurated data was collected, exit.
                ExitP
            End_If

            ;call DoProcessSleep D@hProcess, D@dwPrcsClass, D@hThread, D@dwThrdLevel; <-- Useless

            test eax eax | jz L1<

        ; now we start collecting from xxx (MAX_GOOD_PASSES) "good samples" the one that have the smallest mean
        fld R$TmpBKMean | fstp R$ebx | add ebx 8
        fld R$TmpBKSquaredMean | fstp R$ebx | add ebx 8
        inc D@GoodPassCnt ; The real amount of "good samples" to be analyzed on the 3rd stage. Used here as a simple variable


        If D@lpStartAddress = DoCalibration
            call PlaceHolder D@lpCallBack, CT_Nfo, CT_CALIBRATION_RUNNING ; <--- save to callback
        Else
            mov eax D@GoodPassCnt | mov D$CT_Nfo.GoodSamples eax
            call PlaceHolder D@lpCallBack, CT_Nfo, CT_BENCHMARK_RUNNING; <--- save to callback
        End_If

        ; call DoProcessSleep D@hProcess, D@dwPrcsClass, D@hThread, D@dwThrdLevel ; <--- Totally useless.
;  Using "sleeps" routines or apis just causes the timings to be inaccurate since it
; potentiate the accumulation of errors while they are being collected.
; Never use Api Sleep or SleepEx.
; Instead using SetPriorityClass/SetThreadPriority and loops to slowdown are more effective.
; But, yet, it generates errors and increases the overall time to finish :(

    .Loop_Until_Zero D@GoodPasses



This variances of time must be resultant somewhere else. I´ll  give a try and see if the clock frequency is being changed too much if those sort of functions are being used externally. I doubt it is the case, but...who knows ?

Damn..i really needed a beer right now :greensml: :greensml: :greensml:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 05, 2016, 04:12:04 PM
QuoteStability of the results is a must to get a better interpretation of the results. Afterall, if your processor is built to execute a function in let´s say 3 nanoseconds (10 clock cycles), there is no reason why this value goes up. The only reason is that something is interfering in the measuring itself. Perhaps, some of those Winblows Apis functions decreases or increases the speed of the processor ? Or it varies it speed considerably what may be the cause for this differences.

Sometimes antivirus can replace original api function with their own one. Perhaps it is not as well written as the ap's one?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 04:55:57 PM
I don´t know if the problem is in the AV that is influencing the processing speed somehow. One thing that i noticed is that it was not only with sprintf. I replaced that Api with a itoa function and the problem persists when analyzing JJ´s algo. Whenever i touch  the main executable this variance shows up. I´m totally clueless what maybe the cause of that.

This is happenning only on JJ´s algo and i have no idea why. I tested with other functions that does not uses SSE instructions and everything went ok (If it makes any difference/sense think that the cause maybe something with the target´s SSE instructions)

I´ll reboot and take a Knapp for a while. Too tired to think of this problem today. All i can say is that it is weird, because the new function (CreateTimeProfileEx) is running well, fast and with stability, despite this damn difference.  :icon_mrgreen:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 05, 2016, 10:59:30 PM
After a night of sleep and rebooting the PC   :icon_mrgreen:  the app is running fine  :t

Some app was running while i was working causing the clock frequency to vary it´s speed. Probably was firefox or PaintShopPro that were opened "eating" memory. Or as phillipe said, the AV also eating memory of the PC running on the background. The odd thing is that it was reacting like that only when i used masmbasicstrlen function to test. Weirdness of "winblows"... as usual  :icon_mrgreen:

I´ll finish the executable example on the new function and try to prepare  one small example for masm.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 01:50:15 AM
I´m currently trying to make a example for Masm user´s

In the meanwhile, can someone please test this example that uses the new Function ?

The problem of yesterday was fixed but, still, i was able to reproduce the same error today. It seems a problem with testing fast functions and the usage of a thread. I solved it putting the callback function before the  Start of the application. The line before "Main: ("Start" token in masm).

I have no idea why this difference is happening if i insert the callback function on other line of code (At the very beginning before start as explained), but, nevertheless it works here as expected with the changes on the executable.


But..here a shot of the good results after the changes on the main executable
(http://i68.tinypic.com/5a4xo3.jpg)

Btw..what are the BBCode on the forum for me resize whenever i post a image ? I tried [img height = "100 width = "100"].... and nothing :(
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 06, 2016, 02:30:57 AM
Works fine:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 06, 2016, 02:41:34 AM
(http://phrio.biz/masmforum/Guga.jpg)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 02:54:52 AM
Tks for the feedback, guys.

Jochen, 204 seconds on a I5... Seems a bit slow, i guess. Did you measured on your I5 some of your benchmark with 90 million operations ?

I mean, if you use your benchmark tools with masmbasicstrlen forcing it to pass 8 times (or looping 90 million times), do you have an idea of the time it will take to finish ?

I would like to compare the time it is taking to work on this huge amount of loops to see if i can be able to optimize it.

Phillip, Those results seems wrong to me. Try again and let me know the results, please.

QueryPerformance Counter is by far, the worst timming measure technique. To it have displayed that value i t means that the others were zeroed.

I guess i forgot to put an error message on such cases. If i remember well from other posts, your processor don´t have all instructions enabled. method1, 2 and 8 are available as far i can remember, but not the others. Maybe, i´m forgetting to include a error message when the algo that is being tested is not compatible with the processor.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: fearless on January 06, 2016, 03:09:03 AM
(http://i265.photobucket.com/albums/ii214/nwofearless/Misc/th_results_zpsspjusyup.png) (http://s265.photobucket.com/user/nwofearless/media/Misc/results_zpsspjusyup.png.html)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 03:33:47 AM
Fearless your results seems accurate but this total amount of time it is taking to run is way too high. I probably forgot to include a error message somewhere.

I´ll take a better look and try to use the app as if i my processor didn´t support some of the methods to see what happens.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 06, 2016, 03:47:35 AM
Quote from: guga on January 06, 2016, 02:54:52 AMI mean, if you use your benchmark tools with masmbasicstrlen forcing it to pass 8 times (or looping 90 million times), do you have an idea of the time it will take to finish ?

Depends on
- len of string(s)
- position of strings in memory

What do you suggest?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 04:04:37 AM
This is the way i´m testing.

call MasmBasicStrLen {B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero", 0}

This function is used inside the internal loops. Each algo method that uses to compute the Mean and STD values, passes it around 9 million times at least (Loops as i saw on your benchmark tools).  The amount of loops can be estimated like this: Good Samples*Iterations*3 then multiply the result by 8 (that is the total algorithm methods used).

Since you tested it with 300 samples +3000 iterations, the total amount of loops performed is something around 21600000.

I would like to see how much time it takes to run on this amount of loops (using that string passed through your function) . On this way i can compare to the results of the profiler and see if it can be optimized.

I´m pretty sure i forgot some error message somewhere biased on the results of Phillip and  maybe it is related to the increase amount of time it is taking to finish in yours and on fearless.

But, if you think that with your benchmark tools will take more then 204 seconds to run, no problem. I just wanted to know if this huge amount of loops really is that slow on older processors or it was me (again) ruining something inside the dll  :icon_mrgreen:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 06, 2016, 05:02:51 AM
Quote from: guga on January 06, 2016, 04:04:37 AM
This is the way i´m testing.

call MasmBasicStrLen {B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero", 0}

This function is used inside the internal loops. Each algo method that uses to compute the Mean and STD values, passes it around 9 million times at least (Loops as i saw on your benchmark tools).  The amount of loops can be estimated like this: Good Samples*Iterations*3 then multiply the result by 8 (that is the total algorithm methods used).

Since you tested it with 300 samples +3000 iterations, the total amount of loops performed is something around 21600000.

With 10 Mio loops, it's 64 cycles per loop for the 100 byte string.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 06:59:19 AM
Tks...and how long did it take to finish running ? did it take more then 200 seconds too ?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 06, 2016, 07:00:29 AM
(http://www.phrio.biz/masmforum/Guga2.jpg)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 06, 2016, 08:06:16 AM
Quote from: guga on January 06, 2016, 06:59:19 AM
Tks...and how long did it take to finish running ? did it take more then 200 seconds too ?

No, it takes about 350 milliseconds.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 09:32:21 AM
200 mliisconds ? Hmm..considering the several functions, it would be acceptable take 20 seconds to run on yours, but 204 is something i need to see what is happening.


Btw..here goes a small example in masm. I assembled it with Radasm. (No fancy stuff...simply a dialog that displays the Mean). Once i succeed to see what is the cause of this huge delay, i´ll post other examples of the new function and will update the library
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 09:41:28 AM
Phillip. Can you please test with this other version i´m attaching ? This one uses only CreateTimeProfile and not CreateTimeProfileEx

I would like to see what is happenning with yours. If possible, please test all available algo methods. Specially the last one (QPC) i would like to compare the results from CreateTimeProfile and CreateTimeProfileEx on your AMD.

Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 06, 2016, 12:51:32 PM
Here are the results : 4 algorythms.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 06, 2016, 12:55:29 PM
Your RadAsm test
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 02:00:11 PM
Tks....

Ok, so, AMD (or, most likely, the OS) have some serious issues with QPC. From the images, the normal smallest speed on your processor is something around 13 ns. On the complete oposite hand, QPC says 0.06 ns.

It most likely is returning 0. If that is the case, then M$ (again) is providing wrong documentation. On msdn it says that QPC never return 0, but it is a false statement. On the same page there are users complaining about this.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904%28v=vs.85%29.aspx

However their solution to use SetThreadAffinityMask seems senseless. It will only take huge amount of time to process, i suppose. I tried it before and it takes 10 more times to finish and the results are not at all accurated (in fact, the results are duplicated using those other Apis). ..I then inserted this Apis outside the main loops just before and after they major loops are being used and nothing happened to effectively stabilize the result.

I´ll make some tests on this and see if i can be able to reproduce the error here. If i could not repoduce i´ll simply remove QPC among the list of algo methods and see what happens to the overall performance of CodeTune Api

Which OS are u using ?

About the radasm example. The result is ok. You are having the correct timming. :t
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 06, 2016, 02:28:06 PM
i´ll try to analyse what QPC is dong internally. The Api is calling inside hal.dll to a Api _KeQueryPerformanceCounter (From which, btw calls to HalpAcpiTimerQueryPerfCount). I´ll see what this internal api is doing to see if it will return 0 and how. I gyess that the cause of this insane amount of time it is taking in some PCs are due to the usage of QPC as well.

On Win7 QPc uses rdtsc without being a part of the system, but on XP it is a system call that also seems to use only rdtsc. http://permalink.gmane.org/gmane.comp.web.chromium.devel/37936
I´m disassembling the library tyo check either it is true or not. If QPC simply uses a fancy (Bogus, in fact) method of using rdtsc, i´ll simply removing it, since one of the methods already uses only rdtsc
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 06, 2016, 04:28:38 PM
I use Windows 10 Pro

QuoteIt certainly seems like QPC could potentially be too costly.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 07, 2016, 01:13:55 AM
Indeed, they were correct. QPC (Quotient of Pile of Crap  :icon_mrgreen:) is only a rdtsc instruction to be executed after XXX calls untill it finally arrives inside hal.dll (Probably WIn10 is faster).

I´m seeing the way windows XP did. The only thing that maybe interesting is the usage of pause instruction. I inserted it immediately before the main loop (after the starttime and endtime do their jobs) and also forced the calibration and benchmark variables (Pointers to memory in fact) be aligned on a 128 bit boundary

Under this other techniques i got this results:

Pros : The library is now more stable and faster (1/3 of the overall time to finish. Instead finishing in 10 seconds here, it finishes in 3)
Cons: The accuracy lost a bit. I mean, the old problem of that difference of timing on JJ´s function showed again :(

I´m trying a couple of things today about the usage of the pause instruction if is really that usefull.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 07, 2016, 07:22:37 AM
Guys, can you test this version ? I would like to know if the results are more closer to normal now. Also, fearless and Jonchen, if you can, please test this. I would want to know how long did it is taking to finish on yours.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 07, 2016, 07:46:47 AM
Here are my results
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: fearless on January 07, 2016, 07:59:41 AM
(http://s18.postimg.org/ik1m1af6h/codeprofiler.png)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 07, 2016, 08:17:58 AM
 :t :t :t

Finally !!! It seems fixed  :biggrin:

10 seconds on Fearless (against 263 on  the previous test)

and 9 nanoseconds on philipe instead of 0.03  :t

I guess accuracy and time spent to run is fixed now for this function. I´ll make a couple of more tests before releasing the updates on the 1st post.

I hope that on Jochen´s I5, the elapsed time have being decreased as well as in Fearless. Jochen, can u test it on your I5, please ?

Many thanks, guys !
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 07, 2016, 08:59:17 AM
Guga, where is the latest version?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 07, 2016, 09:11:24 AM
On this post "Guys, can you test this version ? I would like to know if the results are more closer to normal now. Also, fearless and Jonchen, if you can, please test this. I would want to know how long did it is taking to finish on yours."

But,  i uploaded it for you here too on the atachment
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: fearless on January 07, 2016, 09:50:52 AM
(http://s14.postimg.org/v6mgiwtfl/codeprofiler2.png)
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Grincheux on January 07, 2016, 09:57:50 AM
New results
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 07, 2016, 10:14:37 AM
Tks Phillip. It is working correctly on yours and on Fearless :)

I´m currently cleaning up the code (removing comments and some tests functions that i changed to try to figure it out what was wrong) before release the next version and update the documentation. I made a true mess on the code while was trying to fix those oddities :icon_mrgreen:
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 07, 2016, 11:01:07 AM
It works:
The fastest results was found in Algo method: 4

Value: 7.53133083071831 ns

Standard Deviation Results

Mean: 7.92986730296088 ns

Max (STD Population): 8.32423050342501 ns

Min (STD Population): 7.53550410249675 ns

Variance (STD Population): 0.15552233388031 ns

Standard Deviation (STD Population): 0.39436320046413 ns

Max (STD Sample): 8.32840377520345 ns

Min (STD Sample): 7.53133083071831 ns

Variance (STD Sample): 0.15883131970755 ns

Standard Deviation (STD Sample): 0.39853647224257 ns
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: guga on January 07, 2016, 11:16:04 AM
Tks Jochen  :t

How long did it took to finish on your I5 this time ? Was it faster then before ?
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: Siekmanski on January 07, 2016, 03:56:34 PM
The fastest results was found in Algo method: 8

Value: 3.01931424724331 ns

Standard Deviation Results

Mean: 3.16572185937386 ns

Max (STD Population): 3.31153310400678 ns

Min (STD Population): 3.01991061474095 ns

Variance (STD Population): 0.02126091906140 ns

Standard Deviation (STD Population): 0.14581124463292 ns

Max (STD Sample): 3.31212947150442 ns

Min (STD Sample): 3.01931424724331 ns

Variance (STD Sample): 0.02143518888977 ns

Standard Deviation (STD Sample): 0.14640761213056 ns

Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: jj2007 on January 07, 2016, 06:39:28 PM
Quote from: guga on January 07, 2016, 11:16:04 AMHow long did it took to finish on your I5 this time ? Was it faster then before ?

Yes, much faster, less than ten seconds.
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: TWell on January 07, 2016, 08:25:50 PM
AMD + Win10
The fastest results was found in Algo method: 1
Value: 11.11060643132881 ns
Standard Deviation Results
Mean: 11.42314959020781 ns
Max (STD Population): 11.73505685186878 ns
Min (STD Population): 11.11124232854683 ns
Variance (STD Population): 0.09728613987685 ns
Standard Deviation (STD Population): 0.31190726166098 ns
Max (STD Sample): 11.73569274908680 ns
Min (STD Sample): 11.11060643132881 ns
Variance (STD Sample): 0.09768322616206 ns
Standard Deviation (STD Sample): 0.31254315887900 ns
Title: Re: CodeTune Timming Analyzer - Beta (functional for masm users)
Post by: avcaballero on January 07, 2016, 09:50:37 PM
Hello
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 10, 2016, 08:44:08 AM
Thanks guys


I Updated the library at the 1st post.

New version include


TODO:

I Also updated the Api Guide in pdf format, but for sizing limitation of the forum i uploaded it in 4shared (link also on the 1st post)

A small tip: To enhance the accuracy of the library it is good if perform an alignment on your code. I don´t know how to make a proper alignment in Masm, but the better approach is align to 16 bytes boundary on the start of each of your functions inside your app (I´m not saying to align the PE sections, but align  your functions to a 16 byte boundary). This will make your app works better and also enhance the accuracy of the library itself.
Although the library Api´s are already aligned, if you align your own apps on that way, you probably will get a better performance (At least on x86 32 bits, dunno how is it on 64 bits)
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 11, 2016, 08:10:46 AM
Damn, i found something that is tearing the library apart :(

Someone ever tested Blowfish algorithm ???

Here it only works if i set
sample: 50 to 300
iteration = 100

The result is 86708 nanoseconds !!!!  = 254350 clock cycles !!!!!!!!

Any value on iteration higher then that takes ages to finish.

I realize that BlowFish internally perform a few thousands of loops, but, is it that slow ????????????????????

Jochen, have you ever tried this algorithm ? If you need, i can post here a working example of it. I ported to asm the file from Paul Kocher.

I´m testing the initialization routine (Blowfish_Init) and not the encrypter/decrypter.

How is that possible that a functon takes about 0.08 miliseconds to run after a single call ????
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: Grincheux on January 11, 2016, 08:35:51 AM
I send you a mail with Sph dll which contains many algorythms. Too big to post here.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 11, 2016, 08:52:47 AM
Tks philipe. I´ll take a look :)
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 11, 2016, 08:58:30 AM
Hmm..the dll is for x64 only. Do you have a 32 bits version ?
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: Grincheux on January 11, 2016, 09:07:52 AM
I send you the source and the Pelle's C projet. Just modfy the project to select 32 bits.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: Grincheux on January 11, 2016, 09:14:17 AM
An other library I used : http://www.libtom.org/ (http://www.libtom.org/) It has a BlowFish algo
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 11, 2016, 11:06:08 AM
I can´t compile it. :(

I trully hate Visual studio :icon_mrgreen: whenever i try to compile the blowfish.c from tom library, it says unreferenced library XXXX.....

Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: Grincheux on January 11, 2016, 03:51:47 PM
Try this link to download Pelles C Compiler : http://www.smorgasbordet.com/pellesc/
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 12, 2016, 04:03:30 AM
Well...who ever "blow the fish'" back in the 90´s deserved be buried upside-down on a hole filled with sand, crabs and some ants to he be eaten alive sloooowly  :greensml: :greensml: :greensml: :greensml:

Blowfish value is, in fact, something around 87247 ns but it is taking 990 seconds (300 samples x 3000 iterations = 9 to 27 million loops using all algos in CreateTimeProfileEx not considering the multiplication by millions internal loops of the tested function itself. OUCH !!!) to finish on the Api after i made a couple of preliminary tests to try to stop the hang.

I´m having to make some design choices on the strategy to try to overcome problems of tested functions that contains endless loops internally. I´m reviewing the good sample criteria to see if the new changes will not break the library for "normal" functions to be tested.

I have chosen to use as a criteria to be considered a good sample when Variance is smaller then the Mean, and also when STD/VAriance values are smaller then the time the user´s processor takes to perform 1 clock cycle from where i take as a basis for enabling or disabling other criteria of consider a good sample or not.

Concerning the Variance be bigger then the mean, to overcome the problem with BlowFish i had to use Mean^2 (power of 2 of the Mean) so it can be more or like approximate to the Variance (STD is the square of the variance, so..). It worked, but yet, it is not the best choice for other functions to be tested.

I´m thinking on a general solution to not ruin the performance and accuracy just because some tested functions have extreme "heavy"computations by their own.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 13, 2016, 03:38:28 AM
I guess i suceeded to fix that particular problem. I´m making some tests to be sure before release the update.

Also, i´m porting Agner´s fog to test the rdmsr as another algo method. If it really works as expected, then it maybe helpful having more information about wht is going on with your code like branch predictions etc...But i confess that it is hard to port.

Since he uses C and i never converted wdm header files before to RosAsm, it is painful for me to understand but...i´m trying :)

Also i had to create a Fastcall macro, since ntoskrnl Api uses fastcall calling convention for some Apis.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 13, 2016, 04:06:58 AM
Btw...Someone ever used Agner´sFog pcmtest file on XP (32 bits) bfore ?

I can´t run startcounter here. Neither i can compile the main file. All i was able to compile was the textB App. :( But i´m unable to understand the complete functionality

What exactly is this?



SCounterDefinition CounterDefinitions[] = {
    //  id   scheme cpu    countregs eventreg event  mask   name
    {100,  S_P4, PRALL,  4,   7,     0,      9,      7,  "Uops"     }, // uops from any source
    {101,  S_P4, PRALL,  4,   7,     0,      9,      2,  "UopsTC"   }, // uops from trace cache
    {102,  S_P4, PRALL,  4,   7,     0,      9,      1,  "UopsDec"  }, // uops directly from decoder
    {103,  S_P4, PRALL,  4,   7,     0,      9,      4,  "UopsMCode"}, // uops from microcode ROM
    {110,  S_P4, PRALL, 12,  17,     4,      1,      1,  "UopsNB"   }, // uops non-bogus
    {111,  S_P4, PRALL, 12,  17,     4,      2,   0x0c,  "UopsBogus"}, // uops bogus
    {150,  S_P4, PRALL,  8,  11,     1,      4, 0x8000,  "UopsFP"   }, // uops floating point, except move etc.
    {151,  S_P4, PRALL,  8,  11,     1,   0x2e,      8,  "UopsFPMov"}, // uops floating point and SIMD move
    {152,  S_P4, PRALL,  8,  11,     1,   0x2e,   0x10,  "UopsFPLd" }, // uops floating point and SIMD load
    {160,  S_P4, PRALL,  8,  11,     1,      2, 0x8000,  "UopsMMX"  }, // uops 64-bit MMX
    {170,  S_P4, PRALL,  8,  11,     1,   0x1a, 0x8000,  "UopsXMM"  }, // uops 128-bit integer XMM
    {200,  S_P4, PRALL, 12,  17,     5,      6,   0x0f,  "Branch"   }, // branches
    {201,  S_P4, PRALL, 12,  17,     5,      6,   0x0c,  "BrTaken"  }, // branches taken
    {202,  S_P4, PRALL, 12,  17,     5,      6,   0x03,  "BrNTaken" }, // branches not taken
    {203,  S_P4, PRALL, 12,  17,     5,      6,   0x05,  "BrPredict"}, // branches predicted
    {204,  S_P4, PRALL, 12,  17,     4,      3,   0x01,  "BrMispred"}, // branches mispredicted
    {210,  S_P4, PRALL,  4,   7,     2,      5,   0x02,  "CondJMisp"}, // conditional jumps mispredicted
    {211,  S_P4, PRALL,  4,   7,     2,      5,   0x04,  "CallMisp" }, // indirect call mispredicted
    {212,  S_P4, PRALL,  4,   7,     2,      5,   0x08,  "RetMisp"  }, // return mispredicted
    {220,  S_P4, PRALL,  4,   7,     2,      5,   0x10,  "IndirMisp"}, // indirect calls, jumps and returns mispredicted
    {310,  S_P4, PRALL,  0,   3,     0,      3,   0x01,  "TCMiss"   }, // trace cache miss
    {320,  S_P4, PRALL,  0,   3,     7,   0x0c,  0x100,  "Cach2Miss"}, // level 2 cache miss
    {321,  S_P4, PRALL,  0,   3,     7,   0x0c,  0x200,  "Cach3Miss"}, // level 3 cache miss
    {330,  S_P4, PRALL,  0,   3,     3,   0x18,   0x02,  "ITLBMiss" }, // instructions TLB Miss
    {340,  S_P4, PRALL,  0,   3,     2,      3,   0x3a,  "LdReplay" }, // memory load replay


    //  id   scheme cpu    countregs eventreg event  mask   name
    {  9,  S_P1, PRALL,  0,   1,     0,   0x16,        2,  "Instruct" }, // instructions executed
    { 11,  S_P1, PRALL,  0,   1,     0,   0x17,        2,  "InstVpipe"}, // instructions executed in V-pipe
    {202,  S_P1, PRALL,  0,   1,     0,   0x15,        2,  "Flush"    }, // pipeline flush due to branch misprediction or serializing event   
    {310,  S_P1, PRALL,  0,   1,     0,   0x0e,        2,  "CodeMiss" }, // code cache miss
    {311,  S_P1, PRALL,  0,   1,     0,   0x29,        2,  "DataMiss" }, // data cache miss


    //  id   scheme  cpu     countregs eventreg event  mask   name
    {  9, S_P2MC, PRALL,    0,   1,     0,   0xc0,     0,  "Instruct" }, // instructions executed
    { 10, S_P2MC, PRALL,    0,   1,     0,   0xd0,     0,  "IDecode"  }, // instructions decoded
    { 20, S_P2MC, PRALL,    0,   1,     0,   0x80,     0,  "IFetch"   }, // instruction fetches
    { 21, S_P2MC, PRALL,    0,   1,     0,   0x86,     0,  "IFetchStl"}, // instruction fetch stall
    { 22, S_P2MC, PRALL,    0,   1,     0,   0x87,     0,  "ILenStal" }, // instruction length decoder stalls
    {100, S_P2MC, INTEL_PM, 0,   1,     0,   0xc2,     0,  "Uops(F)"  }, // microoperations in fused domain
    {100, S_P2MC, PRALL,    0,   1,     0,   0xc2,     0,  "Uops"     }, // microoperations
    {110, S_P2MC, INTEL_PM, 0,   1,     0,   0xa0,     0,  "Uops(UF)" }, // unfused microoperations submitted to execution units (Undocumented counter!)
    {104, S_P2MC, INTEL_PM, 0,   1,     0,   0xda,     0,  "UopsFused"}, // fused uops
    {115, S_P2MC, INTEL_PM, 0,   1,     0,   0xd3,     0,  "SynchUops"}, // stack synchronization uops
    {121, S_P2MC, PRALL,    0,   1,     0,   0xd2,     0,  "PartRStl" }, // partial register access stall
    {130, S_P2MC, PRALL,    0,   1,     0,   0xa2,     0,  "Rs Stall" }, // all resource stalls
    {201, S_P2MC, PRALL,    0,   1,     0,   0xc9,     0,  "BrTaken"  }, // branches taken
    {204, S_P2MC, PRALL,    0,   1,     0,   0xc5,     0,  "BrMispred"}, // mispredicted branches
    {205, S_P2MC, PRALL,    0,   1,     0,   0xe6,     0,  "BTBMiss"  }, // static branch prediction made
    {310, S_P2MC, PRALL,    0,   1,     0,   0x28,  0x0f,  "CodeMiss" }, // level 2 cache code fetch
    {311, S_P2MC, INTEL_P23,0,   1,     0,   0x29,  0x0f,  "L1D Miss" }, // level 2 cache data fetch
(...)
    //  end of list   
    {0, S_UNKNOWN, PRUNKNOWN, 0,  0,     0,      0,     0,    0     }  // list must end with a record of all 0
};

Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 14, 2016, 05:24:49 AM
I`m doing some designe choices here.

I suceeded to port Agner´s fog code to RosAsm (weell..kind-of  :icon_mrgreen:), but i needed to update RosAsm to it allow use those MONSTER wmd strutures.

Didn´t knew how weird were those structures until now. But, i hope i ported it roght. I´ll make some tests tonight
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: dedndave on January 14, 2016, 06:21:02 AM
isn't that the TIB structure ?
it looks like it (thread info block aka thread envrionment block)

i.e.,
1) you don't have to initialize it
2) all you really need are offsets to work with specific values

i often use something like

    mov     eax,fs:[8]

notice that it's the stack limit - but my code doesn't need to know the name - lol
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 14, 2016, 06:31:34 AM
Hi Dave

No, the driver uses something called KPCR http://masm32.com/board/index.php?topic=5023.0;topicseen

It took me hours to try to port this beast. I already ported NT_TIB/TEB structure before but, nothing compare to this monster. The worst thing is the M$ header i had was utterly wrong. It said that KPCR was a simple structure,but, in fact it contains more then 1700 members !!!

I had to compare the info from 2 different sites and port all other structures used to build such a thing. I also had to take a look at my Windows Kernel (The one used for M$ research released to public/students and the leaked one i had - the same used by ReactOS, btw)
http://msdn.moonsols.com/win7rtm_x86/KPCR.html
http://www.nirsoft.net/kernel_struct/vista/KPCR.html

I also made a macro called KeGetCurrentProcessorNumber that behave the same way as he did.

[KPCR.NumberDis 81] ; equate value related to the KPCR structure
; KeGetCurrentProcessorNumber
[KeGetCurrentProcessorNumber | movzx eax B$fs:KPCR.NumberDis | cdq]


The original disassembled file of this was:

mov cl B$fs:051 (in hex)
movzx eax cl
cdq


I hope i suceeded to port it correctly. :greensml:


Btw, some routines maybe usefull for you, dave. For example how to enable/disable rtdsc/rdpmc. I ported as:

                .Else_If D@command = PMC_ENABLE
                    mov eax CR4 ; Read CR4
                    and eax 0-5 ; Enable RDTSC
                    or eax 0100 ; Enable RDPMC
                    mov CR4 eax ; Write CR4
                .Else_If D@command = PMC_DISABLE
                    mov eax CR4 ; Read CR4
                    and eax 0FFFFFEFF ; 0-257 Disable RDPMC
                    mov CR4 eax ; Write CR4


What i dont know yet is how he found the values after the and/or instructions (-5, 0100, etc). How he knew the proper bit to enable or disable. I would like to create a proper equate for that, instead using the hexadecimal values. (For making it easier to read)
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: dedndave on January 14, 2016, 06:37:59 AM
the CR (control registers) are defined in the Intel manuals
most of us rarely use them, so that part of the manual is easily skipped   :biggrin:
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 14, 2016, 09:06:38 AM
Found some good info from intel and other sites

http://www.cise.ufl.edu/~sb3/files/pmc.pdf
http://www.rcollins.org/p6/opcodes/RDPMC.html
https://en.wikipedia.org/wiki/Control_register
http://search.luky.org/linux-kernel.2000/msg25962.html

You may want to read it for your functions that reads CPU contents :)
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 14, 2016, 12:52:58 PM
Can someone, please test to see if i ported Agner´s Fog driver correctly ?

Here the driver on I7, seems to work and the app is reading the rdpmc instruction. I´ll see the app itself (the executable) and try to implement it on the library as well, if it is really working as expected :)
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: Grincheux on January 14, 2016, 02:43:48 PM
Windows cannot launch it

Quotethe application failed to start because its side by side configuration ...
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 14, 2016, 03:22:20 PM
Tks :)

Did you used the "teste.bat" ? 

Perhaps it is not launching because the app was for Intel and not AMD or you don´t have RDPMC instruction, i guess. The driver works here on intel.

From Agner´s fog manual it tells to open the counter rdpmc using the command startcounters and close it with stopcounters. So, on the batch file you must use "guga.exe startcounters" to open and "guga.exe stopcounters"

But....Agner fog said that after you are enabling the counter you must close it immediatelly (you can´t let it open as far i read). So, you can write 2 more batch files for the necessary command . Example

File1 - Create a batch file to start the counter with this. Name it as "Start.bat"
guga.exe startcounters
pause


File2 -  - Create a batch file to close the counter with this. Name it as "Stop.bat"
guga.exe stopcounters
pause


FIle3 - The batch file provided in the zip to run after you started the counter (without any command). Name it as "Go.bat"
guga.exe
pause


So, if you are using the bacth file, you can do this (in order):
a) Click on Start.bat to activate (enable) the counters
b) Click in Go.bat to use and measure the timmings
c) Click in Stop.bat to stop (disable) the counters

Btw: guga.exe is the Agner´s fog pcmtestB.exe from where i renamed and compiled with VS. What i ported to Asm was the driver (and not the main AF executable)



But....i´m testing it here too and as far i can tell, the results of this rdpmc did not impressed me at all. In fact, the timmings are similar to rdtsc+lfence and way way worst then  the timmings measured with the new CreateTimeProfileEx api

I´m currently reading his manual and trying to port what i can to analyze if it will be worthfull use it on the library. The timmings really didn´t impressed me, but i´ll make a tests on the latency, throughputs etc as described in his files.

If i could be able to understand how he managed to compute those things, i can try to do the same without the neeeds of using the driver (neither the rdpmc instruction)

Just for you have an idea of how it is being measured

Enabling rdpmc here in my I7 and Using it with CreateTimeProfile api it resulted in 5,43 nanoseconds on Jochen function (similar result to rdtsc).

When i tested it under the new Api on my library (CreateTimeProfileEx), i still got better time wirh rdtscp (3,78 nanoseconds).

The only good thing i´m seeing about this is that the result using the driver seems more stable (but not accurated at all). I´m trying to see how all of this work exactly to see if it wil be really usefull somehow.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 14, 2016, 03:30:02 PM
About that particular error...I guess i found why


"This issue is caused by a conflict with some of the files in the 2008 version of the C run-time libraries. These libraries are part of the Visual Studio 2008 release, the version numbers start with 9.0. These libraries may be installed with several different Microsoft and third party products."

https://support.microsoft.com/en-us/kb/2525435

This is one of the reasons why i truly hate VS  :icon_mrgreen:
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: Grincheux on January 14, 2016, 04:19:52 PM
QuoteDid you used the "teste.bat" ?

YES
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: qWord on January 24, 2016, 06:25:10 AM
guga,

might it be possible to upload your documentation on some place that does not requires a registration nor 1000s wait-time?

regards
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: fearless on January 24, 2016, 10:31:36 AM
I created a chm help file from the original pdf (and the updated pdf one for the newest CreateProfileTimeEx functions and structures).

Put it together couple weeks ago, and have been meaning to upload it, but got sidetracked with other stuff. Its so that guga can package it with his library - if he so wishes to do so - or leave it as an additional extra someone can download if they want to, or ignore it all together - all valid and totally cool.

I have the source project as well, for guga to use as he wishes, to add/modify etc etc. I havnt included the appendix I and II in the chm file - hadnt got round to adding them in.
All credits are to guga, i just put the chm together.

The project was created using RoboHelp HTML X5.0.1, the source is linked here, on dropbox: https://www.dropbox.com/s/9o0ke0lc8aj70tj/CodeTune%20Library%20CHM%20Source.zip?dl=0

This is the dropbox link for the chm file itself, probably most people will opt to download this: https://www.dropbox.com/s/2eh9n3n65woywrn/CodeTune%20Library.chm?dl=0

Hopefully someone will find it useful.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 24, 2016, 02:31:28 PM
Hi qword. 1000 secs ? Damn...4shared is making people wait that long when they are not registered ?

Try this one  :t

http://docdro.id/onPXIMc

Let me know if this site still makes people wait too much.

Fearless....Huge Thanks !!!!  :t :t :t :t

I´m a bit busy updating a couple of things in RosAsm before the next release. I plan to also include a update of CodeTune as well, but i needed to fix some issues in RosAsm that was there since ages :icon_mrgreen:
And, trying to make an update on the interface as well. (I´m tired the old visual style of the toolbar)

This is a hell of a work, because the major functions in RosAsm are way too attached to the interface so it is hard to isolate them without breaking other parts of the code. This is needed, since i plan to eventually make some functions ready to import and export lib and obj files too. (Not only see it´s contents as it actually is on the LibScanner)

I succeeded to isolate some major functions related to memory management and built it as the dll (RosMem.dll), but on the way i messed up with some parts of RosAsm that used it and i´m tracing where the errors occurs so i can fix. The good thing is that i suceeded to fix some allocationg problems of huge files that was stopping the debugger to load files such as Sony vegas (or the dlls it used, specially dlls that are packed). Now i can load and create a Sony Vegas plugin and debug the whole program without problems (finally  :biggrin:).
Also i improved the disassembler in terms of accuracy and speed, but still needs some adjusts in the inner core of the decoder before i try to implement new SSE3/SSE4 instructions). Since those "new" opcodes are not that relevant at the moment (This stage of development, i mean), i choosed to improve it before add newer opcodes.
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: qWord on January 24, 2016, 03:20:49 PM
Quote from: guga on January 24, 2016, 02:31:28 PM
Let me know if this site still makes people wait too much.
much better - no wait time nor any other restriction.

The mathematic excursus makes your documentation somehow pleasant, BTW  :P
Title: Re: CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)
Post by: guga on January 24, 2016, 11:47:43 PM
QuoteThe mathematic excursus makes your documentation somehow pleasant, BTW  :P
:icon_mrgreen: Many tks

I like those explanations as simple as possible. Much easier to understand on that way rather then simply putting all formulas on an article.

I did engineering when i was young (but, never graduated. Left it and migrated onto law school  :greensml:) but it is hard for me to remember all those math symbols and formulas so, i choose to read some that are more pleasant to read and easier to follow.

Whenever i have to understand some of those sigma, mu, theta and complicated math formulas (Integrals, derivated etc) to try to use on some algos (for video/image enhancement, specially), my mind twist 3 times, do some spins around, burn and finally blow of the top of my head :icon_mrgreen: :icon_mrgreen: :icon_mrgreen: