The MASM Forum

General => The Laboratory => Topic started by: guga on December 16, 2015, 01:56:19 AM

Title: Benchmark with minimum overhead
Post by: guga on December 16, 2015, 01:56:19 AM
Hi guys

I´m building a app that can be used to test the performance of the different benchmark apps being used. I´m testing several differenet algorithms to calculate the benchmark performance x accuracy to see if we can have a method that is more close to the real timmings a certain function may use.

So far, the best performance was found on the start and ending of the computations as:
Method1
; start to calculate the data
    xor eax eax ; serialize
    xor edx edx ; serialize
    xor ecx ecx ; serialize
    xor ebx ebx ; serialize
    rdtscp       ; read clock counter
    lfence
    ; store edx/eax as the start
(user code)
; end of calculate the data
    xor eax eax ; serialize
    xor edx edx ; serialize
    xor ecx ecx ; serialize
    xor ebx ebx ; serialize
    rdtscp       ; read clock counter
    lfence

   ; subract the start in edx/eax with the ending of edx/eax, to we calculate the delta value used to compute the STD
loop back untill XXX iterations

After loop, compute he STD and analyse the results. Whenever STD is bigger then 1, start the test all over again. Do this major loop about 30-100 times (Not sure,, yet) to make sure you are collecting the smallest overhead data as possible.

After collecting these data, then perform another STD on it, to compute the true mean. This should represent the real time used by your code with a minimum of overhead.




This results in a small amount of overhead that can be computed after computing the results of a standard variation computation. So, the smallest STD or variance (Closer to 0 or smaller then 1), represents the true clock cycles spent by the algo. So, with the minimum of overhead/ throughputs as possible.

Method2
or we can compute the overhead before all the major loops (as a calibration), calculating the mean of the variances of a series of iterations. Let´s say after we calibrate we have a Mean variance of 30000.
Then we can simply subtract it from the resulting STD computed on the main benchmark function.

But, this 2nd method maybe inaccurate, i guess.

I´ll try finishing it, and post here the results and App to be tested.

Btw..    cpuid/rdtsc seems to be the worst, because it have hundreds of overheads (variances on the STD) before find a value smaller then 1 (for the variance and STD, i mean)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 16, 2015, 07:04:12 AM
preliminary tests represent a stability of the rdtscp+lfence series. On each usage of the app there is no variances. The mean seems to remains the same no matter how many times we use it. (Of course, this is just a preliminary test and i found a variation of the result, but ths can be overcome, as soon i finish the analysis)

When clicked on 1st time..
(http://i64.tinypic.com/2uonr53.jpg)

And clicked again, just after it :)  Btw...I tested closing the app and not closing the app. (Only restarting the benchmark, i mean)

(http://i63.tinypic.com/2hn75v9.jpg)

clicked in sequence about 40 times, and found only one major difference of the mean. All the rest remained fixed at 180.56 nanoseconds :) which seems to be the exact amount of time spent by the tested code.
Title: Re: Benchmark with minimum overhead
Post by: guga on December 18, 2015, 08:43:26 AM
Can someone, please test the app on other Machines ?

So far the result is accurated (although a bit slow after 3000 * 3000 * 500 * 4 internal loop or something :)

I disabled the edit control to increase/decrease the iterations, because 3000 as it is marked, is only the total amount of "good results" the app is collecting. The total amount of iterations is something around the number above (Including the calibration).

This example, it is counting the total amount of time (in nanoseconds) that the function memcpy_SSE uses and also tries to cmopute how many overheads it is being found when tryng to stabilize the results

I´m analysing it for stability of the results. So, the algo with the less variations is the one that represent the more accurated value.

I´ll try optimizing it a bit and review the total amount of "iterations for good passes" are necessary to collect stable results. On previous tests, something around 300 were more then enough for a stabilization, but i´ll take a look at it later.

Btw...i did not implemented any warning message, saying that the app is running on Some specific part (or enabled a progressbar yet). So, if you please can tell me how much time does it takes to work on the machine, i´ll really apreciate.

This version tests the total amount of time (in nanosecs) that the code below takes to run

Proc memcpy_SSE:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
           ; movlps XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            ;movhps XMM1 X$esi+edx*8+8 ; copy the 1st 4 dwords from esi to register XMM
            ;movlps X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            ;movhps X$edi+edx*8+8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            movsd | movsd | movsd | movsd


;            movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
;            movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            ;lea edx D$edx+2
            jnz L1<
        emms ; clear the registers back to use on FPU
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L3> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        mov eax eax ; fix potential stallings
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP
Title: Re: Benchmark with minimum overhead
Post by: guga on December 18, 2015, 09:22:36 AM
On this other version it counts how many time the operation "xor eax, eax" uses

On my machine the time is varying depending the tested algorithm. It is something around 0.8 nanosecs to 1.2 nanosecs (or 3.4 with QueryPerformanceCounter).

The difference of the results represents the iteration of the used algorithms with the code. After calibrating it will collect and try to bypass all overheads as possible.

The problem is still with the variation of some code combinations. For example, using cpuid/rdtsc seems to have problems, because, although i find some stability after clicking them to run. Sometimes it founds a value of 0.01 nanosecs and others 0.52 nanosecs meaning that the operation involvng on the computation of the time is interfering somehow with the real counting of the code.

Acceptables differences sems to be around 25% for above or beyond each algorithm.

In any cases, the result of the timing of "xor eax eax" is something around 1 nanosec which seems the same as related by Agner fog and reported on Intel documents

The "stability" is measured from the variances, standard variation and Overheads results. The smaller the values, more stable the counting algorithm is.

One thing that i´m finding is that when computing the "end time" it seems to don´pt need to serialize it before with 3 extra "xor operations"
    ;xor eax eax                 ; serialize
    ;xor edx edx                 ; serialize
    ;xor ecx ecx                 ; serialize
    ;xor ebx ebx                 ; serialize
    lfence
    rdtscp

If i comment those lines, it seems that the variance/Standard Deviation decreases considerably and the speed is more properly calculated. I´m trying to fix what i can and make further tests.

when i achieve a rate of less then 10% of variances in between each one of the algorithms, then it should represent better the best accuracy of them all.
Title: Re: Benchmark with minimum overhead
Post by: hutch-- on December 18, 2015, 10:16:03 AM
Guga,

It does not seem to work properly on my Win7 64. The iteration count is set to 3000 and it cannot be edited. When I hit the benchmark button, it just locks up.
Title: Re: Benchmark with minimum overhead
Post by: guga on December 18, 2015, 10:57:31 AM
Tks for the report, Steve:)

So 3000 is too much. Ok, i´ll reenable the button and add some more variables or messages to the user knows it is running. I´ll post another version as soon i finish.
Title: Re: Benchmark with minimum overhead
Post by: Magnum on December 18, 2015, 11:10:08 AM
It locks up also on my machine.

Intel Core 2 Dup CPU P8600 2.4 GHz
Title: Re: Benchmark with minimum overhead
Post by: guga on December 18, 2015, 09:41:47 PM
Ok guys, new version

Added 2 Users inputs

Input sample: The minimum  amount of good samples that the algo needs to find before be stabilized (Something around 100 and 300 should be enough)
Input Number of iterations: The total amount of iterations (loopings) needed to analyze the code that is being tested. During the loops, all overheads as possible are already  being discarded from computation, so you don´t have do insert a large number here (Otherwise, the app will take a looong time to finish the analysis). A number in between 300 to 30000 should be enough.

Also added the routines to detect the presence of CPUid and rdtscp. If any of those mnemonics are not present on your processor, the correspondent checkboxes are disabled (and also the routines used by them).

Stabilization of the app and the tested algos can be better interpreted when you test the app twice using different values of the iterations. For example, you test once with 3000 and when finish test again with 30000 and compare the results. A good stabilization seems to happens whenever the different between them are less the 5%. Ex: The 1st time you run, the resultant mean is 161.5 and on the next one is 161 etc.


If it is still hanging on newer Windows versions (I´m using XP here), let me know, because perhaps there is a problem with the manifest in the resource section file used ? Not sure, f Windows Vista or others uses the same manifest as in XP.

If it is running Ok, without hangings, i´ll implement the conversion of nanosecs to milisecs and to ticks, and probably make a fine tune on the "good sample" routines. Maybe it would be necessary to add a final comparison between all algos at once. I mean, perhaps, the better should be let the app run on all algos (the good candidates: accuracy x speed) and choose the value that bests fits to the performance in general.

If it is working now, please let me know the results for each tested algo. On this way i can think on a better way to enhance the analysis

Once all tests are Ok, i´ll try to make it a dll in order to you guys also use it inside your own apps. It will be better for the ones that are not used with Rosasm code yet.
Title: Re: Benchmark with minimum overhead
Post by: Siekmanski on December 18, 2015, 11:25:57 PM
Windows 8.1 64bit
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 19, 2015, 12:34:33 AM
(http://www.phrio.biz/download/CodeProfiler.jpg)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 01:07:11 AM
That´s odd results. On both the CPU frequency is not being retrieved properly. I used QueryPerformanceFrequency to retrieve it, and it seems that it is not being able to work as expected on Windows 64 and on AMD.

I´ll use dave´s algo to retrieve the correct CPU Frequency, instead. He made a excellent work  :t

http://masm32.com/board/index.php?topic=4693.45
Title: Re: Benchmark with minimum overhead
Post by: FORTRANS on December 19, 2015, 01:39:10 AM
Hi,

   Windows 2000, did not run.  The procedure entry point GetNativeSystemInfo could not be located in the dynamic link library KERNEL32.dll.

   Windows XP results in attached JPG.  Took quite a while to run.

Regards,

Steve N.
Title: Re: Benchmark with minimum overhead
Post by: TWell on December 19, 2015, 01:39:59 AM
Another AMD 2.8 GHz
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 02:53:27 AM
Hi Guys, many thanks for the reports

About GetNativeSystemInfo i completelly forgot i inserted this routine on RosMem.dll, many thanks. I´ll fix it for the next release.

About the time spent, this is because QueryPerformanceFrequency is not working as expected on all OSes or processors. I´m making tests on dave´s  function to retrieve the correct CPU Frequency, but i found a small problem on his function that is interfering with the results.

I guess i succeeded to overcome this problem serializing his function before it starts with a sequence of "xor operations" like this:

GetCpuFrequency:
xor eax eax
xor edx edx
xor ecx ecx
xor ebx ebx
xor edi edi
xor esi esi
    push eax
    push eax
    push esp
(...)

Title: Re: Benchmark with minimum overhead
Post by: mabdelouahab on December 19, 2015, 03:10:08 AM
win 8.1 x64
(http://gdurl.com/DYPo)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 04:55:31 AM
New version. (Fixed problem for windows 2000), removed manifest file (that was causing problems on win8.1 x64) and tried to fix the slowdown problems and the bad identification of the CPU Frequency.

Please, let me know if the errors were fixed. many thanks
Title: Re: Benchmark with minimum overhead
Post by: TWell on December 19, 2015, 06:28:19 AM
AMD again
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 06:54:39 AM
Great, TWell. many thanks.

It seem to work properly on a AMD.Did it took so long this time to finish ? In my machine it takes something around 2 seconds to finish. (I´ll consider inserting a function to find the total time it takes to work only after i reach the necessary stability i want. This is because, if i have to use gettickcount or other ways to measure it, maybe it interferes (again) with the computation).

On my I7 870 ,2.93 GHZ it have a average of

162.32416152339977 nanoseconds with Algo1 (Cpuid/rdtsc)
161.72941143718396 nanoseconds with Algo2 (Lfence/rdtsc)
163.20196901594997 nanoseconds with Algo3 (Lfence/rdtsc/lfence)

But still, i´m having a fluctuation of 1 nanoseconds after restarting the counting. This variations are due to some minor interferences on the new GetCpuFrequency function and the instability generated by the usage of cpuid/rdtsc. On my machine, what seems more stable is the 2nd, 5th and 7th algorithms.

I suceeded to overcome a bit with the GetCpuFrequency but it seems that still it have a margin of error or 1 or 2 nanoseconds here and there. Although this is not a problem for timming real functions, this causes errors on anlysing short codes, such as analyse only "mov eax 0" or "xor eax eax" etc. The timmings i was achieving before the implementation of that were something around 0.6 to 0.8 nanoseconds (That is more related to the measures from intel and agner fog).

Anyway, if the average you are finding on the different algos don´t varies that much, then, the app is working as planned :) I´ll continue the analysis here to see if i can get more stability and accuracy.
Title: Re: Benchmark with minimum overhead
Post by: mabdelouahab on December 19, 2015, 06:57:11 AM
win8.1 x64
(http://gdurl.com/mPy4)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 07:04:11 AM
Great timmings mabdelouahab. Since you have rdtscp on your machine, try to use Algo7 or 5 to check the results.
I`m surprised to it found a variance of 43 !!! It wasn´t suppose to happen . I´ll review the code for collecting the "good samples". A variance of that rate means that some sample passed through the function that was supposed to barrier him.

Although the minimum STD you found was 279 (Which may represents the actual time your processor is computing it), it is weird to find a variance so high. It is not uncommon CPUID causes high rates on the variance, but on this level, it means i need to review the function used to collect the samples.
Title: Re: Benchmark with minimum overhead
Post by: mabdelouahab on December 19, 2015, 07:19:46 AM
(http://gdurl.com/sxFn)

(http://gdurl.com/lhTv)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 07:56:05 AM
Many tks. Ok, you got more stability with algo7. It is still high the variance, but i guess the results can help me find what is going on.  :t

The good news is that the differences between the Population and sample Standard Deviations are almost zero, meaning that i´m on the right path to collect the samples to be analyse :) :) :) The main problem here seems to be a fine tunning
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 08:05:53 AM
But...wait..there´s something weird.Why your CPU frequency have different values ? The value at the bottom retrieved from the string of CPUID says 1,70 Ghz, but te value collected by Dave´s function says 2,39 ?

I´ll ask dave to see what values are the correct ones,
Title: Re: Benchmark with minimum overhead
Post by: mabdelouahab on December 19, 2015, 09:22:12 AM
Quote from: guga on December 19, 2015, 08:05:53 AM
But...wait..there´s something weird.Why your CPU frequency have different values ? The value at the bottom retrieved from the string of CPUID says 1,70 Ghz, but te value collected by Dave´s function says 2,39 ?

I´ll ask dave to see what values are the correct ones,

Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz 2.40GHz  :redface:


(http://gdurl.com/PwbX)
Title: Re: Benchmark with minimum overhead
Post by: FORTRANS on December 19, 2015, 09:37:38 AM
Hi,

   Windows 2000 results.  Ran in less than a minute.  Much, much
quicker than before.

Cheers,

Steve N.
Title: Re: Benchmark with minimum overhead
Post by: hutch-- on December 19, 2015, 09:50:53 AM
Hi Guga, this one works fine on my Win7 64 box.
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 10:42:26 AM
Hi Guys

many thanks. I´m working on it, trying to improve the accuracy. I´m glad it is working now on other OSes and processors
Title: Re: Benchmark with minimum overhead
Post by: Siekmanski on December 19, 2015, 04:18:11 PM
Windows 8.1 64bit

Algo 5 & 7

Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 19, 2015, 09:04:07 PM
I try to answer when someone requires infos on test to do. Now I would like to know the goal of this program. Will it be useful for us ?
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 11:24:54 PM
Hi Grincheux

Tks. The info i´m looking for is how fast the different algorithms runs. The purpose of this is make a benchmark app that is more accurated. The current benchmarks apps used to measure the time do a raw aproximation of the total amount of time your code will take to run.  The one i´m developing tries to be more stable and accurate using different algorithms to measure time, trying to avoid all overheads that may interfere with the computation of the real time.

Concerning the usage: It can be used by developers measure the time X performance of their apps. For example, i´m building in my free time a plugin to be used in sony Vegas. The algorithm used to process the image needs to be fast, otherwise the plugin can take hours to finish. So, one way to know exactly how fast the plugin will  run, a better way to measure the timings are needed, and the best way to know how much time will take, is measuring the functions you are programming, so you can try to improve your app.

In image or video processing for example, the app you create neeeds to be fast and accurate, depending the technique you are using. If you are trying to create a color transfer algorithm, for example, the app demands heavy computation. And optimization of it may be needed or not. So, to achieve a more reliable level of accuracy of how much time the app will take, you need a better benchmark tool

And this is the purpose of the app i´m building: trying to make a better approach of the amount of time your code will take to run and how "stable" it will be. What i mean with stability is if your code may be easily influenced by other parts of the app or not.

To analyse the speed of the code you are benchmarking, you analyses the resultant Mean (Or the Minimum values, can also be used to analyse the timmings)
To Analyse the stability, you measure the values retrieved from The STD and variance after different tests to see how much they differs on each execution and also see if during the tests you found too much "overheads"

The speed of the code you are analysing (The mean) must be as fixed as possible (No matter which algo method you are using). Afterall, the amount of time the mnemonics you are using in your function takes to work are fixed by the processor, the variation is minimum. So, if your function takes 128 clock cycles to run, no matter how many times you test (benchmark) the result always must be this 128 clock cycles.

Once i succeed to finish the analysis of the different methods, i´ll try to choose the ones that can be better used to measure the timmings and create a alternate dll version for the general usage. So, you can benchmark your own apps using the dll. Unless you are coding with RosAsm, so, the executable version will be enough
Title: Re: Benchmark with minimum overhead
Post by: guga on December 19, 2015, 11:36:24 PM
Many tks Siekmanski

Your results are closer to mine :) Minimum variation and STD and Algo7 is the prefered choice. I´m current reorganizing the code to make it faster and trying to fix the problem of the huge variances reported by others that eventually are also being measured.
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 20, 2015, 01:34:38 AM
Guga that means that your program as a kind of interpreter.
I don't use RosAsm but I have an account on your forum.


It's a kind of VTune (Intel) or Code Analyst (Amd), better of course.
Title: Re: Benchmark with minimum overhead
Post by: guga on December 21, 2015, 12:20:36 AM
Hi
Grincheux

Tks for the comments :) I never used VTune or Code Analyst before, but, you are right, the app is a kind of interpreter. The main problem i´m facing is now with tiny pieces of code. It seems that the mnemonics used to measure the timmings are not at all that accurated when it deals with small chuncks of code.

The problem is that whenever the loop is performed before and after the code you are testing, the mnemonics donp´t have enough time to measure, causing errors on the result.

This is particulary true when the code you test is as simple as an "xor eax eax",  "mov eax 1" etc. I mean, when you analyse those chuncks alone. For computing the correct time, i made a calibration of the algorithm that is basically the subtraction of a "null" function with the ones you are testing. On this, we have 2 times measure, the one resultant from the calibration and the other for the code you are analysing. Ex:

Calibration function:
Proc Dummy:
EndP


Benchmarking test:
Proc Benchmark:
      xor eax eax
EndP


Considering that Proc and Endp Are macros, they also contains mnemnonics. So, in fact what is being measured is:
Calibration function:
Dummy:
push ebp
mov ebp esp

mov esp ebp
pop ebp
ret


Benchmarking test:
push ebp
mov ebp esp

     xor eax eax

mov esp ebp
pop ebp
ret


Ok, so let´s say that at the end of all analysis, the calibration function have a average time 30 ns and the Benchmark function have a average of 31 ns.

The resultant value is simply the subraction between both. So, 31ns - 30 ns = 1 ns which is the time the instruction "xor eax eax" alone takes to run.

Well...this is where things starts to get wrong. :(

In theory, the logic is correct, but, in practice, the results are teh same as if there is no instruction at all to be measured. So, it is like "xor eax eax" never existed., leading to a result of 0 or 0,001 etc...much much similar to the same timming as the one found on the calibration.
And also, when the results are negative :( which are the worst cases !

Those are due to some margin of error that i must be created in order to correctly fix those variations of timmings for very tiny pieces of code.

And, for that i need a sort algorithm. So, if anyone can help me with a sort algo, please, i´ll really appreciate it.

The sort algorthm need to soert a Array of Sructures containing 2 Double Floats inside (REAL data type. Don´t remember if masm it have the same notation as in RosAsm or Nasm. They are the floats with 8 bytes). The sort must be biased on the 1st element of the array. Ex:

Array:
[MyStructArray:

MyStruct.Data1: R$ 5.25
MyStruct.Data1a: R$ 125689.4

MyStruct.Data2: R$ 15.01
MyStruct.Data2a: R$ 145.5

MyStruct.Data3: R$ 1.25
MyStruct.Data3a: R$ 458]



After sorting the result must be:

[MyStructArray:

MyStruct.Data1: R$ 1.25
MyStruct.Data1a: R$ 458

MyStruct.Data2: R$ 5.25
MyStruct.Data2a: R$ 125689.4

MyStruct.Data3: R$ 15.01
MyStruct.Data3a: R$ 145.5]


Does anyone have a sort algorithm for this ?  fast/optimized if possible

Btw: attached the new version. I hope it is still fast
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 21, 2015, 02:52:43 AM
My results : seconds

(http://www.phrio.biz/download/CodeProfiler2.jpg)

Thanks Guga for describing your process.
It's a bit hard to understand.

I ran the test a second time and noticed :

Population Standard deviation

Minimum : 142.36290032751498
Maximum : 142.36298469603162
Variance : 0.00000000177951
Standard deviation : 0.00004218425831
Title: Re: Benchmark with minimum overhead
Post by: TWell on December 21, 2015, 03:04:20 AM
@Grincheux
Could you run cpu_z (http://www.cpuid.com/softwares/cpu-z.html) to see what it tells about your CPU speed?
AMD X2 250 is locked to 3.0 GHz?
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 21, 2015, 03:38:25 AM
CPU-Z result (http://valid.x86.fr/r1d6vy)
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 21, 2015, 03:40:51 AM

[size=78%]Image result[/size]
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 21, 2015, 03:45:29 AM
My results
Title: Re: Benchmark with minimum overhead
Post by: guga on December 21, 2015, 05:40:15 AM
New version. Faster and it seems that stability was not affected (I hope so  :icon_mrgreen:)

(http://i64.tinypic.com/2jb6wzp.jpg)

I´m in need of the sorting algo i told about
Title: Re: Benchmark with minimum overhead
Post by: Grincheux on December 21, 2015, 07:05:12 AM
(http://www.phrio.biz/download/CodeProfiler3.jpg)


I hope that could help you
Title: Re: Benchmark with minimum overhead
Post by: guga on December 21, 2015, 08:11:02 AM
Helped a Lot. many tks. The app is working perfectly on your AMD. Almost 0 as variance means the Mean found is closer to the correct timming the tested function works on your machine :)

I´mworking now on the soerting algorithm in order to try fix a problem it have on tiny chunks of code. Once i succeed, i´ll start porting it to a dll so you guys can use it on your apps too :)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 23, 2015, 05:27:19 PM
Ok, guys...Finished fixing the bugs.  I guess now it is more closer to the real timming the user function works :) Tomorrow i´ll start cleaning the code and preparing the dll version.

Any suggestions of how to implement it as a dll are more then welcome :)

I was considering inserting the main functions on a thread so when you use it inside your apps the function won´t hang on it I mean, got stuck until the analysis are finished, but i have no idea how to make a thread to handle this.
Title: Re: Benchmark with minimum overhead
Post by: guga on December 29, 2015, 05:06:23 AM
Suceeded to clean up and i´m currently making some fine tunes on the code before implementing the dll.

There is a weird behaviour when the TSC mnemonics (CpuiD, rdtsc, rdtscp) when dealing with functions using cllbacks or memory allocation inside. Although i succeed to stabilize the results, i´m getting variatins around 10% of the timing while cleanng other functions on the algo not related to the benchmark itself). For example, the simple fact that i replace a cmp with a test on a internal routine causes teh algo to varyies the result from 72 ns to 80 ns on functions that uses callbacks!!!

Example, this behaviour happens when measuring the timming on qsort.

c_call 'msvcrt.qsort' TestingArray7, 5, 16, StructSortCompare

When i let the internal routines of CodeTune as test eax eax......the result is 72 ns
but...
If i change the same routines to cmp eax 0, the result jumps to 80 ns.

It happens only on those kind of functions under analysis. It seems that when we are analysing a function with callbacks or internal memory allocation, it may cause a pipeline problem or a throughput problem somewhere else that interferes the results.

I´m trying to keep the result with the lowest value as possible while doing the necessary optimizations of CodeTune before implementing the dll.

Althought the app is still fast and i suceeded to fix a major problem while the method chosen is QPC i need it to be a bit faster. It is taking 1.5 to 2.3 seconds on my I7 depending the chosen algo. Which is almost 2 times slower then before.  For instance, Cpuid seems to be the worst algorithm for timming meausre (not to mention QPC). but still, it is possible to make it work faster and with the same accuracy as rdtsc or rtdscp

Of course, this wouldn´t be a big deal, but, if i could be able to keep the older speed, it will be better for all of us, special when doing the benchmark on others people code as many times we want.

My goal is to keep the total elapsed time in something around 1.0 seconds or less on my I7. If i succeed to make work on the same speed as the executable version (no dll), it will be better when using older processors.

Attached the newer version that uses the main function as a thread and analyse the API qsort. All other parts of the function are ready for the user insert his own code, but i need to make some fine tunes as i explained, before releasing the proper dll.
Title: Re: Benchmark with minimum overhead
Post by: fearless on December 29, 2015, 05:29:07 AM
Getting an error messagebox: "The program can't start because ROMEM.dll is missing from your computer. Try reinstalling the program to fix this problem."
Title: Re: Benchmark with minimum overhead
Post by: guga on December 29, 2015, 05:34:41 AM
I forgot to include it. Here it is :)
Title: Re: Benchmark with minimum overhead
Post by: guga on December 29, 2015, 05:55:49 AM
Btw, many tks for the tips.

Now the main function to be used by teh user is simply this:

call TimeProfiler D$SamplesToCollect, D$Iterations, D$AlgoMethod, MyPointer, Algoritm1

SamplestoCollect: Input the amunt of samples
Iterations: Input the iterations he wants to be performed
AlgoMethod: A value (equate) corresponding to the chosen algo method
MyPointer: A callback function to teh user uses the generated results as he want inside or outside a thread
Algoritm1: The user´s target callback function to be tested.

At eax If the fucntion suceed it returns a structure containing the Standard Deviation results to he uses
Or if the function fails, it returns 0 and places the error messages on the dwStatusCode (The parameter for the callback function "MyPointer") as you suggested.
Title: Re: Benchmark with minimum overhead
Post by: guga on December 29, 2015, 11:27:39 AM
Started the 1st tests on the dll. So, far, the porting to a library was ok. I´ll make some minor modifications on the main app that is calling it to see if the library is running ok.
Title: Re: Benchmark with minimum overhead
Post by: guga on January 03, 2016, 04:00:31 AM
Library finished.

Released a beta version including masm inc and lib files ready to be used and a complete guide for the Api usage.

http://masm32.com/board/index.php?topic=4962.msg53374#msg53374

I hope i succeeded to make the porting to masm. I tried to make it on a way that you guys can use invoke token on it.

New version enhances the accuracy and now it is no more necessary to use RosMem.dll :) I chosen to eliminate the needs of virtual allocating memory with external apis.