Benchmark with minimum overhead

guga · December 16, 2015, 01:56:19 AM

Hi guys

I´m building a app that can be used to test the performance of the different benchmark apps being used. I´m testing several differenet algorithms to calculate the benchmark performance x accuracy to see if we can have a method that is more close to the real timmings a certain function may use.

So far, the best performance was found on the start and ending of the computations as:
Method1

Code Select

; start to calculate the data
    xor eax eax ; serialize
    xor edx edx ; serialize
    xor ecx ecx ; serialize
    xor ebx ebx ; serialize
    rdtscp       ; read clock counter
    lfence
    ; store edx/eax as the start
(user code)
; end of calculate the data
    xor eax eax ; serialize
    xor edx edx ; serialize
    xor ecx ecx ; serialize
    xor ebx ebx ; serialize
    rdtscp       ; read clock counter
    lfence

   ; subract the start in edx/eax with the ending of edx/eax, to we calculate the delta value used to compute the STD
loop back untill XXX iterations

After loop, compute he STD and analyse the results. Whenever STD is bigger then 1, start the test all over again. Do this major loop about 30-100 times (Not sure,, yet) to make sure you are collecting the smallest overhead data as possible.

After collecting these data, then perform another STD on it, to compute the true mean. This should represent the real time used by your code with a minimum of overhead.

This results in a small amount of overhead that can be computed after computing the results of a standard variation computation. So, the smallest STD or variance (Closer to 0 or smaller then 1), represents the true clock cycles spent by the algo. So, with the minimum of overhead/ throughputs as possible.

Method2
or we can compute the overhead before all the major loops (as a calibration), calculating the mean of the variances of a series of iterations. Let´s say after we calibrate we have a Mean variance of 30000.
Then we can simply subtract it from the resulting STD computed on the main benchmark function.

But, this 2nd method maybe inaccurate, i guess.

I´ll try finishing it, and post here the results and App to be tested.

Btw.. cpuid/rdtsc seems to be the worst, because it have hundreds of overheads (variances on the STD) before find a value smaller then 1 (for the variance and STD, i mean)

guga · December 16, 2015, 07:04:12 AM

preliminary tests represent a stability of the rdtscp+lfence series. On each usage of the app there is no variances. The mean seems to remains the same no matter how many times we use it. (Of course, this is just a preliminary test and i found a variation of the result, but ths can be overcome, as soon i finish the analysis)

When clicked on 1st time..

And clicked again, just after it :) Btw...I tested closing the app and not closing the app. (Only restarting the benchmark, i mean)

clicked in sequence about 40 times, and found only one major difference of the mean. All the rest remained fixed at 180.56 nanoseconds :) which seems to be the exact amount of time spent by the tested code.

guga · December 18, 2015, 08:43:26 AM

Can someone, please test the app on other Machines ?

So far the result is accurated (although a bit slow after 3000 * 3000 * 500 * 4 internal loop or something :)

I disabled the edit control to increase/decrease the iterations, because 3000 as it is marked, is only the total amount of "good results" the app is collecting. The total amount of iterations is something around the number above (Including the calibration).

This example, it is counting the total amount of time (in nanoseconds) that the function memcpy_SSE uses and also tries to cmopute how many overheads it is being found when tryng to stabilize the results

I´m analysing it for stability of the results. So, the algo with the less variations is the one that represent the more accurated value.

I´ll try optimizing it a bit and review the total amount of "iterations for good passes" are necessary to collect stable results. On previous tests, something around 300 were more then enough for a stabilization, but i´ll take a look at it later.

Btw...i did not implemented any warning message, saying that the app is running on Some specific part (or enabled a progressbar yet). So, if you please can tell me how much time does it takes to work on the machine, i´ll really apreciate.

This version tests the total amount of time (in nanosecs) that the code below takes to run

Code Select


Proc memcpy_SSE:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
           ; movlps XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            ;movhps XMM1 X$esi+edx*8+8 ; copy the 1st 4 dwords from esi to register XMM
            ;movlps X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            ;movhps X$edi+edx*8+8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            movsd | movsd | movsd | movsd


;            movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
;            movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            ;lea edx D$edx+2
            jnz L1<
        emms ; clear the registers back to use on FPU
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L3> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        mov eax eax ; fix potential stallings
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP

guga · December 18, 2015, 09:22:36 AM

On this other version it counts how many time the operation "xor eax, eax" uses

On my machine the time is varying depending the tested algorithm. It is something around 0.8 nanosecs to 1.2 nanosecs (or 3.4 with QueryPerformanceCounter).

The difference of the results represents the iteration of the used algorithms with the code. After calibrating it will collect and try to bypass all overheads as possible.

The problem is still with the variation of some code combinations. For example, using cpuid/rdtsc seems to have problems, because, although i find some stability after clicking them to run. Sometimes it founds a value of 0.01 nanosecs and others 0.52 nanosecs meaning that the operation involvng on the computation of the time is interfering somehow with the real counting of the code.

Acceptables differences sems to be around 25% for above or beyond each algorithm.

In any cases, the result of the timing of "xor eax eax" is something around 1 nanosec which seems the same as related by Agner fog and reported on Intel documents

The "stability" is measured from the variances, standard variation and Overheads results. The smaller the values, more stable the counting algorithm is.

One thing that i´m finding is that when computing the "end time" it seems to don´pt need to serialize it before with 3 extra "xor operations"
;xor eax eax ; serialize
;xor edx edx ; serialize
;xor ecx ecx ; serialize
;xor ebx ebx ; serialize
lfence
rdtscp

If i comment those lines, it seems that the variance/Standard Deviation decreases considerably and the speed is more properly calculated. I´m trying to fix what i can and make further tests.

when i achieve a rate of less then 10% of variances in between each one of the algorithms, then it should represent better the best accuracy of them all.

hutch-- · December 18, 2015, 10:16:03 AM

Guga,

It does not seem to work properly on my Win7 64. The iteration count is set to 3000 and it cannot be edited. When I hit the benchmark button, it just locks up.

guga · December 18, 2015, 10:57:31 AM

Tks for the report, Steve:)

So 3000 is too much. Ok, i´ll reenable the button and add some more variables or messages to the user knows it is running. I´ll post another version as soon i finish.

Magnum · December 18, 2015, 11:10:08 AM

It locks up also on my machine.

Intel Core 2 Dup CPU P8600 2.4 GHz

guga · December 18, 2015, 09:41:47 PM

Ok guys, new version

Added 2 Users inputs

Input sample: The minimum amount of good samples that the algo needs to find before be stabilized (Something around 100 and 300 should be enough)
Input Number of iterations: The total amount of iterations (loopings) needed to analyze the code that is being tested. During the loops, all overheads as possible are already being discarded from computation, so you don´t have do insert a large number here (Otherwise, the app will take a looong time to finish the analysis). A number in between 300 to 30000 should be enough.

Also added the routines to detect the presence of CPUid and rdtscp. If any of those mnemonics are not present on your processor, the correspondent checkboxes are disabled (and also the routines used by them).

Stabilization of the app and the tested algos can be better interpreted when you test the app twice using different values of the iterations. For example, you test once with 3000 and when finish test again with 30000 and compare the results. A good stabilization seems to happens whenever the different between them are less the 5%. Ex: The 1st time you run, the resultant mean is 161.5 and on the next one is 161 etc.

If it is still hanging on newer Windows versions (I´m using XP here), let me know, because perhaps there is a problem with the manifest in the resource section file used ? Not sure, f Windows Vista or others uses the same manifest as in XP.

If it is running Ok, without hangings, i´ll implement the conversion of nanosecs to milisecs and to ticks, and probably make a fine tune on the "good sample" routines. Maybe it would be necessary to add a final comparison between all algos at once. I mean, perhaps, the better should be let the app run on all algos (the good candidates: accuracy x speed) and choose the value that bests fits to the performance in general.

If it is working now, please let me know the results for each tested algo. On this way i can think on a better way to enhance the analysis

Once all tests are Ok, i´ll try to make it a dll in order to you guys also use it inside your own apps. It will be better for the ones that are not used with Rosasm code yet.

Siekmanski · December 18, 2015, 11:25:57 PM

Windows 8.1 64bit

Grincheux · December 19, 2015, 12:34:33 AM

guga · December 19, 2015, 01:07:11 AM

That´s odd results. On both the CPU frequency is not being retrieved properly. I used QueryPerformanceFrequency to retrieve it, and it seems that it is not being able to work as expected on Windows 64 and on AMD.

I´ll use dave´s algo to retrieve the correct CPU Frequency, instead. He made a excellent work :t

http://masm32.com/board/index.php?topic=4693.45

FORTRANS · December 19, 2015, 01:39:10 AM

Hi,

Windows 2000, did not run. The procedure entry point GetNativeSystemInfo could not be located in the dynamic link library KERNEL32.dll.

Windows XP results in attached JPG. Took quite a while to run.

Regards,

Steve N.

TWell · December 19, 2015, 01:39:59 AM

Another AMD 2.8 GHz

guga · December 19, 2015, 02:53:27 AM

Hi Guys, many thanks for the reports

About GetNativeSystemInfo i completelly forgot i inserted this routine on RosMem.dll, many thanks. I´ll fix it for the next release.

About the time spent, this is because QueryPerformanceFrequency is not working as expected on all OSes or processors. I´m making tests on dave´s function to retrieve the correct CPU Frequency, but i found a small problem on his function that is interfering with the results.

I guess i succeeded to overcome this problem serializing his function before it starts with a sequence of "xor operations" like this:

Code Select


GetCpuFrequency:
xor eax eax
xor edx edx
xor ecx ecx
xor ebx ebx
xor edi edi
xor esi esi
    push eax
    push eax
    push esp
(...)

mabdelouahab · December 19, 2015, 03:10:08 AM

win 8.1 x64

The MASM Forum

News:

Benchmark with minimum overhead

guga

guga

guga

guga

hutch--

guga

Magnum

guga

Siekmanski

Grincheux

guga

FORTRANS

TWell

guga

mabdelouahab