Benchmark with minimum overhead

guga · December 19, 2015, 11:36:24 PM

Many tks Siekmanski

Your results are closer to mine :) Minimum variation and STD and Algo7 is the prefered choice. I´m current reorganizing the code to make it faster and trying to fix the problem of the huge variances reported by others that eventually are also being measured.

Grincheux · December 20, 2015, 01:34:38 AM

Guga that means that your program as a kind of interpreter.
I don't use RosAsm but I have an account on your forum.

It's a kind of VTune (Intel) or Code Analyst (Amd), better of course.

guga · December 21, 2015, 12:20:36 AM

Hi
Grincheux

Tks for the comments :) I never used VTune or Code Analyst before, but, you are right, the app is a kind of interpreter. The main problem i´m facing is now with tiny pieces of code. It seems that the mnemonics used to measure the timmings are not at all that accurated when it deals with small chuncks of code.

The problem is that whenever the loop is performed before and after the code you are testing, the mnemonics donp´t have enough time to measure, causing errors on the result.

This is particulary true when the code you test is as simple as an "xor eax eax", "mov eax 1" etc. I mean, when you analyse those chuncks alone. For computing the correct time, i made a calibration of the algorithm that is basically the subtraction of a "null" function with the ones you are testing. On this, we have 2 times measure, the one resultant from the calibration and the other for the code you are analysing. Ex:

Calibration function:

Code Select

Proc Dummy:
EndP

Benchmarking test:

Code Select

Proc Benchmark:
      xor eax eax
EndP

Considering that Proc and Endp Are macros, they also contains mnemnonics. So, in fact what is being measured is:
Calibration function:

Code Select

Dummy:
push ebp
mov ebp esp

mov esp ebp
pop ebp
ret

Benchmarking test:

Code Select

push ebp
mov ebp esp

     xor eax eax

mov esp ebp
pop ebp
ret

Ok, so let´s say that at the end of all analysis, the calibration function have a average time 30 ns and the Benchmark function have a average of 31 ns.

The resultant value is simply the subraction between both. So, 31ns - 30 ns = 1 ns which is the time the instruction "xor eax eax" alone takes to run.

Well...this is where things starts to get wrong. :(

In theory, the logic is correct, but, in practice, the results are teh same as if there is no instruction at all to be measured. So, it is like "xor eax eax" never existed., leading to a result of 0 or 0,001 etc...much much similar to the same timming as the one found on the calibration.
And also, when the results are negative :( which are the worst cases !

Those are due to some margin of error that i must be created in order to correctly fix those variations of timmings for very tiny pieces of code.

And, for that i need a sort algorithm. So, if anyone can help me with a sort algo, please, i´ll really appreciate it.

The sort algorthm need to soert a Array of Sructures containing 2 Double Floats inside (REAL data type. Don´t remember if masm it have the same notation as in RosAsm or Nasm. They are the floats with 8 bytes). The sort must be biased on the 1st element of the array. Ex:

Array:

Code Select

[MyStructArray:

MyStruct.Data1: R$ 5.25
MyStruct.Data1a: R$ 125689.4

MyStruct.Data2: R$ 15.01
MyStruct.Data2a: R$ 145.5

MyStruct.Data3: R$ 1.25
MyStruct.Data3a: R$ 458]

After sorting the result must be:

Code Select

[MyStructArray:

MyStruct.Data1: R$ 1.25
MyStruct.Data1a: R$ 458

MyStruct.Data2: R$ 5.25
MyStruct.Data2a: R$ 125689.4

MyStruct.Data3: R$ 15.01
MyStruct.Data3a: R$ 145.5]

Does anyone have a sort algorithm for this ? fast/optimized if possible

Btw: attached the new version. I hope it is still fast

Grincheux · December 21, 2015, 02:52:43 AM

My results : seconds

Thanks Guga for describing your process.
It's a bit hard to understand.

I ran the test a second time and noticed :

Population Standard deviation

Minimum : 142.36290032751498
Maximum : 142.36298469603162
Variance : 0.00000000177951
Standard deviation : 0.00004218425831

TWell · December 21, 2015, 03:04:20 AM

@Grincheux
Could you run cpu_z to see what it tells about your CPU speed?
AMD X2 250 is locked to 3.0 GHz?

Grincheux · December 21, 2015, 03:38:25 AM

CPU-Z result

Grincheux · December 21, 2015, 03:40:51 AM

[size=78%]Image result[/size]

Grincheux · December 21, 2015, 03:45:29 AM

My results

guga · December 21, 2015, 05:40:15 AM

New version. Faster and it seems that stability was not affected (I hope so :icon_mrgreen:)

I´m in need of the sorting algo i told about

Grincheux · December 21, 2015, 07:05:12 AM

I hope that could help you

guga · December 21, 2015, 08:11:02 AM

Helped a Lot. many tks. The app is working perfectly on your AMD. Almost 0 as variance means the Mean found is closer to the correct timming the tested function works on your machine :)

I´mworking now on the soerting algorithm in order to try fix a problem it have on tiny chunks of code. Once i succeed, i´ll start porting it to a dll so you guys can use it on your apps too :)

guga · December 23, 2015, 05:27:19 PM

Ok, guys...Finished fixing the bugs. I guess now it is more closer to the real timming the user function works :) Tomorrow i´ll start cleaning the code and preparing the dll version.

Any suggestions of how to implement it as a dll are more then welcome :)

I was considering inserting the main functions on a thread so when you use it inside your apps the function won´t hang on it I mean, got stuck until the analysis are finished, but i have no idea how to make a thread to handle this.

guga · December 29, 2015, 05:06:23 AM

Suceeded to clean up and i´m currently making some fine tunes on the code before implementing the dll.

There is a weird behaviour when the TSC mnemonics (CpuiD, rdtsc, rdtscp) when dealing with functions using cllbacks or memory allocation inside. Although i succeed to stabilize the results, i´m getting variatins around 10% of the timing while cleanng other functions on the algo not related to the benchmark itself). For example, the simple fact that i replace a cmp with a test on a internal routine causes teh algo to varyies the result from 72 ns to 80 ns on functions that uses callbacks!!!

Example, this behaviour happens when measuring the timming on qsort.

c_call 'msvcrt.qsort' TestingArray7, 5, 16, StructSortCompare

When i let the internal routines of CodeTune as test eax eax......the result is 72 ns
but...
If i change the same routines to cmp eax 0, the result jumps to 80 ns.

It happens only on those kind of functions under analysis. It seems that when we are analysing a function with callbacks or internal memory allocation, it may cause a pipeline problem or a throughput problem somewhere else that interferes the results.

I´m trying to keep the result with the lowest value as possible while doing the necessary optimizations of CodeTune before implementing the dll.

Althought the app is still fast and i suceeded to fix a major problem while the method chosen is QPC i need it to be a bit faster. It is taking 1.5 to 2.3 seconds on my I7 depending the chosen algo. Which is almost 2 times slower then before. For instance, Cpuid seems to be the worst algorithm for timming meausre (not to mention QPC). but still, it is possible to make it work faster and with the same accuracy as rdtsc or rtdscp

Of course, this wouldn´t be a big deal, but, if i could be able to keep the older speed, it will be better for all of us, special when doing the benchmark on others people code as many times we want.

My goal is to keep the total elapsed time in something around 1.0 seconds or less on my I7. If i succeed to make work on the same speed as the executable version (no dll), it will be better when using older processors.

Attached the newer version that uses the main function as a thread and analyse the API qsort. All other parts of the function are ready for the user insert his own code, but i need to make some fine tunes as i explained, before releasing the proper dll.

fearless · December 29, 2015, 05:29:07 AM

Getting an error messagebox: "The program can't start because ROMEM.dll is missing from your computer. Try reinstalling the program to fix this problem."

guga · December 29, 2015, 05:34:41 AM

I forgot to include it. Here it is :)

The MASM Forum

News:

Benchmark with minimum overhead

guga

Grincheux

guga

Grincheux

TWell

Grincheux

Grincheux

Grincheux

guga

Grincheux

guga

guga

guga

fearless

guga