News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Benchmark with minimum overhead

Started by guga, December 16, 2015, 01:56:19 AM

Previous topic - Next topic

guga

Many tks Siekmanski

Your results are closer to mine :) Minimum variation and STD and Algo7 is the prefered choice. I´m current reorganizing the code to make it faster and trying to fix the problem of the huge variances reported by others that eventually are also being measured.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

Guga that means that your program as a kind of interpreter.
I don't use RosAsm but I have an account on your forum.


It's a kind of VTune (Intel) or Code Analyst (Amd), better of course.
Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

guga

Hi
Grincheux

Tks for the comments :) I never used VTune or Code Analyst before, but, you are right, the app is a kind of interpreter. The main problem i´m facing is now with tiny pieces of code. It seems that the mnemonics used to measure the timmings are not at all that accurated when it deals with small chuncks of code.

The problem is that whenever the loop is performed before and after the code you are testing, the mnemonics donp´t have enough time to measure, causing errors on the result.

This is particulary true when the code you test is as simple as an "xor eax eax",  "mov eax 1" etc. I mean, when you analyse those chuncks alone. For computing the correct time, i made a calibration of the algorithm that is basically the subtraction of a "null" function with the ones you are testing. On this, we have 2 times measure, the one resultant from the calibration and the other for the code you are analysing. Ex:

Calibration function:
Proc Dummy:
EndP


Benchmarking test:
Proc Benchmark:
      xor eax eax
EndP


Considering that Proc and Endp Are macros, they also contains mnemnonics. So, in fact what is being measured is:
Calibration function:
Dummy:
push ebp
mov ebp esp

mov esp ebp
pop ebp
ret


Benchmarking test:
push ebp
mov ebp esp

     xor eax eax

mov esp ebp
pop ebp
ret


Ok, so let´s say that at the end of all analysis, the calibration function have a average time 30 ns and the Benchmark function have a average of 31 ns.

The resultant value is simply the subraction between both. So, 31ns - 30 ns = 1 ns which is the time the instruction "xor eax eax" alone takes to run.

Well...this is where things starts to get wrong. :(

In theory, the logic is correct, but, in practice, the results are teh same as if there is no instruction at all to be measured. So, it is like "xor eax eax" never existed., leading to a result of 0 or 0,001 etc...much much similar to the same timming as the one found on the calibration.
And also, when the results are negative :( which are the worst cases !

Those are due to some margin of error that i must be created in order to correctly fix those variations of timmings for very tiny pieces of code.

And, for that i need a sort algorithm. So, if anyone can help me with a sort algo, please, i´ll really appreciate it.

The sort algorthm need to soert a Array of Sructures containing 2 Double Floats inside (REAL data type. Don´t remember if masm it have the same notation as in RosAsm or Nasm. They are the floats with 8 bytes). The sort must be biased on the 1st element of the array. Ex:

Array:
[MyStructArray:

MyStruct.Data1: R$ 5.25
MyStruct.Data1a: R$ 125689.4

MyStruct.Data2: R$ 15.01
MyStruct.Data2a: R$ 145.5

MyStruct.Data3: R$ 1.25
MyStruct.Data3a: R$ 458]



After sorting the result must be:

[MyStructArray:

MyStruct.Data1: R$ 1.25
MyStruct.Data1a: R$ 458

MyStruct.Data2: R$ 5.25
MyStruct.Data2a: R$ 125689.4

MyStruct.Data3: R$ 15.01
MyStruct.Data3a: R$ 145.5]


Does anyone have a sort algorithm for this ?  fast/optimized if possible

Btw: attached the new version. I hope it is still fast
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

My results : seconds



Thanks Guga for describing your process.
It's a bit hard to understand.

I ran the test a second time and noticed :

Population Standard deviation

Minimum : 142.36290032751498
Maximum : 142.36298469603162
Variance : 0.00000000177951
Standard deviation : 0.00004218425831
Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

TWell

@Grincheux
Could you run cpu_z to see what it tells about your CPU speed?
AMD X2 250 is locked to 3.0 GHz?

Grincheux

Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

Grincheux

Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

Grincheux

Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

guga

New version. Faster and it seems that stability was not affected (I hope so  :icon_mrgreen:)



I´m in need of the sorting algo i told about
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

guga

Helped a Lot. many tks. The app is working perfectly on your AMD. Almost 0 as variance means the Mean found is closer to the correct timming the tested function works on your machine :)

I´mworking now on the soerting algorithm in order to try fix a problem it have on tiny chunks of code. Once i succeed, i´ll start porting it to a dll so you guys can use it on your apps too :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok, guys...Finished fixing the bugs.  I guess now it is more closer to the real timming the user function works :) Tomorrow i´ll start cleaning the code and preparing the dll version.

Any suggestions of how to implement it as a dll are more then welcome :)

I was considering inserting the main functions on a thread so when you use it inside your apps the function won´t hang on it I mean, got stuck until the analysis are finished, but i have no idea how to make a thread to handle this.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Suceeded to clean up and i´m currently making some fine tunes on the code before implementing the dll.

There is a weird behaviour when the TSC mnemonics (CpuiD, rdtsc, rdtscp) when dealing with functions using cllbacks or memory allocation inside. Although i succeed to stabilize the results, i´m getting variatins around 10% of the timing while cleanng other functions on the algo not related to the benchmark itself). For example, the simple fact that i replace a cmp with a test on a internal routine causes teh algo to varyies the result from 72 ns to 80 ns on functions that uses callbacks!!!

Example, this behaviour happens when measuring the timming on qsort.

c_call 'msvcrt.qsort' TestingArray7, 5, 16, StructSortCompare

When i let the internal routines of CodeTune as test eax eax......the result is 72 ns
but...
If i change the same routines to cmp eax 0, the result jumps to 80 ns.

It happens only on those kind of functions under analysis. It seems that when we are analysing a function with callbacks or internal memory allocation, it may cause a pipeline problem or a throughput problem somewhere else that interferes the results.

I´m trying to keep the result with the lowest value as possible while doing the necessary optimizations of CodeTune before implementing the dll.

Althought the app is still fast and i suceeded to fix a major problem while the method chosen is QPC i need it to be a bit faster. It is taking 1.5 to 2.3 seconds on my I7 depending the chosen algo. Which is almost 2 times slower then before.  For instance, Cpuid seems to be the worst algorithm for timming meausre (not to mention QPC). but still, it is possible to make it work faster and with the same accuracy as rdtsc or rtdscp

Of course, this wouldn´t be a big deal, but, if i could be able to keep the older speed, it will be better for all of us, special when doing the benchmark on others people code as many times we want.

My goal is to keep the total elapsed time in something around 1.0 seconds or less on my I7. If i succeed to make work on the same speed as the executable version (no dll), it will be better when using older processors.

Attached the newer version that uses the main function as a thread and analyse the API qsort. All other parts of the function are ready for the user insert his own code, but i need to make some fine tunes as i explained, before releasing the proper dll.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

fearless

Getting an error messagebox: "The program can't start because ROMEM.dll is missing from your computer. Try reinstalling the program to fix this problem."

guga

I forgot to include it. Here it is :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com