CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)

guga · January 06, 2016, 03:33:47 AM

Fearless your results seems accurate but this total amount of time it is taking to run is way too high. I probably forgot to include a error message somewhere.

I´ll take a better look and try to use the app as if i my processor didn´t support some of the methods to see what happens.

jj2007 · January 06, 2016, 03:47:35 AM

Quote from: guga on January 06, 2016, 02:54:52 AMI mean, if you use your benchmark tools with masmbasicstrlen forcing it to pass 8 times (or looping 90 million times), do you have an idea of the time it will take to finish ?

Depends on
- len of string(s)
- position of strings in memory

What do you suggest?

guga · January 06, 2016, 04:04:37 AM

This is the way i´m testing.

Code Select

call MasmBasicStrLen {B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero", 0}

This function is used inside the internal loops. Each algo method that uses to compute the Mean and STD values, passes it around 9 million times at least (Loops as i saw on your benchmark tools). The amount of loops can be estimated like this: Good Samples*Iterations*3 then multiply the result by 8 (that is the total algorithm methods used).

Since you tested it with 300 samples +3000 iterations, the total amount of loops performed is something around 21600000.

I would like to see how much time it takes to run on this amount of loops (using that string passed through your function) . On this way i can compare to the results of the profiler and see if it can be optimized.

I´m pretty sure i forgot some error message somewhere biased on the results of Phillip and maybe it is related to the increase amount of time it is taking to finish in yours and on fearless.

But, if you think that with your benchmark tools will take more then 204 seconds to run, no problem. I just wanted to know if this huge amount of loops really is that slow on older processors or it was me (again) ruining something inside the dll :icon_mrgreen:

jj2007 · January 06, 2016, 05:02:51 AM

Quote from: guga on January 06, 2016, 04:04:37 AM
This is the way i´m testing.

Code Select Expand
call MasmBasicStrLen {B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero", 0}

This function is used inside the internal loops. Each algo method that uses to compute the Mean and STD values, passes it around 9 million times at least (Loops as i saw on your benchmark tools). The amount of loops can be estimated like this: Good Samples*Iterations*3 then multiply the result by 8 (that is the total algorithm methods used).

Since you tested it with 300 samples +3000 iterations, the total amount of loops performed is something around 21600000.

With 10 Mio loops, it's 64 cycles per loop for the 100 byte string.

guga · January 06, 2016, 06:59:19 AM

Tks...and how long did it take to finish running ? did it take more then 200 seconds too ?

Grincheux · January 06, 2016, 07:00:29 AM

jj2007 · January 06, 2016, 08:06:16 AM

Quote from: guga on January 06, 2016, 06:59:19 AM
Tks...and how long did it take to finish running ? did it take more then 200 seconds too ?

No, it takes about 350 milliseconds.

guga · January 06, 2016, 09:32:21 AM

200 mliisconds ? Hmm..considering the several functions, it would be acceptable take 20 seconds to run on yours, but 204 is something i need to see what is happening.

Btw..here goes a small example in masm. I assembled it with Radasm. (No fancy stuff...simply a dialog that displays the Mean). Once i succeed to see what is the cause of this huge delay, i´ll post other examples of the new function and will update the library

guga · January 06, 2016, 09:41:28 AM

Phillip. Can you please test with this other version i´m attaching ? This one uses only CreateTimeProfile and not CreateTimeProfileEx

I would like to see what is happenning with yours. If possible, please test all available algo methods. Specially the last one (QPC) i would like to compare the results from CreateTimeProfile and CreateTimeProfileEx on your AMD.

Grincheux · January 06, 2016, 12:51:32 PM

Here are the results : 4 algorythms.

Grincheux · January 06, 2016, 12:55:29 PM

Your RadAsm test

guga · January 06, 2016, 02:00:11 PM

Tks....

Ok, so, AMD (or, most likely, the OS) have some serious issues with QPC. From the images, the normal smallest speed on your processor is something around 13 ns. On the complete oposite hand, QPC says 0.06 ns.

It most likely is returning 0. If that is the case, then M$ (again) is providing wrong documentation. On msdn it says that QPC never return 0, but it is a false statement. On the same page there are users complaining about this.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904%28v=vs.85%29.aspx

However their solution to use SetThreadAffinityMask seems senseless. It will only take huge amount of time to process, i suppose. I tried it before and it takes 10 more times to finish and the results are not at all accurated (in fact, the results are duplicated using those other Apis). ..I then inserted this Apis outside the main loops just before and after they major loops are being used and nothing happened to effectively stabilize the result.

I´ll make some tests on this and see if i can be able to reproduce the error here. If i could not repoduce i´ll simply remove QPC among the list of algo methods and see what happens to the overall performance of CodeTune Api

Which OS are u using ?

About the radasm example. The result is ok. You are having the correct timming. :t

guga · January 06, 2016, 02:28:06 PM

i´ll try to analyse what QPC is dong internally. The Api is calling inside hal.dll to a Api _KeQueryPerformanceCounter (From which, btw calls to HalpAcpiTimerQueryPerfCount). I´ll see what this internal api is doing to see if it will return 0 and how. I gyess that the cause of this insane amount of time it is taking in some PCs are due to the usage of QPC as well.

On Win7 QPc uses rdtsc without being a part of the system, but on XP it is a system call that also seems to use only rdtsc. http://permalink.gmane.org/gmane.comp.web.chromium.devel/37936
I´m disassembling the library tyo check either it is true or not. If QPC simply uses a fancy (Bogus, in fact) method of using rdtsc, i´ll simply removing it, since one of the methods already uses only rdtsc

Grincheux · January 06, 2016, 04:28:38 PM

I use Windows 10 Pro

QuoteIt certainly seems like QPC could potentially be too costly.

guga · January 07, 2016, 01:13:55 AM

Indeed, they were correct. QPC (Quotient of Pile of Crap :icon_mrgreen:) is only a rdtsc instruction to be executed after XXX calls untill it finally arrives inside hal.dll (Probably WIn10 is faster).

I´m seeing the way windows XP did. The only thing that maybe interesting is the usage of pause instruction. I inserted it immediately before the main loop (after the starttime and endtime do their jobs) and also forced the calibration and benchmark variables (Pointers to memory in fact) be aligned on a 128 bit boundary

Under this other techniques i got this results:

Pros : The library is now more stable and faster (1/3 of the overall time to finish. Instead finishing in 10 seconds here, it finishes in 3)
Cons: The accuracy lost a bit. I mean, the old problem of that difference of timing on JJ´s function showed again :(

I´m trying a couple of things today about the usage of the pause instruction if is really that usefull.

The MASM Forum

News:

CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)

guga

jj2007

guga

jj2007

guga

Grincheux

jj2007

guga

guga

Grincheux

Grincheux

guga

guga

Grincheux

guga