News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Test of timing code

Started by Antariy, May 08, 2013, 01:11:33 AM

Previous topic - Next topic

Gunther

Hi Alex,

here are my results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

Integer looped code test

Cycles count: 68 (minimal was: 314, overhead: 246)
Cycles count: 70 (minimal was: 319, overhead: 246)
Cycles count: 76 (minimal was: 322, overhead: 246)
Cycles count: 68 (minimal was: 317, overhead: 246)
Cycles count: 76 (minimal was: 322, overhead: 246)
Cycles count: 70 (minimal was: 317, overhead: 246)
Cycles count: 68 (minimal was: 315, overhead: 246)
Cycles count: 70 (minimal was: 317, overhead: 246)
Cycles count: 68 (minimal was: 322, overhead: 246)
Cycles count: 76 (minimal was: 322, overhead: 246)


SSE code test

Cycles count: 18 (minimal was: 265, overhead: 246)
Cycles count: 16 (minimal was: 263, overhead: 246)
Cycles count: 18 (minimal was: 264, overhead: 246)
Cycles count: 18 (minimal was: 264, overhead: 246)
Cycles count: 16 (minimal was: 266, overhead: 246)
Cycles count: 16 (minimal was: 264, overhead: 246)
Cycles count: 18 (minimal was: 264, overhead: 246)
Cycles count: 16 (minimal was: 266, overhead: 246)
Cycles count: 18 (minimal was: 264, overhead: 246)
Cycles count: 16 (minimal was: 262, overhead: 246)
--- ok ---

I hope that helps a bit.

Gunther
You have to know the facts before you can distort them.

Antariy

Thank you all very much for your tests :biggrin: Thank you, thank you! :biggrin:


We all of course noticed that on more or less modern CPUs there is something strange with timings: they are obviously inadequate - can you imagine that the loop of code that consists 10+ instructions, and it gets 8 iterations, plus prologue of the function, epilogue (pops/ret etc) call to it, all this takes ~20 cycles? Or even for SSE code - the timings are unbelievable small - well, yes, SSE is a fast beast, but just think on it: how can ~20 instructions take only 2 cycles to be called, processed, returned? It is just impossible technically.

Well, my guess is that every CPU which has out of order possibility and the prediction logic, i.e. PPro and above, just "plays" a short-cut, some kind of history - when we set the high loop count to get average, we just force CPU to be smart and after, let's say, couple of thousands of repeats it just show us how good is its prediction logic in the particular generation. So, there we see such things that PIII have pretty good stability of timings, PIV - not so good, Core family does have increasing inadequateness of reported timings with every new generation.

Just imagine such an algorighm of short-cut: some code gets called, CPU fethed it, decoded it, executed it. This first time run will be the slowest, but now CPU have the code ready to be run just in microops. Also CPU tracks the history - where the code has accessed the memory to, so, next time the CPU checks - if the code accesses the same memory location again, then if that region is not "dirty" i.e. it's not marked as changed, the CPU knows that the data is the same, and if it knows (and it knows, of course :biggrin: especially if the testing data is smaller than one page) that the code will not access beyond the same area as it was before, so the code will produce the same results as before, and there is no reason to execute it again - it just returns the previously tracked result. So, instead of execution time we have the time that CPU's logic has checked the execution context for being the same as previously used, and returned the predicted result.

This is rough example, probably, but you know what I mean.


There I tried to solve the main problem that we have with the timings measurement: the influence of OS' background work if loop counter specified for timing macro is small, the influence of advanced CPUs logic when the loop counter specified for timing macro is high.

If we use high loop counter, we get "better" average, but counting a lot of superfluous stuff like background drivers execution for every hardware interrupt (every HDD I/O operation - the entire virtual memory subsystem based on HDD operations, so it can have influence even if there is at first look nothing to cause an I/O operations; mouse movements, network adapters and so on). With high loop counts this stuff gets divided between loops, so, we have more or less right timings for not out-of-order / prediction logic CPUs. For more or less modern CPUs we have inadequate timings - just because it's not execution time but rather prediction time - as I guessed above. I.e., the timings macro work well and does its job, but the CPU became to play smart games, so, the measurement that made with timing code is not that we want to measure.

If we use small loop counter - we get inadequate timings just because if with high counts those background OS work was divided by high counter to get average, with small counts we have hundreds or thousands of background (mostly driver's) cycles divided by small value, so, the timings do not show any useful info at all.

So, we need use as small loop count as possible, but at the same time avoid influence of multitasking environment.

The code below tries to use the smallest possible loop count just by getting the smallest time that tested code got executed. The logic is: we run the code inside timing frame, then we check total timing with a value that we think is the minimal possible loop count for the union of timing code and tested code. If the current timing measurement is bigger than that minimal, then it means that the minimal is too small, or the OS has interrupted our tested code somewhere in the testing time. First issue we fix with consequent increasing of the minimal value we check with, second issue we fix just by re-executing timing frame again. Statistically there should be a much timespace when we will run the test without any iterrupts, and this method catches that timespace. As you noticed it doesn't bother much even with dropping a timeslice after a test/printing results - and still it works, because, for instance, "20 Cycles" here meant that the frame timing code + tested code have got ran just 20 loops: starting from an minimal (for testing frame it is overhead) to the minimal + loops count. I.e. not hundrends, not thousands, not millions of loops that force CPU to show us its advanced techniques of the out of order logic, but just 20.
The minimal - the overhead - is computed just by execution "empty" timing loop and starting to check it with zero minimal. Since the timing loops consists mostly from not OoO code (it has cpuid), the value we measure will be pretty stable thing and will be used as the minimal at the further measurements to avoid excessive measurement loops (so eleminate CPU's short-cuts as much as possible).


This meant that for the timing frame, let's say, 100 clocks long (actual number is the "minimal was: XXX" printed in this testbed), we take only 2000 of CPU cycles for entire process of a timing measurement, that with modern CPUs meant no more than one microsecond - the OS' timislice is much bigger, so, we obviously will find a "free time" when nothing is interrupting our test. At the same time - 20...100...(needs checking) loops is not so much to get CPU's advanced logic involved, at least not so much as with high loop counts.




Now the drawbacks. If the tested code runs longer that some maximal count of cycles, then the test will be interrupted by the OS, and having small loop count of measurements, we will probably have inadequate results.
If we will guess a timeslice for user code as 250 milliseconds, so, the rough maximum for the example above will be ~5000 clock cycles, i.e., if tested code runs near to or longer than 5000 cycles, the timings will not be stable.

So, this is not absolutely universal timing code, it's rather the solution for the measurement under modern CPUs and/or for very fast algos that just may not be adequately measured with the timing method we are currently wide using in our testbeds. So, it's something like "quantum measurement" for lighthing fast algos :biggrin:


I think the tests proved that this method is good, so I probably will try to make these as a macroses.

Many thanks to Michael (MichaelW) for his timings macro! And to Jochen (jj2007), had been talking with him I thought out this idea, now it needs wide testing in format of simplified usage as a macro :biggrin:

Antariy

And here we can see that CPU's prediction logic works best of all with SSE code - well, this seems pretty logically - SSE does not bother with flags nor is used for anything else "unpredictable-driven" things like unpredictable runtime changements are. The GPR code like MOV EAX,[somewhere] / ... / MOV ECX,[EAX + EDX] is hard thing to predict, such things are rare and almost unused with SSE code, there we rather have consequent access to a memory block and simple math with the XMM regs that doesn't relate to execution path at all, so, if CPU knows that the data is not changed (the cache line / page is not marked as a dirty) from previous call, it just plays short-cut and returns predicted result - that's the reason for "proud" 2 clock cycles timings for a complex code :biggrin:

This timing method shows that even on very power and modern CPUs real performace of SSE is far from "near zero" cycles for a bunch of a depending instructions like tested code is. And this obviously should be so.

Hm... Note for Jochen: how do you think, will you-know-who register here to sprinkle all of us with his hydrophobic saliva, now, seeing these results, for instance:

Quote from: habran on May 08, 2013, 06:21:37 AM
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

Integer looped code test

Cycles count: 46 (minimal was: 108, overhead: 58)
Cycles count: 48 (minimal was: 106, overhead: 58)

...

SSE code test

Cycles count: 36 (minimal was: 94, overhead: 58)
Cycles count: 34 (minimal was: 93, overhead: 58)
...

--- ok ---


:greensml:

After reading such an impolite note it will be hard question for him: to register, or not to register, that's the question :greensml:

(note for those who don't know what I'm talking about to Jochen: there is someone who will maybe disappointed to see that his SSE code is not way faster than some old-fashioned i386 code, both being running on very modern model of CPU, at least no one test shows it running at "zero" clock cycles :lol:)

Antariy

BTW, im my big post two posts above here:

> If we will guess a timeslice for user code as 250 milliseconds

It is mistypo: not miiliseconds but microseconds I meant - the other numbers are right since they were computed for microseconds, that's just a mistypo.


Also, by results of testing you may see interesting details, for instance:

Quote from: habran on May 08, 2013, 06:21:37 AM
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

Integer looped code test

Cycles count: 46 (minimal was: 108, overhead: 58) 1)
Cycles count: 48 (minimal was: 106, overhead: 58) 2)

...

SSE code test

Cycles count: 36 (minimal was: 94, overhead: 58) 3)
Cycles count: 34 (minimal was: 93, overhead: 58) 4)
...

--- ok ---



In a test cases 1) and 4) you see that overhead + cycles count is smaller than the minimal. For instance for 1) it is 46+58=104 clock cycles took the measurement process, but the minimal was 108 cycles. What this meant? This probably meant that the test was interrupted and/or was slow down at the measurement loops with 103 <= minimal < 108, and timings were too big, so, before the minimal=104 the timings did not suit because code was slower than that, but after minimal got value 103 or 104 there was a slowdown from the OS side, right after that all have got to a "calm", the first value was grabbed that passed the comparsion with minimal to be smaller than it.

This is representative image of how timing method filters and eliminates the multitasking environment issues.

Antariy

Interesting how small penalties does have P-MMX which is not a true OoO CPU so don't have deep pipelines which cause these penalties when CPUID is executed (overhead value is very small):

Quote from: FORTRANS on May 08, 2013, 01:47:23 AM
A:\>35hex2dw

Integer looped code test

Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 121 (minimal was: 150, overhead: 29)
Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 116 (minimal was: 145, overhead: 29)
Cycles count: 121 (minimal was: 150, overhead: 29)
Cycles count: 121 (minimal was: 150, overhead: 29)


The data collected in this thread is very informative, thank you all again!

Thesis
With the numbers collected here we can guess that CPUs with high overhead value have deeper pipeline and/or have higher penalties for mispredictions, but for the code that is well predictable (for example is well serializable) the deep pipeline has advantages of OoO and prediction logic.

Just an analysis of the data collected:

Quote from: habran on May 08, 2013, 06:21:37 AM
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

Integer looped code test

Cycles count: 46 (minimal was: 108, overhead: 58)
Cycles count: 48 (minimal was: 106, overhead: 58)

SSE code test

Cycles count: 36 (minimal was: 94, overhead: 58)
Cycles count: 34 (minimal was: 93, overhead: 58)


Not deep pipeline:
• CPU fast recovers from mispredictions;
• Very predictable and serialized code like SSE one is does not have much advantages in front of looped GPR code - the timings factor SSE vs GPR is 1,35.

Also, just a thought: maybe this is because this mobile model of the CPU does have different circuit design for the predictor logic - to be less power consuming it has smaller number of logic schemes => smaller circuit => smaller TDP than Gunther's one:




Quote from: Gunther on May 08, 2013, 07:32:34 AM

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

Integer looped code test

Cycles count: 68 (minimal was: 314, overhead: 246)
Cycles count: 70 (minimal was: 319, overhead: 246)

SSE code test

Cycles count: 18 (minimal was: 265, overhead: 246)
Cycles count: 16 (minimal was: 263, overhead: 246)



Much deeper pipeline (this paragraph goes here just because it relates to previous CPU with the same architecture, actually here would have been next paragraph about other CPU):
• CPU has bigger penalties from mispredictions;
• Well serializable code like tested SSE one is have more advantage: SSE:GPR factor is 4,25 (!).

So, it looks like this CPU should have more compex circuit => higher power consumption.

OK, that was an guess... will check it now... Here the numbers are: habran's CPU TDP is 45W, Gunther's CPU TDP is 77W - just like was guessed :biggrin: So, now we see that from timings results we can even guess some details of particular CPU's model design :biggrin:

It will be very good if John (sinsi) will kindly give us timings for his desktop CPU model :biggrin:




Quote from: anta40 on May 08, 2013, 01:32:55 AM

Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)

Integer looped code test

Cycles count: 36 (minimal was: 124, overhead: 87)
Cycles count: 36 (minimal was: 125, overhead: 87)

SSE code test

Cycles count: 15 (minimal was: 102, overhead: 87)
Cycles count: 15 (minimal was: 102, overhead: 87)


Deeper pipeline (overhead is 87 clocks instead of 58):
• As have already been said, CPU has bigger penalties for mispredictions and faults of OoO logic;
• Well serializable code runs better: SSE:GPR ratio is 2,4.




Quote from: hutch-- on May 08, 2013, 02:01:14 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)

Integer looped code test

Cycles count: 36 (minimal was: 261, overhead: 225)
Cycles count: 27 (minimal was: 259, overhead: 225)

SSE code test

Cycles count: 9 (minimal was: 235, overhead: 225)
Cycles count: 9 (minimal was: 234, overhead: 225)


Deeper pipeline: SSE:GPR ration is 3.




Here you should look to a Gunther's CPU data - it fits here in this series of consequentally growing overhead = pipeline = prediction possibilities. As we remember SSE:GPR ratio is 4,25.




Quote from: Magnum on May 08, 2013, 06:38:30 AM

Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)

Integer looped code test

Cycles count: 90 (minimal was: 783, overhead: 684)
Cycles count: 90 (minimal was: 774, overhead: 684)

SSE code test

Cycles count: 18 (minimal was: 702, overhead: 684)
Cycles count: 18 (minimal was: 702, overhead: 684)


Deepest pipeline in the series of test: SSE:GPR factor is 5.



So, as you can see, the collected experimental data prove the thesis from the start of this message: higher overhead value shows more complex prediction logic of CPU.
Here I used only data of Core family CPUs, but you all know that Netburst has a deep pipeline and it very badly behaves when the code is not well predictable and when it does a fault prediction, so, PIV and Celerons here show the same behaviour in the series of tests, as well as one AMD - it has small overhead (63), and SSE:GPR ratio is small, too (1,31) :biggrin:

Probably we have a good additional measurement instrument :biggrin:

sinsi

Your wish is my command  :biggrin:

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)

Integer looped code test

Cycles count: 53 (minimal was: 141, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 53 (minimal was: 141, overhead: 87)
Cycles count: 53 (minimal was: 143, overhead: 87)
Cycles count: 53 (minimal was: 143, overhead: 87)
Cycles count: 53 (minimal was: 143, overhead: 87)


SSE code test

Cycles count: 9 (minimal was: 96, overhead: 87)
Cycles count: 6 (minimal was: 93, overhead: 87)
Cycles count: 9 (minimal was: 116, overhead: 87)
Cycles count: 35 (minimal was: 122, overhead: 87)
Cycles count: 35 (minimal was: 122, overhead: 87)
Cycles count: 35 (minimal was: 123, overhead: 87)
Cycles count: 35 (minimal was: 122, overhead: 87)
Cycles count: 9 (minimal was: 98, overhead: 87)
Cycles count: 35 (minimal was: 123, overhead: 87)
Cycles count: 35 (minimal was: 123, overhead: 87)

Antariy

Thank you very much, John! :biggrin: Well, I'm crazy already, bored to type-type-type tonight... :greensml: Anyone can explain its behaviour if wants, but I'm going offline. At least, half of the time its SSE:GPR factor is 1,51 that perfectly fits into thesis, but sometimes there is something weird, probably desktop CPU models have so powerful prediction logic that even 9 loops is enough to get it to use short-cut, and the same is applied for the overhead computation loop, so... We strongly need more experimental data! :lol:
BTW, did you run it from an WinRAR, didn't you?

My next thought was a "threshold" value to add into testing method, but it's already for macroses...
Please, John, test the archive I posted now, it has hard-coded threshold for both algos for specifically your results. We may actually automate such things and get the timing scheme where the code get called just couple of times - this will be really quantum measurement :lol:

My results:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

Integer looped code test

Cycles count: 56 (minimal was: 528, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)
Cycles count: 48 (minimal was: 523, overhead: 472)
Cycles count: 48 (minimal was: 523, overhead: 472)
Cycles count: 48 (minimal was: 522, overhead: 472)


SSE code test

Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
Cycles count: 16 (minimal was: 504, overhead: 472)
--- ok ---


Please all - test it, but if you have problems like unstable timings - remember it is because threshold is hardcoded and is not known for your particular system. It should actually make timings more stable, like it done this for my CPU, but "hardcoded" value is a hardcoded...

sinsi


Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)

Integer looped code test

Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 144, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 144, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 144, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)
Cycles count: 56 (minimal was: 143, overhead: 87)


SSE code test

Cycles count: 9 (minimal was: 119, overhead: 87)
Cycles count: 35 (minimal was: 122, overhead: 87)
Cycles count: 6 (minimal was: 120, overhead: 87)
Cycles count: 35 (minimal was: 122, overhead: 87)
Cycles count: 35 (minimal was: 122, overhead: 87)
Cycles count: 9 (minimal was: 119, overhead: 87)
Cycles count: 9 (minimal was: 119, overhead: 87)
Cycles count: 6 (minimal was: 122, overhead: 87)
Cycles count: 9 (minimal was: 121, overhead: 87)
Cycles count: 38 (minimal was: 125, overhead: 87)

BTW, for the first test I ran it from WinRAR, the first two sse tests were -23 which I ignored.

Antariy

Thank you, John :biggrin: At least for GPR code the timings now are stable, still it seems something funny with highly predictable SSE code... well, it's improper thing to choose threshold manually, so this obviously needs some sharpening and automation :lol: Maybe even recalculate overhead before every measurement just like Michael's macroses do.

So, thoughts are welcome :t

habran

me know nothing...  :icon_redface:
you :t
:biggrin: :biggrin: :biggrin:
Cod-Father

FORTRANS

Quote from: Antariy on May 08, 2013, 03:33:26 PM
Please all - test it,

Hi Alex,

   Ran it on the same computers as in Reply #7, and the results
look to be unchanged.

Regards,

Steve

Antariy

Quote from: habran on May 08, 2013, 05:39:05 PM
me know nothing...  :icon_redface:
you :t
:biggrin: :biggrin: :biggrin:

:biggrin: :t (no, I really don't know either :icon_redface:)

Quote from: FORTRANS on May 08, 2013, 10:56:21 PM
Hi Alex,

   Ran it on the same computers as in Reply #7, and the results
look to be unchanged.

Thank you very much, Steve!


So it seems that the reason is in the John's CPU :biggrin:

Gunther

Alex,

here again the results with the program from test2.zip:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

Integer looped code test

Cycles count: 36 (minimal was: 189, overhead: 136)
Cycles count: 48 (minimal was: 186, overhead: 136)
Cycles count: 40 (minimal was: 187, overhead: 136)
Cycles count: 44 (minimal was: 186, overhead: 136)
Cycles count: 44 (minimal was: 187, overhead: 136)
Cycles count: 40 (minimal was: 186, overhead: 136)
Cycles count: 40 (minimal was: 187, overhead: 136)
Cycles count: 56 (minimal was: 193, overhead: 136)
Cycles count: 52 (minimal was: 188, overhead: 136)
Cycles count: 28 (minimal was: 186, overhead: 136)


SSE code test

Cycles count: 8 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
Cycles count: 8 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
Cycles count: 8 (minimal was: 168, overhead: 136)
Cycles count: 12 (minimal was: 168, overhead: 136)
--- ok ---

Gunther
You have to know the facts before you can distort them.

Antariy


habran

Hey Antariy,
I appreciate your enthusiasm :t
I guess you must be a young man according to your energy level :biggrin:
   
Cod-Father