Thank you all very much for your tests

Thank you, thank you!

We all of course noticed that on more or less modern CPUs there is something strange with timings: they are obviously inadequate - can you imagine that the loop of code that consists 10+ instructions, and it gets 8 iterations, plus prologue of the function, epilogue (pops/ret etc) call to it, all this takes ~20 cycles? Or even for SSE code - the timings are unbelievable small - well, yes, SSE is a fast beast, but just think on it: how can ~20 instructions take only 2 cycles to be called, processed, returned? It is just impossible technically.
Well, my guess is that every CPU which has out of order possibility and the prediction logic, i.e. PPro and above, just "plays" a short-cut, some kind of history - when we set the high loop count to get average, we just force CPU to be smart and after, let's say, couple of thousands of repeats it just show us how good is its prediction logic in the particular generation. So, there we see such things that PIII have pretty good stability of timings, PIV - not so good, Core family does have increasing inadequateness of reported timings with every new generation.
Just imagine such an algorighm of short-cut: some code gets called, CPU fethed it, decoded it, executed it. This first time run will be the slowest, but now CPU have the code ready to be run just in microops. Also CPU tracks the history - where the code has accessed the memory to, so, next time the CPU checks - if the code accesses the same memory location again, then if that region is not "dirty" i.e. it's not marked as changed, the CPU knows that the data is the same, and if it knows (and it knows, of course

especially if the testing data is smaller than one page) that the code will not access beyond the same area as it was before, so the code will produce the same results as before, and there is no reason to execute it again - it just returns the previously tracked result. So, instead of execution time we have the time that CPU's logic has checked the execution context for being the same as previously used, and returned the predicted result.
This is rough example, probably, but you know what I mean.
There I tried to solve the main problem that we have with the timings measurement: the influence of OS' background work if loop counter specified for timing macro is small, the influence of advanced CPUs logic when the loop counter specified for timing macro is high.
If we use high loop counter, we get "better" average, but counting a lot of superfluous stuff like background drivers execution for every hardware interrupt (every HDD I/O operation - the entire virtual memory subsystem based on HDD operations, so it can have influence even if there is at first look nothing to cause an I/O operations; mouse movements, network adapters and so on). With high loop counts this stuff gets divided between loops, so, we have more or less right timings for
not out-of-order / prediction logic CPUs. For more or less modern CPUs we have inadequate timings - just because it's not execution time but rather prediction time - as I guessed above. I.e., the timings macro work well and does its job, but the CPU became to play smart games, so, the measurement that made with timing code is not that we want to measure.
If we use small loop counter - we get inadequate timings just because if with high counts those background OS work was divided by high counter to get average, with small counts we have hundreds or thousands of background (mostly driver's) cycles divided by small value, so, the timings do not show any useful info at all.
So, we need use as small loop count as possible, but at the same time avoid influence of multitasking environment.
The code below tries to use the smallest possible loop count just by getting the smallest time that tested code got executed. The logic is: we run the code inside timing frame, then we check total timing with a value that we think is the minimal possible loop count for the union of timing code and tested code. If the current timing measurement is bigger than that minimal, then it means that the minimal is too small, or the OS has interrupted our tested code somewhere in the testing time. First issue we fix with consequent increasing of the minimal value we check with, second issue we fix just by re-executing timing frame again. Statistically there should be a much timespace when we will run the test without any iterrupts, and this method catches that timespace. As you noticed it doesn't bother much even with dropping a timeslice after a test/printing results - and still it works, because, for instance, "20 Cycles" here meant that the frame timing code + tested code have got ran just 20 loops: starting from an minimal (for testing frame it is overhead) to the minimal + loops count. I.e. not hundrends, not thousands, not millions of loops that force CPU to show us its advanced techniques of the out of order logic, but just 20.
The minimal - the overhead - is computed just by execution "empty" timing loop and starting to check it with zero minimal. Since the timing loops consists mostly from not OoO code (it has cpuid), the value we measure will be pretty stable thing and will be used as the minimal at the further measurements to avoid excessive measurement loops (so eleminate CPU's short-cuts as much as possible).
This meant that for the timing frame, let's say, 100 clocks long (actual number is the "minimal was: XXX" printed in this testbed), we take only 2000 of CPU cycles for entire process of a timing measurement, that with modern CPUs meant no more than one microsecond - the OS' timislice is much bigger, so, we obviously will find a "free time" when nothing is interrupting our test. At the same time - 20...100...(needs checking) loops is not so much to get CPU's advanced logic involved, at least not so much as with high loop counts.
Now the drawbacks. If the tested code runs longer that some maximal count of cycles, then the test will be interrupted by the OS, and having small loop count of measurements, we will probably have inadequate results.
If we will guess a timeslice for user code as 250 milliseconds, so, the rough maximum for the example above will be ~5000 clock cycles, i.e., if tested code runs near to or longer than 5000 cycles, the timings will not be stable.
So, this is not absolutely universal timing code, it's rather the solution for the measurement under modern CPUs and/or for very fast algos that just may not be adequately measured with the timing method we are currently wide using in our testbeds. So, it's something like "quantum measurement" for lighthing fast algos

I think the tests proved that this method is good, so I probably will try to make these as a macroses.
Many thanks to Michael (MichaelW) for his timings macro! And to Jochen (jj2007), had been talking with him I thought out this idea, now it needs wide testing in format of simplified usage as a macro
