News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Time to crash the "best algo" to format dword unsigned numbers

Started by frktons, December 26, 2012, 11:06:06 AM

Previous topic - Next topic

frktons

Quote from: dedndave on December 28, 2012, 02:22:57 PM
if you want more repeatable results, let the test run a little longer   :P
These results are more real than making the loop 1 million times.
In real life, and in the "File Compressor" 1000 times are more than enough.
:P
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Gunther

Hi frktons,

Quote from: frktons on December 28, 2012, 03:44:10 PM
In real life, and in the "File Compressor" 1000 times are more than enough.
:P

that's not quiet right. A few years ago, I've written a fractal image compressor. The bottle neck were 5 simple procedures (calculating arithmetic mean, dot product etc.) which wasted 90% of computing time. For a small image with, lets say, 256x256 pixels those functions were called over 30 million times. That's the reality.

Gunther
You have to know the facts before you can distort them.

dedndave

what i was refering to was increasing the loop count for longer tests
these numbers won't jump all over the place - makes it easier to get a comparison
Quote8,568   cycles for NumFormatX - IDIV / Stack
8,523   cycles for NumFormatX - IDIV / Stack
8,678   cycles for NumFormatX - IDIV / Stack
8,526   cycles for NumFormatX - IDIV / Stack
8,576   cycles for NumFormatX - IDIV / Stack
9,091   cycles for NumFormatX - IDIV / Stack
8,394   cycles for NumFormatX - IDIV / Stack
8,301   cycles for NumFormatX - IDIV / Stack
8,709   cycles for NumFormatX - IDIV / Stack
8,390   cycles for NumFormatX - IDIV / Stack
8,605   cycles for NumFormatX - IDIV / Stack
8,648   cycles for NumFormatX - IDIV / Stack
8,643   cycles for NumFormatX - IDIV / Stack
8,529   cycles for NumFormatX - IDIV / Stack
8,199   cycles for NumFormatX - IDIV / Stack
8,527   cycles for NumFormatX - IDIV / Stack
8,406   cycles for NumFormatX - IDIV / Stack
8,622   cycles for NumFormatX - IDIV / Stack
8,533   cycles for NumFormatX - IDIV / Stack
8,516   cycles for NumFormatX - IDIV / Stack
8,832   cycles for NumFormatX - IDIV / Stack
8,570   cycles for NumFormatX - IDIV / Stack
8,402   cycles for NumFormatX - IDIV / Stack
8,663   cycles for NumFormatX - IDIV / Stack
8,141   cycles for NumFormatX - IDIV / Stack
8,369   cycles for NumFormatX - IDIV / Stack
8,360   cycles for NumFormatX - IDIV / Stack
8,292   cycles for NumFormatX - IDIV / Stack
they are jumping around +/- 6 %   ::)

frktons

Quote from: Gunther on December 28, 2012, 10:31:18 PM
Hi frktons,

Quote from: frktons on December 28, 2012, 03:44:10 PM
In real life, and in the "File Compressor" 1000 times are more than enough.
:P

that's not quiet right. A few years ago, I've written a fractal image compressor. The bottle neck were 5 simple procedures (calculating arithmetic mean, dot product etc.) which wasted 90% of computing time. For a small image with, lets say, 256x256 pixels those functions were called over 30 million times. That's the reality.

Gunther
Hi Gunther,
In your case it's good to think in these terms, in my case, with
a routine that formats binary numbers into strings, it isn't.
The reality I'm speaking of is just the program in which I'm going
to use the routine, not the universal and irreversible truth. :P

Quote from: dedndave on December 28, 2012, 10:52:06 PM
what i was refering to was increasing the loop count for longer tests
these numbers won't jump all over the place - makes it easier to get a comparison,
....
they are jumping around +/- 6 %   ::)
According to my experience, the most credible tests are those that
simulate in the best possible way the reality we have to simulate.
Having a loop with million/s calls to the routines is useful only
when you presume the program is going to do that, otherwise there
is no point simulating a non-existing/high improbable case, that takes in the overhead
of a multitask environment, among other influences as well.

+/- 6% is a reasonable amount I can accept.  :biggrin:

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave


frktons

Quote from: dedndave on December 29, 2012, 07:50:39 AM
i'm good if you are   8)
I'm more than good if you are good. :biggrin:
I only hope that you understand my point of view, even if you disagree,
or your experince tells you otherwise. :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

MichaelW

Quote from: frktons on December 29, 2012, 07:26:22 AM
+/- 6% is a reasonable amount I can accept.

What about when you are trying to optimize your code for speed? In this case with a 12% uncertainty in the run time you would have to run multiple trials to recognize small increases in speed. With a higher loop count you could recognize these increases in one trial.
Well Microsoft, here's another nice mess you've gotten us into.

frktons

Quote from: MichaelW on December 29, 2012, 07:58:16 AM
Quote from: frktons on December 29, 2012, 07:26:22 AM
+/- 6% is a reasonable amount I can accept.

What about when you are trying to optimize your code for speed? In this case with a 12% uncertainty in the run time you would have to run multiple trials to recognize small increases in speed. With a higher loop count you could recognize these increases in one trial.


Yes Michael,  of course the tests have to be useful for
something. In this case I'm just showing that an old algo can
be surpassed by a new one parallellizing some passages.
I started the thread saying that we can have an algo that
is about 2:1 faster than the previous one. If I run the test 1,000
times or 1 million times, it doesn't change a lot the results,
and if it does, it is highly probable external influences that do
it, not the algo in itself.
But the most important thing is, from my point of view, what are we
looking for in a test. About 2:1 is about 100% faster, that's enough
for what I'm searching for. If a +/- 6% occurr in "about 100%" then
it is reasonable, +/-20/30% would be less reasonable.

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

For lovers of longer loop counts  :lol:

This test had a loop count of 10 millions:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
3.005   cycles for NumFormatX - IDIV / Stack
2.203   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.367   cycles for NumFormatF-I - IDIV / Stack / Table
1.185   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.113   cycles for NumFormatX - IDIV / Stack
2.338   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.475   cycles for NumFormatF-I - IDIV / Stack / Table
1.126   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.130   cycles for NumFormatX - IDIV / Stack
2.121   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.496   cycles for NumFormatF-I - IDIV / Stack / Table
1.109   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.197   cycles for NumFormatX - IDIV / Stack
2.285   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.463   cycles for NumFormatF-I - IDIV / Stack / Table
1.015   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------

Press a key to exit ...

This is NOT the final test, by the way. The III step is work in progress...

Dave, in your opinion/knowledge, what's the meaning of 6% difference
between:
Quote
3.005   cycles for NumFormatX - IDIV / Stack
and
Quote
3.197   cycles for NumFormatX - IDIV / Stack

I'm not able to interpret it correctly, not to speak of 16% difference:
Quote
1.185   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
and
Quote
1.015   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table

Is somebody able to tell me what this means? I just think this is
a sign of external influence [OS, swap, other processes in memory...]
and has nothing to do with the algo we are testing.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

MichaelW

Maybe it will work as expected if you try 10 billion :biggrin:

Now that I look at your code I see that you are not setting an affinity mask to limit the process/thread to a single core. You cannot depend on the time-stamp counters being in sync between the cores, and the system can (allegedly) move the process/thread to a different core than it started on. If that does not correct the problem then try increasing the priority. Running on a P4 processor with HT, I have had no problems at the highest possible priority (combining REALTIME_PRIORITY_CLASS  with THREAD_PRIORITY_TIME_CRITICAL), even with test code that triggered an exception. Note that the macros you are using cannot control the thread priority, but there is nothing to stop you from doing this separately.


Well Microsoft, here's another nice mess you've gotten us into.

frktons

Quote from: MichaelW on December 29, 2012, 06:54:29 PM
Maybe it will work as expected if you try 10 billion :biggrin:

Now that I look at your code I see that you are not setting an affinity mask to limit the process/thread to a single core. You cannot depend on the time-stamp counters being in sync between the cores, and the system can (allegedly) move the process/thread to a different core than it started on. If that does not correct the problem then try increasing the priority. Running on a P4 processor with HT, I have had no problems at the highest possible priority (combining REALTIME_PRIORITY_CLASS  with THREAD_PRIORITY_TIME_CRITICAL), even with test code that triggered an exception. Note that the macros you are using cannot control the thread priority, but there is nothing to stop you from doing this separately.
Well Michael, I'm here to learn something. So what should I do?
Should I modify this line:

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

or use a new macro/call/whatever?
Better if you post the code other than your suggestion, so I can
understand what you mean. :biggrin:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

before you run any tests...
        INVOKE  GetCurrentProcess
        INVOKE  SetProcessAffinityMask,eax,1
        INVOKE  Sleep,500

that sets the processor affinity to a single core (core 0)
the Sleep gives it a chance to bind, 750 mS works a little better, but who wants to wait that long   :P
the idea is to relinquish the current time-slice, then wait a little extra

then, for each test, try to choose a loop count that yields a 500 mS test
even using HIGH_PRIORITY_CLASS, you will see fairly stable results
if you use REALTIME_PRIORITY_CLASS, and the test is a bit too long, it may hang

you may get a little extra stability by diddling with the thread priority, but generally not worth the effort
on machines with more than 64 cores, you might have to play with thread affinity
the documents tell us little about machines that have 33 to 64 cores   :lol:
i don't think any forum members have more than 8 cores
for us normal people, thread affinity is simply a subset of process affinity

dedndave

there is some behaviour that i have not figured out
that is - we usually run some code to identify and display the cpu name and features
then we run the tests

if you put the cpu id code after the tests, you will get different results
at least, that's how it is on my machine, which is a little weird because it's running XP MCE

frktons

Using the 3 lines of code that Dave suggests, and running the
test 1 million times:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
2.923   cycles for NumFormatX - IDIV / Stack
2.179   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.463   cycles for NumFormatF-I - IDIV / Stack / Table
1.249   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.119   cycles for NumFormatX - IDIV / Stack
2.197   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.475   cycles for NumFormatF-I - IDIV / Stack / Table
1.249   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.131   cycles for NumFormatX - IDIV / Stack
2.198   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.484   cycles for NumFormatF-I - IDIV / Stack / Table
1.256   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.134   cycles for NumFormatX - IDIV / Stack
2.210   cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.475   cycles for NumFormatF-I - IDIV / Stack / Table
1.248   cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
What can we say now?
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

it's not "how many times" you run the test, really
it's how much time you spend running it
you want to run it for a long enough period of time so that OS intervention is negligible
or, at least, consistent - lol

Quotethen, for each test, try to choose a loop count that yields a 500 mS test