Time to crash the "best algo" to format dword unsigned numbers

frktons · December 28, 2012, 03:44:10 PM

Quote from: dedndave on December 28, 2012, 02:22:57 PM
if you want more repeatable results, let the test run a little longer :P

These results are more real than making the loop 1 million times.
In real life, and in the "File Compressor" 1000 times are more than enough.
:P

Gunther · December 28, 2012, 10:31:18 PM

Hi frktons,

Quote from: frktons on December 28, 2012, 03:44:10 PM
In real life, and in the "File Compressor" 1000 times are more than enough.
:P

that's not quiet right. A few years ago, I've written a fractal image compressor. The bottle neck were 5 simple procedures (calculating arithmetic mean, dot product etc.) which wasted 90% of computing time. For a small image with, lets say, 256x256 pixels those functions were called over 30 million times. That's the reality.

Gunther

dedndave · December 28, 2012, 10:52:06 PM

what i was refering to was increasing the loop count for longer tests
these numbers won't jump all over the place - makes it easier to get a comparison

Quote8,568 cycles for NumFormatX - IDIV / Stack
8,523 cycles for NumFormatX - IDIV / Stack
8,678 cycles for NumFormatX - IDIV / Stack
8,526 cycles for NumFormatX - IDIV / Stack
8,576 cycles for NumFormatX - IDIV / Stack
9,091 cycles for NumFormatX - IDIV / Stack
8,394 cycles for NumFormatX - IDIV / Stack
8,301 cycles for NumFormatX - IDIV / Stack
8,709 cycles for NumFormatX - IDIV / Stack
8,390 cycles for NumFormatX - IDIV / Stack
8,605 cycles for NumFormatX - IDIV / Stack
8,648 cycles for NumFormatX - IDIV / Stack
8,643 cycles for NumFormatX - IDIV / Stack
8,529 cycles for NumFormatX - IDIV / Stack
8,199 cycles for NumFormatX - IDIV / Stack
8,527 cycles for NumFormatX - IDIV / Stack
8,406 cycles for NumFormatX - IDIV / Stack
8,622 cycles for NumFormatX - IDIV / Stack
8,533 cycles for NumFormatX - IDIV / Stack
8,516 cycles for NumFormatX - IDIV / Stack
8,832 cycles for NumFormatX - IDIV / Stack
8,570 cycles for NumFormatX - IDIV / Stack
8,402 cycles for NumFormatX - IDIV / Stack
8,663 cycles for NumFormatX - IDIV / Stack
8,141 cycles for NumFormatX - IDIV / Stack
8,369 cycles for NumFormatX - IDIV / Stack
8,360 cycles for NumFormatX - IDIV / Stack
8,292 cycles for NumFormatX - IDIV / Stack

they are jumping around +/- 6 % ::)

frktons · December 29, 2012, 07:26:22 AM

Quote from: Gunther on December 28, 2012, 10:31:18 PM
Hi frktons,

Quote from: frktons on December 28, 2012, 03:44:10 PM
In real life, and in the "File Compressor" 1000 times are more than enough.
:P

that's not quiet right. A few years ago, I've written a fractal image compressor. The bottle neck were 5 simple procedures (calculating arithmetic mean, dot product etc.) which wasted 90% of computing time. For a small image with, lets say, 256x256 pixels those functions were called over 30 million times. That's the reality.

Gunther

Hi Gunther,
In your case it's good to think in these terms, in my case, with
a routine that formats binary numbers into strings, it isn't.
The reality I'm speaking of is just the program in which I'm going
to use the routine, not the universal and irreversible truth. :P

Quote from: dedndave on December 28, 2012, 10:52:06 PM
what i was refering to was increasing the loop count for longer tests
these numbers won't jump all over the place - makes it easier to get a comparison,
....
they are jumping around +/- 6 % ::)

According to my experience, the most credible tests are those that
simulate in the best possible way the reality we have to simulate.
Having a loop with million/s calls to the routines is useful only
when you presume the program is going to do that, otherwise there
is no point simulating a non-existing/high improbable case, that takes in the overhead
of a multitask environment, among other influences as well.

+/- 6% is a reasonable amount I can accept.

Frank

dedndave · December 29, 2012, 07:50:39 AM

i'm good if you are 8)

frktons · December 29, 2012, 07:56:39 AM

Quote from: dedndave on December 29, 2012, 07:50:39 AM
i'm good if you are 8)

I'm more than good if you are good.

I only hope that you understand my point of view, even if you disagree,
or your experince tells you otherwise. :t

MichaelW · December 29, 2012, 07:58:16 AM

Quote from: frktons on December 29, 2012, 07:26:22 AM
+/- 6% is a reasonable amount I can accept.

What about when you are trying to optimize your code for speed? In this case with a 12% uncertainty in the run time you would have to run multiple trials to recognize small increases in speed. With a higher loop count you could recognize these increases in one trial.

frktons · December 29, 2012, 08:13:20 AM

Quote from: MichaelW on December 29, 2012, 07:58:16 AM
Quote from: frktons on December 29, 2012, 07:26:22 AM
+/- 6% is a reasonable amount I can accept.

What about when you are trying to optimize your code for speed? In this case with a 12% uncertainty in the run time you would have to run multiple trials to recognize small increases in speed. With a higher loop count you could recognize these increases in one trial.

Yes Michael, of course the tests have to be useful for
something. In this case I'm just showing that an old algo can
be surpassed by a new one parallellizing some passages.
I started the thread saying that we can have an algo that
is about 2:1 faster than the previous one. If I run the test 1,000
times or 1 million times, it doesn't change a lot the results,
and if it does, it is highly probable external influences that do
it, not the algo in itself.
But the most important thing is, from my point of view, what are we
looking for in a test. About 2:1 is about 100% faster, that's enough
for what I'm searching for. If a +/- 6% occurr in "about 100%" then
it is reasonable, +/-20/30% would be less reasonable.

Frank

frktons · December 29, 2012, 08:40:01 AM

For lovers of longer loop counts :lol:

This test had a loop count of 10 millions:

Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
3.005 cycles for NumFormatX - IDIV / Stack
2.203 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.367 cycles for NumFormatF-I - IDIV / Stack / Table
1.185 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.113 cycles for NumFormatX - IDIV / Stack
2.338 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.475 cycles for NumFormatF-I - IDIV / Stack / Table
1.126 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.130 cycles for NumFormatX - IDIV / Stack
2.121 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.496 cycles for NumFormatF-I - IDIV / Stack / Table
1.109 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.197 cycles for NumFormatX - IDIV / Stack
2.285 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.463 cycles for NumFormatF-I - IDIV / Stack / Table
1.015 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------

Press a key to exit ...

This is NOT the final test, by the way. The III step is work in progress...

Dave, in your opinion/knowledge, what's the meaning of 6% difference
between:

Quote
3.005 cycles for NumFormatX - IDIV / Stack

and

Quote
3.197 cycles for NumFormatX - IDIV / Stack

I'm not able to interpret it correctly, not to speak of 16% difference:

Quote
1.185 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table

and

Quote
1.015 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table

Is somebody able to tell me what this means? I just think this is
a sign of external influence [OS, swap, other processes in memory...]
and has nothing to do with the algo we are testing.

MichaelW · December 29, 2012, 06:54:29 PM

Maybe it will work as expected if you try 10 billion

Now that I look at your code I see that you are not setting an affinity mask to limit the process/thread to a single core. You cannot depend on the time-stamp counters being in sync between the cores, and the system can (allegedly) move the process/thread to a different core than it started on. If that does not correct the problem then try increasing the priority. Running on a P4 processor with HT, I have had no problems at the highest possible priority (combining REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL), even with test code that triggered an exception. Note that the macros you are using cannot control the thread priority, but there is nothing to stop you from doing this separately.

frktons · December 30, 2012, 12:20:09 AM

Quote from: MichaelW on December 29, 2012, 06:54:29 PM
Maybe it will work as expected if you try 10 billion

Now that I look at your code I see that you are not setting an affinity mask to limit the process/thread to a single core. You cannot depend on the time-stamp counters being in sync between the cores, and the system can (allegedly) move the process/thread to a different core than it started on. If that does not correct the problem then try increasing the priority. Running on a P4 processor with HT, I have had no problems at the highest possible priority (combining REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL), even with test code that triggered an exception. Note that the macros you are using cannot control the thread priority, but there is nothing to stop you from doing this separately.

Well Michael, I'm here to learn something. So what should I do?
Should I modify this line:

Code Select


	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS

or use a new macro/call/whatever?
Better if you post the code other than your suggestion, so I can
understand what you mean.

dedndave · December 30, 2012, 12:50:32 AM

before you run any tests...

Code Select

        INVOKE  GetCurrentProcess
        INVOKE  SetProcessAffinityMask,eax,1
        INVOKE  Sleep,500

that sets the processor affinity to a single core (core 0)
the Sleep gives it a chance to bind, 750 mS works a little better, but who wants to wait that long :P
the idea is to relinquish the current time-slice, then wait a little extra

then, for each test, try to choose a loop count that yields a 500 mS test
even using HIGH_PRIORITY_CLASS, you will see fairly stable results
if you use REALTIME_PRIORITY_CLASS, and the test is a bit too long, it may hang

you may get a little extra stability by diddling with the thread priority, but generally not worth the effort
on machines with more than 64 cores, you might have to play with thread affinity
the documents tell us little about machines that have 33 to 64 cores :lol:
i don't think any forum members have more than 8 cores
for us normal people, thread affinity is simply a subset of process affinity

dedndave · December 30, 2012, 01:05:00 AM

there is some behaviour that i have not figured out
that is - we usually run some code to identify and display the cpu name and features
then we run the tests

if you put the cpu id code after the tests, you will get different results
at least, that's how it is on my machine, which is a little weird because it's running XP MCE

frktons · December 30, 2012, 02:41:01 AM

Using the 3 lines of code that Dave suggests, and running the
test 1 million times:

Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
2.923 cycles for NumFormatX - IDIV / Stack
2.179 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.463 cycles for NumFormatF-I - IDIV / Stack / Table
1.249 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.119 cycles for NumFormatX - IDIV / Stack
2.197 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.475 cycles for NumFormatF-I - IDIV / Stack / Table
1.249 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.131 cycles for NumFormatX - IDIV / Stack
2.198 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.484 cycles for NumFormatF-I - IDIV / Stack / Table
1.256 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.134 cycles for NumFormatX - IDIV / Stack
2.210 cycles for NumFormatX2 - Reciprocal IMUL / Stack

1.475 cycles for NumFormatF-I - IDIV / Stack / Table
1.248 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------

What can we say now?

dedndave · December 30, 2012, 03:11:36 AM

it's not "how many times" you run the test, really
it's how much time you spend running it
you want to run it for a long enough period of time so that OS intervention is negligible
or, at least, consistent - lol

Quotethen, for each test, try to choose a loop count that yields a 500 mS test

The MASM Forum

News:

Time to crash the "best algo" to format dword unsigned numbers

frktons

Gunther

dedndave

frktons

dedndave

frktons

MichaelW

frktons

frktons

MichaelW

frktons

dedndave

dedndave

frktons

dedndave