There are a lot of algos to convert integer binary to ASCII numbers.
Some of them also add thousand separator. The best algo for this
kind of job was posted some time ago by clive. I don't know
if something new and better has come out in the last two years, but
for what I have in my mind, that best algo can be replaced with a faster
one that is about 2:1 faster.
While I'm going to do some tests to see how my ideas can become a
real algo, I'll post clive's algo to see if someone has already done
something faster, testing it vs all the solutions you have realized for
your own pleasure or necessity, and are eager to compare with other's.
This is another drop in the ocean called "File Compressor" that is and
will be my main project for the months to come. :biggrin:
Frank
Here we start with the algos to deal with.
Both were written by clive about 2 years ago and,
at that time, were the fastest around.
Enjoy the challenge. 8)
Frank
Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
3.233 cycles for NumFormat - IDIV / Stack written by clive
1.934 cycles for NumFormatX - Rec.IMUL / Stack written by clive
------------------------------------------------------------------------
2.947 cycles for NumFormat - IDIV / Stack written by clive
1.962 cycles for NumFormatX - Rec.IMUL / Stack written by clive
------------------------------------------------------------------------
2.939 cycles for NumFormat - IDIV / Stack written by clive
2.869 cycles for NumFormatX - Rec.IMUL / Stack written by clive
------------------------------------------------------------------------
4.356 cycles for NumFormat - IDIV / Stack written by clive
2.899 cycles for NumFormatX - Rec.IMUL / Stack written by clive
------------------------------------------------------------------------
According to my vision, these algos are simple and efficient
in their design. And this fact is confirmed by the speed test
we performed some time ago.
After two years of no coding at all, in my mind have arised
new ideas about optimization, that mostly depends on a
paradigm shift and a change in the algos design.
I see now clearly some weakness in the overall design that
I didn't see some time ago.
The most important weakness I can see is the single byte
processing. To obtain a single formatted byte the algo has to:
divide or multiply / move / add / push / pop / move.
I have the idea that we can have a much faster algo if we
use a parallel approach [at least 4 bytes in the same cycle]
bypassing unnecessary steps altogether.
So the first step is to put together some processing, making
the single byte processing into a multibytes processing.
In next post I'll show you the code doing this kind of transformation.
Stay tuned, and post your code as well if you like.
Frank :biggrin:
feeling kinda lonely in this thread, Frank ? :(
i will come and keep you company for a minute
you may not get a lot of participation on this one
that's because you are optimizing a function that will be used only a few times per session
very rare are the cases where you might want to format a million string values :P
Quote from: dedndave on December 27, 2012, 05:43:22 AM
feeling kinda lonely in this thread, Frank ? :(
i will come and keep you company for a minute
you may not get a lot of participation on this one
that's because you are optimizing a function that will be used only a few times per session
very rare are the cases where you might want to format a million string values :P
I need this function for formatting about 100-120 numbers per page
when I'll examine the recurrences of chars or tokens in the program
to compress a text file.
It doesn't matter, for the time being, how many partecipants will be
into this drop of optimization. I'm just sharing my thoughts and code.
Thanks anyway for your welcome company. :t
Of course everybody is welcome. :biggrin:
And talking about optimization, here we are with the first step
in the speed up process. I'm afraid
the first step already crashedthe
previous fastest algo :redface:
But the new algo design is so powerful it just will shine in the
next steps. :lol:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
4.489 cycles for NumFormat - IDIV / Stack written by clive
2.082 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.301 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
2.995 cycles for NumFormat - IDIV / Stack written by clive
2.075 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.950 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
4.488 cycles for NumFormat - IDIV / Stack written by clive
3.093 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.961 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
4.505 cycles for NumFormat - IDIV / Stack written by clive
3.117 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.951 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
prescott w/htt
Quote----------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz
Instructions: MMX, SSE1, SSE2, SSE3
----------------------------------------------------------------------------
8,627 cycles for NumFormat - IDIV / Stack written by clive
3,551 cycles for NumFormatX - Rec.IMUL / Stack written by clive
3,606 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
8,715 cycles for NumFormat - IDIV / Stack written by clive
3,578 cycles for NumFormatX - Rec.IMUL / Stack written by clive
3,587 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
8,654 cycles for NumFormat - IDIV / Stack written by clive
3,535 cycles for NumFormatX - Rec.IMUL / Stack written by clive
3,622 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
8,631 cycles for NumFormat - IDIV / Stack written by clive
3,619 cycles for NumFormatX - Rec.IMUL / Stack written by clive
3,663 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
Dave, for your old PC :redface: I think you'll have to wait until
the II step to see "The Crash" :lol:
So far the new version is 2.5:1 faster compared to its old equivalent
anyway, so my suspects are confirmed. Designing a parallel data
processing pays. :P
Ciao Frank,
It looks very promising :t
Buon Natale,
Jochen
Quote from: jj2007 on December 27, 2012, 10:40:39 AM
Ciao Frank,
It looks very promising :t
Buon Natale,
Jochen
Buone feste anche a te, Jochen.
Well tomorrow I'm going to post the II step algo, that should
definitively outdo the old formatting algo performances.
Hi frktons,
here are my results:
----------------------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
----------------------------------------------------------------------------
2.323 cycles for NumFormat - IDIV / Stack written by clive
1.472 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.210 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
2.158 cycles for NumFormat - IDIV / Stack written by clive
1.795 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.143 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
1.749 cycles for NumFormat - IDIV / Stack written by clive
1.264 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.204 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
1.937 cycles for NumFormat - IDIV / Stack written by clive
1.476 cycles for NumFormatX - Rec.IMUL / Stack written by clive
1.463 cycles for NumFormatF-I - IDIV / Stack / Table written by frktons
----------------------------------------------------------------------------
--- ok ---
Gunther
Quote from: Gunther on December 27, 2012, 11:12:47 PM
Hi frktons,
here are my results:
Gunther
Thanks Gunther, your results confirm mine, on new machines
parallel processing is much faster.
Hi frktons,
Quote from: frktons on December 28, 2012, 01:30:51 AM
Thanks Gunther, your results confirm mine, on new machines
parallel processing is much faster.
yes, of course. That isn't a surprise.
Gunther
Quote from: Gunther on December 28, 2012, 01:39:32 AM
Hi frktons,
yes, of course. That isn't a surprise.
Gunther
Not a surprise, but it is good to see it by myself. :P
The III step version is coming in a few days.
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
2.957 cycles for NumFormatX - IDIV / Stack
2.210 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.988 cycles for NumFormatF-I - IDIV / Stack / Table
1.581 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
4.430 cycles for NumFormatX - IDIV / Stack
3.017 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.988 cycles for NumFormatF-I - IDIV / Stack / Table
1.563 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
4.545 cycles for NumFormatX - IDIV / Stack
3.034 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.988 cycles for NumFormatF-I - IDIV / Stack / Table
1.562 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
4.461 cycles for NumFormatX - IDIV / Stack
3.016 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.988 cycles for NumFormatF-I - IDIV / Stack / Table
1.562 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
Press a key to exit ...
if you want more repeatable results, let the test run a little longer :P
QuoteIntel(R) Pentium(R) 4 CPU 3.00GHz
Instructions: MMX, SSE1, SSE2, SSE3
----------------------------------------------------------------------------
8,338 cycles for NumFormatX - IDIV / Stack
3,616 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,319 cycles for NumFormatF-I - IDIV / Stack / Table
1,296 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
8,598 cycles for NumFormatX - IDIV / Stack
3,499 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,281 cycles for NumFormatF-I - IDIV / Stack / Table
1,297 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
9,171 cycles for NumFormatX - IDIV / Stack
3,497 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,313 cycles for NumFormatF-I - IDIV / Stack / Table
1,297 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
8,566 cycles for NumFormatX - IDIV / Stack
3,479 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,295 cycles for NumFormatF-I - IDIV / Stack / Table
1,296 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
Quote from: dedndave on December 28, 2012, 02:22:57 PM
if you want more repeatable results, let the test run a little longer :P
These results are more real than making the loop 1 million times.
In real life, and in the "File Compressor" 1000 times are more than enough.
:P
Hi frktons,
Quote from: frktons on December 28, 2012, 03:44:10 PM
In real life, and in the "File Compressor" 1000 times are more than enough.
:P
that's not quiet right. A few years ago, I've written a fractal image compressor. The bottle neck were 5 simple procedures (calculating arithmetic mean, dot product etc.) which wasted 90% of computing time. For a small image with, lets say, 256x256 pixels those functions were called over 30 million times. That's the reality.
Gunther
what i was refering to was increasing the loop count for longer tests
these numbers won't jump all over the place - makes it easier to get a comparison
Quote8,568 cycles for NumFormatX - IDIV / Stack
8,523 cycles for NumFormatX - IDIV / Stack
8,678 cycles for NumFormatX - IDIV / Stack
8,526 cycles for NumFormatX - IDIV / Stack
8,576 cycles for NumFormatX - IDIV / Stack
9,091 cycles for NumFormatX - IDIV / Stack
8,394 cycles for NumFormatX - IDIV / Stack
8,301 cycles for NumFormatX - IDIV / Stack
8,709 cycles for NumFormatX - IDIV / Stack
8,390 cycles for NumFormatX - IDIV / Stack
8,605 cycles for NumFormatX - IDIV / Stack
8,648 cycles for NumFormatX - IDIV / Stack
8,643 cycles for NumFormatX - IDIV / Stack
8,529 cycles for NumFormatX - IDIV / Stack
8,199 cycles for NumFormatX - IDIV / Stack
8,527 cycles for NumFormatX - IDIV / Stack
8,406 cycles for NumFormatX - IDIV / Stack
8,622 cycles for NumFormatX - IDIV / Stack
8,533 cycles for NumFormatX - IDIV / Stack
8,516 cycles for NumFormatX - IDIV / Stack
8,832 cycles for NumFormatX - IDIV / Stack
8,570 cycles for NumFormatX - IDIV / Stack
8,402 cycles for NumFormatX - IDIV / Stack
8,663 cycles for NumFormatX - IDIV / Stack
8,141 cycles for NumFormatX - IDIV / Stack
8,369 cycles for NumFormatX - IDIV / Stack
8,360 cycles for NumFormatX - IDIV / Stack
8,292 cycles for NumFormatX - IDIV / Stack
they are jumping around +/- 6 % ::)
Quote from: Gunther on December 28, 2012, 10:31:18 PM
Hi frktons,
Quote from: frktons on December 28, 2012, 03:44:10 PM
In real life, and in the "File Compressor" 1000 times are more than enough.
:P
that's not quiet right. A few years ago, I've written a fractal image compressor. The bottle neck were 5 simple procedures (calculating arithmetic mean, dot product etc.) which wasted 90% of computing time. For a small image with, lets say, 256x256 pixels those functions were called over 30 million times. That's the reality.
Gunther
Hi Gunther,
In your case it's good to think in these terms, in my case, with
a routine that formats binary numbers into strings, it isn't.
The reality I'm speaking of is just the program in which I'm going
to use the routine, not the universal and irreversible truth. :P
Quote from: dedndave on December 28, 2012, 10:52:06 PM
what i was refering to was increasing the loop count for longer tests
these numbers won't jump all over the place - makes it easier to get a comparison,
....
they are jumping around +/- 6 % ::)
According to my experience, the most credible tests are those that
simulate in the best possible way the reality we have to simulate.
Having a loop with million/s calls to the routines is useful only
when you presume the program is going to do that, otherwise there
is no point simulating a non-existing/high improbable case, that takes in the overhead
of a multitask environment, among other influences as well.
+/- 6% is a reasonable amount I can accept. :biggrin:
Frank
i'm good if you are 8)
Quote from: dedndave on December 29, 2012, 07:50:39 AM
i'm good if you are 8)
I'm more than good if you are good. :biggrin:
I only hope that you understand my point of view, even if you disagree,
or your experince tells you otherwise. :t
Quote from: frktons on December 29, 2012, 07:26:22 AM
+/- 6% is a reasonable amount I can accept.
What about when you are trying to optimize your code for speed? In this case with a 12% uncertainty in the run time you would have to run multiple trials to recognize small increases in speed. With a higher loop count you could recognize these increases in one trial.
Quote from: MichaelW on December 29, 2012, 07:58:16 AM
Quote from: frktons on December 29, 2012, 07:26:22 AM
+/- 6% is a reasonable amount I can accept.
What about when you are trying to optimize your code for speed? In this case with a 12% uncertainty in the run time you would have to run multiple trials to recognize small increases in speed. With a higher loop count you could recognize these increases in one trial.
Yes Michael, of course the tests have to be useful for
something. In this case I'm just showing that an old algo can
be surpassed by a new one parallellizing some passages.
I started the thread saying that we can have an algo that
is about 2:1 faster than the previous one. If I run the test 1,000
times or 1 million times, it doesn't change a lot the results,
and if it does, it is highly probable external influences that do
it, not the algo in itself.
But the most important thing is, from my point of view, what are we
looking for in a test. About 2:1 is about 100% faster, that's enough
for what I'm searching for. If a +/- 6% occurr in "about 100%" then
it is reasonable, +/-20/30% would be less reasonable.
Frank
For lovers of longer loop counts :lol:
This test had a loop count of 10 millions:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
3.005 cycles for NumFormatX - IDIV / Stack
2.203 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.367 cycles for NumFormatF-I - IDIV / Stack / Table
1.185 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.113 cycles for NumFormatX - IDIV / Stack
2.338 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.475 cycles for NumFormatF-I - IDIV / Stack / Table
1.126 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.130 cycles for NumFormatX - IDIV / Stack
2.121 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.496 cycles for NumFormatF-I - IDIV / Stack / Table
1.109 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.197 cycles for NumFormatX - IDIV / Stack
2.285 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.463 cycles for NumFormatF-I - IDIV / Stack / Table
1.015 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
Press a key to exit ...
This is NOT the final test, by the way. The III step is
work in progress...Dave, in your opinion/knowledge, what's the meaning of 6% difference
between:
Quote
3.005 cycles for NumFormatX - IDIV / Stack
and
Quote
3.197 cycles for NumFormatX - IDIV / Stack
I'm not able to interpret it correctly, not to speak of 16% difference:
Quote
1.185 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
and
Quote
1.015 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
Is somebody able to tell me what this means? I just think this is
a sign of external influence [OS, swap, other processes in memory...]
and has nothing to do with the algo we are testing.
Maybe it will work as expected if you try 10 billion :biggrin:
Now that I look at your code I see that you are not setting an affinity mask to limit the process/thread to a single core. You cannot depend on the time-stamp counters being in sync between the cores, and the system can (allegedly) move the process/thread to a different core than it started on. If that does not correct the problem then try increasing the priority. Running on a P4 processor with HT, I have had no problems at the highest possible priority (combining REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL), even with test code that triggered an exception. Note that the macros you are using cannot control the thread priority, but there is nothing to stop you from doing this separately.
Quote from: MichaelW on December 29, 2012, 06:54:29 PM
Maybe it will work as expected if you try 10 billion :biggrin:
Now that I look at your code I see that you are not setting an affinity mask to limit the process/thread to a single core. You cannot depend on the time-stamp counters being in sync between the cores, and the system can (allegedly) move the process/thread to a different core than it started on. If that does not correct the problem then try increasing the priority. Running on a P4 processor with HT, I have had no problems at the highest possible priority (combining REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL), even with test code that triggered an exception. Note that the macros you are using cannot control the thread priority, but there is nothing to stop you from doing this separately.
Well Michael, I'm here to learn something. So what should I do?
Should I modify this line:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
or use a new macro/call/whatever?
Better if you post the code other than your suggestion, so I can
understand what you mean. :biggrin:
before you run any tests...
INVOKE GetCurrentProcess
INVOKE SetProcessAffinityMask,eax,1
INVOKE Sleep,500
that sets the processor affinity to a single core (core 0)
the Sleep gives it a chance to bind, 750 mS works a little better, but who wants to wait that long :P
the idea is to relinquish the current time-slice, then wait a little extra
then, for each test, try to choose a loop count that yields a 500 mS test
even using HIGH_PRIORITY_CLASS, you will see fairly stable results
if you use REALTIME_PRIORITY_CLASS, and the test is a bit too long, it may hang
you may get a little extra stability by diddling with the thread priority, but generally not worth the effort
on machines with more than 64 cores, you might have to play with thread affinity
the documents tell us little about machines that have 33 to 64 cores :lol:
i don't think any forum members have more than 8 cores
for us normal people, thread affinity is simply a subset of process affinity
there is some behaviour that i have not figured out
that is - we usually run some code to identify and display the cpu name and features
then we run the tests
if you put the cpu id code after the tests, you will get different results
at least, that's how it is on my machine, which is a little weird because it's running XP MCE
Using the 3 lines of code that Dave suggests, and running the
test 1 million times:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
2.923 cycles for NumFormatX - IDIV / Stack
2.179 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.463 cycles for NumFormatF-I - IDIV / Stack / Table
1.249 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.119 cycles for NumFormatX - IDIV / Stack
2.197 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.475 cycles for NumFormatF-I - IDIV / Stack / Table
1.249 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.131 cycles for NumFormatX - IDIV / Stack
2.198 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.484 cycles for NumFormatF-I - IDIV / Stack / Table
1.256 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
3.134 cycles for NumFormatX - IDIV / Stack
2.210 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.475 cycles for NumFormatF-I - IDIV / Stack / Table
1.248 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
----------------------------------------------------------------------------
What can we say now?
it's not "how many times" you run the test, really
it's how much time you spend running it
you want to run it for a long enough period of time so that OS intervention is negligible
or, at least, consistent - lol
Quotethen, for each test, try to choose a loop count that yields a 500 mS test
Quote from: dedndave on December 30, 2012, 03:11:36 AM
it's not "how many times" you run the test, really
it's how much time you spend running it
you want to run it for a long enough period of time so that OS intervention is negligible
or, at least, consistent - lol
Quotethen, for each test, try to choose a loop count that yields a 500 mS test
The program I'm going to use this function in will never have so long
time to run for this specific function. So I really don't understand what's
the point of doing so long test. The results I get are not the speed the
function is going to have while I use it.
By the way I've almost finished:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
3.268 cycles for NumFormatX - IDIV / Stack
2.332 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.514 cycles for NumFormatF-I - IDIV / Stack / Table
1.282 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1.199 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
891 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
3.371 cycles for NumFormatX - IDIV / Stack
2.253 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.592 cycles for NumFormatF-I - IDIV / Stack / Table
1.237 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1.296 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
973 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
3.220 cycles for NumFormatX - IDIV / Stack
2.306 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.477 cycles for NumFormatF-I - IDIV / Stack / Table
1.267 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1.215 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
908 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
In a few days I'm going to post the final results and program. :biggrin:
the purpose is merely to get stable, repeatable results
it isn't for my benefit - it is for yours
it will make it easier to interpret the data, is all
Quote from: dedndave on December 30, 2012, 08:07:30 AM
the purpose is merely to get stable, repeatable results
it isn't for my benefit - it is for yours
it will make it easier to interpret the data, is all
If the principle of getting stable and repeatable results
is referred to general algos, I do understand all the things
that you, Michael, or others have said.
My doubts are only referred to the application of these
principles to a case in which I forecast the routine will
be called 100-200 times for each cycle.
Time to move towards other routines and the main
project: Text File Scanner & Compressor.
For the time being I've already fulfilled the goal of a
2:1 faster algo. It is in reality about 3:1 faster.
My routines right align the numbers, as you can see with
one of the two programs attached, clive's ones align left.
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
3.091 cycles for NumFormatX - IDIV / Stack
2.463 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.425 cycles for NumFormatF-I - IDIV / Stack / Table
1.225 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1.193 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
912 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
3.252 cycles for NumFormatX - IDIV / Stack
2.447 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.408 cycles for NumFormatF-I - IDIV / Stack / Table
1.261 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1.177 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
929 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
3.240 cycles for NumFormatX - IDIV / Stack
2.434 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1.403 cycles for NumFormatF-I - IDIV / Stack / Table
1.230 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1.179 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
902 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
If somebody posts his results I'd be nice.
;)
QuoteMicrosoft Windows XP [Version 5.1.2600]
(C) Copyri0ht 1985-2001 Micros0ft Corp. 0
0 9 9 9
9Documen10 and Settings\Dave10esktop => pf 10
10 99 99 99
99 100 100 100
100 999 999 999
999 1,000 1,000 1.000 1.
000 9,999 9,999 9.999 9.
999 10,000 10,000 10.000 10.
000 99,999 99,999 99.999 99.
999 100,000 100,000 100.000 100.
000 999,999 999,999 999.999 999.
999 1,000,000 1,000,000 1.000.000 1.000.
000 9,999,999 9,999,999 9.999.999 9.999.
999 10,000,000 10,000,000 10.000.000 10.000.
000 99,999,999 99,999,999 99.999.999 99.999.
999 100,000,000 100,000,000 100.000.000 100.000.
000 999,999,999 999,999,999 999.999.999 999.999.
999 1,000,000,000 1,000,000,000 1.000.000.000 1.000.000.
000 4,294,967,295 4,294,967,295 4.294.967.295 4.294.967.295
prescott w/htt
Quote----------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz
Instructions: MMX, SSE1, SSE2, SSE3
----------------------------------------------------------------------------
8,328 cycles for NumFormatX - IDIV / Stack
3,178 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,364 cycles for NumFormatF-I - IDIV / Stack / Table
2,581 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,283 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
1,124 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
8,322 cycles for NumFormatX - IDIV / Stack
3,201 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,365 cycles for NumFormatF-I - IDIV / Stack / Table
2,564 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,280 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
1,125 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
8,266 cycles for NumFormatX - IDIV / Stack
3,172 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,382 cycles for NumFormatF-I - IDIV / Stack / Table
2,545 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,285 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
1,109 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
Thanks Dave for your collaboration. As you can see your
student is trying to learn something, my master. :lol:
Oh, sorry. I forgot the usual $50 for you. :t
I decided not to do, for the time being, the SSE version,
maybe in the future. Now there are a thousand more routines to prepare :lol:
If somebody with an english setting of Windows can run
this test and post the results, I'll be glad.
I have to verify that the routines use the appropriate thousand separator
for english and non-english setting.
These routines have been checked and corrected in
order to adapt to the locale setting of the system.
Thanks
0 0 0 0
9 9 9 9
10 10 10 10
99 99 99 99
100 100 100 100
999 999 999 999
1,000 1,000 1,000 1,000
9,999 9,999 9,999 9,999
10,000 10,000 10,000 10,000
99,999 99,999 99,999 99,999
100,000 100,000 100,000 100,000
999,999 999,999 999,999 999,999
1,000,000 1,000,000 1,000,000 1,000,000
9,999,999 9,999,999 9,999,999 9,999,999
10,000,000 10,000,000 10,000,000 10,000,000
99,999,999 99,999,999 99,999,999 99,999,999
100,000,000 100,000,000 100,000,000 100,000,000
999,999,999 999,999,999 999,999,999 999,999,999
1,000,000,000 1,000,000,000 1,000,000,000 1,000,000,000
4,294,967,295 4,294,967,295 4,294,967,295 4,294,967,295
Thanks Sinsi. This is a good test :t
I forgot the last touch to get an algo 3:1 faster than previous ones:
Quote
----------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------------------------------
3,243 cycles for NumFormatX - IDIV / Stack
2,432 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1,463 cycles for NumFormatF-I - IDIV / Stack / Table
1,278 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,173 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
763 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
3,293 cycles for NumFormatX - IDIV / Stack
2,419 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1,453 cycles for NumFormatF-I - IDIV / Stack / Table
1,250 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,170 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
776 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
3,259 cycles for NumFormatX - IDIV / Stack
2,411 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1,453 cycles for NumFormatF-I - IDIV / Stack / Table
1,238 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,174 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
768 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
Now it's OK. :t
:t
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz
Instructions: MMX, SSE1, SSE2, SSE3
--------------------------------------------------------------------
2,896 cycles for NumFormatX - IDIV / Stack
2,279 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1,327 cycles for NumFormatF-I - IDIV / Stack / Table
1,231 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,055 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
777 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
--------------------------------------------------------------------
2,895 cycles for NumFormatX - IDIV / Stack
2,279 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1,328 cycles for NumFormatF-I - IDIV / Stack / Table
1,241 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,076 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
777 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
--------------------------------------------------------------------
2,889 cycles for NumFormatX - IDIV / Stack
2,286 cycles for NumFormatX2 - Reciprocal IMUL / Stack
1,341 cycles for NumFormatF-I - IDIV / Stack / Table
1,271 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,056 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
777 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
prescott w/htt
Quote----------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz
Instructions: MMX, SSE1, SSE2, SSE3
----------------------------------------------------------------------------
8,306 cycles for NumFormatX - IDIV / Stack
3,730 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,355 cycles for NumFormatF-I - IDIV / Stack / Table
2,556 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,293 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
968 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
8,285 cycles for NumFormatX - IDIV / Stack
3,744 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,368 cycles for NumFormatF-I - IDIV / Stack / Table
2,569 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,293 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
995 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
8,381 cycles for NumFormatX - IDIV / Stack
3,719 cycles for NumFormatX2 - Reciprocal IMUL / Stack
3,364 cycles for NumFormatF-I - IDIV / Stack / Table
2,543 cycles for NumFormatF-IA - IDIV / STRUCT / Table
1,286 cycles for NumFormatF-II - Reciprocal IMUL / Stack / Table
972 cycles for NumFormatF-III - Reciprocal IMUL / STRUCT / Table
----------------------------------------------------------------------------
Dave,
on your PC my last algo is 4:1 faster than previous ones.
Very good my Master, you teached it well. :t
You'll find your $50 at the usual link. :lol:
what do you think you are, some kind of jedi knight or something ?
waving your link around in front of me
i'm an assembly programmer
mind tricks don't work on me
only money !
(http://img593.imageshack.us/img593/5135/tworkonmeonlymoney.jpg)
Quote from: dedndave on January 07, 2013, 10:52:05 AM
what do you think you are, some kind of jedi knight or something ?
waving your link around in front of me
i'm an assembly programmer
mind tricks don't work on me
only money !
(http://img593.imageshack.us/img593/5135/tworkonmeonlymoney.jpg)
Nowadays we only use electronic money, so here it is, prepayment
for next algo included 8)
(http://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Usdollar100front.jpg/640px-Usdollar100front.jpg)
ole' Ben looks pissed :lol:
Quote from: dedndave on January 07, 2013, 12:00:15 PM
ole' Ben looks pissed :lol:
On his planet they don't use US dollars anymore. :lol:
So is it ready to be posted as a challenge on Pelle's C Forum? What is the corresponding CRT algo?
:P
Quote from: jj2007 on January 07, 2013, 05:44:09 PM
So is it ready to be posted as a challenge on Pelle's C Forum? What is the corresponding CRT algo?
:P
Hi Jochen.
I don't know about C matters, or challenge on Pelle's C Forum.
Do you mean they could try to do better in C or there is already
a formatting routine in C? I haven't seen any of them so far. :icon_rolleyes: