Inspired by ahsat's Binary to displayable text, any base (https://masm32.com/board/index.php?topic=11817.0) thread, I searched the Lab for our fastest dword-to-string. To my surprise, there is almost nothing; please correct me if I am wrong.
So I fumbled together a testbed. Can I have some timings, please? Even better: some really fast algos?
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
4471 cycles for 100 * dwtoa
31143 cycles for 100 * dw2a
3531 cycles for 100 * dw2str
4428 cycles for 100 * dwtoa
31325 cycles for 100 * dw2a
3456 cycles for 100 * dw2str
4463 cycles for 100 * dwtoa
31232 cycles for 100 * dw2a
3440 cycles for 100 * dw2str
4400 cycles for 100 * dwtoa
31364 cycles for 100 * dw2a
3458 cycles for 100 * dw2str
4440 cycles for 100 * dwtoa
31783 cycles for 100 * dw2a
3472 cycles for 100 * dw2str
Averages:
4444 cycles for dwtoa
31307 cycles for dw2a
3462 cycles for dw2str
20 bytes for dwtoa
20 bytes for dw2a
74 bytes for dw2str
dwtoa 123456789
dw2a 123456789
dw2str 123456789
dw2a is CRT printf
dw2str is an adaption of the Masm32 lib dwtoa algo
Version 2, with Ray's algo:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
4485 cycles for 100 * dwtoa
3452 cycles for 100 * dw2str
16911 cycles for 100 * MasmBasic Str$()
12479 cycles for 100 * Ray's algo
4412 cycles for 100 * dwtoa
3480 cycles for 100 * dw2str
17000 cycles for 100 * MasmBasic Str$()
12466 cycles for 100 * Ray's algo
4446 cycles for 100 * dwtoa
3470 cycles for 100 * dw2str
16844 cycles for 100 * MasmBasic Str$()
12497 cycles for 100 * Ray's algo
4400 cycles for 100 * dwtoa
3444 cycles for 100 * dw2str
16814 cycles for 100 * MasmBasic Str$()
12511 cycles for 100 * Ray's algo
Averages:
4429 cycles for dwtoa
3461 cycles for dw2str
16878 cycles for MasmBasic Str$()
12488 cycles for Ray's algo
20 bytes for dwtoa
74 bytes for dw2str
16 bytes for MasmBasic Str$()
138 bytes for Ray's algo
dwtoa 123456789
dw2a 123456789
dw2str 123456789
MasmBasic Str$() 123456789
Ray's algo 123456789
Here is Ray's algo, with minor modifications:
;----------------------------------------------------------------------------
; Original by Ray Gwinn
; Any number to any base with a specified digit length, will pad leading zeros.
; Parameters:
; eax The number to convert
; ebx The desired base
; ecx The desired digit count
; edi Where to place the converted string
;----------------------------------------------------------------------------
ctAnyToAny:
push edx ; save edx, or a digit
cdq ; same as xor edx, edx: zero edx
div ebx ; generate next digit in edx
dec ecx ; decrement digit count
.if !Zero? ; br if done
call ctAnyToAny ; generate next digit
.endif
mov al, @digits[edx] ; get the ascii value
stosb ; save it
pop edx ; restore edx, or get next digit
ret
13th Gen Intel(R) Core(TM) i9-13900KF (SSE4)
2155 cycles for 100 * dwtoa
1928 cycles for 100 * dw2str
11807 cycles for 100 * MasmBasic Str$()
2912 cycles for 100 * Ray's algo
2137 cycles for 100 * dwtoa
1897 cycles for 100 * dw2str
11673 cycles for 100 * MasmBasic Str$()
2922 cycles for 100 * Ray's algo
2133 cycles for 100 * dwtoa
1913 cycles for 100 * dw2str
11662 cycles for 100 * MasmBasic Str$()
2934 cycles for 100 * Ray's algo
2133 cycles for 100 * dwtoa
1897 cycles for 100 * dw2str
11687 cycles for 100 * MasmBasic Str$()
2924 cycles for 100 * Ray's algo
Averages:
2135 cycles for dwtoa
1905 cycles for dw2str
11680 cycles for MasmBasic Str$()
2923 cycles for Ray's algo
20 bytes for dwtoa
74 bytes for dw2str
16 bytes for MasmBasic Str$()
138 bytes for Ray's algo
dwtoa 123456789
dw2a 123456789
dw2str 123456789
MasmBasic Str$() 123456789
Ray's algo 123456789
AMD Athlon(tm) II X2 220 Processor (SSE3)
7967 cycles for 100 * dwtoa
7799 cycles for 100 * dw2str
27640 cycles for 100 * MasmBasic Str$()
31049 cycles for 100 * Ray's algo
7999 cycles for 100 * dwtoa
7750 cycles for 100 * dw2str
27413 cycles for 100 * MasmBasic Str$()
31224 cycles for 100 * Ray's algo
7968 cycles for 100 * dwtoa
7747 cycles for 100 * dw2str
27365 cycles for 100 * MasmBasic Str$()
31046 cycles for 100 * Ray's algo
7965 cycles for 100 * dwtoa
7747 cycles for 100 * dw2str
27587 cycles for 100 * MasmBasic Str$()
31092 cycles for 100 * Ray's algo
Averages:
7968 cycles for dwtoa
7748 cycles for dw2str
27500 cycles for MasmBasic Str$()
31070 cycles for Ray's algo
20 bytes for dwtoa
74 bytes for dw2str
16 bytes for MasmBasic Str$()
138 bytes for Ray's algo
dwtoa 123456789
dw2a 123456789
dw2str 123456789
MasmBasic Str$() 123456789
Ray's algo 123456789
--- ok ---
Any base that is a power of two, such as, 2=bianry, 8=octal, 16=hex, will convert faster because they can be done with shifts, avoiding timely multiply and divide instructions. They will always be faster than the recursive code.
Any base that is not a power of two, like our everyday base 10=decimal and base 60=Sexagesimal :-), will convert faster using the recursive code.
Don't forget that in the everyday world, base 10 is the most used.
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (SSE4)
3709 cycles for 100 * dwtoa
3567 cycles for 100 * dw2str
17710 cycles for 100 * MasmBasic Str$()
11878 cycles for 100 * Ray's algo
4336 cycles for 100 * dwtoa
3609 cycles for 100 * dw2str
15036 cycles for 100 * MasmBasic Str$()
9591 cycles for 100 * Ray's algo
3862 cycles for 100 * dwtoa
3340 cycles for 100 * dw2str
13489 cycles for 100 * MasmBasic Str$()
10160 cycles for 100 * Ray's algo
4346 cycles for 100 * dwtoa
3768 cycles for 100 * dw2str
16407 cycles for 100 * MasmBasic Str$()
10623 cycles for 100 * Ray's algo
Quote from: sinsi on April 01, 2024, 06:22:32 AM13th Gen Intel(R) Core(TM) i9-13900KF (SSE4)
Intel must have really improved the speed of a divide on this processor.
Version 3, dword to decimal string, this time with a negative number:
Averages:
4670 cycles for dwtoa
2923 cycles for dw2str
17191 cycles for MasmBasic Str$()
14168 cycles for Ray's algo
20 bytes for dwtoa
74 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo
dwtoa -123456789
dw2a 4171510507
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo 171510507
For positive numbers, all algos work fine. My dw2str got a little speed boost, strangely enough also thanks to a jecxz :cool:
OK, it's spring, and apparently that's time for a favorite indoor sport around here: How Fast is My Algo?
I always have to laugh at all this attention paid here to speed when it comes to numeric-conversion routines, which seem to be the favorite category in this sport. Why? Because it's all so ridiculous; really, except in a few cases, who cares how fast they are? Usually these routines will be used to either accept or display user input/output, in which case the time it takes to do the conversion is grossly swamped by the input/output time.
But "Oh, look, my a2bin function is a whole 25 microseconds faster than yours!".
But I suppose it's a harmless sport at that. And so far as our OP here goes (here's looking at you, @ahsat), it's all part of your assembly learning experience, so I won't begrudge you that. Just don't fall for the hype here.
Oh dear... can you explain, in a few words, why you are hanging around here? Apparently, it's not for assembly programming.
I just like pissing you off, JJ.
Besides, as you well know, I've posted plenty of code here and continue to do so.
Quote from: NoCforMe on April 01, 2024, 09:37:41 AMOK, it's spring, and apparently that's time for a favorite indoor sport around here: How Fast is My Algo?
I always have to laugh at all this attention paid here to speed when it comes to numeric-conversion routines, which seem to be the favorite category in this sport. Why? Because it's all so ridiculous; really, except in a few cases, who cares how fast they are? Usually these routines will be used to either accept or display user input/output, in which case the time it takes to do the conversion is grossly swamped by the input/output time.
But I suppose it's a harmless sport at that. And so far as our OP here goes (here's looking at you, @ahsat), it's all part of your assembly learning experience, so I won't begrudge you that. Just don't fall for the hype here.
So when Excel wants to show 10000 integers in a spreadsheet an algo 4 times slower is enough?
More results
13th Gen Intel(R) Core(TM) i9-13900KF (SSE4)
3219 cycles for 100 * dwtoa
2059 cycles for 100 * dw2str
14581 cycles for 100 * MasmBasic Str$()
3617 cycles for 100 * Ray's algo
2692 cycles for 100 * dwtoa
2056 cycles for 100 * dw2str
14797 cycles for 100 * MasmBasic Str$()
3627 cycles for 100 * Ray's algo
3120 cycles for 100 * dwtoa
2063 cycles for 100 * dw2str
11700 cycles for 100 * MasmBasic Str$()
2900 cycles for 100 * Ray's algo
2201 cycles for 100 * dwtoa
1669 cycles for 100 * dw2str
11825 cycles for 100 * MasmBasic Str$()
2916 cycles for 100 * Ray's algo
Averages:
2906 cycles for dwtoa
2058 cycles for dw2str
13203 cycles for MasmBasic Str$()
3266 cycles for Ray's algo
20 bytes for dwtoa
74 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo
dwtoa -123456789
dw2a 4171510507
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo 171510507
--- ok ---
Quote from: NoCforMe on April 01, 2024, 09:57:18 AMI just like pissing you off, JJ.
Besides, as you well know, I've posted plenty of code here and continue to do so.
No problem. From now on, I will intervene under each and every of your coding attempts and leave some crispy remarks. I'm quite good at that game, too :thup:
Btw what happened to your funny "cubby holes" editor (https://masm32.com/board/index.php?topic=10319.0)? Any chance to see a release version this year?
Quote from: sinsi on April 01, 2024, 10:00:49 AMQuote from: NoCforMe on April 01, 2024, 09:37:41 AMOK, it's spring, and apparently that's time for a favorite indoor sport around here: How Fast is My Algo?
I always have to laugh at all this attention paid here to speed when it comes to numeric-conversion routines, which seem to be the favorite category in this sport. Why? Because it's all so ridiculous; really, except in a few cases, who cares how fast they are? Usually these routines will be used to either accept or display user input/output, in which case the time it takes to do the conversion is grossly swamped by the input/output time.
So when Excel wants to show 10000 integers in a spreadsheet an algo 4 times slower is enough?
As I wrote:
except in a few cases, who cares how fast they are?. That's one of those few cases.
Quote from: jj2007 on April 01, 2024, 10:02:21 AMBtw what happened to your funny "cubby holes" editor (https://masm32.com/board/index.php?topic=10319.0)? Any chance to see a release version this year?
Yes, good question and thanks for asking. Yes, there'll be a new "release" coming along. And I think that's a perfectly fine (and humorous) way to describe it. (I do wish I had a bit more feedback on it, though.)
And of course I wish you the best of luck with your editor efforts (which I have admitted have some very nice functionality that mine lacks).
@Sinsi: thanks :thumbsup:
Quote from: ahsat on April 01, 2024, 08:00:16 AMIntel must have really improved the speed of a divide on this processor.
The
div ebx is indeed the bottleneck in your algo. I've tried to replace it with a
mul 0CCCCCCCDh but with little success, speedwise. It's a pity because your algo is really very elegant and short :thumbsup:
ctAnyToAny:
push edx ; save edx, or a digit
xor edx, edx ; same as cdq but maybe a tick faster
div ebx ; generate next digit in edx
; dec ecx ; decrement digit count
test eax, eax ; anything left?
.if !Zero? ; br if done
call ctAnyToAny ; generate next digit
.endif
mov al, @digits[edx] ; get the ascii value
stosb ; save it
pop edx ; restore edx, or get next digit
ret
Btw you could drop the digit counter and instead check if anything is left in eax:
div ebx ; generate next digit in edx
; dec ecx ; decrement digit count
test eax, eax ; anything left?
.if !Zero? ; br if done
call ctAnyToAny ; generate next digit
.endif
Works fine and leaves one register intact. Speedwise there is no difference, it seems.
Quote from: NoCforMe on April 01, 2024, 10:07:05 AMAs I wrote: except in a few cases, who cares how fast they are?. That's one of those few cases.
Or Explorer showing the details of 1000 files.
Or 7-zip showing a 20,000 file archive in flat view.
Anyway, I wouldn't class Excel (or Access) as "a few cases", seeing how well Office sells :biggrin:
For me, a more useful algo would be to have thousand separators.
Rather than continue this (tangential) discussion here, I've started a new topic in the Soap Box (https://masm32.com/board/index.php?topic=11822.0) in this whole algorithm-speed thing.
Quote from: sinsi on April 01, 2024, 11:02:59 AMFor me, a more useful algo would be to have thousand separators.
If you're talking about putting in, say, commas every 3 digits:
3,549,782
then here's my code that does that:
;====================================================================
; CommaNum (inputBuffer, outputBuffer)
;
; Number comma-formatting routine.
; On entry,
; inputBuffer has ASCII number string (zero-terminated)
;
; On exit,
; outputBuffer contains the comma-formatted number string.
;====================================================================
CommaNum PROC inputBuffer:DWORD, outputBuffer:DWORD
LOCAL digitMoveCount:DWORD, quotient:DWORD, remainder:DWORD
PUSH ESI
PUSH EDI
MOV EDI, outputBuffer ;Where to store our finished product.
MOV digitMoveCount, 3 ;Default # of digits to move/"triad".
; Inline strlen() here:
MOV EAX, inputBuffer
MOV ESI, EAX ;Set up for later moves.
XOR EDX, EDX ;Len. counter
sl10: CMP BYTE PTR [EAX + EDX], NULL
JE sl88
INC EDX
JMP sl10
sl88: MOV EAX, EDX ;EAX = # digits.
TEST EAX, EAX ;Test for zero length.
JZ done ;Cap off output buffer and exit.
XOR EDX, EDX
MOV ECX, 3
DIV ECX ;How many "triads"?
MOV quotient, EAX
MOV remainder, EDX
; If quotient = 0, use remainder as digit count for last "triad":
CMP quotient, 0
JNE @F
MOV EAX, remainder
MOV digitMoveCount, EAX
MOV quotient, 1 ;1 time through move loop.
JMP SHORT triads
; First see if there's a short "triad" at beginning:
@@: MOV ECX, remainder
JECXZ triads ;No remainder, so no short "triad".
REP MOVSB ;Yes, so move that many digits.
MOV AL, ','
STOSB ;Put in comma.
; Now we're just moving "triads" and putting in commas after them:
triads: MOV ECX, quotient
nloop: PUSH ECX
MOV ECX, digitMoveCount
REP MOVSB
POP ECX
; Do we put a comma in, or are we on the last "triad"?
CMP ECX, 1
JE nocom
MOV AL, ','
STOSB
nocom: LOOP nloop
; Cap off formatted # buffer:
done: MOV BYTE PTR [EDI], 0
exit99: POP EDI
POP ESI
RET
CommaNum ENDP
Quote from: jj2007 on April 01, 2024, 10:24:11 AMIt's a pity because your algo is really very elegant and short
Thank you. I like elegant and short. I developed this algo circa 1978. We needed it for debugging on embedded systems. Replace the stosb with a print, and it printed one character at a time, usually on a tty.
This algo is also much more powerful, because it will do any base.
Oh, I have code to do that. It does it while converting the number to ascii.
I was just wondering if there was a way of optimising it.
Quote from: jj2007 on April 01, 2024, 10:24:11 AMBtw you could drop the digit counter and instead check if anything is left in eax:
You did not look at the code for Anybase.exe. There are two versions of the algorithm. One test eax, the other does a jecxz.
I now wonder if your speed tests were flawed. Base 10 should have use the algo that uses eax.
Thank you ahsat, :thumbsup:
for positive integers less than 100,000,000 = 5F5E100h I use:
; Int_to_string - Convert a small binary interger into an string
; IN: RAX = binary integer
; RCX = address of location to store string
align 16
db 90h,90h,90h
i2a proc
mov r9, rcx
mov r8d, 10
xor ecx, ecx
xor edx, edx
@@:
div r8 ; divide by the number-base
shl rcx, 8
add rdx, 30h
xchg dl, cl
test eax, eax
jne @b
mov [r9],rcx
ret
i2a endp
xchg dl, cl
That, mister, is the best instruction (for how many things it does) that I've ever seen. :greenclp:
Quote from: ahsat on April 01, 2024, 12:58:32 PMThere are two versions of the algorithm. One test eax, the other does a jecxz.
Apologies. I tested both versions, and there is no difference, speedwise. If you have a 32-bit version of the jecxz algo, please post it, so that I can incorporate it into the testbed. I also noted that your code produces 00000123 for dword=123, i.e. leading zeros.
Quote from: NoCforMe on April 01, 2024, 10:10:12 AMI wish you the best of luck with your editor
I wish you both good luck. I am amazed at how poor text editors are today compared to what they use to be. The best text editor I have ever used was Brief. There is a fair start of al clone of Brief, called Grief (https://github.com/adamyg/grief), on Github. You guys should take a look if you haven't already.
A real text editor is like assembly language, it is not for the beginners.
Quote from: jj2007 on April 01, 2024, 07:34:04 PMI also noted that your code produces 00000123 for dword=123
Not my code, but your implementation of my code. The original 64bit version that I posted only put leading zeros into hex and Sexagesimal, but no leading zeros with base 10. You would not use the jecxz version of the algo, unless you wanted the possibility of leading zeros.
The 32 bit version of Anybase.exe is working now. I have to clean it up but I will post it latter today.
Look closer this time, and you will see two versions of the algo, one with leading zeros and one without leading zeros.
I remember using Brief, briefly back in the day. I thought it was OK.
The editor I really, really miss is MultiEdit, which up to XP would run under Windoze (no more on Win7). It did everything: regular expression search & replace, infinitely extensible through a sophisticated macro language (yes, JJ!), columnar text selection, so much more.
But then for years I plodded along using Notepad as an editor, until I wrote my own.
BTW, if you care to take a look at my editor, here's the latest version (https://masm32.com/board/index.php?msg=127757). If you do peek at it I'd be very interested in your take on it.
Quote from: jj2007 on April 01, 2024, 07:34:04 PMIf you have a 32-bit version of the jecxz algo
I just posted the 32 bit version of Anybase. It has both, and always had, both versions of the algo.
Sorry, I don't know how to directly link you to the post, but it is in the same message thread that I released the 64 bit version, the "Miscellaneous Projects " part of this forum.
EDIT: https://masm32.com/board/index.php?topic=11817.60 (https://masm32.com/board/index.php?topic=11817.60)
On each post, the text at top ("Today at xx:xx:xx AM/PM") is a link you can copy to link to that reply in the thread. Use the link button (3rd button in 5th group) to insert a link into the post.
Quote from: ahsat on April 02, 2024, 05:56:36 AMI don't know how to directly link you to the post
It's here (https://masm32.com/board/index.php?topic=11817.msg128470#msg128470).
Quote from: jj2007 on April 01, 2024, 10:24:11 AMThe div ebx is indeed the bottleneck in your algo. I've tried to replace it with a mul 0CCCCCCCDh but with little success, speedwise. It's a pity because your algo is really very elegant and short :thumbsup:
Using your version 3 and enabling test F I get
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)
6026 cycles for 100 * dwtoa
6271 cycles for 100 * dw2str
22536 cycles for 100 * MasmBasic Str$()
15352 cycles for 100 * Ray's algo
5964 cycles for 100 * Ray+mul
6054 cycles for 100 * dwtoa
6152 cycles for 100 * dw2str
22515 cycles for 100 * MasmBasic Str$()
15347 cycles for 100 * Ray's algo
5983 cycles for 100 * Ray+mul
5941 cycles for 100 * dwtoa
6034 cycles for 100 * dw2str
22455 cycles for 100 * MasmBasic Str$()
15310 cycles for 100 * Ray's algo
5944 cycles for 100 * Ray+mul
6060 cycles for 100 * dwtoa
6359 cycles for 100 * dw2str
22421 cycles for 100 * MasmBasic Str$()
15315 cycles for 100 * Ray's algo
5984 cycles for 100 * Ray+mul
Averages:
6040 cycles for dwtoa
6212 cycles for dw2str
22485 cycles for MasmBasic Str$()
15331 cycles for Ray's algo
5974 cycles for Ray+mul
20 bytes for dwtoa
74 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo
74 bytes for Ray+mul
dwtoa -123456789
dw2a 4171510507
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo 171510507
Ray+mul j
Seems faster with the multiply to me ?
Quote from: jimg on April 02, 2024, 07:43:17 AMI've tried to replace it with a mul 0CCCCCCCDh but with little success, speedwise.
Quote from: jimg on April 02, 2024, 07:43:17 AMSeems faster with the multiply to me ?
If you made it faster, show me the code.
Quote from: jimg on April 02, 2024, 07:43:17 AMUsing your version 3 and enabling test F I get
It's definitely faster, Jim, but look at the result: there was a reason why I disabled it :biggrin:
Some fine tuning needed; dwtoa and dw2str use the mul variant.
Quote from: jj2007 on April 02, 2024, 09:40:50 AMIt's definitely faster, Jim, but look at the result:
Is there any chance I can see the code with the multiply?
Quote from: ahsat on April 02, 2024, 09:46:58 AMIs there any chance I can see the code with the multiply?
It's in the archive posted in your thread (https://masm32.com/board/index.php?msg=128470) as Test_F, but disabled. Here is the code:
_xmul dd 0CCCCCCCDh
ctAnyToAnyMul:
push edx ; save edx, or a digit
push ecx
mov ecx, eax
; int 3
mul _xmul ; something is wrong here
shr edx, 3
mov eax, edx
add edx, edx
lea edx, [edx*4+edx+"0"]
sub edx, ecx
neg edx
pop ecx
dec ecx ; decrement digit count
.if !Zero? ; br if done
call ctAnyToAnyMul ; generate next digit
.endif
mov al, @digits[edx] ; get the ascii value
stosb ; save it
pop edx ; restore edx, or get next digit
ret
This is a multiply by 10, then add 30h:
add edx, edx
lea edx, [edx*4+edx+"0"]
Note it will probably work with minor changes, but I was too tired to fix it...
Instead of the
decrement digit count, I would vote for a test eax, eax :cool:
it normally looks like this. I added comments so it isn't so cryptic-
mov ecx, 0CCCCCCCDh ; 8 * 1/10
mov ebx,eax ; save original
mul ecx ; num * 1/10 magic number
shr edx, 3 ; magic number fixup
lea edx,[edx*4+edx] ; *5
add edx,edx ; *10
sub ebx,edx ; remainder
I couldn't resist. This seems to produce the correct number. Patched together from dwtoa and ctAnyToAnyMul. Could use some refinement, I just wanted to see it work.
ctAnyToAnyMul: ; setup
test eax,eax
jnz sign
mov word ptr [edi],30h
ret
sign:
jns pos
mov byte ptr [edi],'-'
neg eax
add edi, 1
pos:
call ctanydoit
ret
ctanydoit:
push edx ; save edx, or a digit
xor edx,edx
push ecx
mov ecx,0CCCCCCCDh ; 8 * 1/10
mov ebx,eax ; save original
mul ecx ; num * 1/10 magic number
shr edx, 3 ; magic number fixup
mov eax,edx
lea edx,[edx*4+edx] ; *5
add edx,edx ; *10
sub ebx,edx ; remainder
mov edx,ebx
pop ecx
dec ecx ; decrement digit count
.if !Zero? ; br if done
call ctanydoit ; generate next digit
.endif
mov al, @digits[edx] ; get the ascii value
stosb ; save it
pop edx ; restore edx, or get next digit
ret
A little slower than dwtoa, but I didn't spend any time optimizing for speed.
Quote from: jimg on April 02, 2024, 01:15:09 PMI couldn't resist.
That is really cool, if it works. But I would leave it up to the caller to handle the sigh, unless negative numbers break things.
Quote from: jimg on April 02, 2024, 01:15:09 PMA little slower than dwtoa, but I didn't spend any time optimizing for speed.
It can be optimized a little, but I tested it and now see that it only works for base 10. I like my algorithm which will convert to any base.
Of course. This was just for fun. You'd need the magic number for each base you wanted to use.
Quote from: jimg on April 02, 2024, 04:52:38 PMYou'd need the magic number for each base you wanted to use.
Which wouldn't be too difficult, actually.
.CODE
_xmul dd 0CCCCCCCDh
ctAnyToAnyMul:
push ecx ; save edx, or a digit
mov ecx, eax
mul esi ; e.g. 0CCCCCCCDh
shr edx, 3
mov eax, edx
add edx, edx
lea edx, [edx*4+edx-"0"+1]
sub ecx, edx
test eax, eax
.if !Zero? ; br if done
call ctAnyToAnyMul ; generate next digit
.endif
mov al, dl ; @digits[edx] ; get the ascii value
stosb ; save it
pop ecx ; restore edx, or get next digit
ret
...
mov eax, TheNumber
mov edi, offset result2
push esi
mov esi, _xmul[0*4] ; 0...2 index
call ctAnyToAnyMul
pop esi
And voilà, it beats good ol' dwtoa :biggrin:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
3400 cycles for dwtoa
2180 cycles for dw2str
15316 cycles for MasmBasic Str$()
10984 cycles for Ray's algo I
3086 cycles for Ray+mul
20 bytes for dwtoa
82 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo I
64 bytes for Ray+mul
dwtoa 1234567
dw2a 1234567
dw2str 1234567
MasmBasic Str$() 1234567
Ray's algo I 001234567
Ray+mul
QuoteYou'd need the magic number for each base you wanted to use.
n/base = n*(1/base) ; Not all bases have magic numbers,so "div" can't change into "mul" sometimes.
Only 10,19,28,37,46,55 is a magic number in 2-60.
QuoteCheck if a number is magic (Recursive sum of digits is 1)
A number is said to be a magic number, if the sum of its digits are calculated till a single digit recursively by adding the sum of the digits after every addition. If the single digit comes out to be 1,then the number is a magic number.
For example-
Number= 50113
=> 5+0+1+1+3=10
=> 1+0=1
This is a Magic Number
For example-
Number= 1234
=> 1+2+3+4=10
=> 1+0=1
This is a Magic Number
ChkMagicNumber proc uses rbx n:QWORD
Local sum:QWORD
mov rax,n
mov sum,0
mov rcx,10
.while (rax > 0) || (sum > 9)
.if rax == 0
mov rax,sum
mov sum,0
.endif
xor rdx,rdx
div rcx
mov rbx,sum
add rbx,rdx
mov sum,rbx
.endw
mov rax,sum ;rax = 1, n is a magic number.
ret
ChkMagicNumber endp
Quote from: six_L on April 02, 2024, 08:42:46 PMOnly 10,19,28,37,46,55 is a magic number in 2-60.
Oops, that's bad news :sad:
So we'll better stick with Ray's div ebx. It's a very elegant algo indeed.
Quote from: jj2007 on April 02, 2024, 08:58:34 PMSo we'll better stick with Ray's div ebx. It's a very elegant algo indeed.
But now you will have to make another decision. Which is valid, 0x0A or 0x0a.
Recursion can also be used to very simply handle unzipping zips within zips, or any like problem.
Quote from: ahsat on April 03, 2024, 12:22:46 AMRecursion can also be used to very simply handle unzipping zips within zips, or any like problem.
I wonder if you have skill with chess recursive code to climb up and down a tree until find best chess move?
Simpler code is do it with tic tac toe
Magic number generator by qWord
Hi,jj2007
Quotethat's bad news
That's a good news because of avoid to waste your unnecessary times.
regard
Finally got something to beat dwtoa consistently using Rays recursion technique. (Apologies to Ray, we're trying to beat dwtoa in this thread, not trash your fine general purpose algorithm.)
ctAnyToAnyMul: ; setup
test eax,eax
jnz sign
mov word ptr [edi],30h
ret
sign:
jns pos
mov byte ptr [edi],'-'
neg eax
add edi, 1
pos:
push ebx
mov ecx,0CCCCCCCDh ; 8 * 1/10
call ctanydoit
pop ebx
ret
align 16
ctanydoit:
push edx ; save edx, or a digit
mov ebx,eax ; save original
mul ecx ; 0CCCCCCCDh 8 * 1/10 num * 1/10 magic number
shr edx, 3 ; magic number fixup (divide by 8)
mov eax,edx
lea edx,[edx*4+edx] ; *5
add edx,edx ; *10
sub ebx,edx ; remainder
mov edx,ebx
test eax,eax
.if !zero?
call ctanydoit ; generate next digit
.endif
add dl,'0'
mov [edi],dl
inc edi
pop edx ; restore edx, or get next digit
ret
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)
5984 cycles for 100 * dwtoa
5921 cycles for 100 * Ray+mul
5979 cycles for 100 * dwtoa
5905 cycles for 100 * Ray+mul
5976 cycles for 100 * dwtoa
5877 cycles for 100 * Ray+mul
5978 cycles for 100 * dwtoa
5882 cycles for 100 * Ray+mul
Averages:
5978 cycles for dwtoa
5894 cycles for Ray+mul
20 bytes for dwtoa
104 bytes for Ray+mul
dwtoa -123456789
Ray+mul -123456789
Okay, those test are probably invalid. I added a duplicate in TestG, called it Ray+mul2, and it ran slower than the other two, even though the code was identical. Maybe a page crossing? Which one is faster depends on where it sits in the program and in memory I guess. But the new routine is at least as fast as dwtoa.
6014 cycles for 100 * dwtoa
5975 cycles for 100 * Ray+mul
6083 cycles for 100 * Ray+mul2
6038 cycles for 100 * dwtoa
5914 cycles for 100 * Ray+mul
6111 cycles for 100 * Ray+mul2
6041 cycles for 100 * dwtoa
5886 cycles for 100 * Ray+mul
6074 cycles for 100 * Ray+mul2
6032 cycles for 100 * dwtoa
5870 cycles for 100 * Ray+mul
6108 cycles for 100 * Ray+mul2
Averages:
6035 cycles for dwtoa
5900 cycles for Ray+mul
6096 cycles for Ray+mul2
20 bytes for dwtoa
104 bytes for Ray+mul
104 bytes for Ray+mul2
dwtoa -123456789
Ray+mul -123456789
Ray+mul2 -123456789
Quote from: daydreamer on April 03, 2024, 12:54:31 AMI wonder if you have skill with chess recursive code
No, but I did a tic tac toe in fortran once, without recursion.
Hi,all
Perhaps the Ray's algorithm has an important research and application field in compression/encryption/large number calculation. since it includes "any".
regard
Quote from: jimg on April 03, 2024, 02:46:08 AMMaybe a page crossing?
I think you can force either or both to a page boundary with an "Align 4096" and then a "db 4090 dup (?)" and then the code.
Can someone here explain how these magic numbers work (for example, how you can use them to turn a divide into a multiply)? In a simple way? My non-higher-math head hurts from all this.
Hi
The web is full of good examples.
This technique is now part of compiler optimisations.
https://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf (https://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf) Integer Optimizations
https://www.complang.tuwien.ac.at/papers/ertl19kps.pdf (https://www.complang.tuwien.ac.at/papers/ertl19kps.pdf) Good intro
https://gmplib.org/~tege/divcnst-pldi94.pdf (https://gmplib.org/~tege/divcnst-pldi94.pdf) Original publication by Torbjörn Granlund (~1991)
Biterider
Thanks, but all of those are too technical or too mathematically abstract. Can't someone here explain, in simple terms, how it works?
Hi
In chapter "2 Background" of the second link, it is explained very well.
Biterider
Quote from: NoCforMe on April 03, 2024, 06:28:08 AMhow it works?
Simply stated, you can divide by a number, or multiply by it's reciprocal and get the same result. The reciprocal of a number is dividing 1 by the number.
For example 1 divided by 2 is 0.5, thus 0.5 is the reciprocal of 2. So we can divide a number by 2, or we can multiply by 0.5 and get the same result.
Very simply stated, what they are doing is multiplying by the reciprocal of 10, instead of dividing my 10. But its the remainder of the divide they want, so that is what the rest of the code is about. I have not studied the code to see how they come up with the remainder.
Thanks! That's what I was after, a simple explanation rather than a 5,000-word treatise on processor optimizations or whatever.
Still not sure how 0CCCCCCCDh works out to be the reciprocal of 10 ...
0ccccccc..ccc = 0.0001100110011001100110011..... in binary
you know the stuff to the left of the binary point are powers of two, 2^1,2^2,2^3, etc.
the stuff to the right are also, 2^-1,2^-2,2^-3,etc. (1/2, 1/4,1/8,1/16,etc)
so in .00001100110011 we have 0*1/2 + 0*1/4 + 0*1/8 + 1*1/16 + 1*1/32 + 0*1/64 etc. so
sum
1/16=.0625 .0625
+ 1/32=.03125 .09375
+ 1/256=.00390625 .09765625
etc, approaching .1
The final D is just the remaining CCCC... rounded up.
OK ... I guess.
What's still hard to wrap my head around here is that what we're actually doing is multiplying, by a very large non-zero number, so I still don't get how we get multiplication by the reciprocal powers of two (2^-1, 2^-2, etc.) from that operation. How does that work? How do we get that "virtual" binary point?
Thanks to all who are putting up with my simple-mindedness here.
Quote from: NoCforMe on April 03, 2024, 07:14:28 AMStill not sure how 0CCCCCCCDh works out to be the reciprocal of 10 ...
Neither am I, but I understand what they are doing, but not the details.
we are throwing away the overflow and keeping only the bits that line up the way we want. If you look closely, we are actually multiplying by 8*1/10 to get things lined up. Later down we shift edx over 3 places to divide out the extra 8.
okay, here is the latest and final for me. rearranged some registers and scrapped of a tiny bit of time.
ctAnyToAnyMul: ; setup (needs a better name)
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,0CCCCCCCDh ; 8 * 1/10
call ctanydoit
ret
align 16
ctanydoit: ; this one needs a better name also
push ebx ; save ebx, or a digit
mov ebx,eax ; save original
mul ecx ; 0CCCCCCCDh 8 * 1/10 num * 1/10 magic number
shr edx, 3 ; magic number fixup (divide by 8)
mov eax,edx ; save answer
lea edx,[edx*4+edx] ; *5
add edx,edx ; *10
sub ebx,edx ; remainder
test eax,eax
.if !zero?
call ctanydoit ; generate next digit
.endif
add bl,'0'
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
ret
Quote from: jimg on April 03, 2024, 10:05:44 AMokay, here is the latest and final for me.
How was the timing? Post what you got.
Quote from: jimg on April 03, 2024, 10:05:44 AMokay, here is the latest and final for me
Well done :thumbsup:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4483 cycles for dwtoa
2836 cycles for dw2str
16663 cycles for MasmBasic Str$()
13924 cycles for Ray's algo I
3870 cycles for Ray's algo, mod JimG
20 bytes for dwtoa
82 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo I
153 bytes for Ray's algo, mod JimG
dwtoa -123456789
dw2a 4171510507
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo I 171510507
Ray's algo, mod JimG -123456789
Thank you jimg,
For big integers use my faster code.
Would you like to include my code in your speed test?
; Int_to_string - Convert a big integer into an string
; IN: RAX = binary integer
; RCX = address of location to store string
align 16
db 15 dup(90h)
i2aA proc
push rdi
xor r9d, r9d
mov rdi, rcx
mov r8d, 10
xor ecx, ecx
xor edx, edx
@@1:
div r8 ; divide by the number-base
shl rcx, 8
or edx, 30h
add r9, 1
xchg dl, cl
test eax, eax
je @@3
cmp r9d, 8
jne @@1
push rsi
mov rsi,rcx
xor ecx, ecx
@@2:
div r8 ; divide by the number-base
shl rcx, 8
or edx, 30h
inc r9 ; r9 -> counter
xchg dl, cl
test eax, eax
jne @@2
mov [rdi],rcx
mov [rdi+r9-8],rsi
pop rsi
pop rdi
ret
align 16
@@3:
mov [rdi],rcx
pop rdi
ret
i2aA endp
Note:
The code for the minus sign and leading zeros must be kept out of the algorithm.
It's not normal to bloat the algorithm with 99.99% unusable code.
Quote from: jimg on April 03, 2024, 02:46:08 AMWhich one is faster depends on where it sits in the program and in memory I guess
Thanks, Ray. I checked that and you are right, there was a problem with the align_64 macro that I borrowed from Nidud (https://masm32.com/board/index.php?topic=4545.msg48734) years ago.
The latest test (https://masm32.com/board/index.php?msg=128545) (above, reply #68) has an align 256. That assembles with UAsm only, though. M$ ml.exe throws an error.
Quote from: lingo on April 03, 2024, 11:22:50 AMWould you like to include my code in your speed test?
My speed test is 32-bit code. Besides, we all know how slow
div is - no need to include a slow algo.
Quote from: jj2007 on April 03, 2024, 11:28:46 AMQuote from: jimg on April 03, 2024, 02:46:08 AMWhich one is faster depends on where it sits in the program and in memory I guess
Thanks, Ray. I checked that and you are right, there was a problem with the align_64 macro that I borrowed from Nidud (https://masm32.com/board/index.php?topic=4545.msg48734) years ago.
The latest test (https://masm32.com/board/index.php?msg=128545) (above, reply #68) has an align 256. That assembles with UAsm only, though. M$ ml.exe throws an error.
I am glad you got it working.
Quote from: NoCforMe on April 03, 2024, 08:02:31 AMOK ... I guess.
What's still hard to wrap my head around here is that what we're actually doing is multiplying, by a very large non-zero number, so I still don't get how we get multiplication by the reciprocal powers of two (2^-1, 2^-2, etc.) from that operation. How does that work? How do we get that "virtual" binary point?
Thanks to all who are putting up with my simple-mindedness here.
You know how exponential number work, right?
2000 * 300 = (2.0*10^3) * (3.0*10^2) = (2*3)*10^(3+2)= 6* 10^5 = 600000
to multiply exponentials, you multiply the mantissas and add the exponents
Same for binary, it's just powers of 2 rather than 10.
So, if you look a the hex representation of .1 in IEEE754 floating point it looks like 3dcccccd hex.
The leftmost bit is the sign, followed by 8 bits of exponent followed by the mantissa with an implied 1 as the high bit.
exp=7bh=123, exponents are stored with 127 added, so 123-127=-4, so exponent is 2^-4
The mantissa for 1/10 is cccccdh. look familiar?
so to multiply by 1/10 you multiply your numbers mantissa by that mantissa, and adjust the exponent as needed.
The mantissa of your number is just the number with the exponent of (10^0=1) e.g. 1234 is 1234*10^0
Try these if you are hazy on floating point numbers.
FloatConverter (https://www.h-schmidt.net/FloatConverter/IEEE754.html)
FloatToHex (https://gregstoll.com/~gregstoll/floattohex/)
OK, I do understand floating-point. But what I don't understand is how you get an integer value (say 0CCCCCCCDh) to behave as if it were a floating-point value. It's the mechanics of the procedure that escape me: would you mind going through the process, step by step, including the shifts and all, for dummies like me?
I could actually use some pictures and diagrams (like in Alice's Restaurant: 8x10 color glossy photographs with circles and arrows and a paragraph on the back of each one explainin' what each one was).
okay, that's going to have to wait for tomorrow.
But you do see that magic number is the mantissa of .1, right?
So you multiply that mantissa by the mantissa of the number you want to divide by 10.
I didn't design this system, and it took me a while to figure it out.
Well, I have to give you points for seeming to understand it.
I can wait.
It's called a magic number for a reason :biggrin:
AMD Athlon(tm) II X2 220 Processor (SSE3)
61 bytes for other
8105 cycles for 100 * dwtoa
6404 cycles for 100 * dw2str
48371 cycles for 100 * MasmBasic Str$()
34694 cycles for 100 * Ray's algo I
10410 cycles for 100 * Ray's algo, mod JimG
8207 cycles for 100 * dwtoa
6404 cycles for 100 * dw2str
26515 cycles for 100 * MasmBasic Str$()
34499 cycles for 100 * Ray's algo I
10303 cycles for 100 * Ray's algo, mod JimG
8109 cycles for 100 * dwtoa
6405 cycles for 100 * dw2str
26517 cycles for 100 * MasmBasic Str$()
34405 cycles for 100 * Ray's algo I
10303 cycles for 100 * Ray's algo, mod JimG
8104 cycles for 100 * dwtoa
6413 cycles for 100 * dw2str
27009 cycles for 100 * MasmBasic Str$()
34508 cycles for 100 * Ray's algo I
10303 cycles for 100 * Ray's algo, mod JimG
Averages:
8107 cycles for dwtoa
6404 cycles for dw2str
26763 cycles for MasmBasic Str$()
34504 cycles for Ray's algo I
10303 cycles for Ray's algo, mod JimG
20 bytes for dwtoa
82 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo I
153 bytes for Ray's algo, mod JimG
dwtoa -123456789
dw2a 4171510507
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo I 171510507
Ray's algo, mod JimG -123456789
--- ok ---
For the sake of completeness, Ray's algo used for converting to a bin$:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
29156 cycles for crt__itoa
928 cycles for MasmBasic Bin$()
45257 cycles for Ray's algo
26 bytes for crt__itoa
16 bytes for MasmBasic Bin$()
110 bytes for Ray's algo
crt__itoa 100101101011010000111
MasmBasic Bin$() 00000000000100101101011010000111
Ray's algo 00000000000100101101011010000111
As expected, the div ebx with ebx=2 is not efficient, compared to a simple sar eax, 1.
One more bin-to-dec, with a modification proposed by JimG:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4440 cycles for dwtoa
2843 cycles for dw2str
16348 cycles for MasmBasic Str$()
13864 cycles for Ray's algo I
3767 cycles for Ray's algo, mod JimG
20 bytes for dwtoa
82 bytes for dw2str
16 bytes for MasmBasic Str$()
110 bytes for Ray's algo I
1129 bytes for Ray's algo, mod JimG
dwtoa -123456789
dw2a 4171510507
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo I 171510507
Ray's algo, mod JimG -123456789
Quote from: NoCforMe on April 03, 2024, 01:41:21 PMWell, I have to give you points for seeming to understand it.
I can wait.
I've given this a lot of thought, and I'm not good enough to explain this more than I have. If there is something in particular you don't understand, I'd be happy to explain that, but a blanket explain everything is beyond me.
Hi,Lingo
Quote00001220 clock cycles, (Ray_AnyToAny)x1000, OutPut: 987654321098
00001130 clock cycles, (lingo_i2aA)x1000, OutPut: 987654321098
00000416 clock cycles, (Roberts_dqtoa)x1000, OutPut: 987654321098
Roberts_dqtoa:
dqtoa proc uses rbx rsi rdi dqValue:QWORD, lpBuffer:QWORD
; -------------------------------------------------------------
; convert QWORD to ascii string
; dqValue is value to be converted
; lpBuffer is the address of the receiving buffer
; EXAMPLE:
; invoke dqtoa,dqValue,ADDR buffer
; -------------------------------------------------------------
mov rax, dqValue
mov rdi, lpBuffer
test rax,rax
jnz @pos
@zero:
mov word ptr [rdi],30h
jmp @exit
@pos:
mov rcx, 0cccccccccccccccdh ; 8 * 1/10
mov rsi, rdi
.while (rax > 0)
mov rbx,rax ; save original
mul rcx ; num * 1/10 magic number. rax/10 = rax* magic number
shr rdx, 3 ; magic number fixup
mov rax,rdx ;
lea rdx,[rdx*4+rdx] ; *5
add rdx,rdx ; *10
sub rbx,rdx ; remainder
add bl,'0' ; get the ascii value
mov [rdi],bl ; save it
add rdi, 1 ; get next digit
.endw
mov byte ptr [rdi], 0 ; terminate the string
; We now have all the digits, but in reverse order.
.while (rsi < rdi)
sub rdi, 1
mov al, [rsi]
mov ah, [rdi]
mov [rdi], al
mov [rsi], ah
add rsi, 1
.endw
@exit:
ret
dqtoa endp
Quote from: jimg on April 04, 2024, 04:59:41 AMQuote from: NoCforMe on April 03, 2024, 01:41:21 PMWell, I have to give you points for seeming to understand it.
I can wait.
I've given this a lot of thought, and I'm not good enough to explain this more than I have. If there is something in particular you don't understand, I'd be happy to explain that, but a blanket explain everything is beyond me.
Maybe you're better at explaining than you think you are.
Let's try this: can you explain, in some detail, how the bits of the number being "divided" (multiplied) and the magic number get shuffled and manipulated to come up with the final answer? This would explain how the virtual binary point comes into play, and how we turn a lowly integer (like
0CCCC ... CCCDh) into a virtual floating-point number. Possible?
I know I'm asking a lot, but at this point I really do not understand how this works. And that bothers me. I don't like taking things that I don't understand for granted.
Quote from: six_L on April 04, 2024, 05:29:49 AMlea rdx,[rdx*4+rdx]
That can be made faster, see reply #80.
Thank you six_L, :thumbsup:
Perfect!
Quote from: NoCforMe on April 04, 2024, 05:35:24 AMLet's try this: can you explain, in some detail, how the bits of the number being "divided" (multiplied) and the magic number get shuffled and manipulated to come up with the final answer? This would explain how the virtual binary point comes into play, and how we turn a lowly integer (like 0CCCC ... CCCDh) into a virtual floating-point number. Possible?
I know I'm asking a lot, but at this point I really do not understand how this works. And that bothers me. I don't like taking things that I don't understand for granted.
Okay, you made me take a closer look and found out my assumptions were wrong all along.
The answers come out correct, the algorithm works, it just that it seems the extra
multiply by 8, divide by 8 was unnecessary as far as I can see. I'll have to write
a test program before I'm 100% sure, but it looks like it from here.
So, removing the factor of eight means we save a shr,3 instruction.
The "magic" number now becomes 1/8 less or 19999999 hex. Strange coincidence with
a decimal looking number, but it's not. Originally, I thought the extra factor of eight
was required to maximize the precision of the numbers, left justifying the binary in the
maximum 32 bits. For now, it seems it's unneccessary. I tested on the min and max
32 bit binary number (minimum value of -2,147,483,648 and a maximum value of 2,147,483,647)
and it seems to work.
This was my original explanation, in case it turns out we do need the extra *8
" 1/10 is a repeating number in binary 0.000110011001100....
we multiply by 8 to maximize the precision= .11001100110011001100
on the bottom end, we slid in additional bits from the repeating binary
then, since there is more repeating binary left, we round up to a D
We do all this because, without full 32 bits of precision, the last
digit of our result would suffer from truncation error and would be
incorrect.
"
What hogwash.
Anyway, here is my new dialog-
eax = number in binary = 1234
mov ebx,eax ; save original for subtraction below to get remainder
mul ecx ; 19999999h = 1/10
1/10 in binary = 0.00011001100110011001100110011001 = 19999999 hex
1234 * 19999999h = 1234 * 429,496,729.625 = 529,998,964,357.25 = 7B66666685 hex
7B66666685 => edx=7B eax=66666685 edx= 123
edx is the overflow, from the multiplication, the integer part of the answer
the fractional part being in eax.
mov eax,edx ; save answer
lea edx,[edx*4+edx] ; *5 = 123*5 = 615
add edx,edx ; * = *10 total = 1230
sub ebx,edx ; remainder 1234-1230 = 4 our last digit,
which we push onto the stack for saving
next time through we have 123 in eax and do the same thing
mov ebx,eax ; save original 123
mul ecx ; 19999999h = 1/10 gives 12 in edx, don't care about eax
mov eax,edx ; save answer
lea edx,[edx*4+edx] ; *5 12*5=60
add edx,edx ; *2=120
sub ebx,edx ; remainder 123-120 = 3, our next digit
etc until our remainder was zero, then we are done.
at the end we pop the digits off the stack and save them. You know this.
I hope this helps. If not, ask again.
Quote from: jimg on April 04, 2024, 07:35:50 AMOkay, you made me take a closer look and found out my assumptions were wrong all along.
Wow. Very impressive answer. I haven't had time to delve into it yet, but I will. Thanks!
Thank you six_L :thumbsup:
Would you like to include my new 64 bit code in your perfect speed test, together with the Robert's code?
; Int_to_string - Convert a big integer into an string
; IN: RAX = binary integer
; RCX = address of location to store string
; OUT: RAX = new start of the string in the buffer
align 16
db 90h
i2a64 proc
push rbx
lea r9, [rcx+24] ; lpBuffer
mov r8, 0cccccccccccccccdh ; 8 * 1/10
@@:
mov rbx, rax
mul r8
shr rdx, 3
sub r9, 1
lea rcx, [rdx*4+rdx-18h]
lea rax, [rcx+rcx]
sub rbx, rax
mov rax, rdx
mov [r9],bl
test eax, eax
jnz @b
mov rax, r9 ; return in RAX new start of the buffer
pop rbx
ret
i2a64 endp
Note:
The code for the minus sign and leading zeros must be kept out of the algorithm.
It's not normal to bloat the algorithm with 99.99% unusable code.
So....
I know I said I was done before, but with all the changes, here's the latest.
You can blame NoCforMe.
.DATA?
align 4
resultJim db 16 dup(?)
.CODE
alignProc
TestG_s:
db 2 dup (?) ; dummy spacer to align instructions best
ctAnyToAnyMulJim: ; setup (needs a better name)
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,19999999h ; 1/10
call ctanydoit
ret
align 16
nop
ctanydoit: ; this one needs a better name also
push ebx ; save ebx, or a digit
mov ebx,eax ; save original
mul ecx ; 19999999h = 1/10
mov eax,edx ; save answer
lea edx,[edx*4+edx] ; *5
add edx,edx ; *10
sub ebx,edx ; subtract from original gives remainder digit
test eax,eax
.if !zero?
call ctanydoit ; generate next digit
.endif
add bl,'0'
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
ret
results:
Averages:
5989 cycles for dwtoa
5380 cycles for dw2str
22244 cycles for MasmBasic Str$()
14633 cycles for Ray's algo I
5313 cycles for Ray's algo, mod JimG
Averages:
5862 cycles for dwtoa
5386 cycles for dw2str
22392 cycles for MasmBasic Str$()
14412 cycles for Ray's algo I
5341 cycles for Ray's algo, mod JimG
Averages:
5861 cycles for dwtoa
5376 cycles for dw2str
22837 cycles for MasmBasic Str$()
14746 cycles for Ray's algo I
5310 cycles for Ray's algo, mod JimG
Averages:
5856 cycles for dwtoa
5415 cycles for dw2str
22320 cycles for MasmBasic Str$()
14484 cycles for Ray's algo I
5230 cycles for Ray's algo, mod JimG
That four consecutive passes, no cherry picking.
Not bad for having used dwtoa for a couple decades. Even beat that cheater dw2str.
Of course, now we need to update all those other routines with the changes (use the new 1/10 and don't shift) to get fair comparisons.
I've attach JJ's routine with changes I made because some things were skewing the results.
Quote from: jimg on April 04, 2024, 10:52:19 AMI know I said I was done before
A good programmer is never done, always looking for a better way. Its best to say, "done for now".
You certainly did a very good job of adapting the algorithm for base 10.
Well, with that being said......
If you weren't impressed with the last one, hang onto your socks.
I did something I've been meaning to try all along.
Personally I dislike recursion in the strongest terms.
So I got rid of the recursion.
.DATA?
align 4
resultJim2 db 16 dup(?)
.CODE
alignProc
TestH_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii: ; setup
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,19999999h ; 1/10
push esi
mov esi,0
call dword2asciiloop
pop esi
ret
align 16
nop ; code optimization
dword2asciiloop:
push ebx ; save ebx, or a digit
inc esi
mov ebx,eax ; save original
mul ecx ; 19999999h = 1/10
mov eax,edx ; save answer
lea edx,[edx*4+edx] ; *5
add edx,edx ; *10
sub ebx,edx ; subtract from original gives remainder digit
test eax,eax ; are we done?
jnz dword2asciiloop
.repeat
add bl,'0'
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
dec esi
.until zero?
ret
results:
Averages:
5932 cycles for dwtoa
5363 cycles for dw2str
22231 cycles for MasmBasic Str$()
5204 cycles for Ray's algo, mod JimG
4581 cycles for jgnorecurse
Averages:
5859 cycles for dwtoa
5376 cycles for dw2str
22270 cycles for MasmBasic Str$()
5282 cycles for Ray's algo, mod JimG
4596 cycles for jgnorecurse
Averages:
5860 cycles for dwtoa
5376 cycles for dw2str
22506 cycles for MasmBasic Str$()
5282 cycles for Ray's algo, mod JimG
4595 cycles for jgnorecurse
Averages:
5855 cycles for dwtoa
5376 cycles for dw2str
22690 cycles for MasmBasic Str$()
5264 cycles for Ray's algo, mod JimG
4593 cycles for jgnorecurse
dwtoa -123456789
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo, mod JimG -123456789
jgnorecurse -123456789
Quote from: jimg on April 04, 2024, 10:52:19 AMEven beat that cheater dw2str
How dare you :rofl:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4618 cycles for dwtoa
2994 cycles for dw2str
3930 cycles for Ray's algo, mod JimG
3040 cycles for dw2$X
Averages:
4620 cycles for dwtoa
2994 cycles for dw2str
3951 cycles for Ray's algo, mod JimG
3044 cycles for dw2$X
Averages:
4634 cycles for dwtoa
3000 cycles for dw2str
3975 cycles for Ray's algo, mod JimG
3036 cycles for dw2$X
> If you weren't impressed with the last one, hang onto your socks.
P.S.: socks algo not yet included, sorry - it's 2:40 AM here...
And a final cleanup. Only gained about 200 from the change.
alignProc
TestH_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii: ; setup
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,19999999h ; 1/10
push esi
mov esi,0
.repeat
push ebx ; save ebx, or a digit
inc esi
mov ebx,eax ; save original
mul ecx ; 19999999h = 1/10
mov eax,edx ; save answer
lea edx,[edx*4+edx-24] ; *5 also subtract out half of '0' (24) to convert last byte to ascii ala lingo
add edx,edx ; *10 double it, so full value of '0' is covered. so -48 extra to dl
sub ebx,edx ; subtract from original gives remainder digit -(-48) =+48=+'0' bl now in ascii
test eax,eax ; are we done?
.until zero?
.repeat
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
dec esi
.until zero?
pop esi
ret
NameH equ <jgnorecurse>
Quote from: jimg on April 04, 2024, 12:10:44 PMAnd a final cleanup.
Very good job, excellent work.
Just read the comments.
I'm using Jochen's trick here. I didn't see any gain in speed, but I left it in anyway-
lea edx,[edx*4+edx-26] ; *5 also subtract out half of '0' (26) to convert last byte to ascii
add edx,edx ; *10 double it, so full value of '0' is covered. so -52 extra to dl
sub ebx,edx ; subtract from original gives remainder digit -(-52) =+52=+'0' bl now in ascii
I'll add these to the code above
QuoteI'm using Jochen's trick here. I didn't see any gain in speed, but I left it in anyway-
lea edx,[edx*4+edx-26] ; *5 also subtract out half of '0' (26) to convert last byte to ascii
add edx,edx ; *10 double it, so full value of '0' is covered. so -52 extra to dl
sub ebx,edx ; subtract from original gives remainder digit -(-52) =+52=+'0' bl now in ascii
26*2=52='4' -> ASCII code
This trick was stolen from my post (see page 6 bottom in my post) by a very stupid thief, ha-ha-hah... :badgrin: :badgrin: :badgrin: :skrewy:
The true number in my post is for 18h=24 * 2 = 30h=48 = '0'
lea rcx, [rdx*4+rdx-18h]
lea rax, [rcx+rcx]
...........
You are absolutely correct. My appologies. I'm going to see why it works.
This is totally my fault. I tried to modify one of Jochen's test procs to test my new proc, and I missed where I had to change one value, and it picked up the results from the previous tests answers. No excuse, just stupidity. Had I been printing the correct results, I would have immediately seen that something was wrong. Stupid.
Again, my apologies for not crediting the correct person. But anyway, I'll fix the above code.
Please don't start a war over this, this was totally unintentional.
On another note, I said above that dw2str was a cheater. That because it requires one to preset what the length of the answer is going to be. But of course, one doesn't know what the length is going to be. By presetting it, it saves all the code to reverse the string. Not usable unless you know you want a right justified string, which is totally possible of being the case, just not for this test.
Unless I'm completely misreading the code, which, given what happened just above, is possible.
One other caveat. I'm optimizing this on an Intel computer. If you are running AMD, use Jochen's result to choose a routine. Here is my latest results for three consequtive runs fixing the screwup above-
Averages:
5862 cycles for dwtoa
5389 cycles for dw2str
22890 cycles for MasmBasic Str$()
5332 cycles for Ray's algo, mod JimG
4280 cycles for jgnorecurse
Averages:
5876 cycles for dwtoa
5377 cycles for dw2str
23038 cycles for MasmBasic Str$()
5293 cycles for Ray's algo, mod JimG
4256 cycles for jgnorecurse
Averages:
5865 cycles for dwtoa
5376 cycles for dw2str
22772 cycles for MasmBasic Str$()
5220 cycles for Ray's algo, mod JimG
4256 cycles for jgnorecurse
20 bytes for dwtoa
82 bytes for dw2str
16 bytes for MasmBasic Str$()
100 bytes for Ray's algo, mod JimG
80 bytes for jgnorecurse
dwtoa -123456789
dw2str -123456789
MasmBasic Str$() -123456789
Ray's algo, mod JimG -123456789
jgnorecurse -123456789
Hi,jj2007
QuoteNo negative numbers?
the lingo_i2aA hasn't the negative numbers.
For the sake of fair race, I made a bit modification to remove the negative numbers.
Hi,Lingo
1)
Quote00001088 clock cycles, (Ray_AnyToAny)x1000, OutPut: 987654321098
00000281 clock cycles, (lingo_i2a64)x1000, OutPut: 987654321098
00000386 clock cycles, (Roberts_dqtoa)x1000, OutPut: 987654321098
You are the fastest this time.
2)
if you have some idle time, help me check the X64timers.asm. whether is it correctly? it's been translated from 32bit asm.
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
; These two macros perform the grunt work involved in measuring the
; processor clock cycle count for a block of code. These macros must
; be used in pairs, and the block of code must be placed in between
; the counter_begin and counter_end macro calls. The counter_end macro
; returns the clock cycle count for a single pass through the block of
; code, corrected for the test loop overhead, in RAX.
;
; These macros require a .586 or higher processor directive.
;
; If your code is using MMX instructions and not executing an EMMS
; at the end of each MMX instruction sequence, defining the symbol
; _EMMS will cause the ctr_end macro to insert an EMMS in front of
; the FPU instructions.
;
; The loopcount parameter should be set to a relatively high value to
; produce repeatable results.
;
; Note that setting the priority parameter to REALTIME_PRIORITY_CLASS
; involves some risk, as it will cause your process to preempt *all*
; other processes, including critical Windows processes. Setting the
; priority parameter to HIGH_PRIORITY_CLASS instead will significantly
; reduce the risk, and in most cases will produce the same cycle count.
;------------------------------------------------------------------------------
counter_begin MACRO loopcount:REQ, priority
LOCAL label
IFNDEF __counter__stuff__defined__
__counter__stuff__defined__ equ <1>
.data
__counter__pc__count__delta LARGE_INTEGER <>
__counter__pc__count__0 LARGE_INTEGER <>
__counter__pc__count__1 LARGE_INTEGER <>
__counter__pc__count__2 LARGE_INTEGER <>
__counter__pc__count__3 LARGE_INTEGER <>
__counter__loop__count__ dq 0
__counter__loop__counter__ dq 0
__counter__dq_count__ dq 0
.code
ENDIF
mov __counter__loop__count__, loopcount
IFNB <priority>
invoke GetCurrentProcess
invoke SetPriorityClass, rax, priority
ENDIF
invoke QueryPerformanceCounter, ADDR __counter__pc__count__0
mov __counter__loop__counter__, loopcount
@@: ;; Start an empty reference loop
sub __counter__loop__counter__, 1
jnz @B
invoke QueryPerformanceCounter, ADDR __counter__pc__count__1
mov rax,__counter__pc__count__1.QuadPart
sub rax,__counter__pc__count__0.QuadPart
mov __counter__pc__count__delta.QuadPart,rax ;; Overhead count
mov __counter__loop__counter__, loopcount
invoke QueryPerformanceCounter, ADDR __counter__pc__count__2
label: ;; Start test loop
__counter__loop__label__ equ <label>
ENDM
;------------------------------------------------------------------------------
counter_end MACRO
sub __counter__loop__counter__, 1
jnz __counter__loop__label__
invoke QueryPerformanceCounter, ADDR __counter__pc__count__3
invoke GetCurrentProcess
invoke SetPriorityClass, rax, NORMAL_PRIORITY_CLASS
IFDEF _EMMS
EMMS
ENDIF
mov rax,__counter__pc__count__3.QuadPart
sub rax,__counter__pc__count__2.QuadPart
sub rax,__counter__pc__count__delta.QuadPart
mov __counter__pc__count__3.QuadPart,rax ;; count
finit
fild __counter__pc__count__3.QuadPart
fild __counter__loop__count__
fdiv
fistp __counter__dq_count__
mov rax, __counter__dq_count__
ENDM
;------------------------------------------------------------------------------
; These two macros perform the grunt work involved in measuring the
; execution time in milliseconds for a specified number of loops
; through a block of code. These macros must be used in pairs, and
; the block of code must be placed in between the timer_begin and
; timer_end macro calls. The timer_end macro returns the elapsed
; milliseconds for the entire loop in RAX.
;
; These macros utilize the high-resolution performance counter.
; The return value will be zero if the high-resolution performance
; counter is not available.
;
; If your code is using MMX instructions and not executing an EMMS
; at the end of each MMX instruction sequence, defining the symbol
; _EMMS will cause the timer_end macro to insert an EMMS in front of
; the FPU instructions.
;
; The loopcount parameter should be set to a relatively high value to
; produce repeatable results.
;
; Note that setting the priority parameter to REALTIME_PRIORITY_CLASS
; involves some risk, as it will cause your process to preempt *all*
; other processes, including critical Windows processes. Setting the
; priority parameter to HIGH_PRIORITY_CLASS instead will significantly
; reduce the risk, and in most cases will produce very nearly the same
; result.
;------------------------------------------------------------------------------
timer_begin MACRO loopcount:REQ, priority
LOCAL label
IFNDEF __timer__stuff__defined__
__timer__stuff__defined__ equ <1>
.data
__timer__pc__frequency__ LARGE_INTEGER <>
__timer__pc__count__delta LARGE_INTEGER <>
__timer__pc__count__0 LARGE_INTEGER <>
__timer__pc__count__1 LARGE_INTEGER <>
__timer__pc__count__2 LARGE_INTEGER <>
__timer__pc__count__3 LARGE_INTEGER <>
__timer__loop__counter__ dq 0
__timer__dq_count__ dq 0
.code
ENDIF
invoke QueryPerformanceFrequency, ADDR __timer__pc__frequency__
.if rax != 0
IFNB <priority>
invoke GetCurrentProcess
invoke SetPriorityClass, rax, priority
ENDIF
invoke QueryPerformanceCounter, ADDR __timer__pc__count__0
mov __timer__loop__counter__, loopcount
@@: ;; Start an empty reference loop
sub __timer__loop__counter__, 1
jnz @B
invoke QueryPerformanceCounter, ADDR __timer__pc__count__1
mov rax,__timer__pc__count__1.QuadPart
sub rax, __timer__pc__count__0.QuadPart
mov __timer__pc__count__delta.QuadPart,rax ;; Overhead count
invoke QueryPerformanceCounter, ADDR __timer__pc__count__2 ;; Start test count
mov __timer__loop__counter__, loopcount
label: ;; Start test loop
__timer__loop__label__ equ <label>
.endif
ENDM
;------------------------------------------------------------------------------
timer_end MACRO
sub __timer__loop__counter__, 1
jnz __timer__loop__label__
invoke QueryPerformanceFrequency, ADDR __timer__pc__frequency__
.IF rax != 0
invoke QueryPerformanceCounter, ADDR __timer__pc__count__3
invoke GetCurrentProcess
invoke SetPriorityClass, rax, NORMAL_PRIORITY_CLASS
IFDEF _EMMS
EMMS
ENDIF
mov rax,__timer__pc__count__3.QuadPart
sub rax, __timer__pc__count__2.QuadPart
sub rax, __timer__pc__count__delta.QuadPart
mov __timer__pc__count__3.QuadPart,rax ;; count
finit
fild __timer__pc__count__3.QuadPart
fild __timer__pc__frequency__.QuadPart
fdiv
mov __timer__dq_count__, 1000000 ;ns*1000000=ms
fild __timer__dq_count__
fmul
fistp __timer__dq_count__
mov rax, __timer__dq_count__
.ELSE
xor rax, rax ;; No performance counter
.ENDIF
ENDM
;------------------------------------------------------------------------------
SpinUp MACRO
invoke Sleep, 1
mov rax, 100000000 ; tell the CPU it's needed
.Repeat
dec rax
.Until Sign?
ENDM
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
i modify the "counter_begin 10, HIGH_PRIORITY_CLASS" into "counter_begin 20, HIGH_PRIORITY_CLASS", there has no distinction.
Thank you six_L, :thumbsup:
Excellent!
That's better than ever. :greenclp:
Hi,Lingo
Quote00001237 clock cycles, (Ray_AnyToAny)x1000, OutPut: 987654321098
00000297 clock cycles, (lingo_i2a64)x1000, OutPut: 987654321098
00000458 clock cycles, (Roberts_dqtoa)x1000, OutPut: 987654321098
00000319 clock cycles, (jimg_qword2ascii)x1000, OutPut: 98765431::98
jimg_qword2ascii:
;db 2 dup (?) ; dummy spacer to align instructions best
qword2ascii proc
test rax,rax
jnz @f
mov word ptr [rdi],30h
ret
;@@:
; jns @f
; mov byte ptr [rdi],'-'
; neg rax
; add rdi, 1
@@:
mov rcx,1999999999999999h ; 1/10
push rsi
mov rsi,0
.repeat
push rbx ; save rbx, or a digit
inc rsi
mov rbx,rax ; save original
mul rcx ; 19999999h = 1/10
mov rax,rdx ; save answer
lea rdx,[rdx*4+rdx-18h] ; *5 also subtract out half of '0' (24) to convert last byte to ascii ala lingo
add rdx,rdx ; *10 double it, so full value of '0' is covered. so -48 extra to dl
sub rbx,rdx ; subtract from original gives remainder digit -(-48) =+48=+'0' bl now in ascii
test rax,rax ; are we done?
.until zero?
.repeat
mov [rdi],bl
inc rdi
pop rbx ; restore rbx, or get next digit
dec rsi
.until zero?
pop rsi
ret
qword2ascii endp
but the jimg_qword2ascii is not right result. maybe i translated errorly.
regard
Quote from: lingo on April 04, 2024, 03:08:53 PMa very stupid thief, ha-ha-hah... :badgrin: :badgrin: :badgrin: :skrewy:
See
Bravo, Lingo! (https://masm32.com/board/index.php?topic=11831.0) in the Colosseum ;-)
Quote from: jimg on April 04, 2024, 12:10:44 PMAnd a final cleanup.
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4638 cycles for dwtoa
2986 cycles for dw2str
3225 cycles for jgnorecurse
3042 cycles for dw2$X
Averages:
4616 cycles for dwtoa
2988 cycles for dw2str
3232 cycles for jgnorecurse
3042 cycles for dw2$X
Averages:
4608 cycles for dwtoa
2982 cycles for dw2str
3210 cycles for jgnorecurse
3026 cycles for dw2$X
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Averages:
7034 cycles for dwtoa
5718 cycles for dw2str
4764 cycles for jgnorecurse
6172 cycles for dw2$X
Averages:
7232 cycles for dwtoa
5883 cycles for dw2str
4932 cycles for jgnorecurse
6285 cycles for dw2$X
Averages:
7382 cycles for dwtoa
6289 cycles for dw2str
5239 cycles for jgnorecurse
6760 cycles for dw2$X
Looks like an Intel vs AMD problem...
That is a significant and surprising difference.
Jochen-
Would you check this out if you get a chance?
This is the test program but with a conditional compile around test H. On line 462 you should find
dotest=2 ; 1=test without DW2$X, else test with DS2$X
if dotest eq 1
Case 1 is just a copy of my routine for TestH.
Case 2 is your DS2$X routine.
When I run, my routine runs 120 cycles faster for case 1 than case 2. This makes no sense to me, but it is consistent. Does this happen on AMD also?
Quote from: jimg on April 05, 2024, 01:44:25 AMWould you check this out if you get a chance?
Averages:
4552 cycles for dwtoa
3041 cycles for dw2str
3176 cycles for jgnorecurse
2993 cycles for dw2$X
Averages:
4450 cycles for dwtoa
2845 cycles for dw2str
3072 cycles for jgnorecurse
3072 cycles for jgnorecurse2
Timings are not very consistent :rolleyes:
Same but with TimerLoops = 100000 (the default - you set it very low at 1000):
Averages:
4480 cycles for dwtoa
2870 cycles for dw2str
3100 cycles for jgnorecurse
3099 cycles for jgnorecurse2
Averages:
4597 cycles for dwtoa
2968 cycles for dw2str
3227 cycles for jgnorecurse
3016 cycles for dw2$X
Hmmmm...
Looks like it ran an extra 100 cycles on the first test, and an extra 127 on the second test, so you are seeing the same effect. Strange.
From reply #106
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
61 bytes for other
9439 cycles for 100 * dwtoa
7928 cycles for 100 * dw2str
7370 cycles for 100 * jgnorecurse
7980 cycles for 100 * dw2$X
9527 cycles for 100 * dwtoa
8051 cycles for 100 * dw2str
7511 cycles for 100 * jgnorecurse
7954 cycles for 100 * dw2$X
9533 cycles for 100 * dwtoa
8081 cycles for 100 * dw2str
7475 cycles for 100 * jgnorecurse
7945 cycles for 100 * dw2$X
9567 cycles for 100 * dwtoa
8053 cycles for 100 * dw2str
7472 cycles for 100 * jgnorecurse
7942 cycles for 100 * dw2$X
Averages:
9530 cycles for dwtoa
8052 cycles for dw2str
7474 cycles for jgnorecurse
7950 cycles for dw2$X
20 bytes for dwtoa
82 bytes for dw2str
80 bytes for jgnorecurse
106 bytes for dw2$X
dwtoa -123456789
dw2str -123456789
jgnorecurse -123456789
dw2$X -123456789
--- ok ---
:smiley:
Hi,Lingo
Quotebut the jimg_qword2ascii is not right result. maybe i translated errorly.
1) i know that's why. we can't use the 01999,9999,9999,9999h(1/10),must use the 0ccc,cccc,cccc,cccdh(8*1/10).
otherwise, when a few zero in qword, the qword2ascii will be failed.
2) you use the "little end" to save the result. that eliminate the unnecessary codes.
3) {lea rdx,[rdx*4+rdx-18h]}, We really can't deliberate it. That's very nice.
i learned a lot skills from you.
Thank you very much!
regard
Thanks for that. I was just getting around to testing. So for now, back to the old way.
.DATA?
align 4
resultJim db 16 dup(?)
.CODE
alignProc
TestG_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii: ; setup
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,0CCCCCCCDh ; 8 * 1/10
push esi
mov esi,0
.repeat
push ebx ; save ebx, or a digit
inc esi
mov ebx,eax ; save original
mul ecx ; 0CCCCCCCDh =8 * 1/10
shr edx,3 ; take out the factor of 8
mov eax,edx ; save answer
lea edx,[edx*4+edx-24] ; *5 also subtract out half of '0' (24) to convert last byte to ascii
add edx,edx ; *10 double it, so full value of '0' is covered. so -48 extra to dl
sub ebx,edx ; subtract from original gives remainder digit -(-48) =+48=+'0' bl now in ascii
test eax,eax ; are we done?
.until zero?
.repeat
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
dec esi
.until zero?
pop esi
ret
After a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.
Hi, jimg
Quoteuse 66666667h to multiply
This is a significant improvement.
Quote00000374 clock cycles, (lingo_i2a64)x1000, OutPut: 98765432109876 ;use 8*1/10
00000360 clock cycles, (lingo_i2a64+)x1000, OutPut: 98765432109876 ;use 4*1/10
Thank you very much!
regard
it shouldn't really make a difference as you have to shift anyway. And I tested only 32 bit numbers, may not be true for 64 bit.
Quote from: jimg on April 05, 2024, 04:30:50 AMAfter a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.
66666667h/2 and shr 1 would be a tick faster. The sh* reg32,
1 instructions are faster than shifts with a counter higher than one. Unfortunately, that limits the range of valid multiplies - try the numbers above 1,073,741,824 :cool:
Hi,jj2007
Quote66666667h/2 and shr 1 would be a tick faster.
:cool:
This ia an another improvement.
Could you test the speed of below codes?
Ldword2ascii proc uses ebx esi edi dwVaule:DWORD,pOutBuf:DWORD
local nFlag:BOOL
mov nFlag,FALSE
mov eax,dwVaule
mov edi,pOutBuf
lea esi,[edi+16]
test eax,eax
jnz @f
mov word ptr [esi],30h
jmp @exit
@@:
jns @f
mov nFlag,TRUE
neg eax
@@:
mov ecx,033333334h ; 2*(1/10)
.repeat
mov ebx,eax ; save original
mul ecx ; 033333334h = 2*(1/10)
shr edx,1 ; /2
mov edi,edx ; edx = remainder of eax/10
lea edx,[edx*4+edx-18h] ; *5 also subtract out half of '0' (24) to convert last byte to ascii ala lingo
lea edx,[edx+edx] ; *10 double it, so full value of '0' is covered. so -48 extra to dl
sub ebx,edx ; subtract from original gives remainder digit -(-48) =+48=+'0' bl now in ascii
sub esi,1
mov [esi],bl
mov eax,edi ; save answer
test eax,eax ; are we done?
.until zero?
.if nFlag == TRUE
sub esi, 1
mov byte ptr [esi],'-'
.endif
@exit:
mov eax,esi ;eax output addr in pOutBuf
ret
Ldword2ascii endp
local Outbuf[64] :BYTE
invoke RtlZeroMemory,addr Outbuf,sizeof Outbuf
invoke Ldword2ascii, -9076541,addr Outbuf
invoke MessageBox,0,eax,0,0
This is it for me. This is the all up proc. Overall, about a 10% improvement over dwtoa (dwtoa averaged 72 cycles, dword2ascii averaged 64 cycles).
align 16
dword2ascii proc uses esi edi, Value,buff
; converts 32 bit signed integer to ascii
; Value to be converted
; buff = address of where to store the results
mov eax,Value
mov edi,buff
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,66666667h ; 4 * 1/10
push esi
xor esi,esi
.repeat
push ebx ; save ebx, or a digit
inc esi
mov ebx,eax ; save original
mul ecx ; 66666667h = 4 * 1/10
shr edx,2 ; take out the factor of 4
mov eax,edx ; save answer
add edx,edx ; *2
lea edx,[edx*4+edx-'0'] ; *10 also subtract out '0' to convert last byte to ascii
sub ebx,edx ; subtract from original gives remainder, digit - (-'0') = +'0' bl now in ascii
test eax,eax ; are we done?
.until zero?
.repeat
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
dec esi
.until zero?
mov byte ptr [edi],0
pop esi
ret
dword2ascii endp
Hi six_L,
It is a new version which return length of the string in rcx.
; qword to string - Convert a qword into string
; IN: RAX = qword
; RCX = address of location to store string
; OUT: RAX = new start of the string in the buffer
RCX = length of the string
align 16
db 8 Dup (90h)
q2a64 proc
lea r9, [rcx+24] ; lpBuffer
mov r8, 0cccccccccccccccdh ; 8 * 1/10
mov byte ptr[r9],0
mov r10, r9
mov rcx, rax
@@:
mul r8
shr rdx, 3 ; : 8
sub r9, 1
lea rax, [rdx*4+rdx] ; * 5
neg rax
lea rcx, [rcx+2*rax+30h] ; * 2 +30h
mov rax, rdx
mov [r9],cl
mov rcx, rdx
test eax, eax
jnz @b
mov rcx, r10
mov rax, r9 ; return in RAX new starting address of tne string of the buffer
sub rcx, r9 ; return in RCX length of the string
ret
q2a64 endp
Note:
The code for the minus sign and leading zeros must be kept out of the algorithm.
It's not normal to bloat the algorithm with 99.99% unusable code.
I am curious when the stupid thief will come back...ha,ha :skrewy:
Hi,Lingo
This is the tested result.
Quote00005214 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004638 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014990 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004458 clock cycles, (lingo_i2a64_1)x10000, OutPut: 98765432109876
00003845 clock cycles, (lingo_q2a64_2)x10000, OutPut: 98765432109876
the attachment is the tested exe.
Thank you six_L, :thumbsup:
Excellent results!
Quote from: jimg on April 04, 2024, 03:42:13 PMAgain, my apologies for not crediting the correct person.
Jim, as described here, you credited the right person (https://masm32.com/board/index.php?topic=11831.0). Btw I wouldn't call Lingo a thief just because he used one of my ideas, that's not my style. Actually, I feel flattered that our great genius, the greatest coder since Adam & Eve, uses my ideas.
Hi,All
Note to all:
If anyone wants to argue, hurry up and argue hotly. Actually, there will be a few time left. because our member NoCforMe forecasted "No planet, Zero bytes at 2030".
I quite approve of his view. Since the World Thief Biden forgot the UN's rule which the five permanent members of the UN cannot attack each other.
Good Luck to all in those remaining time.
Like I say, enjoy the ride while it lasts!
Thank you six_L,
I was raised not to argue with elderly people with mental and/or physical disabilities.
With specific regard to certain statements, suggestions and accusations made here recently, may I Politely and Respectfully refer you ALL to this:
A Reiteration of certain forum rules (https://masm32.com/board/index.php?topic=11836.0)
Thanks, Stoo :thup:
Back to the topic: may I have some timings, please?
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4603 cycles for dwtoa
2954 cycles for dw2str
2212 cycles for dw2$JJ
3248 cycles for jgnorecurse
2957 cycles for dw2$X
Quote from: jj2007 on April 07, 2024, 08:52:16 AMBack to the topic: may I have some timings, please?
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (SSE4)
3144 cycles for 100 * dwtoa
2625 cycles for 100 * dw2str
1474 cycles for 100 * dw2$JJ
2276 cycles for 100 * jgnorecurse
2810 cycles for 100 * dw2$X
3243 cycles for 100 * dwtoa
2634 cycles for 100 * dw2str
1514 cycles for 100 * dw2$JJ
2338 cycles for 100 * jgnorecurse
2880 cycles for 100 * dw2$X
3226 cycles for 100 * dwtoa
2704 cycles for 100 * dw2str
1605 cycles for 100 * dw2$JJ
2359 cycles for 100 * jgnorecurse
2861 cycles for 100 * dw2$X
3275 cycles for 100 * dwtoa
2686 cycles for 100 * dw2str
1528 cycles for 100 * dw2$JJ
2388 cycles for 100 * jgnorecurse
2882 cycles for 100 * dw2$X
Averages:
3234 cycles for dwtoa
2660 cycles for dw2str
1521 cycles for dw2$JJ
2348 cycles for jgnorecurse
2870 cycles for dw2$X
20 bytes for dwtoa
82 bytes for dw2str
94 bytes for dw2$JJ
76 bytes for jgnorecurse
106 bytes for dw2$X
dwtoa-123456789
dw2str-123456789
dw2$JJ-123456789
jgnorecurse-123456789
dw2$X-123456789
13th Gen Intel(R) Core(TM) i9-13900KF (SSE4)
2180 cycles for 100 * dwtoa
1748 cycles for 100 * dw2str
905 cycles for 100 * dw2$JJ
1798 cycles for 100 * jgnorecurse
2090 cycles for 100 * dw2$X
2227 cycles for 100 * dwtoa
1799 cycles for 100 * dw2str
995 cycles for 100 * dw2$JJ
1843 cycles for 100 * jgnorecurse
2088 cycles for 100 * dw2$X
2249 cycles for 100 * dwtoa
1796 cycles for 100 * dw2str
957 cycles for 100 * dw2$JJ
1815 cycles for 100 * jgnorecurse
2056 cycles for 100 * dw2$X
2241 cycles for 100 * dwtoa
1790 cycles for 100 * dw2str
958 cycles for 100 * dw2$JJ
1814 cycles for 100 * jgnorecurse
2061 cycles for 100 * dw2$X
Averages:
2234 cycles for dwtoa
1793 cycles for dw2str
958 cycles for dw2$JJ
1814 cycles for jgnorecurse
2074 cycles for dw2$X
20 bytes for dwtoa
82 bytes for dw2str
94 bytes for dw2$JJ
76 bytes for jgnorecurse
106 bytes for dw2$X
dwtoa -123456789
dw2str -123456789
dw2$JJ -123456789
jgnorecurse -123456789
dw2$X -123456789
And the winner is ... dw2$JJ by a length!
I'm sure this'll be great news to all coders who use this to display user output ... wow, I saved a whole 800 cycles!
Quote from: NoCforMe on April 07, 2024, 10:37:40 AMAnd the winner is ... dw2$JJ by a length!
Not yet. The algo works fine for 123456789, less so for smaller numbers. Will be fixed tomorrow ;-)
Quote from: NoCforMe on April 07, 2024, 10:37:40 AMAnd the winner is ... dw2$JJ by a length!
I'm sure this'll be great news to all coders who use this to display user output ... wow, I saved a whole 800 cycles!
For some of us that aren't just hobbyists, having a user sitting twiddling their thumbs is not a good look.
After all, this is The Laboratory
QuoteThe Laboratory
Algorithm and code design research laboratory. This is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful.
Maybe you should hang around The Campus :biggrin:
The show must go on :thumbsup:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4761 cycles for dwtoa
3044 cycles for dw2str
2122 cycles for MbDw2Str <<<<<<<<<<<<<<<<<<<< tested for the full range -1...0
3364 cycles for jgnorecurse
2948 cycles for dw2$X
The new algo uses a table, as suggested initially by Ray alias ahsat :thumbsup:
Quote from: jj2007 on April 07, 2024, 08:52:16 AMThanks, Stoo :thup:
Back to the topic: may I have some timings, please?
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4603 cycles for dwtoa
2954 cycles for dw2str
2212 cycles for dw2$JJ
3248 cycles for jgnorecurse
2957 cycles for dw2$X
13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)
1794 cycles for 100 * dwtoa
1416 cycles for 100 * dw2str
751 cycles for 100 * dw2$JJ
1442 cycles for 100 * jgnorecurse
1707 cycles for 100 * dw2$X
1845 cycles for 100 * dwtoa
1464 cycles for 100 * dw2str
793 cycles for 100 * dw2$JJ
1496 cycles for 100 * jgnorecurse
1710 cycles for 100 * dw2$X
1839 cycles for 100 * dwtoa
1466 cycles for 100 * dw2str
806 cycles for 100 * dw2$JJ
1509 cycles for 100 * jgnorecurse
1686 cycles for 100 * dw2$X
1837 cycles for 100 * dwtoa
1466 cycles for 100 * dw2str
799 cycles for 100 * dw2$JJ
1612 cycles for 100 * jgnorecurse
1693 cycles for 100 * dw2$X
Averages:
1838 cycles for dwtoa
1465 cycles for dw2str
796 cycles for dw2$JJ
1502 cycles for jgnorecurse
1700 cycles for dw2$X
20 bytes for dwtoa
82 bytes for dw2str
94 bytes for dw2$JJ
76 bytes for jgnorecurse
106 bytes for dw2$X
dwtoa -123456789
dw2str -123456789
dw2$JJ -123456789
jgnorecurse -123456789
dw2$X -123456789
--- ok ---
Quote from: jj2007 on April 07, 2024, 07:55:18 PMThe show must go on :thumbsup:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4761 cycles for dwtoa
3044 cycles for dw2str
2122 cycles for MbDw2Str <<<<<<<<<<<<<<<<<<<< tested for the full range -1...0
3364 cycles for jgnorecurse
2948 cycles for dw2$X
The new algo uses a table, as suggested initially by Ray alias ahsat :thumbsup:
13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)
1782 cycles for 100 * dwtoa
1383 cycles for 100 * dw2str
997 cycles for 100 * MbDw2Str
1400 cycles for 100 * jgnorecurse
1650 cycles for 100 * dw2$X
1830 cycles for 100 * dwtoa
1442 cycles for 100 * dw2str
1052 cycles for 100 * MbDw2Str
1440 cycles for 100 * jgnorecurse
1657 cycles for 100 * dw2$X
1846 cycles for 100 * dwtoa
1442 cycles for 100 * dw2str
1038 cycles for 100 * MbDw2Str
1447 cycles for 100 * jgnorecurse
1650 cycles for 100 * dw2$X
1833 cycles for 100 * dwtoa
1446 cycles for 100 * dw2str
1000 cycles for 100 * MbDw2Str
1437 cycles for 100 * jgnorecurse
1647 cycles for 100 * dw2$X
Averages:
1832 cycles for dwtoa
1442 cycles for dw2str
1019 cycles for MbDw2Str
1438 cycles for jgnorecurse
1650 cycles for dw2$X
20 bytes for dwtoa
82 bytes for dw2str
102 bytes for MbDw2Str
76 bytes for jgnorecurse
106 bytes for dw2$X
dwtoa 123456789
dw2str 123456789
MbDw2Str 123456789
jgnorecurse 123456789
dw2$X 123456789
--- ok ---
Quote from: LiaoMi on April 07, 2024, 08:06:59 PM13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)
Thanks, LiaoMi - so even on Intel MbDw2Str performs well :thumbsup:
Special edition for JimG - I had activated an older version of his algo. Here is the correct one :thup:
Sorry for that :cool:
56.7% faster than dwtoa is ok, right?
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
4618 cycles for dwtoa
2938 cycles for dw2str
2000 cycles for MbDw2Str
3972 cycles for jgnorecurse
2968 cycles for dw2$X
20 bytes for dwtoa
82 bytes for dw2str
94 bytes for MbDw2Str
104 bytes for jgnorecurse
106 bytes for dw2$X
dwtoa -123456789
dw2str -123456789
MbDw2Str -123456789
jgnorecurse -123456789
dw2$X -123456789
Hi,Lingo
In my memory, jj2007 always loses in the speed racing. But this time, He eventually won once.
Quote00004502 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004291 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014634 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004312 clock cycles, (Lingo_i2a64_1)x10000, OutPut: 98765432109876
00003799 clock cycles, (Lingo_q2a64_2)x10000, OutPut: 98765432109876
00002310 clock cycles, (jj2007_MbDq2Str)x10000, OutPut: 98765432109876
His method:
1) div 100, handle 2 digits at once.
2) query the table includes 100dw digits
; init:
dw2aTable dw 100 dup (0)
xor rbx,rbx
lea rdi,dw2aTable
.repeat
invoke RtlZeroMemory,addr tmp,sizeof tmp
invoke wsprintf,addr tmp,CStr("%02i"),rbx
lea rcx,tmp
mov ax,[rcx]
mov [rdi+2*rbx],ax
inc rbx
.until rbx == 100
...
MbDq2Str proc uses rsi rdi rbx dqValue:QWORD,OutBuffer:QWORD
mov rax, dqValue
;mov r8, rax ;for sign
lea rsi, dw2aTable
mov rdi, OutBuffer
lea rdi, [rdi+24]
test rax, rax
jnz @F
mov word ptr [rdi],30h
jmp @Exit
;@@:
;jns @F
;neg rax
@@:
mov byte ptr [rdi], 0 ; terminate the string
dec rdi
mov rcx, 051eb851eb851eb85h ; 051EB851EB851EB85h=32*1/100=4*(0cccccccccccccccdh=8*1/10)/10
.while rax > 0
xor rdx, rdx
mov rbx, rax ; number
mul rcx ; *32x1/100
shr rdx, 5 ; /32
mov rax, rdx
imul rdx, rdx, 100
sub rbx, rdx
movzx rdx, word ptr [rsi+2*rbx] ;00-99,word table
sub rdi, 2
mov [rdi], dx
.endw
.if rbx < 10
mov byte ptr [rdi], 0 ;figure the "0" in front number out
inc rdi
.endif
;test r8, r8
;.if Sign?
; dec rdi
; mov byte ptr [rdi], "-"
;.endif
@Exit:
xchg rax, rdi
ret
MbDq2Str endp
Edit: + exe attachment
regard
Thank you six_L, :thumbsup:
Do you include the time to complete the table?
Can I have your test?
I want to run your test on my computer.
Hi, six_L
Have 'moved' / 'merged' your post into the requested Dword to ascii thread in The Laboratory + the attachment.
Admin'
:thumbsup:
I don't really like 64-bit code, but I guess there is enough interest for a 64-bit version:
843 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
2172 ms for 100Mio*Roberts result $rdi 1234567890123456789
15406 ms for 100Mio*sprintf result $rdi 1234567890123456789
"Roberts" is the code from reply #82 (https://masm32.com/board/index.php?topic=11820.msg128574#msg128574). Unfortunately, it doesn't handle negative numbers, which distorts the numbers a bit (handling negative numbers costs some cycles).
Hi,Lingo
QuoteDo you include the time to complete the table?
No, the digit's table is outside the MbDq2Str proc. it is being created at the testing thread startup.
Hi,jj2007
QuoteI don't really like 64-bit code.
the 32bit codes are working in VM of 64bit OS. Perhaps your tested result was thingemyed.
Thank you
six_L,
I don't believe there is a masochist who would fill in the table before using this "algo"...?! :badgrin:
It's 32 bit slow and bloated garbage...
Any 64 bit code will be shorter and faster.
You can try the same with
20 rows less code:
align 16
db 8 dup(90h)
Dq2Str proc
lea r9, dw2aTable
mov r10, rax
mov r8, 51EB851EB851EB85h
add rcx, 18h
@@:
mul r8
shr rdx, 5 ; :32
sub rcx, 2
mov rax, rdx
imul rdx, 64h ; *100
sub r10, rdx
mov dx, [r9+r10*2] ; r9 = addr dw2aTable
mov r10, rax
mov [rcx], dx
cmp rax, 0
ja @b
cmp rax, 0
ja @f
cmp rbx, 0Ah
jnb @Exit
@@:
mov byte ptr [rcx], 0
inc rcx
@Exit:
mov rax, rcx
ret ; minus 20 rows
Dq2Str endp
QuoteHis method:
1) div 100, handle 2 digits at once.
2) query the table includes 100dw digits
Next version will:
1) div 1000, handle 4 digits at once.
2) query the table includes ?? dd digits...ha,ha,hah :badgrin: :badgrin: :skrewy:
QuotePerhaps your tested result was thingemyed.
Don't believe his manipulated "tests". The reason for that is that there is no one from this forum who can compile his test "sources" normally... :undecided:
Quote from: lingo on April 10, 2024, 04:42:10 PMThe reason for that is that there is no one from this forum who can compile his test "sources" normally... :undecided:
Some of us can even understand the syntax :biggrin:
Quote from: six_L on April 10, 2024, 02:07:09 PMthe 32bit codes are working in VM of 64bit OS.
You should try to understand how 32-bit code runs on a 64-bit OS.
QuotePerhaps your tested result was thingemyed.
All benchmarks have little problems. Mine have been tested hundreds of times in the Lab, and many members of this forum have contributed to make them reliable. Btw most of us here understand that including the time to build the table has no measurable effect on a test that runs with a Million iterations. We know our stuff here in the Lab.
0 µs for building the tableSometimes it's even
one microsecond :bgrin:
Quote from: sinsi on April 10, 2024, 05:51:54 PMSome of us can even understand the syntax :biggrin:
Right :biggrin:
And some even know how to open a Rich Text Format file, and happily build my sources:
Quote from: jimg on April 02, 2024, 07:43:17 AMUsing your version 3 and enabling test F I get
Quote from: lingo on April 10, 2024, 04:42:10 PMYou can try the same with 20 rows less code:
align 16
db 8 dup(90h)
Dq2Str proc
lea r9, dw2aTable
mov r10, rax
mov r8, 51EB851EB851EB85h
add rcx, 18h
@@:
mul r8
To those who think that "Lingo" is just an impostor:
No, he is the real Lingo. His code is badly copied, and of course, it crashes (he also forgets to handle negative numbers, but that's only minor criticism).
Hi,Lingo
1)
You have surpassed jj2007 again.
Quote00004451 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004603 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014591 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004130 clock cycles, (Lingo_i2a64_1)x10000, OutPut: 98765432109876
00004036 clock cycles, (Lingo_q2a64_2)x10000, OutPut: 98765432109876
00002609 clock cycles, (jj2007_MbDq2Str)x10000, OutPut: 98765432109876
00002426 clock cycles, (Lingo3_Dq2Str)x10000, OutPut: 98765432109876
No the best codes, Only has the better codes. I hope the competition continues.
2)
; init:
dw2aTable dw 100 dup (0)
xor rbx,rbx
lea rdi,dw2aTable
.repeat
invoke RtlZeroMemory,addr tmp,sizeof tmp
invoke wsprintf,addr tmp,CStr("%02i"),rbx
lea rcx,tmp
mov ax,[rcx]
mov [rdi+2*rbx],ax
inc rbx
.until rbx == 100
...
QuoteI don't believe there is a masochist who would fill in the table before using this "algo"...?! :badgrin:
That's my codes for read easiely and verifying the correctness of his logic. You should blame me, not him.
regard.
Hi
six_L, :thumbsup:
QuoteYou should blame me, not him.
I cannot and do not blame anyone directly, neither you nor
jj.
I only expressed my opinion about the algorithm and the code of the program.
I wonder why
jj is interrupting
OUR conversation and acting in such a nervous manner.
QuoteI hope the competition continues.
After these insults from
jj, I lost the interest, sorry! :sad:
The latest version. I had to tweak the code from our genius a little bit because a) it crashed and b) it overwrote the table, but now it works :thumbsup:
And indeed, Lingo's code is faster than mine :thumbsup: :thumbsup: :thumbsup:
I feel honoured that he used almost all my tricks :cool:
There are tiny little problems here and there, but hey, it's fast :thumbsup:
This program was assembled with UAsm64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
41 bytes for loadTable
140 bytes for MbDw2Str
-1
0
1
-1234567890123456789
1234567890123456789
Lingo: a) no negative numbers, b) leading zero, c) but only for numbers with odd number of digits
938 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
2187 ms for 50Mio*Roberts result $rdi 1234567890123456789
797 ms for 50Mio*Lingo's result $rdi 01234567890123456789
1532 ms for 5 Mio*sprintf result $rdi 1234567890123456789
718 ms for 50Mio*MbDw2Str result $rdi 123456789012345678
2032 ms for 50Mio*Roberts result $rdi 123456789012345678
687 ms for 50Mio*Lingo's result $rdi 123456789012345678
1484 ms for 5 Mio*sprintf result $rdi 123456789012345678
719 ms for 50Mio*MbDw2Str result $rdi 12345678901234567
1953 ms for 50Mio*Roberts result $rdi 12345678901234567
672 ms for 50Mio*Lingo's result $rdi 012345678901234567
1250 ms for 5 Mio*sprintf result $rdi 12345678901234567
906 ms for 50Mio*MbDw2Str result $rdi -1234567890123456789
2297 ms for 50Mio*Roberts result $rdi 17212176183586094827
766 ms for 50Mio*Lingo's result $rdi 17212176183586094827
1531 ms for 5 Mio*sprintf result $rdi -1234567890123456789
Quote from: jj2007 on April 11, 2024, 05:25:22 AMThe latest version. I had to tweak the code from our genius a little bit because a) it crashed and b) it overwrote the table, but now it works :thumbsup:
And indeed, Lingo's code is faster than mine :thumbsup: :thumbsup: :thumbsup:
I feel honoured that he used almost all my tricks :cool:
There are tiny little problems here and there, but hey, it's fast :thumbsup:
This program was assembled with UAsm64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
41 bytes for loadTable
140 bytes for MbDw2Str
-1
0
1
-1234567890123456789
1234567890123456789
Lingo: a) no negative numbers, b) leading zero, c) but only for numbers with odd number of digits
938 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
2187 ms for 50Mio*Roberts result $rdi 1234567890123456789
797 ms for 50Mio*Lingo's result $rdi 01234567890123456789
1532 ms for 5 Mio*sprintf result $rdi 1234567890123456789
718 ms for 50Mio*MbDw2Str result $rdi 123456789012345678
2032 ms for 50Mio*Roberts result $rdi 123456789012345678
687 ms for 50Mio*Lingo's result $rdi 123456789012345678
1484 ms for 5 Mio*sprintf result $rdi 123456789012345678
719 ms for 50Mio*MbDw2Str result $rdi 12345678901234567
1953 ms for 50Mio*Roberts result $rdi 12345678901234567
672 ms for 50Mio*Lingo's result $rdi 012345678901234567
1250 ms for 5 Mio*sprintf result $rdi 12345678901234567
906 ms for 50Mio*MbDw2Str result $rdi -1234567890123456789
2297 ms for 50Mio*Roberts result $rdi 17212176183586094827
766 ms for 50Mio*Lingo's result $rdi 17212176183586094827
1531 ms for 5 Mio*sprintf result $rdi -1234567890123456789
This program was assembled with UAsm64 in 64-bit format.
13th Gen Intel(R) Core(TM) i9-13980HX
41 bytes for loadTable
140 bytes for MbDw2Str
-1
0
1
-1234567890123456789
1234567890123456789
Lingo: a) no negative numbers, b) leading zero, c) but only for numbers with odd number of digits
313 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
734 ms for 50Mio*Roberts result $rdi 1234567890123456789
250 ms for 50Mio*Lingo's result $rdi 01234567890123456789
438 ms for 5 Mio*sprintf result $rdi 1234567890123456789
281 ms for 50Mio*MbDw2Str result $rdi 123456789012345678
687 ms for 50Mio*Roberts result $rdi 123456789012345678
219 ms for 50Mio*Lingo's result $rdi 123456789012345678
406 ms for 5 Mio*sprintf result $rdi 123456789012345678
297 ms for 50Mio*MbDw2Str result $rdi 12345678901234567
641 ms for 50Mio*Roberts result $rdi 12345678901234567
219 ms for 50Mio*Lingo's result $rdi 012345678901234567
391 ms for 5 Mio*sprintf result $rdi 12345678901234567
484 ms for 50Mio*MbDw2Str result $rdi -1234567890123456789
765 ms for 50Mio*Roberts result $rdi 17212176183586094827
250 ms for 50Mio*Lingo's result $rdi 17212176183586094827
438 ms for 5 Mio*sprintf result $rdi -1234567890123456789
Thanks, LiaoMi :thup:
Your cpu beats mine by almost a factor three. That is a fine machine - I could become a bit envious :mrgreen:
I attach sources and executables for both 64- and 32-bit versions. Note it's the same source ("dual assembly"), there is just one option OPT_64 n on top, with n=0 or 1. My MbDw2Str has only two tiny conditionals on entry:
if @64
mov rax, rcx ; xwValue
else
mov eax, xwValue
endif
...
if @64
mov rcx, 051EB851EB851EB85h
else
mov ecx, 051EB851Fh
endif
CRT sprintf requires a different format specifier:
mov rcx, number
if @64
jinvoke sprintf, rdi, Chr$("%lld"), rcx
else
jinvoke sprintf, rdi, Chr$("%d"), rcx
endif
The Roberts and Lingo versions cannot be converted to 32-bit because of their register usage. Latest timings:
a) 64-bit:
This program was assembled with UAsm64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
41 bytes for loadTable
111 bytes for MbDw2Str
-1 minus 1
0 zero
1 one
Lingo: a) no negative numbers, b) leading zero, c) but only for numbers with odd number of digits
828 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
2156 ms for 50Mio*Roberts result $rdi 1234567890123456789
782 ms for 50Mio*Lingo's result $rdi 01234567890123456789
1500 ms for 5 Mio*sprintf result $rdi 1234567890123456789
687 ms for 50Mio*MbDw2Str result $rdi 123456789012345678
1984 ms for 50Mio*Roberts result $rdi 123456789012345678
688 ms for 50Mio*Lingo's result $rdi 123456789012345678
1453 ms for 5 Mio*sprintf result $rdi 123456789012345678
703 ms for 50Mio*MbDw2Str result $rdi 12345678901234567
1953 ms for 50Mio*Roberts result $rdi 12345678901234567
672 ms for 50Mio*Lingo's result $rdi 012345678901234567
1235 ms for 5 Mio*sprintf result $rdi 12345678901234567
843 ms for 50Mio*MbDw2Str result $rdi -1234567890123456789
2219 ms for 50Mio*Roberts result $rdi 17212176183586094827
766 ms for 50Mio*Lingo's result $rdi 17212176183586094827
1531 ms for 5 Mio*sprintf result $rdi -1234567890123456789
b) 32-bit:
This program was assembled with UAsm64 in 32-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
37 bytes for loadTable
86 bytes for MbDw2Str
-1 minus 1
0 zero
1 one
469 ms for 50Mio*MbDw2Str result $rdi 123456789
1078 ms for 5 Mio*sprintf result $rdi 123456789
328 ms for 50Mio*MbDw2Str result $rdi 12345678
1016 ms for 5 Mio*sprintf result $rdi 12345678
375 ms for 50Mio*MbDw2Str result $rdi 1234567
765 ms for 5 Mio*sprintf result $rdi 1234567
469 ms for 50Mio*MbDw2Str result $rdi -123456789
1063 ms for 5 Mio*sprintf result $rdi -123456789
Hi,Lingo
QuoteAfter these insults from jj, I lost the interest, sorry! :sad:
This is a scene that most of our members do not want to see. I respect your choice. It's you who exhibited us the power of assembly language and the dazzling skills.
jj2007 not only argued with you, but also with many members. such as hutch--/aw27/nidud/_japheth. Some outstanding members had left forever because of arguing. Arguing is his inherent attribute. Human beings need the biological diversity. i guess our forum also needs member diversity. you may ignore him.
regard
Quote from: six_L on April 11, 2024, 10:52:13 PMjj2007 not only argued with you, but also with many members. such as hutch--/aw27/nidud/_japheth.
Stop this nonsense, dear. Yes, I argued with some of them, but it was always civilised. With Hutch, I had a very good relationship.
Lingo comes here out of the blue, never supplied a valid email address to our administrators, wants to take over the forum, posts crappy code, most of which (badly) copied from mine, and then has the chuzpe to call
me a thief.
This is all documented in Bravo, Lingo! (https://masm32.com/board/index.php?topic=11831.msg128620#msg128620)
I have no problem with Lingo
using my code, although just using other registers to hide that sort of "reuse" is a bit fishy imho. I
do have a problem, though, if he calls me a thief.
If I were the moderator, I would have kicked him out a long time ago.
Many of us hoped, that this site's old style vanish, but old dogs are, what they are.
I am only here, as i use Pelle's C and i need to know something about assembly, when i try to hunt bugs.
Hopefully this site users still have an open mind for programming.
I've added the fast algo to the MasmBasic library of 12 April 2024 (http://masm32.com/board/index.php?topic=94.0), in two flavours:
a) as Str$(number):
Print Str$("Rect left/top/right/bottom=%i", [esi.RECT.left]), "/", Str$(MyRC.top), "/", Str$(MyRC.right), "/", Str$([esi.RECT.bottom])
Flexibel, and a tick faster than the Masm32 SDK str$().
b) invoke MbDw2Str, number: ultrafast but no concatenation as in a).
I'll try this Again
lingo
Can you Please Read and Respond to the PM's I have sent.
Thanking you in advance,
Stewart
Testing the laptop. As predicted elsewhere, not very high performance. Not even medium-performance.
64 bit
This program was assembled with UAsm64 in 64-bit format.
AMD A8-6410 APU with AMD Radeon R5 Graphics
41 bytes for loadTable
111 bytes for MbDw2Str
-1 minus 1
0 zero
1 one
Lingo: a) no negative numbers, b) leading zero, c) but only for numbers with odd
number of digits
2762 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
5382 ms for 50Mio*Roberts result $rdi 1234567890123456789
2558 ms for 50Mio*Lingo's result $rdi 01234567890123456789
2449 ms for 5 Mio*sprintf result $rdi 1234567890123456789
2528 ms for 50Mio*MbDw2Str result $rdi 123456789012345678
4492 ms for 50Mio*Roberts result $rdi 123456789012345678
2309 ms for 50Mio*Lingo's result $rdi 123456789012345678
2356 ms for 5 Mio*sprintf result $rdi 123456789012345678
2558 ms for 50Mio*MbDw2Str result $rdi 12345678901234567
5008 ms for 50Mio*Roberts result $rdi 12345678901234567
2309 ms for 50Mio*Lingo's result $rdi 012345678901234567
2230 ms for 5 Mio*sprintf result $rdi 12345678901234567
2793 ms for 50Mio*MbDw2Str result $rdi -1234567890123456789
4852 ms for 50Mio*Roberts result $rdi 17212176183586094827
2558 ms for 50Mio*Lingo's result $rdi 17212176183586094827
2558 ms for 5 Mio*sprintf result $rdi -1234567890123456789
32 bit
This program was assembled with UAsm64 in 32-bit format.
AMD A8-6410 APU with AMD Radeon R5 Graphics
37 bytes for loadTable
86 bytes for MbDw2Str
-1 minus 1
0 zero
1 one
1092 ms for 50Mio*MbDw2Str result $rdi 123456789
1779 ms for 5 Mio*sprintf result $rdi 123456789
936 ms for 50Mio*MbDw2Str result $rdi 12345678
1622 ms for 5 Mio*sprintf result $rdi 12345678
936 ms for 50Mio*MbDw2Str result $rdi 1234567
1482 ms for 5 Mio*sprintf result $rdi 1234567
1139 ms for 50Mio*MbDw2Str result $rdi -123456789
1841 ms for 5 Mio*sprintf result $rdi -123456789
On the bright side, it does have 4 cores... :undecided:
With old PC
This program was assembled with UAsm64 in 32-bit format.
AMD Athlon(tm) II X2 220 Processor
37 bytes for loadTable
86 bytes for MbDw2Str
-1 minus 1
0 zero
1 one
967 ms for 50Mio*MbDw2Str result $rdi 123456789
1903 ms for 5 Mio*sprintf result $rdi 123456789
874 ms for 50Mio*MbDw2Str result $rdi 12345678
1638 ms for 5 Mio*sprintf result $rdi 12345678
780 ms for 50Mio*MbDw2Str result $rdi 1234567
1357 ms for 5 Mio*sprintf result $rdi 1234567
1092 ms for 50Mio*MbDw2Str result $rdi -123456789
1763 ms for 5 Mio*sprintf result $rdi -123456789
This program was assembled with UAsm64 in 64-bit format.
AMD Athlon(tm) II X2 220 Processor
41 bytes for loadTable
111 bytes for MbDw2Str
-1 minus 1
0 zero
1 one
Lingo: a) no negative numbers, b) leading zero, c) but only for numbers with odd
number of digits
2075 ms for 50Mio*MbDw2Str result $rdi 1234567890123456789
4040 ms for 50Mio*Roberts result $rdi 1234567890123456789
1841 ms for 50Mio*Lingo's result $rdi 01234567890123456789
2792 ms for 5 Mio*sprintf result $rdi 1234567890123456789
1841 ms for 50Mio*MbDw2Str result $rdi 123456789012345678
3760 ms for 50Mio*Roberts result $rdi 123456789012345678
1669 ms for 50Mio*Lingo's result $rdi 123456789012345678
2590 ms for 5 Mio*sprintf result $rdi 123456789012345678
1840 ms for 50Mio*MbDw2Str result $rdi 12345678901234567
3604 ms for 50Mio*Roberts result $rdi 12345678901234567
1654 ms for 50Mio*Lingo's result $rdi 012345678901234567
2433 ms for 5 Mio*sprintf result $rdi 12345678901234567
2091 ms for 50Mio*MbDw2Str result $rdi -1234567890123456789
4181 ms for 50Mio*Roberts result $rdi 17212176183586094827
1856 ms for 50Mio*Lingo's result $rdi 17212176183586094827
2824 ms for 5 Mio*sprintf result $rdi -1234567890123456789