Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

LiaoMi · April 07, 2024, 08:04:51 PM

Quote from: jj2007 on April 07, 2024, 08:52:16 AMThanks, Stoo

Back to the topic: may I have some timings, please?

Code Select Expand
AMD Athlon Gold 3150U with Radeon Graphics (SSE4) Averages: 4603 cycles for dwtoa 2954 cycles for dw2str 2212 cycles for dw2$JJ 3248 cycles for jgnorecurse 2957 cycles for dw2$X

Code Select

13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)

1794    cycles for 100 * dwtoa
1416    cycles for 100 * dw2str
751     cycles for 100 * dw2$JJ
1442    cycles for 100 * jgnorecurse
1707    cycles for 100 * dw2$X

1845    cycles for 100 * dwtoa
1464    cycles for 100 * dw2str
793     cycles for 100 * dw2$JJ
1496    cycles for 100 * jgnorecurse
1710    cycles for 100 * dw2$X

1839    cycles for 100 * dwtoa
1466    cycles for 100 * dw2str
806     cycles for 100 * dw2$JJ
1509    cycles for 100 * jgnorecurse
1686    cycles for 100 * dw2$X

1837    cycles for 100 * dwtoa
1466    cycles for 100 * dw2str
799     cycles for 100 * dw2$JJ
1612    cycles for 100 * jgnorecurse
1693    cycles for 100 * dw2$X

Averages:
1838    cycles for dwtoa
1465    cycles for dw2str
796     cycles for dw2$JJ
1502    cycles for jgnorecurse
1700    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
94      bytes for dw2$JJ
76      bytes for jgnorecurse
106     bytes for dw2$X

dwtoa                                   -123456789
dw2str                                  -123456789
dw2$JJ                                  -123456789
jgnorecurse                             -123456789
dw2$X                                   -123456789

--- ok ---

LiaoMi · April 07, 2024, 08:06:59 PM

Quote from: jj2007 on April 07, 2024, 07:55:18 PMThe show must go on

Code Select Expand
AMD Athlon Gold 3150U with Radeon Graphics (SSE4) Averages: 4761 cycles for dwtoa 3044 cycles for dw2str 2122 cycles for MbDw2Str <<<<<<<<<<<<<<<<<<<< tested for the full range -1...0 3364 cycles for jgnorecurse 2948 cycles for dw2$X
The new algo uses a table, as suggested initially by Ray alias ahsat

Code Select

13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)

1782    cycles for 100 * dwtoa
1383    cycles for 100 * dw2str
997     cycles for 100 * MbDw2Str
1400    cycles for 100 * jgnorecurse
1650    cycles for 100 * dw2$X

1830    cycles for 100 * dwtoa
1442    cycles for 100 * dw2str
1052    cycles for 100 * MbDw2Str
1440    cycles for 100 * jgnorecurse
1657    cycles for 100 * dw2$X

1846    cycles for 100 * dwtoa
1442    cycles for 100 * dw2str
1038    cycles for 100 * MbDw2Str
1447    cycles for 100 * jgnorecurse
1650    cycles for 100 * dw2$X

1833    cycles for 100 * dwtoa
1446    cycles for 100 * dw2str
1000    cycles for 100 * MbDw2Str
1437    cycles for 100 * jgnorecurse
1647    cycles for 100 * dw2$X

Averages:
1832    cycles for dwtoa
1442    cycles for dw2str
1019    cycles for MbDw2Str
1438    cycles for jgnorecurse
1650    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
102     bytes for MbDw2Str
76      bytes for jgnorecurse
106     bytes for dw2$X

dwtoa                                   123456789
dw2str                                  123456789
MbDw2Str                                123456789
jgnorecurse                             123456789
dw2$X                                   123456789

--- ok ---

jj2007 · April 07, 2024, 08:12:53 PM

Quote from: LiaoMi on April 07, 2024, 08:06:59 PM13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)

Thanks, LiaoMi - so even on Intel MbDw2Str performs well

jj2007 · April 08, 2024, 02:55:33 AM

Special edition for JimG - I had activated an older version of his algo. Here is the correct one

Sorry for that

jj2007 · April 08, 2024, 07:33:28 AM

56.7% faster than dwtoa is ok, right?

Code Select

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

Averages:
4618    cycles for dwtoa
2938    cycles for dw2str
2000    cycles for MbDw2Str
3972    cycles for jgnorecurse
2968    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
94      bytes for MbDw2Str
104    bytes for jgnorecurse
106    bytes for dw2$X

dwtoa                                  -123456789
dw2str                                  -123456789
MbDw2Str                                -123456789
jgnorecurse                            -123456789
dw2$X                                  -123456789

six_L · April 09, 2024, 07:37:47 PM

Hi,Lingo
In my memory, jj2007 always loses in the speed racing. But this time, He eventually won once.

Quote00004502 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004291 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014634 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004312 clock cycles, (Lingo_i2a64_1)x10000, OutPut: 98765432109876
00003799 clock cycles, (Lingo_q2a64_2)x10000, OutPut: 98765432109876
00002310 clock cycles, (jj2007_MbDq2Str)x10000, OutPut: 98765432109876

His method:
1) div 100, handle 2 digits at once.
2) query the table includes 100dw digits

Code Select

 ; init:
    dw2aTable    dw 100 dup (0)

    xor    rbx,rbx
    lea    rdi,dw2aTable
    .repeat
        invoke    RtlZeroMemory,addr tmp,sizeof tmp
        invoke    wsprintf,addr tmp,CStr("%02i"),rbx
        lea    rcx,tmp
        mov    ax,[rcx]
        mov    [rdi+2*rbx],ax
        inc    rbx
    .until rbx == 100
...
MbDq2Str proc uses rsi rdi rbx dqValue:QWORD,OutBuffer:QWORD

    mov    rax, dqValue
    ;mov    r8,  rax    ;for sign
    lea    rsi, dw2aTable
    mov    rdi, OutBuffer
    lea    rdi, [rdi+24]
    test    rax, rax
    jnz    @F
    mov    word ptr [rdi],30h
    jmp    @Exit
;@@:    
    ;jns    @F
    ;neg    rax
@@:    
    mov    byte ptr [rdi], 0            ; terminate the string
    dec    rdi
    mov    rcx, 051eb851eb851eb85h            ; 051EB851EB851EB85h=32*1/100=4*(0cccccccccccccccdh=8*1/10)/10    
    .while rax > 0
        xor    rdx, rdx
        mov    rbx, rax            ; number
        mul    rcx                ; *32x1/100
        shr    rdx, 5                ; /32
        mov    rax, rdx
        imul    rdx, rdx, 100
        sub    rbx, rdx
        movzx    rdx, word ptr [rsi+2*rbx]    ;00-99,word table
        sub    rdi, 2
        mov    [rdi], dx
    .endw
    .if    rbx < 10
        mov byte ptr [rdi], 0            ;figure the "0" in front number out 
        inc rdi
    .endif
    ;test    r8, r8
    ;.if Sign?
    ;    dec    rdi
    ;    mov    byte ptr [rdi], "-"
    ;.endif
@Exit:    
    xchg    rax, rdi
    ret
MbDq2Str endp

Edit: + exe attachment
regard

lingo · April 10, 2024, 12:57:07 AM

Thank you six_L,

Do you include the time to complete the table?
Can I have your test?
I want to run your test on my computer.

six_L · April 10, 2024, 02:34:46 AM

Hi, six_L
Have 'moved' / 'merged' your post into the requested Dword to ascii thread in The Laboratory + the attachment.

Admin'

jj2007 · April 10, 2024, 08:39:51 AM

I don't really like 64-bit code, but I guess there is enough interest for a 64-bit version:

Code Select

843     ms for 50Mio*MbDw2Str   result  $rdi    1234567890123456789
2172    ms for 100Mio*Roberts   result  $rdi    1234567890123456789
15406   ms for 100Mio*sprintf   result  $rdi    1234567890123456789

"Roberts" is the code from reply #82. Unfortunately, it doesn't handle negative numbers, which distorts the numbers a bit (handling negative numbers costs some cycles).

six_L · April 10, 2024, 02:07:09 PM

Hi,Lingo

QuoteDo you include the time to complete the table?

No, the digit's table is outside the MbDq2Str proc. it is being created at the testing thread startup.

Hi,jj2007

QuoteI don't really like 64-bit code.

the 32bit codes are working in VM of 64bit OS. Perhaps your tested result was thingemyed.

lingo · April 10, 2024, 04:42:10 PM

Thank you six_L,

I don't believe there is a masochist who would fill in the table before using this "algo"...?!

It's 32 bit slow and bloated garbage...
Any 64 bit code will be shorter and faster.
You can try the same with 20 rows less code:

Code Select

align 16
db 8 dup(90h)
Dq2Str proc
       lea  r9,  dw2aTable
       mov  r10, rax
       mov  r8,  51EB851EB851EB85h
       add  rcx, 18h
@@:
       mul  r8
       shr  rdx, 5           ; :32
       sub  rcx, 2
       mov  rax, rdx
       imul rdx, 64h         ; *100
       sub  r10, rdx
       mov  dx,  [r9+r10*2]  ; r9 = addr dw2aTable
       mov  r10, rax
       mov  [rcx], dx
       cmp  rax, 0
       ja   @b
       cmp  rax, 0
       ja   @f
       cmp  rbx, 0Ah
       jnb  @Exit
@@:
       mov  byte ptr [rcx], 0
       inc  rcx
@Exit:
       mov  rax, rcx
       ret     ; minus 20 rows
Dq2Str endp

QuoteHis method:
1) div 100, handle 2 digits at once.
2) query the table includes 100dw digits

Next version will:
1) div 1000, handle 4 digits at once.
2) query the table includes ?? dd digits...ha,ha,hah

QuotePerhaps your tested result was thingemyed.

Don't believe his manipulated "tests". The reason for that is that there is no one from this forum who can compile his test "sources" normally...

sinsi · April 10, 2024, 05:51:54 PM

Quote from: lingo on April 10, 2024, 04:42:10 PMThe reason for that is that there is no one from this forum who can compile his test "sources" normally...

Some of us can even understand the syntax

jj2007 · April 10, 2024, 06:52:17 PM

Quote from: six_L on April 10, 2024, 02:07:09 PMthe 32bit codes are working in VM of 64bit OS.

You should try to understand how 32-bit code runs on a 64-bit OS.

QuotePerhaps your tested result was thingemyed.

All benchmarks have little problems. Mine have been tested hundreds of times in the Lab, and many members of this forum have contributed to make them reliable. Btw most of us here understand that including the time to build the table has no measurable effect on a test that runs with a Million iterations. We know our stuff here in the Lab.

0 µs for building the table

Sometimes it's even one microsecond

Quote from: sinsi on April 10, 2024, 05:51:54 PMSome of us can even understand the syntax

Right

And some even know how to open a Rich Text Format file, and happily build my sources:

Quote from: jimg on April 02, 2024, 07:43:17 AMUsing your version 3 and enabling test F I get

jj2007 · April 10, 2024, 07:13:00 PM

Quote from: lingo on April 10, 2024, 04:42:10 PMYou can try the same with 20 rows less code:
Code Select Expand
align 16 db 8 dup(90h) Dq2Str proc lea r9, dw2aTable mov r10, rax mov r8, 51EB851EB851EB85h add rcx, 18h @@: mul r8

To those who think that "Lingo" is just an impostor: No, he is the real Lingo. His code is badly copied, and of course, it crashes (he also forgets to handle negative numbers, but that's only minor criticism).

six_L · April 11, 2024, 01:46:24 AM

Hi,Lingo
1)
You have surpassed jj2007 again.

Quote00004451 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004603 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014591 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004130 clock cycles, (Lingo_i2a64_1)x10000, OutPut: 98765432109876
00004036 clock cycles, (Lingo_q2a64_2)x10000, OutPut: 98765432109876
00002609 clock cycles, (jj2007_MbDq2Str)x10000, OutPut: 98765432109876
00002426 clock cycles, (Lingo3_Dq2Str)x10000, OutPut: 98765432109876

No the best codes, Only has the better codes. I hope the competition continues.
2)

Code Select

; init:
    dw2aTable    dw 100 dup (0)

    xor    rbx,rbx
    lea    rdi,dw2aTable
    .repeat
        invoke    RtlZeroMemory,addr tmp,sizeof tmp
        invoke    wsprintf,addr tmp,CStr("%02i"),rbx
        lea    rcx,tmp
        mov    ax,[rcx]
        mov    [rdi+2*rbx],ax
        inc    rbx
    .until rbx == 100
...

QuoteI don't believe there is a masochist who would fill in the table before using this "algo"...?!

That's my codes for read easiely and verifying the correctness of his logic. You should blame me, not him.

regard.

The MASM Forum

News:

Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

LiaoMi

LiaoMi

jj2007

jj2007

jj2007

six_L

lingo

six_L

jj2007

six_L

lingo

sinsi

jj2007

jj2007

six_L