News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

Started by jj2007, April 01, 2024, 03:42:50 AM

Previous topic - Next topic

LiaoMi

Quote from: jj2007 on April 07, 2024, 08:52:16 AMThanks, Stoo :thup:

Back to the topic: may I have some timings, please?

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

Averages:
4603    cycles for dwtoa
2954    cycles for dw2str
2212    cycles for dw2$JJ
3248    cycles for jgnorecurse
2957    cycles for dw2$X

13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)

1794    cycles for 100 * dwtoa
1416    cycles for 100 * dw2str
751     cycles for 100 * dw2$JJ
1442    cycles for 100 * jgnorecurse
1707    cycles for 100 * dw2$X

1845    cycles for 100 * dwtoa
1464    cycles for 100 * dw2str
793     cycles for 100 * dw2$JJ
1496    cycles for 100 * jgnorecurse
1710    cycles for 100 * dw2$X

1839    cycles for 100 * dwtoa
1466    cycles for 100 * dw2str
806     cycles for 100 * dw2$JJ
1509    cycles for 100 * jgnorecurse
1686    cycles for 100 * dw2$X

1837    cycles for 100 * dwtoa
1466    cycles for 100 * dw2str
799     cycles for 100 * dw2$JJ
1612    cycles for 100 * jgnorecurse
1693    cycles for 100 * dw2$X

Averages:
1838    cycles for dwtoa
1465    cycles for dw2str
796     cycles for dw2$JJ
1502    cycles for jgnorecurse
1700    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
94      bytes for dw2$JJ
76      bytes for jgnorecurse
106     bytes for dw2$X

dwtoa                                   -123456789
dw2str                                  -123456789
dw2$JJ                                  -123456789
jgnorecurse                             -123456789
dw2$X                                   -123456789

--- ok ---

LiaoMi

Quote from: jj2007 on April 07, 2024, 07:55:18 PMThe show must go on :thumbsup:

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

Averages:
4761    cycles for dwtoa
3044    cycles for dw2str
2122    cycles for MbDw2Str  <<<<<<<<<<<<<<<<<<<< tested for the full range -1...0
3364    cycles for jgnorecurse
2948    cycles for dw2$X

The new algo uses a table, as suggested initially by Ray alias ahsat :thumbsup:

13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)

1782    cycles for 100 * dwtoa
1383    cycles for 100 * dw2str
997     cycles for 100 * MbDw2Str
1400    cycles for 100 * jgnorecurse
1650    cycles for 100 * dw2$X

1830    cycles for 100 * dwtoa
1442    cycles for 100 * dw2str
1052    cycles for 100 * MbDw2Str
1440    cycles for 100 * jgnorecurse
1657    cycles for 100 * dw2$X

1846    cycles for 100 * dwtoa
1442    cycles for 100 * dw2str
1038    cycles for 100 * MbDw2Str
1447    cycles for 100 * jgnorecurse
1650    cycles for 100 * dw2$X

1833    cycles for 100 * dwtoa
1446    cycles for 100 * dw2str
1000    cycles for 100 * MbDw2Str
1437    cycles for 100 * jgnorecurse
1647    cycles for 100 * dw2$X

Averages:
1832    cycles for dwtoa
1442    cycles for dw2str
1019    cycles for MbDw2Str
1438    cycles for jgnorecurse
1650    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
102     bytes for MbDw2Str
76      bytes for jgnorecurse
106     bytes for dw2$X

dwtoa                                   123456789
dw2str                                  123456789
MbDw2Str                                123456789
jgnorecurse                             123456789
dw2$X                                   123456789

--- ok ---

jj2007

Quote from: LiaoMi on April 07, 2024, 08:06:59 PM13th Gen Intel(R) Core(TM) i9-13980HX (SSE4)

Thanks, LiaoMi - so even on Intel MbDw2Str performs well :thumbsup:

jj2007

Special edition for JimG - I had activated an older version of his algo. Here is the correct one :thup:

Sorry for that :cool:

jj2007

56.7% faster than dwtoa is ok, right?

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

Averages:
4618    cycles for dwtoa
2938    cycles for dw2str
2000    cycles for MbDw2Str
3972    cycles for jgnorecurse
2968    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
94      bytes for MbDw2Str
104    bytes for jgnorecurse
106    bytes for dw2$X

dwtoa                                  -123456789
dw2str                                  -123456789
MbDw2Str                                -123456789
jgnorecurse                            -123456789
dw2$X                                  -123456789

six_L

#140
Hi,Lingo
In my memory, jj2007 always loses in the speed racing. But this time, He eventually won once.

Quote00004502 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004291 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014634 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004312 clock cycles, (Lingo_i2a64_1)x10000, OutPut: 98765432109876
00003799 clock cycles, (Lingo_q2a64_2)x10000, OutPut: 98765432109876
00002310 clock cycles, (jj2007_MbDq2Str)x10000, OutPut: 98765432109876

His method:
1) div 100, handle 2 digits at once.
2) query the table includes 100dw digits

; init:
    dw2aTable    dw 100 dup (0)

    xor    rbx,rbx
    lea    rdi,dw2aTable
    .repeat
        invoke    RtlZeroMemory,addr tmp,sizeof tmp
        invoke    wsprintf,addr tmp,CStr("%02i"),rbx
        lea    rcx,tmp
        mov    ax,[rcx]
        mov    [rdi+2*rbx],ax
        inc    rbx
    .until rbx == 100
...
MbDq2Str proc uses rsi rdi rbx dqValue:QWORD,OutBuffer:QWORD

    mov    rax, dqValue
    ;mov    r8,  rax    ;for sign
    lea    rsi, dw2aTable
    mov    rdi, OutBuffer
    lea    rdi, [rdi+24]
    test    rax, rax
    jnz    @F
    mov    word ptr [rdi],30h
    jmp    @Exit
;@@:   
    ;jns    @F
    ;neg    rax
@@:   
    mov    byte ptr [rdi], 0            ; terminate the string
    dec    rdi
    mov    rcx, 051eb851eb851eb85h            ; 051EB851EB851EB85h=32*1/100=4*(0cccccccccccccccdh=8*1/10)/10   
    .while rax > 0
        xor    rdx, rdx
        mov    rbx, rax            ; number
        mul    rcx                ; *32x1/100
        shr    rdx, 5                ; /32
        mov    rax, rdx
        imul    rdx, rdx, 100
        sub    rbx, rdx
        movzx    rdx, word ptr [rsi+2*rbx]    ;00-99,word table
        sub    rdi, 2
        mov    [rdi], dx
    .endw
    .if    rbx < 10
        mov byte ptr [rdi], 0            ;figure the "0" in front number out
        inc rdi
    .endif
    ;test    r8, r8
    ;.if Sign?
    ;    dec    rdi
    ;    mov    byte ptr [rdi], "-"
    ;.endif
@Exit:   
    xchg    rax, rdi
    ret
MbDq2Str endp
Edit: + exe attachment
regard
Say you, Say me, Say the codes together for ever.

lingo

Thank you six_L, :thumbsup:

Do you include the time to complete the table?
Can I have your test?
I want to run your test on my computer.
Quid sit futurum cras fuge quaerere.

six_L

#142
Hi, six_L
Have 'moved' / 'merged' your post into the requested Dword to ascii thread in The Laboratory + the attachment.


Admin'
:thumbsup:
Say you, Say me, Say the codes together for ever.

jj2007

I don't really like 64-bit code, but I guess there is enough interest for a 64-bit version:
843     ms for 50Mio*MbDw2Str   result  $rdi    1234567890123456789
2172    ms for 100Mio*Roberts   result  $rdi    1234567890123456789
15406   ms for 100Mio*sprintf   result  $rdi    1234567890123456789

"Roberts" is the code from reply #82. Unfortunately, it doesn't handle negative numbers, which distorts the numbers a bit (handling negative numbers costs some cycles).

six_L

Hi,Lingo
QuoteDo you include the time to complete the table?
No, the digit's table is outside the MbDq2Str proc. it is being created at the testing thread startup.

Hi,jj2007
QuoteI don't really like 64-bit code.
the 32bit codes are working in VM of 64bit OS. Perhaps your tested result was thingemyed.
Say you, Say me, Say the codes together for ever.

lingo

Thank you six_L,

I don't believe there is a masochist who would fill in the table before using this "algo"...?! :badgrin:
It's 32 bit slow and bloated garbage...
Any 64 bit code will be shorter and faster.
You can try the same with 20 rows less code:
align 16
db 8 dup(90h)
Dq2Str proc
       lea  r9,  dw2aTable
       mov  r10, rax
       mov  r8,  51EB851EB851EB85h
       add  rcx, 18h
@@:
       mul  r8
       shr  rdx, 5           ; :32
       sub  rcx, 2
       mov  rax, rdx
       imul rdx, 64h         ; *100
       sub  r10, rdx
       mov  dx,  [r9+r10*2]  ; r9 = addr dw2aTable
       mov  r10, rax
       mov  [rcx], dx
       cmp  rax, 0
       ja   @b
       cmp  rax, 0
       ja   @f
       cmp  rbx, 0Ah
       jnb  @Exit
@@:
       mov  byte ptr [rcx], 0
       inc  rcx
@Exit:
       mov  rax, rcx
       ret     ; minus 20 rows
Dq2Str endp
QuoteHis method:
1) div 100, handle 2 digits at once.
2) query the table includes 100dw digits
Next version will:
1) div 1000, handle 4 digits at once.
2) query the table includes ?? dd digits...ha,ha,hah  :badgrin:  :badgrin:  :skrewy:
QuotePerhaps your tested result was thingemyed.
Don't believe his manipulated "tests". The reason for that is that there is no one from this forum who can compile his test "sources" normally... :undecided:
Quid sit futurum cras fuge quaerere.

sinsi

Quote from: lingo on April 10, 2024, 04:42:10 PMThe reason for that is that there is no one from this forum who can compile his test "sources" normally... :undecided:
Some of us can even understand the syntax  :biggrin:
🍺🍺🍺

jj2007

Quote from: six_L on April 10, 2024, 02:07:09 PMthe 32bit codes are working in VM of 64bit OS.

You should try to understand how 32-bit code runs on a 64-bit OS.

QuotePerhaps your tested result was thingemyed.

All benchmarks have little problems. Mine have been tested hundreds of times in the Lab, and many members of this forum have contributed to make them reliable. Btw most of us here understand that including the time to build the table has no measurable effect on a test that runs with a Million iterations. We know our stuff here in the Lab.

0 µs for building the table

Sometimes it's even one microsecond :bgrin:

Quote from: sinsi on April 10, 2024, 05:51:54 PMSome of us can even understand the syntax  :biggrin:

Right :biggrin:

And some even know how to open a Rich Text Format file, and happily build my sources:

Quote from: jimg on April 02, 2024, 07:43:17 AMUsing your version 3 and enabling test F I get

jj2007

Quote from: lingo on April 10, 2024, 04:42:10 PMYou can try the same with 20 rows less code:
align 16
db 8 dup(90h)
Dq2Str proc
       lea  r9,  dw2aTable
       mov  r10, rax
       mov  r8,  51EB851EB851EB85h
       add  rcx, 18h
@@:
       mul  r8

To those who think that "Lingo" is just an impostor: No, he is the real Lingo. His code is badly copied, and of course, it crashes (he also forgets to handle negative numbers, but that's only minor criticism).

six_L

Hi,Lingo
1)
You have surpassed jj2007 again.
Quote00004451 clock cycles, (Roberts_dqtoa)x10000, OutPut: 98765432109876
00004603 clock cycles, (Roberts_Lqword2ascii)x10000, OutPut: 98765432109876
00014591 clock cycles, (Ray_AnyToAny)x10000, OutPut: 98765432109876
00004130 clock cycles, (Lingo_i2a64_1)x10000, OutPut: 98765432109876
00004036 clock cycles, (Lingo_q2a64_2)x10000, OutPut: 98765432109876
00002609 clock cycles, (jj2007_MbDq2Str)x10000, OutPut: 98765432109876
00002426 clock cycles, (Lingo3_Dq2Str)x10000, OutPut: 98765432109876

No the best codes, Only has the better codes. I hope the competition continues.
2)
; init:
    dw2aTable    dw 100 dup (0)

    xor    rbx,rbx
    lea    rdi,dw2aTable
    .repeat
        invoke    RtlZeroMemory,addr tmp,sizeof tmp
        invoke    wsprintf,addr tmp,CStr("%02i"),rbx
        lea    rcx,tmp
        mov    ax,[rcx]
        mov    [rdi+2*rbx],ax
        inc    rbx
    .until rbx == 100
...
QuoteI don't believe there is a masochist who would fill in the table before using this "algo"...?! :badgrin:
That's my codes for read easiely and verifying the correctness of his logic. You should blame me, not him.

regard.
Say you, Say me, Say the codes together for ever.