Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

Started by jj2007, April 01, 2024, 03:42:50 AM

Quote from: jimg on April 04, 2024, 12:10:44 PMAnd a final cleanup.

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)
4638    cycles for dwtoa
2986    cycles for dw2str
3225    cycles for jgnorecurse
3042    cycles for dw2$X

4616    cycles for dwtoa
2988    cycles for dw2str
3232    cycles for jgnorecurse
3042    cycles for dw2$X

4608    cycles for dwtoa
2982    cycles for dw2str
3210    cycles for jgnorecurse
3026    cycles for dw2$X

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
7034    cycles for dwtoa
5718    cycles for dw2str
4764    cycles for jgnorecurse
6172    cycles for dw2$X

7232    cycles for dwtoa
5883    cycles for dw2str
4932    cycles for jgnorecurse
6285    cycles for dw2$X

7382    cycles for dwtoa
6289    cycles for dw2str
5239    cycles for jgnorecurse
6760    cycles for dw2$X

Looks like an Intel vs AMD problem...


That is a significant and surprising difference.   



Would you check this out if you get a chance?

This is the test program but with a conditional compile around test H.  On line 462 you should find

dotest=2  ; 1=test without DW2$X, else test with DS2$X

if dotest eq 1

Case 1 is just a copy of my routine for TestH.
Case 2 is your DS2$X routine.

When I run, my routine runs 120 cycles faster for case 1 than case 2.  This makes no sense to me, but it is consistent.  Does this happen on AMD also?


4552    cycles for dwtoa
3041    cycles for dw2str
3176    cycles for jgnorecurse
2993    cycles for dw2$X

4450    cycles for dwtoa
2845    cycles for dw2str
3072    cycles for jgnorecurse
3072    cycles for jgnorecurse2

Timings are not very consistent :rolleyes:

Same but with TimerLoops = 100000 (the default - you set it very low at 1000):

4480    cycles for dwtoa
2870    cycles for dw2str
3100    cycles for jgnorecurse
3099    cycles for jgnorecurse2

4597    cycles for dwtoa
2968    cycles for dw2str
3227    cycles for jgnorecurse
3016    cycles for dw2$X



Looks like it ran an extra 100 cycles on the first test, and an extra 127 on the second test, so you are seeing the same effect.  Strange.


Intel(R) Core(TM)2 Duo CPU    E8400  @ 3.00GHz (SSE4)
61      bytes for other

9439    cycles for 100 * dwtoa
7928    cycles for 100 * dw2str
7370    cycles for 100 * jgnorecurse
7980    cycles for 100 * dw2$X

9527    cycles for 100 * dwtoa
8051    cycles for 100 * dw2str
7511    cycles for 100 * jgnorecurse
7954    cycles for 100 * dw2$X

9533    cycles for 100 * dwtoa
8081    cycles for 100 * dw2str
7475    cycles for 100 * jgnorecurse
7945    cycles for 100 * dw2$X

9567    cycles for 100 * dwtoa
8053    cycles for 100 * dw2str
7472    cycles for 100 * jgnorecurse
7942    cycles for 100 * dw2$X

9530    cycles for dwtoa
8052    cycles for dw2str
7474    cycles for jgnorecurse
7950    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
80      bytes for jgnorecurse
106    bytes for dw2$X

dwtoa                                  -123456789
dw2str                                  -123456789
jgnorecurse                            -123456789
dw2$X                                  -123456789

--- ok ---

"We are living in interesting times"   :tongue:


1) i know that's why. we can't use the 01999,9999,9999,9999h(1/10),must use the 0ccc,cccc,cccc,cccdh(8*1/10).
otherwise, when a few zero in qword, the qword2ascii will be failed.
2) you use the "little end" to save the result. that eliminate the unnecessary codes.
3) {lea rdx,[rdx*4+rdx-18h]}, We really can't deliberate it. That's very nice.

i learned a lot skills from you.
Thank you very much!


Thanks for that.  I was just getting around to testing.   So for now, back to the old way.

align 4
resultJim        db 16 dup(?)
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii:  ; setup
    test eax,eax
    jnz @f
    mov word ptr [edi],30h
@@: jns @f
    mov byte ptr [edi],'-'
    neg eax
    add edi, 1
@@: mov ecx,0CCCCCCCDh  ; 8 * 1/10
    push esi
    mov esi,0
      push ebx            ; save ebx, or a digit
      inc esi
      mov ebx,eax         ; save original
      mul ecx             ; 0CCCCCCCDh =8 * 1/10
      shr edx,3           ; take out the factor of 8
      mov eax,edx         ; save answer
      lea edx,[edx*4+edx-24] ; *5  also subtract out half of '0' (24) to convert last byte to ascii
      add edx,edx         ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
      sub ebx,edx         ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
      test eax,eax        ; are we done?
    .until zero?
      mov [edi],bl
      inc edi
      pop ebx      ; restore ebx, or get next digit
      dec esi
    .until zero?
    pop esi


After a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.


Hi, jimg
This is a significant improvement.
Quote00000374 clock cycles, (lingo_i2a64)x1000, OutPut: 98765432109876  ;use 8*1/10
00000360 clock cycles, (lingo_i2a64+)x1000, OutPut: 98765432109876 ;use 4*1/10
Thank you very much!

Say you, Say me, Say the codes together for ever.


it shouldn't really make a difference as you have to shift anyway.  And I tested only 32 bit numbers, may not be true for 64 bit.


66666667h/2 and shr 1 would be a tick faster. The sh* reg32, 1 instructions are faster than shifts with a counter higher than one. Unfortunately, that limits the range of valid multiplies - try the numbers above 1,073,741,824 :cool:


This ia an another improvement.
Could you test the speed of below codes?
Ldword2ascii proc uses ebx esi edi dwVaule:DWORD,pOutBuf:DWORD
    local    nFlag:BOOL
    mov    nFlag,FALSE
    mov    eax,dwVaule
    mov    edi,pOutBuf
    lea    esi,[edi+16]
    test    eax,eax
    jnz    @f
    mov    word ptr [esi],30h
    jmp    @exit
    jns    @f
    mov    nFlag,TRUE
    neg    eax
    mov    ecx,033333334h            ; 2*(1/10)
        mov    ebx,eax            ; save original
        mul    ecx            ; 033333334h = 2*(1/10)
        shr    edx,1            ; /2
        mov    edi,edx            ; edx = remainder of eax/10
        lea    edx,[edx*4+edx-18h]    ; *5  also subtract out half of '0' (24) to convert last byte to ascii  ala lingo
        lea    edx,[edx+edx]        ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
        sub    ebx,edx            ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
        sub    esi,1
        mov    [esi],bl
        mov    eax,edi            ; save answer
        test    eax,eax            ; are we done?
    .until zero?
    .if    nFlag == TRUE
        sub    esi, 1
        mov    byte ptr [esi],'-'
    mov    eax,esi                ;eax output addr in pOutBuf


Ldword2ascii  endp

local Outbuf[64] :BYTE

    invoke    RtlZeroMemory,addr Outbuf,sizeof Outbuf
        invoke    Ldword2ascii, -9076541,addr Outbuf
        invoke    MessageBox,0,eax,0,0
Say you, Say me, Say the codes together for ever.


This is it for me.  This is the all up proc.  Overall, about a 10% improvement over dwtoa (dwtoa averaged 72 cycles, dword2ascii averaged 64 cycles). 

align 16
dword2ascii proc uses esi edi, Value,buff
   ; converts 32 bit signed integer to ascii
   ; Value to be converted
   ; buff = address of where to store the results
    mov eax,Value
    mov edi,buff
    test eax,eax
    jnz @f
    mov word ptr [edi],30h
@@: jns @f
    mov byte ptr [edi],'-'
    neg eax 
    add edi, 1
@@: mov ecx,66666667h    ; 4 * 1/10
    push esi
    xor esi,esi
      push ebx           ; save ebx, or a digit
      inc esi
      mov ebx,eax        ; save original
      mul ecx            ; 66666667h = 4 * 1/10
      shr edx,2          ; take out the factor of 4     
      mov eax,edx        ; save answer
      add edx,edx        ; *2
      lea edx,[edx*4+edx-'0'] ; *10  also subtract out '0' to convert last byte to ascii
      sub ebx,edx        ; subtract from original gives remainder, digit - (-'0') = +'0' bl now in ascii
      test eax,eax       ; are we done?
    .until zero?
      mov [edi],bl
      inc edi
      pop ebx      ; restore ebx, or get next digit
      dec esi
    .until zero?
    mov byte ptr [edi],0
    pop esi
dword2ascii endp