News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

Started by jj2007, April 01, 2024, 03:42:50 AM

Previous topic - Next topic

jj2007


jj2007

Quote from: jimg on April 04, 2024, 12:10:44 PMAnd a final cleanup.

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)
Averages:
4638    cycles for dwtoa
2986    cycles for dw2str
3225    cycles for jgnorecurse
3042    cycles for dw2$X

Averages:
4616    cycles for dwtoa
2988    cycles for dw2str
3232    cycles for jgnorecurse
3042    cycles for dw2$X

Averages:
4608    cycles for dwtoa
2982    cycles for dw2str
3210    cycles for jgnorecurse
3026    cycles for dw2$X

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Averages:
7034    cycles for dwtoa
5718    cycles for dw2str
4764    cycles for jgnorecurse
6172    cycles for dw2$X

Averages:
7232    cycles for dwtoa
5883    cycles for dw2str
4932    cycles for jgnorecurse
6285    cycles for dw2$X

Averages:
7382    cycles for dwtoa
6289    cycles for dw2str
5239    cycles for jgnorecurse
6760    cycles for dw2$X

Looks like an Intel vs AMD problem...

jimg

That is a significant and surprising difference.   

jimg

Jochen-

Would you check this out if you get a chance?

This is the test program but with a conditional compile around test H.  On line 462 you should find

dotest=2  ; 1=test without DW2$X, else test with DS2$X

if dotest eq 1


Case 1 is just a copy of my routine for TestH.
Case 2 is your DS2$X routine.

When I run, my routine runs 120 cycles faster for case 1 than case 2.  This makes no sense to me, but it is consistent.  Does this happen on AMD also?

jj2007

Quote from: jimg on April 05, 2024, 01:44:25 AMWould you check this out if you get a chance?

Averages:
4552    cycles for dwtoa
3041    cycles for dw2str
3176    cycles for jgnorecurse
2993    cycles for dw2$X

Averages:
4450    cycles for dwtoa
2845    cycles for dw2str
3072    cycles for jgnorecurse
3072    cycles for jgnorecurse2

Timings are not very consistent :rolleyes:

Same but with TimerLoops = 100000 (the default - you set it very low at 1000):

Averages:
4480    cycles for dwtoa
2870    cycles for dw2str
3100    cycles for jgnorecurse
3099    cycles for jgnorecurse2

Averages:
4597    cycles for dwtoa
2968    cycles for dw2str
3227    cycles for jgnorecurse
3016    cycles for dw2$X

Hmmmm...

jimg

Looks like it ran an extra 100 cycles on the first test, and an extra 127 on the second test, so you are seeing the same effect.  Strange.

zedd151

#111
From reply #106

Intel(R) Core(TM)2 Duo CPU    E8400  @ 3.00GHz (SSE4)
61      bytes for other

9439    cycles for 100 * dwtoa
7928    cycles for 100 * dw2str
7370    cycles for 100 * jgnorecurse
7980    cycles for 100 * dw2$X

9527    cycles for 100 * dwtoa
8051    cycles for 100 * dw2str
7511    cycles for 100 * jgnorecurse
7954    cycles for 100 * dw2$X

9533    cycles for 100 * dwtoa
8081    cycles for 100 * dw2str
7475    cycles for 100 * jgnorecurse
7945    cycles for 100 * dw2$X

9567    cycles for 100 * dwtoa
8053    cycles for 100 * dw2str
7472    cycles for 100 * jgnorecurse
7942    cycles for 100 * dw2$X

Averages:
9530    cycles for dwtoa
8052    cycles for dw2str
7474    cycles for jgnorecurse
7950    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
80      bytes for jgnorecurse
106    bytes for dw2$X

dwtoa                                  -123456789
dw2str                                  -123456789
jgnorecurse                            -123456789
dw2$X                                  -123456789

--- ok ---
:smiley:

Ventanas diez es el mejor.  :azn:

six_L

Hi,Lingo
Quotebut the jimg_qword2ascii is not right result. maybe i translated errorly.
1) i know that's why. we can't use the 01999,9999,9999,9999h(1/10),must use the 0ccc,cccc,cccc,cccdh(8*1/10).
otherwise, when a few zero in qword, the qword2ascii will be failed.
2) you use the "little end" to save the result. that eliminate the unnecessary codes.
3) {lea rdx,[rdx*4+rdx-18h]}, We really can't deliberate it. That's very nice.

i learned a lot skills from you.
Thank you very much!

regard

Say you, Say me, Say the codes together for ever.

jimg

Thanks for that.  I was just getting around to testing.   So for now, back to the old way.


.DATA?
align 4
resultJim        db 16 dup(?)
.CODE
alignProc
TestG_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii:  ; setup
    test eax,eax
    jnz @f
    mov word ptr [edi],30h
    ret
@@: jns @f
    mov byte ptr [edi],'-'
    neg eax
    add edi, 1
@@: mov ecx,0CCCCCCCDh  ; 8 * 1/10
    push esi
    mov esi,0
    .repeat
      push ebx            ; save ebx, or a digit
      inc esi
      mov ebx,eax         ; save original
      mul ecx             ; 0CCCCCCCDh =8 * 1/10
      shr edx,3           ; take out the factor of 8
      mov eax,edx         ; save answer
      lea edx,[edx*4+edx-24] ; *5  also subtract out half of '0' (24) to convert last byte to ascii
      add edx,edx         ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
      sub ebx,edx         ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
      test eax,eax        ; are we done?
    .until zero?
    .repeat
      mov [edi],bl
      inc edi
      pop ebx      ; restore ebx, or get next digit
      dec esi
    .until zero?
    pop esi
    ret

jimg

After a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.

six_L

Hi, jimg
Quoteuse 66666667h to multiply
This is a significant improvement.
Quote00000374 clock cycles, (lingo_i2a64)x1000, OutPut: 98765432109876  ;use 8*1/10
00000360 clock cycles, (lingo_i2a64+)x1000, OutPut: 98765432109876 ;use 4*1/10
Thank you very much!

regard
Say you, Say me, Say the codes together for ever.

jimg

it shouldn't really make a difference as you have to shift anyway.  And I tested only 32 bit numbers, may not be true for 64 bit.

jj2007

Quote from: jimg on April 05, 2024, 04:30:50 AMAfter a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.

66666667h/2 and shr 1 would be a tick faster. The sh* reg32, 1 instructions are faster than shifts with a counter higher than one. Unfortunately, that limits the range of valid multiplies - try the numbers above 1,073,741,824 :cool:

six_L

Hi,jj2007
Quote66666667h/2 and shr 1 would be a tick faster.
:cool:
This ia an another improvement.
Could you test the speed of below codes?
Ldword2ascii proc uses ebx esi edi dwVaule:DWORD,pOutBuf:DWORD
    local    nFlag:BOOL
   
    mov    nFlag,FALSE
    mov    eax,dwVaule
    mov    edi,pOutBuf
    lea    esi,[edi+16]
    test    eax,eax
    jnz    @f
    mov    word ptr [esi],30h
    jmp    @exit
@@:
    jns    @f
    mov    nFlag,TRUE
    neg    eax
@@:
    mov    ecx,033333334h            ; 2*(1/10)
    .repeat
        mov    ebx,eax            ; save original
        mul    ecx            ; 033333334h = 2*(1/10)
        shr    edx,1            ; /2
        mov    edi,edx            ; edx = remainder of eax/10
        lea    edx,[edx*4+edx-18h]    ; *5  also subtract out half of '0' (24) to convert last byte to ascii  ala lingo
        lea    edx,[edx+edx]        ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
        sub    ebx,edx            ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
        sub    esi,1
        mov    [esi],bl
        mov    eax,edi            ; save answer
        test    eax,eax            ; are we done?
    .until zero?
    .if    nFlag == TRUE
        sub    esi, 1
        mov    byte ptr [esi],'-'
    .endif
@exit:   
    mov    eax,esi                ;eax output addr in pOutBuf

    ret

Ldword2ascii  endp

local Outbuf[64] :BYTE

    invoke    RtlZeroMemory,addr Outbuf,sizeof Outbuf
        invoke    Ldword2ascii, -9076541,addr Outbuf
        invoke    MessageBox,0,eax,0,0
Say you, Say me, Say the codes together for ever.

jimg

This is it for me.  This is the all up proc.  Overall, about a 10% improvement over dwtoa (dwtoa averaged 72 cycles, dword2ascii averaged 64 cycles). 



align 16
dword2ascii proc uses esi edi, Value,buff
   ; converts 32 bit signed integer to ascii
   ; Value to be converted
   ; buff = address of where to store the results
    mov eax,Value
    mov edi,buff
    test eax,eax
    jnz @f
    mov word ptr [edi],30h
    ret
@@: jns @f
    mov byte ptr [edi],'-'
    neg eax 
    add edi, 1
@@: mov ecx,66666667h    ; 4 * 1/10
    push esi
    xor esi,esi
    .repeat
      push ebx           ; save ebx, or a digit
      inc esi
      mov ebx,eax        ; save original
      mul ecx            ; 66666667h = 4 * 1/10
      shr edx,2          ; take out the factor of 4     
      mov eax,edx        ; save answer
      add edx,edx        ; *2
      lea edx,[edx*4+edx-'0'] ; *10  also subtract out '0' to convert last byte to ascii
      sub ebx,edx        ; subtract from original gives remainder, digit - (-'0') = +'0' bl now in ascii
      test eax,eax       ; are we done?
    .until zero?
    .repeat
      mov [edi],bl
      inc edi
      pop ebx      ; restore ebx, or get next digit
      dec esi
    .until zero?
    mov byte ptr [edi],0
    pop esi
    ret
dword2ascii endp