Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

jj2007 · April 04, 2024, 07:53:29 PM

Quote from: lingo on April 04, 2024, 03:08:53 PMa very stupid thief, ha-ha-hah...

See Bravo, Lingo! in the Colosseum ;-)

jj2007 · April 04, 2024, 10:31:14 PM

Quote from: jimg on April 04, 2024, 12:10:44 PMAnd a final cleanup.

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)
Averages:
4638    cycles for dwtoa
2986    cycles for dw2str
3225    cycles for jgnorecurse
3042    cycles for dw2$X

Averages:
4616    cycles for dwtoa
2988    cycles for dw2str
3232    cycles for jgnorecurse
3042    cycles for dw2$X

Averages:
4608    cycles for dwtoa
2982    cycles for dw2str
3210    cycles for jgnorecurse
3026    cycles for dw2$X

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Averages:
7034    cycles for dwtoa
5718    cycles for dw2str
4764    cycles for jgnorecurse
6172    cycles for dw2$X

Averages:
7232    cycles for dwtoa
5883    cycles for dw2str
4932    cycles for jgnorecurse
6285    cycles for dw2$X

Averages:
7382    cycles for dwtoa
6289    cycles for dw2str
5239    cycles for jgnorecurse
6760    cycles for dw2$X

Looks like an Intel vs AMD problem...

jimg · April 05, 2024, 12:32:01 AM

That is a significant and surprising difference.

jimg · April 05, 2024, 01:44:25 AM

Jochen-

Would you check this out if you get a chance?

This is the test program but with a conditional compile around test H. On line 462 you should find

dotest=2 ; 1=test without DW2$X, else test with DS2$X

if dotest eq 1

Case 1 is just a copy of my routine for TestH.
Case 2 is your DS2$X routine.

When I run, my routine runs 120 cycles faster for case 1 than case 2. This makes no sense to me, but it is consistent. Does this happen on AMD also?

jj2007 · April 05, 2024, 02:15:48 AM

Quote from: jimg on April 05, 2024, 01:44:25 AMWould you check this out if you get a chance?

Code Select

Averages:
4552    cycles for dwtoa
3041    cycles for dw2str
3176    cycles for jgnorecurse
2993    cycles for dw2$X

Averages:
4450    cycles for dwtoa
2845    cycles for dw2str
3072    cycles for jgnorecurse
3072    cycles for jgnorecurse2

Timings are not very consistent

Same but with TimerLoops = 100000 (the default - you set it very low at 1000):

Code Select

Averages:
4480    cycles for dwtoa
2870    cycles for dw2str
3100    cycles for jgnorecurse
3099    cycles for jgnorecurse2

Averages:
4597    cycles for dwtoa
2968    cycles for dw2str
3227    cycles for jgnorecurse
3016    cycles for dw2$X

Hmmmm...

jimg · April 05, 2024, 02:25:13 AM

Looks like it ran an extra 100 cycles on the first test, and an extra 127 on the second test, so you are seeing the same effect. Strange.

zedd · April 05, 2024, 03:04:00 AM

From reply #106

Code Select

Intel(R) Core(TM)2 Duo CPU    E8400  @ 3.00GHz (SSE4)
61      bytes for other

9439    cycles for 100 * dwtoa
7928    cycles for 100 * dw2str
7370    cycles for 100 * jgnorecurse
7980    cycles for 100 * dw2$X

9527    cycles for 100 * dwtoa
8051    cycles for 100 * dw2str
7511    cycles for 100 * jgnorecurse
7954    cycles for 100 * dw2$X

9533    cycles for 100 * dwtoa
8081    cycles for 100 * dw2str
7475    cycles for 100 * jgnorecurse
7945    cycles for 100 * dw2$X

9567    cycles for 100 * dwtoa
8053    cycles for 100 * dw2str
7472    cycles for 100 * jgnorecurse
7942    cycles for 100 * dw2$X

Averages:
9530    cycles for dwtoa
8052    cycles for dw2str
7474    cycles for jgnorecurse
7950    cycles for dw2$X

20      bytes for dwtoa
82      bytes for dw2str
80      bytes for jgnorecurse
106    bytes for dw2$X

dwtoa                                  -123456789
dw2str                                  -123456789
jgnorecurse                            -123456789
dw2$X                                  -123456789

--- ok ---

six_L · April 05, 2024, 03:25:12 AM

Hi,Lingo

Quotebut the jimg_qword2ascii is not right result. maybe i translated errorly.

1) i know that's why. we can't use the 01999,9999,9999,9999h(1/10),must use the 0ccc,cccc,cccc,cccdh(8*1/10).
otherwise, when a few zero in qword, the qword2ascii will be failed.
2) you use the "little end" to save the result. that eliminate the unnecessary codes.
3) {lea rdx,[rdx*4+rdx-18h]}, We really can't deliberate it. That's very nice.

i learned a lot skills from you.
Thank you very much!

regard

jimg · April 05, 2024, 04:13:46 AM

Thanks for that. I was just getting around to testing. So for now, back to the old way.

.DATA?
align 4
resultJim db 16 dup(?)
.CODE
alignProc
TestG_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii: ; setup
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,0CCCCCCCDh ; 8 * 1/10
push esi
mov esi,0
.repeat
push ebx ; save ebx, or a digit
inc esi
mov ebx,eax ; save original
mul ecx ; 0CCCCCCCDh =8 * 1/10
shr edx,3 ; take out the factor of 8
mov eax,edx ; save answer
lea edx,[edx*4+edx-24] ; *5 also subtract out half of '0' (24) to convert last byte to ascii
add edx,edx ; *10 double it, so full value of '0' is covered. so -48 extra to dl
sub ebx,edx ; subtract from original gives remainder digit -(-48) =+48=+'0' bl now in ascii
test eax,eax ; are we done?
.until zero?
.repeat
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
dec esi
.until zero?
pop esi
ret

jimg · April 05, 2024, 04:30:50 AM

After a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.

six_L · April 05, 2024, 04:52:03 AM

Hi, jimg

Quoteuse 66666667h to multiply

This is a significant improvement.

Quote00000374 clock cycles, (lingo_i2a64)x1000, OutPut: 98765432109876 ;use 8*1/10
00000360 clock cycles, (lingo_i2a64+)x1000, OutPut: 98765432109876 ;use 4*1/10

Thank you very much!

regard

jimg · April 05, 2024, 05:06:11 AM

it shouldn't really make a difference as you have to shift anyway. And I tested only 32 bit numbers, may not be true for 64 bit.

jj2007 · April 05, 2024, 09:54:20 AM

Quote from: jimg on April 05, 2024, 04:30:50 AMAfter a little testing, it turns out that two bits was enough, i.e. could use 66666667h to multiply, and shr 2, but no gain in speed so I'll leave it at 3 bits ( *8) to be compatible with everyone else.

66666667h/2 and shr 1 would be a tick faster. The sh* reg32, 1 instructions are faster than shifts with a counter higher than one. Unfortunately, that limits the range of valid multiplies - try the numbers above 1,073,741,824

six_L · April 06, 2024, 03:05:42 AM

Hi,jj2007

Quote66666667h/2 and shr 1 would be a tick faster.

This ia an another improvement.
Could you test the speed of below codes?

Code Select

Ldword2ascii proc uses ebx esi edi dwVaule:DWORD,pOutBuf:DWORD
    local    nFlag:BOOL
    
    mov    nFlag,FALSE
    mov    eax,dwVaule
    mov    edi,pOutBuf
    lea    esi,[edi+16]
    test    eax,eax
    jnz    @f
    mov    word ptr [esi],30h
    jmp    @exit
@@: 
    jns    @f
    mov    nFlag,TRUE
    neg    eax
@@: 
    mov    ecx,033333334h            ; 2*(1/10)
    .repeat
        mov    ebx,eax            ; save original
        mul    ecx            ; 033333334h = 2*(1/10)
        shr    edx,1            ; /2
        mov    edi,edx            ; edx = remainder of eax/10
        lea    edx,[edx*4+edx-18h]    ; *5  also subtract out half of '0' (24) to convert last byte to ascii  ala lingo
        lea    edx,[edx+edx]        ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
        sub    ebx,edx            ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
        sub    esi,1
        mov    [esi],bl
        mov    eax,edi            ; save answer
        test    eax,eax            ; are we done?
    .until zero?
    .if    nFlag == TRUE
        sub    esi, 1
        mov    byte ptr [esi],'-'
    .endif
@exit:    
    mov    eax,esi                ;eax output addr in pOutBuf

    ret

Ldword2ascii  endp

local Outbuf[64] :BYTE

invoke RtlZeroMemory,addr Outbuf,sizeof Outbuf
invoke Ldword2ascii, -9076541,addr Outbuf
invoke MessageBox,0,eax,0,0

jimg · April 06, 2024, 03:14:24 AM

This is it for me. This is the all up proc. Overall, about a 10% improvement over dwtoa (dwtoa averaged 72 cycles, dword2ascii averaged 64 cycles).

align 16
dword2ascii proc uses esi edi, Value,buff
; converts 32 bit signed integer to ascii
; Value to be converted
; buff = address of where to store the results
mov eax,Value
mov edi,buff
test eax,eax
jnz @f
mov word ptr [edi],30h
ret
@@: jns @f
mov byte ptr [edi],'-'
neg eax
add edi, 1
@@: mov ecx,66666667h ; 4 * 1/10
push esi
xor esi,esi
.repeat
push ebx ; save ebx, or a digit
inc esi
mov ebx,eax ; save original
mul ecx ; 66666667h = 4 * 1/10
shr edx,2 ; take out the factor of 4
mov eax,edx ; save answer
add edx,edx ; *2
lea edx,[edx*4+edx-'0'] ; *10 also subtract out '0' to convert last byte to ascii
sub ebx,edx ; subtract from original gives remainder, digit - (-'0') = +'0' bl now in ascii
test eax,eax ; are we done?
.until zero?
.repeat
mov [edi],bl
inc edi
pop ebx ; restore ebx, or get next digit
dec esi
.until zero?
mov byte ptr [edi],0
pop esi
ret
dword2ascii endp

The MASM Forum

News:

Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

jj2007

jj2007

jimg

jimg

jj2007

jimg

zedd

six_L

jimg

jimg

six_L

jimg

jj2007

six_L

jimg