News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Dword to ascii (dw2a, dwtoa, dw2str, Str$, ...)

Started by jj2007, April 01, 2024, 03:42:50 AM

Previous topic - Next topic

jimg

So....
I know I said I was done before, but with all the changes, here's the latest.
You can blame NoCforMe.


.DATA?
align 4
resultJim        db 16 dup(?)
.CODE
alignProc
TestG_s:
db 2 dup (?) ; dummy spacer to align instructions best
ctAnyToAnyMulJim:  ; setup  (needs a better name)
  test eax,eax
  jnz @f
  mov word ptr [edi],30h
  ret

@@: jns @f
    mov byte ptr [edi],'-'
    neg eax
    add edi, 1
@@: mov ecx,19999999h  ; 1/10
    call ctanydoit
    ret

align 16
nop
ctanydoit:            ; this one needs a better name also
  push ebx            ; save ebx, or a digit
  mov ebx,eax        ; save original
  mul ecx            ; 19999999h = 1/10
  mov eax,edx        ; save answer
  lea edx,[edx*4+edx] ; *5
  add edx,edx        ; *10
  sub ebx,edx        ; subtract from original gives remainder digit
  test eax,eax
  .if !zero?
      call ctanydoit ; generate next digit
  .endif
  add bl,'0'
  mov [edi],bl
  inc edi
  pop ebx ; restore ebx, or get next digit
  ret



results:

Averages:
5989    cycles for dwtoa
5380    cycles for dw2str
22244  cycles for MasmBasic Str$()
14633  cycles for Ray's algo I
5313    cycles for Ray's algo, mod JimG

Averages:
5862    cycles for dwtoa
5386    cycles for dw2str
22392  cycles for MasmBasic Str$()
14412  cycles for Ray's algo I
5341    cycles for Ray's algo, mod JimG

Averages:
5861    cycles for dwtoa
5376    cycles for dw2str
22837  cycles for MasmBasic Str$()
14746  cycles for Ray's algo I
5310    cycles for Ray's algo, mod JimG

Averages:
5856    cycles for dwtoa
5415    cycles for dw2str
22320  cycles for MasmBasic Str$()
14484  cycles for Ray's algo I
5230    cycles for Ray's algo, mod JimG

That four consecutive passes, no cherry picking.

Not bad for having used dwtoa for a couple decades.  Even beat that cheater dw2str.

Of course, now we need to update all those other routines with the changes (use the new 1/10 and don't shift) to get fair comparisons.

I've attach JJ's routine with changes I made because some things were skewing the results.

ahsat

Quote from: jimg on April 04, 2024, 10:52:19 AMI know I said I was done before
A good programmer is never done, always looking for a better way. Its best to say, "done for now".

You certainly did a very good job of adapting the algorithm for base 10.

jimg

Well, with that being said......

If you weren't impressed with the last one, hang onto your socks.

I did something I've been meaning to try all along.
Personally I dislike recursion in the strongest terms.
So I got rid of the recursion.


.DATA?
align 4
resultJim2        db 16 dup(?)
.CODE
alignProc
TestH_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii:  ; setup
  test eax,eax
  jnz @f
  mov word ptr [edi],30h
  ret

@@: jns @f
    mov byte ptr [edi],'-'
    neg eax
    add edi, 1
@@: mov ecx,19999999h  ; 1/10
    push esi
    mov esi,0
    call dword2asciiloop
pop esi
    ret

align 16
  nop              ; code optimization
dword2asciiloop:
  push ebx            ; save ebx, or a digit
  inc esi
  mov ebx,eax        ; save original
  mul ecx            ; 19999999h = 1/10
  mov eax,edx        ; save answer
  lea edx,[edx*4+edx] ; *5
  add edx,edx        ; *10
  sub ebx,edx        ; subtract from original gives remainder digit
  test eax,eax        ; are we done?
  jnz dword2asciiloop
  .repeat
      add bl,'0'
      mov [edi],bl
      inc edi
      pop ebx ; restore ebx, or get next digit
      dec esi
  .until zero?
  ret



results:

Averages:
5932    cycles for dwtoa
5363    cycles for dw2str
22231  cycles for MasmBasic Str$()
5204    cycles for Ray's algo, mod JimG
4581    cycles for jgnorecurse

Averages:
5859    cycles for dwtoa
5376    cycles for dw2str
22270  cycles for MasmBasic Str$()
5282    cycles for Ray's algo, mod JimG
4596    cycles for jgnorecurse

Averages:
5860    cycles for dwtoa
5376    cycles for dw2str
22506  cycles for MasmBasic Str$()
5282    cycles for Ray's algo, mod JimG
4595    cycles for jgnorecurse

Averages:
5855    cycles for dwtoa
5376    cycles for dw2str
22690  cycles for MasmBasic Str$()
5264    cycles for Ray's algo, mod JimG
4593    cycles for jgnorecurse

dwtoa                                  -123456789
dw2str                                  -123456789
MasmBasic Str$()                        -123456789
Ray's algo, mod JimG                    -123456789
jgnorecurse                            -123456789



jj2007

Quote from: jimg on April 04, 2024, 10:52:19 AMEven beat that cheater dw2str

How dare you :rofl:

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)
Averages:
4618    cycles for dwtoa
2994    cycles for dw2str
3930    cycles for Ray's algo, mod JimG
3040    cycles for dw2$X

Averages:
4620    cycles for dwtoa
2994    cycles for dw2str
3951    cycles for Ray's algo, mod JimG
3044    cycles for dw2$X

Averages:
4634    cycles for dwtoa
3000    cycles for dw2str
3975    cycles for Ray's algo, mod JimG
3036    cycles for dw2$X

> If you weren't impressed with the last one, hang onto your socks.
P.S.: socks algo not yet included, sorry - it's 2:40 AM here...

jimg

#94
And a final cleanup.  Only gained about 200 from the change.


alignProc
TestH_s:
db 2 dup (?) ; dummy spacer to align instructions best
dword2ascii:  ; setup
    test eax,eax
    jnz @f
    mov word ptr [edi],30h
    ret
@@: jns @f
    mov byte ptr [edi],'-'
    neg eax
    add edi, 1
@@: mov ecx,19999999h   ; 1/10
    push esi
    mov esi,0
    .repeat
      push ebx            ; save ebx, or a digit
      inc esi
      mov ebx,eax         ; save original
      mul ecx             ; 19999999h = 1/10
      mov eax,edx         ; save answer
      lea edx,[edx*4+edx-24] ; *5  also subtract out half of '0' (24) to convert last byte to ascii  ala lingo
      add edx,edx         ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
      sub ebx,edx         ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
      test eax,eax        ; are we done?
    .until zero?
    .repeat
      mov [edi],bl
      inc edi
      pop ebx      ; restore ebx, or get next digit
      dec esi
    .until zero?
    pop esi
    ret

NameH equ <jgnorecurse>

ahsat


ahsat

#96
Just read the comments.

jimg

I'm using Jochen's trick here.  I didn't see any gain in speed, but I left it in anyway-

    lea edx,[edx*4+edx-26] ; *5    also subtract out half of '0' (26) to convert last byte to ascii
    add edx,edx         ; *10      double it, so full value of '0' is covered.  so -52 extra to dl
    sub ebx,edx         ; subtract from original gives remainder digit  -(-52) =+52=+'0'   bl now in ascii

I'll add these to the code above

lingo

QuoteI'm using Jochen's trick here.  I didn't see any gain in speed, but I left it in anyway-

    lea edx,[edx*4+edx-26] ; *5    also subtract out half of '0' (26) to convert last byte to ascii
    add edx,edx        ; *10      double it, so full value of '0' is covered.  so -52 extra to dl
    sub ebx,edx        ; subtract from original gives remainder digit  -(-52) =+52=+'0'  bl now in ascii
26*2=52='4'  -> ASCII code

This trick was stolen from my post (see page 6 bottom in my post)  by a very stupid thief, ha-ha-hah... :badgrin:  :badgrin:  :badgrin:  :skrewy:
The true number in my post is for 18h=24 * 2 = 30h=48 = '0'
      lea  rcx, [rdx*4+rdx-18h]
        lea  rax, [rcx+rcx]
 ...........

Quid sit futurum cras fuge quaerere.

jimg

You are absolutely correct.  My appologies.  I'm going to see why it works.

jimg

This is totally my fault.  I tried to modify one of Jochen's test procs to test my new proc, and I missed where I had to change one value, and it picked up the results from the previous tests answers.  No excuse, just stupidity.  Had I been printing the correct results, I would have immediately seen that something was wrong.   Stupid.
Again, my apologies for not crediting the correct person.  But anyway, I'll fix the above code.
Please don't start a war over this, this was totally unintentional.

jimg

On another note, I said above that dw2str was a cheater.   That because it requires one to preset what the length of the answer is going to be.  But of course, one doesn't know what the length is going to be.  By presetting it, it saves all the code to reverse the string.  Not usable unless you know you want a right justified string, which is totally possible of being the case, just not for this test.

Unless I'm completely misreading the code, which, given what happened just above, is possible.

One other caveat.  I'm optimizing this on an Intel computer.  If you are running AMD, use Jochen's result to choose a routine.  Here is my latest results for three consequtive runs fixing the screwup above-

Averages:
5862    cycles for dwtoa
5389    cycles for dw2str
22890   cycles for MasmBasic Str$()
5332    cycles for Ray's algo, mod JimG
4280    cycles for jgnorecurse

Averages:
5876    cycles for dwtoa
5377    cycles for dw2str
23038   cycles for MasmBasic Str$()
5293    cycles for Ray's algo, mod JimG
4256    cycles for jgnorecurse

Averages:
5865    cycles for dwtoa
5376    cycles for dw2str
22772   cycles for MasmBasic Str$()
5220    cycles for Ray's algo, mod JimG
4256    cycles for jgnorecurse

20      bytes for dwtoa
82      bytes for dw2str
16      bytes for MasmBasic Str$()
100     bytes for Ray's algo, mod JimG
80      bytes for jgnorecurse

dwtoa                                   -123456789
dw2str                                  -123456789
MasmBasic Str$()                        -123456789
Ray's algo, mod JimG                    -123456789
jgnorecurse                             -123456789


six_L

Hi,jj2007

QuoteNo negative numbers?

the lingo_i2aA hasn't the negative numbers.
For the sake of fair race, I made a bit modification to remove the negative numbers.

Hi,Lingo
1)
Quote00001088 clock cycles, (Ray_AnyToAny)x1000, OutPut: 987654321098
00000281 clock cycles, (lingo_i2a64)x1000, OutPut: 987654321098
00000386 clock cycles, (Roberts_dqtoa)x1000, OutPut: 987654321098
You are the fastest this time.

2)
if you have some idle time, help me check the X64timers.asm. whether is it correctly? it's been translated from 32bit asm.
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
  ; These two macros perform the grunt work involved in measuring the
  ; processor clock cycle count for a block of code. These macros must
  ; be used in pairs, and the block of code must be placed in between
  ; the counter_begin and counter_end macro calls. The counter_end macro
  ; returns the clock cycle count for a single pass through the block of
  ; code, corrected for the test loop overhead, in RAX.
  ;
  ; These macros require a .586 or higher processor directive.
  ;
  ; If your code is using MMX instructions and not executing an EMMS
  ; at the end of each MMX instruction sequence, defining the symbol
  ; _EMMS will cause the ctr_end macro to insert an EMMS in front of
  ; the FPU instructions.
  ;
  ; The loopcount parameter should be set to a relatively high value to
  ; produce repeatable results.
  ;
  ; Note that setting the priority parameter to REALTIME_PRIORITY_CLASS
  ; involves some risk, as it will cause your process to preempt *all*
  ; other processes, including critical Windows processes. Setting the
  ; priority parameter to HIGH_PRIORITY_CLASS instead will significantly
  ; reduce the risk, and in most cases will produce the same cycle count.
;------------------------------------------------------------------------------
counter_begin MACRO loopcount:REQ, priority
LOCAL label

        IFNDEF __counter__stuff__defined__
__counter__stuff__defined__ equ <1>
.data
__counter__pc__count__delta LARGE_INTEGER <>
__counter__pc__count__0 LARGE_INTEGER <>
__counter__pc__count__1 LARGE_INTEGER <>
__counter__pc__count__2 LARGE_INTEGER <>
__counter__pc__count__3 LARGE_INTEGER <>
__counter__loop__count__ dq 0
__counter__loop__counter__ dq 0
__counter__dq_count__ dq 0
.code
        ENDIF

        mov __counter__loop__count__, loopcount
        IFNB <priority>
invoke GetCurrentProcess
invoke SetPriorityClass, rax, priority
        ENDIF
invoke QueryPerformanceCounter, ADDR __counter__pc__count__0
        mov   __counter__loop__counter__, loopcount
@@: ;; Start an empty reference loop
        sub   __counter__loop__counter__, 1
        jnz   @B
invoke QueryPerformanceCounter, ADDR __counter__pc__count__1

mov   rax,__counter__pc__count__1.QuadPart
sub   rax,__counter__pc__count__0.QuadPart
mov   __counter__pc__count__delta.QuadPart,rax ;; Overhead count
       
mov   __counter__loop__counter__, loopcount
invoke QueryPerformanceCounter, ADDR __counter__pc__count__2
label: ;; Start test loop
        __counter__loop__label__ equ <label>
ENDM
;------------------------------------------------------------------------------
counter_end MACRO
        sub   __counter__loop__counter__, 1
        jnz   __counter__loop__label__

invoke QueryPerformanceCounter, ADDR __counter__pc__count__3

        invoke GetCurrentProcess
        invoke SetPriorityClass, rax, NORMAL_PRIORITY_CLASS

        IFDEF _EMMS
EMMS
        ENDIF

mov   rax,__counter__pc__count__3.QuadPart
sub   rax,__counter__pc__count__2.QuadPart
sub   rax,__counter__pc__count__delta.QuadPart
mov   __counter__pc__count__3.QuadPart,rax ;; count
       
finit
        fild  __counter__pc__count__3.QuadPart
        fild  __counter__loop__count__
        fdiv
        fistp __counter__dq_count__

        mov   rax, __counter__dq_count__
ENDM
;------------------------------------------------------------------------------
  ; These two macros perform the grunt work involved in measuring the
  ; execution time in milliseconds for a specified number of loops
  ; through a block of code. These macros must be used in pairs, and
  ; the block of code must be placed in between the timer_begin and
  ; timer_end macro calls. The timer_end macro returns the elapsed
  ; milliseconds for the entire loop in RAX.
  ;
  ; These macros utilize the high-resolution performance counter.
  ; The return value will be zero if the high-resolution performance
  ; counter is not available.
  ;
  ; If your code is using MMX instructions and not executing an EMMS
  ; at the end of each MMX instruction sequence, defining the symbol
  ; _EMMS will cause the timer_end macro to insert an EMMS in front of
  ; the FPU instructions.
  ;
  ; The loopcount parameter should be set to a relatively high value to
  ; produce repeatable results.
  ;
  ; Note that setting the priority parameter to REALTIME_PRIORITY_CLASS
  ; involves some risk, as it will cause your process to preempt *all*
  ; other processes, including critical Windows processes. Setting the
  ; priority parameter to HIGH_PRIORITY_CLASS instead will significantly
  ; reduce the risk, and in most cases will produce very nearly the same
  ; result.
;------------------------------------------------------------------------------
timer_begin MACRO loopcount:REQ, priority
LOCAL label

        IFNDEF __timer__stuff__defined__
__timer__stuff__defined__ equ <1>
.data
__timer__pc__frequency__    LARGE_INTEGER <>
__timer__pc__count__delta   LARGE_INTEGER <>
__timer__pc__count__0       LARGE_INTEGER <>
__timer__pc__count__1       LARGE_INTEGER <>
__timer__pc__count__2       LARGE_INTEGER <>
__timer__pc__count__3       LARGE_INTEGER <>
__timer__loop__counter__    dq 0
__timer__dq_count__         dq 0
.code
        ENDIF

        invoke QueryPerformanceFrequency, ADDR __timer__pc__frequency__
        .if rax != 0

IFNB <priority>
invoke GetCurrentProcess
invoke SetPriorityClass, rax, priority
ENDIF

invoke QueryPerformanceCounter, ADDR __timer__pc__count__0

mov   __timer__loop__counter__, loopcount
@@: ;; Start an empty reference loop
sub   __timer__loop__counter__, 1
jnz   @B

invoke QueryPerformanceCounter, ADDR __timer__pc__count__1

mov   rax,__timer__pc__count__1.QuadPart
sub   rax, __timer__pc__count__0.QuadPart
mov   __timer__pc__count__delta.QuadPart,rax ;; Overhead count

invoke QueryPerformanceCounter, ADDR __timer__pc__count__2 ;; Start test count

mov __timer__loop__counter__, loopcount
label: ;; Start test loop
__timer__loop__label__ equ <label>
        .endif
ENDM
;------------------------------------------------------------------------------
timer_end MACRO
        sub   __timer__loop__counter__, 1
        jnz   __timer__loop__label__

        invoke QueryPerformanceFrequency, ADDR __timer__pc__frequency__
        .IF rax != 0
invoke QueryPerformanceCounter, ADDR __timer__pc__count__3

invoke GetCurrentProcess
invoke SetPriorityClass, rax, NORMAL_PRIORITY_CLASS

IFDEF _EMMS
EMMS
ENDIF

mov   rax,__timer__pc__count__3.QuadPart
sub   rax, __timer__pc__count__2.QuadPart
sub   rax, __timer__pc__count__delta.QuadPart
mov   __timer__pc__count__3.QuadPart,rax ;; count

finit

fild  __timer__pc__count__3.QuadPart
fild  __timer__pc__frequency__.QuadPart
fdiv
mov   __timer__dq_count__, 1000000 ;ns*1000000=ms
fild  __timer__dq_count__
fmul
fistp __timer__dq_count__
mov   rax, __timer__dq_count__
        .ELSE
xor   rax, rax        ;; No performance counter
        .ENDIF
ENDM
;------------------------------------------------------------------------------
SpinUp MACRO
invoke Sleep, 1
mov rax, 100000000 ; tell the CPU it's needed
.Repeat
dec rax
.Until Sign?
ENDM
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
i modify the "counter_begin 10, HIGH_PRIORITY_CLASS" into "counter_begin 20, HIGH_PRIORITY_CLASS", there has no distinction.
Say you, Say me, Say the codes together for ever.

lingo

Thank you six_L,  :thumbsup:

Excellent!
That's better than ever. :greenclp:
Quid sit futurum cras fuge quaerere.

six_L

Hi,Lingo
Quote00001237 clock cycles, (Ray_AnyToAny)x1000, OutPut: 987654321098
00000297 clock cycles, (lingo_i2a64)x1000, OutPut: 987654321098
00000458 clock cycles, (Roberts_dqtoa)x1000, OutPut: 987654321098
00000319 clock cycles, (jimg_qword2ascii)x1000, OutPut: 98765431::98
jimg_qword2ascii:
;db 2 dup (?) ; dummy spacer to align instructions best
qword2ascii proc
test rax,rax
jnz @f
mov word ptr [rdi],30h
ret
;@@:
; jns @f
; mov byte ptr [rdi],'-'
; neg rax
; add rdi, 1
@@:
mov rcx,1999999999999999h   ; 1/10
push rsi
mov rsi,0
.repeat
push rbx ; save rbx, or a digit
inc rsi
mov rbx,rax         ; save original
mul rcx             ; 19999999h = 1/10
mov rax,rdx         ; save answer
lea rdx,[rdx*4+rdx-18h] ; *5  also subtract out half of '0' (24) to convert last byte to ascii  ala lingo
add rdx,rdx         ; *10    double it, so full value of '0' is covered.  so -48 extra to dl
sub rbx,rdx         ; subtract from original gives remainder digit  -(-48) =+48=+'0' bl now in ascii
test rax,rax ; are we done?
.until zero?
.repeat
mov [rdi],bl
inc rdi
pop rbx ; restore rbx, or get next digit
dec rsi
.until zero?
pop rsi
ret

qword2ascii  endp
but the jimg_qword2ascii is not right result. maybe i translated errorly.

regard
Say you, Say me, Say the codes together for ever.