News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

qword to ascii conversion

Started by allynm, June 06, 2012, 02:36:32 AM

Previous topic - Next topic

jimg

So, I ended up here while looking for a fast qw2asc routine.   Is  b2a3264 still the fastest?  Has there been any usability updates?

jimg

Scratch that.  I couldn't get b2a3264 to work.   It's not working in the last posted test program.  Looks like UBTD is the champ.

jj2007

What is the problem with b2a3264? It works fine for me, but the 2nd algo, uqword, is a tick faster.

Btw we did really exotic things at the time:  mov esi, pQword
  fld FP10(0.00000000000000000009999999999999999972)
  fld FP10(10.000000000000000028)
  fild qword ptr [esi]
  mov ecx, 19
  fmul st, st(2)
  mov edi, pBuffer
  push 0
  .Repeat
fisub dword ptr [esp]
fmul st, st(1)
fist dword ptr [esp]
mov eax, [esp]
add eax, "0"
stosb
dec ecx
  .Until Zero?
:biggrin:

jimg

#63
Perhaps I'm just misunderstanding how it works.
I took your last post and added the following lines after each counter_end to see the results of the last run-

pusha
print offset Dest," - "
invoke RtlZeroMemory,addr Dest,100
popa

And here is my results-

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
4343434343434343434

4343434343434343434 - 862       cycles for Str$
- 211  cycles for uqword
4444444444444444444 - 773       cycles for uqw2a (The Svin)
4444444444444444444 - 880       cycles for uqw2a (mCoder)
- 92   cycles for i64toa (Towers)
44444444444444/4+21 - 674       cycles for JJ
4,444,444,444,444,393,520 - 537        cycles for UBTD (Dave)
6 - 214 cycles for b2a3264

- 664  cycles for Str$
- 211  cycles for uqword
4444444444444393520 - 754       cycles for uqw2a (The Svin)
4444444444444393520 - 770       cycles for uqw2a (mCoder)
- 90   cycles for i64toa (Towers)
44444444444444/4+21 - 647       cycles for JJ
4,444,444,444,444,393,520 - 451        cycles for UBTD (Dave)
6 - 214 cycles for b2a3264

If the answer isn't in Dest, where is it?
If it is, then most of them aren't working.


edit:

Just to see if the answer was somewhere, I tried printing every printable character in Dest.
I replace my previous insertions with a macro called checkit

checkit macro
pusha
mov ebx,99
lea esi,Dest
mov edi,esi
.repeat
    lodsb
    .if al>31
        stosb
    .endif
    dec ebx
.until ebx==0
mov al,0
stosb
print offset Dest," - "
invoke RtlZeroMemory,addr Dest,100
popa
endm

and for results I got:
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
4343434343434343434

4343434343434343434 - 871       cycles for Str$
343434343434343434 - 212        cycles for uqword
4444444444444444444 - 735       cycles for uqw2a (The Svin)
4444444444444444444 - 882       cycles for uqw2a (mCoder)
- 92   cycles for i64toa (Towers)
44444444444444/4+21 - 670       cycles for JJ
4,444,444,444,444,393,520 - 313        cycles for UBTD (Dave)
6513854137424602010 - 214       cycles for b2a3264

- 445  cycles for Str$
343434343434343434 - 151        cycles for uqword
4444444444444393520 - 655       cycles for uqw2a (The Svin)
4444444444444393520 - 882       cycles for uqw2a (mCoder)
- 92   cycles for i64toa (Towers)
44444444444444/4+21 - 663       cycles for JJ
4,444,444,444,444,393,520 - 382        cycles for UBTD (Dave)
6513854137424602010 - 184       cycles for b2a3264

so if the answer is there somewhere, I don't see it.

I got a little better results with an earlier test program (attached as tst4)

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)

Qword to Ascii algos:
1216    cycles for sprintf, result:     12345678901234567890
702     cycles for Asc64, result:       12345678901234567890
214     cycles for U64ToStr, result:    12345678901234567890
132     cycles for UBTD, result:        12,345,678,901,234,567,890
131     cycles for UBTDx, result:       12345678901234567890
36      cycles for b2a3264, result:

1172    cycles for sprintf, result:     12345678901234567890
701     cycles for Asc64, result:       12345678901234567890
207     cycles for U64ToStr, result:    12345678901234567890
138     cycles for UBTD, result:        12,345,678,901,234,567,890
125     cycles for UBTDx, result:       12345678901234567890
34      cycles for b2a3264, result:

1165    cycles for sprintf, result:     12345678901234567890
703     cycles for Asc64, result:       12345678901234567890
207     cycles for U64ToStr, result:    12345678901234567890
135     cycles for UBTD, result:        12,345,678,901,234,567,890
129     cycles for UBTDx, result:       12345678901234567890
34      cycles for b2a3264, result:


Code sizes:
Asc64        = 52
U64ToStr     = 178
b2a3264      = 834 + 200 for chartable



jimg

Added umqtoa to the mix.  Very respectable.  Still can't get b2a3264 to work.
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)

Qword to Ascii algos:
1211    cycles for sprintf, result:     12345678901234567890
306     cycles for umqtoa, result:      12345678901234567890
712     cycles for Asc64, result:       12345678901234567890
216     cycles for U64ToStr, result:    12345678901234567890
137     cycles for UBTD, result:        12,345,678,901,234,567,890
125     cycles for UBTDx, result:       12345678901234567890
34      cycles for b2a3264, result:

1200    cycles for sprintf, result:     12345678901234567890
302     cycles for umqtoa, result:      12345678901234567890
727     cycles for Asc64, result:       12345678901234567890
209     cycles for U64ToStr, result:    12345678901234567890
125     cycles for UBTD, result:        12,345,678,901,234,567,890
137     cycles for UBTDx, result:       12345678901234567890
36      cycles for b2a3264, result:

1177    cycles for sprintf, result:     12345678901234567890
298     cycles for umqtoa, result:      12345678901234567890
711     cycles for Asc64, result:       12345678901234567890
213     cycles for U64ToStr, result:    12345678901234567890
130     cycles for UBTD, result:        12,345,678,901,234,567,890
125     cycles for UBTDx, result:       12345678901234567890
33      cycles for b2a3264, result:


Code sizes:
Asc64        = 52
U64ToStr     = 178
b2a3264      = 834 + 200 for chartable

jj2007

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

Qword to Ascii algos:
1114    cycles for sprintf, result:     12345678901234567890
340     cycles for umqtoa, result:      12345678901234567890
619     cycles for Asc64, result:       12345678901234567890
259     cycles for U64ToStr, result:    12345678901234567890
154     cycles for UBTD, result:        12,345,678,901,234,567,890
168     cycles for UBTDx, result:       12345678901234567890
44      cycles for b2a3264, result:

1175    cycles for sprintf, result:     12345678901234567890
346     cycles for umqtoa, result:      12345678901234567890
638     cycles for Asc64, result:       12345678901234567890
254     cycles for U64ToStr, result:    12345678901234567890
152     cycles for UBTD, result:        12,345,678,901,234,567,890
150     cycles for UBTDx, result:       12345678901234567890
45      cycles for b2a3264, result:

1134    cycles for sprintf, result:     12345678901234567890
330     cycles for umqtoa, result:      12345678901234567890
643     cycles for Asc64, result:       12345678901234567890
261     cycles for U64ToStr, result:    12345678901234567890
150     cycles for UBTD, result:        12,345,678,901,234,567,890
154     cycles for UBTDx, result:       12345678901234567890
45      cycles for b2a3264, result:

Code sizes:
Asc64        = 52
U64ToStr     = 178
b2a3264      = 834 + 200 for chartable


Quote from: jimg on July 07, 2017, 03:28:38 AM
Still can't get b2a3264 to work.

Two lines are missing:
option prologue:none ; turn it off
option epilogue:none ;


Here are few exotic ones - source attached:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

39062   cycles for 100 * MB Str$()
27288   cycles for 100 * q2a fpu 1
27213   cycles for 100 * q2a fpu 2
28920   cycles for 100 * q2a fpu 3
15460   cycles for 100 * UBTDx

37583   cycles for 100 * MB Str$()
26653   cycles for 100 * q2a fpu 1
27548   cycles for 100 * q2a fpu 2
28913   cycles for 100 * q2a fpu 3
15407   cycles for 100 * UBTDx


Exotic because they use the FPU. That old thread had some problems - some algos crashed, others didn't deliver the right results. More experimental than serious.

Btw where is umqtoa in your version?

jimg

It's part of fpulib.  I haven't looked at the actual code yet.

And the earlier one has option prologue none, with the same results.  I'll look more closely, but I really don't understand how it's supposed to work.

jimg

I just took the code right out of you latest post of  QW2Ascii, prologue stuff and all and ran it without changes, exactly as you ran it, and I can get correct answers out of it.

I can't believe I'm getting this bad.   If you get the chance, and only if you feel like it, would you put together a quickie with nothing but b2a3264 doing one number and printing the results? 

jj2007

Jim,
The problem sits here apparently:

  mov edi,[esp+2*4]
  mov [ecx+18],edx
  mov eax, [esp+1*4]  ; added by me
  retn 2*4  ; changed by me

Mine runs with this, but the result looks crappy.

Note the order, different from the others:
       invoke b2a3264, addr Dest, addr Src

P.S.: Lingo's code is fast, bloated and crappy... as usual.
35821   cycles for 100 * MB Str$()
27236   cycles for 100 * q2a fpu 1
26186   cycles for 100 * q2a fpu 2
25412   cycles for 100 * q2a fpu 3
14913   cycles for 100 * UBTDx
30051   cycles for 100 * umqtoa
4865    cycles for 100 * Lingo

18      bytes for MB Str$()
99      bytes for q2a fpu 1
87      bytes for q2a fpu 2
119     bytes for q2a fpu 3
215     bytes for UBTDx
187     bytes for umqtoa
883     bytes for Lingo


Search the source (*.asc in Richmasm/Wordpad/Ms Word, or *.asm in whatever) for the string ForLingo. For unknown reasons, it required that some extra qwords needed to be filled with the source values in order to produce results. An incredible mess 8)

jimg

Thank you.
I thought I could pull out just the parts needed to run the proc once, but somewhere, I screwed it up and I can't see where.  I did my best to pull out your exact code.  If you get a chance, please take a look and let me know what I fouled up.

jj2007

Hi Jim,

Here is one that should work. Minor modifications, and it was definitely not your fault.

jimg

#71
Much thanks.  Now the fun begins :)

jimg

Take a look at the last line in the proc.

jj2007

mov [ecx+8], edx ;
add ecx, -1 ;
jne Jo5 ;


I've given up a long time ago the wish to understand Lingo's code. But it is pretty clear that ecx will never reach the value 1 8)

jimg

I've modded the proc to more normal specifications  (b2a3264x5).   No use of esp.  Normal prologue.  Normal qword input.
I'm pretty happy with the results, and it's certainly much faster than any of the other algos.
I still don't understand what it's doing, I can't even figure out what the magic numbers he's using are.

Here's my results :)

Qword to Ascii algos:
1228    cycles for sprintf, result:     18446744073709551615
305     cycles for umqtoa, result:      18446744073709551615
699     cycles for Asc64, result:       18446744073709551615
213     cycles for U64ToStr, result:    18446744073709551615
124     cycles for UBTD, result:        18,446,744,073,709,551,615
163     cycles for UBTDx, result:       18446744073709551615
42      cycles for b2a3264, result:     18446744073709551615
41      cycles for b2a3264x5, result:   18446744073709551615

1173    cycles for sprintf, result:     12345678901234567890
299     cycles for umqtoa, result:      12345678901234567890
699     cycles for Asc64, result:       12345678901234567890
207     cycles for U64ToStr, result:    12345678901234567890
137     cycles for UBTD, result:        12,345,678,901,234,567,890
175     cycles for UBTDx, result:       12345678901234567890
41      cycles for b2a3264, result:     12345678901234567890
35      cycles for b2a3264x5, result:   12345678901234567890

615     cycles for sprintf, result:     4294967296
85      cycles for umqtoa, result:
425     cycles for Asc64, result:
74      cycles for U64ToStr, result:    4294967296
160     cycles for UBTD, result:                     4,294,967,296
188     cycles for UBTDx, result:       4294967296
42      cycles for b2a3264, result:     4294967296
40      cycles for b2a3264x5, result:   4294967296

601     cycles for sprintf, result:     3012345678
64      cycles for umqtoa, result:
418     cycles for Asc64, result:
61      cycles for U64ToStr, result:    3012345678
158     cycles for UBTD, result:                     3,012,345,678
188     cycles for UBTDx, result:       3012345678
19      cycles for b2a3264, result:     3012345678
14      cycles for b2a3264x5, result:   3012345678

429     cycles for sprintf, result:     123456
41      cycles for umqtoa, result:
295     cycles for Asc64, result:
31      cycles for U64ToStr, result:    123456
173     cycles for UBTD, result:                           123,456
195     cycles for UBTDx, result:       123456
19      cycles for b2a3264, result:     123456
13      cycles for b2a3264x5, result:   123456

191     cycles for sprintf, result:     1
13      cycles for umqtoa, result:
190     cycles for Asc64, result:
10      cycles for U64ToStr, result:    1
194     cycles for UBTD, result:                                 1
210     cycles for UBTDx, result:       1
14      cycles for b2a3264, result:     1
6       cycles for b2a3264x5, result:   1

189     cycles for sprintf, result:     0
12      cycles for umqtoa, result:
179     cycles for Asc64, result:
9       cycles for U64ToStr, result:    0
188     cycles for UBTD, result:                                 0
214     cycles for UBTDx, result:       0
15      cycles for b2a3264, result:     0
7       cycles for b2a3264x5, result:   0