Perhaps I'm just misunderstanding how it works.
I took your last post and added the following lines after each counter_end to see the results of the last run-
pusha
print offset Dest," - "
invoke RtlZeroMemory,addr Dest,100
popa
And here is my results-
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
4343434343434343434
4343434343434343434 - 862 cycles for Str$
- 211 cycles for uqword
4444444444444444444 - 773 cycles for uqw2a (The Svin)
4444444444444444444 - 880 cycles for uqw2a (mCoder)
- 92 cycles for i64toa (Towers)
44444444444444/4+21 - 674 cycles for JJ
4,444,444,444,444,393,520 - 537 cycles for UBTD (Dave)
6 - 214 cycles for b2a3264
- 664 cycles for Str$
- 211 cycles for uqword
4444444444444393520 - 754 cycles for uqw2a (The Svin)
4444444444444393520 - 770 cycles for uqw2a (mCoder)
- 90 cycles for i64toa (Towers)
44444444444444/4+21 - 647 cycles for JJ
4,444,444,444,444,393,520 - 451 cycles for UBTD (Dave)
6 - 214 cycles for b2a3264
If the answer isn't in Dest, where is it?
If it is, then most of them aren't working.
edit:
Just to see if the answer was somewhere, I tried printing every printable character in Dest.
I replace my previous insertions with a macro called checkit
checkit macro
pusha
mov ebx,99
lea esi,Dest
mov edi,esi
.repeat
lodsb
.if al>31
stosb
.endif
dec ebx
.until ebx==0
mov al,0
stosb
print offset Dest," - "
invoke RtlZeroMemory,addr Dest,100
popa
endm
and for results I got:
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
4343434343434343434
4343434343434343434 - 871 cycles for Str$
343434343434343434 - 212 cycles for uqword
4444444444444444444 - 735 cycles for uqw2a (The Svin)
4444444444444444444 - 882 cycles for uqw2a (mCoder)
- 92 cycles for i64toa (Towers)
44444444444444/4+21 - 670 cycles for JJ
4,444,444,444,444,393,520 - 313 cycles for UBTD (Dave)
6513854137424602010 - 214 cycles for b2a3264
- 445 cycles for Str$
343434343434343434 - 151 cycles for uqword
4444444444444393520 - 655 cycles for uqw2a (The Svin)
4444444444444393520 - 882 cycles for uqw2a (mCoder)
- 92 cycles for i64toa (Towers)
44444444444444/4+21 - 663 cycles for JJ
4,444,444,444,444,393,520 - 382 cycles for UBTD (Dave)
6513854137424602010 - 184 cycles for b2a3264
so if the answer is there somewhere, I don't see it.
I got a little better results with an earlier test program (attached as tst4)
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)
Qword to Ascii algos:
1216 cycles for sprintf, result: 12345678901234567890
702 cycles for Asc64, result: 12345678901234567890
214 cycles for U64ToStr, result: 12345678901234567890
132 cycles for UBTD, result: 12,345,678,901,234,567,890
131 cycles for UBTDx, result: 12345678901234567890
36 cycles for b2a3264, result:
1172 cycles for sprintf, result: 12345678901234567890
701 cycles for Asc64, result: 12345678901234567890
207 cycles for U64ToStr, result: 12345678901234567890
138 cycles for UBTD, result: 12,345,678,901,234,567,890
125 cycles for UBTDx, result: 12345678901234567890
34 cycles for b2a3264, result:
1165 cycles for sprintf, result: 12345678901234567890
703 cycles for Asc64, result: 12345678901234567890
207 cycles for U64ToStr, result: 12345678901234567890
135 cycles for UBTD, result: 12,345,678,901,234,567,890
129 cycles for UBTDx, result: 12345678901234567890
34 cycles for b2a3264, result:
Code sizes:
Asc64 = 52
U64ToStr = 178
b2a3264 = 834 + 200 for chartable