Hi All!
Somebody have a 64 bit procedure to convert signed qword integers to unicode string?
Thanks in advance, HSE.
Hector,
What about one of the MSVC functions ?
Hi Hutch!
Quote from: hutch-- on April 27, 2022, 01:10:26 AM
What about one of the MSVC functions ?
No, because I want that for programs running directly from UEFI.
It's not a critical problem because I can load integer to FPU, and float to unicode work well, but I think a more direct solution could be better.
Hi HSE
What format? Decimal, Hex, Bin?
Biterider
include \Masm32\MasmBasic\Res\JBasic.inc ; ## builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
q2aBuffer db 80 dup(?)
.code
q2a:
push rsi
push rdi
mov rsi, offset q2aBuffer+32
lea rdi, [rsi-32]
FBSTP REAL10 ptr [rsi]
mov ecx, REAL10
@@: movzx edx, byte ptr [rsi+rcx]
test edx, edx
je NoNumber
mov al, dl
sar al, 4
and al, 15
add al, "0"
stosw
mov al, dl
and al, 15
add al, "0"
stosw
NoNumber:
dec ecx
jns @B
pop rdi
pop rsi
ret
MyQ QWORD 123456789012345678
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
fild MyQ
call q2a
jinvoke printf, Chr$("Result=%ls"), offset q2aBuffer
EndOfCode
Output:
This program was assembled with ml64 in 64-bit format.
Result=123456789012345678
Hi JJ!
Very strange:
\Masm32\MasmBasic\Res\JBasic.inc(554) : fatal error A1000:cannot open file : \Masm32\MasmBasic☺
That is from command line because RichMasm have some path problem.
Using procedure: numero qword 15
result:q2aBuffer = 1F5F [q2u.asm, 226]
Which assembler?
Does \Masm32\MasmBasic\Res\JBasic.inc exist?
Can you post the code that produces garbage for numero 15?
Here everything works fine, and I see no reason why it shouldn't work :rolleyes:
Quote from: jj2007 on April 27, 2022, 07:19:53 AM
Which assembler?
*** Start D:\masm32\MasmBasic\Res\bldallRM.bat ***
*** 64-bit assembly ***
*** Assemble, link and run q2a ***
*** Assemble using \masm32\bin64\ml64 ***
El sistema no puede encontrar la ruta especificada.
*** Assembly error ***
Quote from: jj2007 on April 27, 2022, 07:19:53 AM
Does \Masm32\MasmBasic\Res\JBasic.inc exist?
The error is in JBasic.inc
Quote from: jj2007 on April 27, 2022, 07:19:53 AM
Can you post the code that produces garbage for numero 15?
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
% include @Environ(OBJASM_PATH)\Code\Macros\Model.inc ;Include & initialize standard modules
SysSetup OOP, WIDE_STRING, NUI64, DEBUG(CON) ;Load OOP files and basic OS support
.data
ConInput CHR 10 DUP(0) ;Get some space for the console input buffer
dBytesRead DWORD 0
numero qword 15
q2aBuffer db 80 dup(?)
.code
q2a:
push rsi
push rdi
mov rsi, offset q2aBuffer+32
lea rdi, [rsi-32]
FBSTP REAL10 ptr [rsi]
mov ecx, REAL10
@@: movzx edx, byte ptr [rsi+rcx]
test edx, edx
je NoNumber
mov al, dl
sar al, 4
and al, 15
add al, "0"
stosw
mov al, dl
and al, 15
add al, "0"
stosw
NoNumber:
dec ecx
jns @B
pop rdi
pop rsi
ret
start proc
SysInit
DbgClearAll
fild numero
call q2a
DbgStrA q2aBuffer
DbgText "Press \[ENTER\] to continue..."
invoke CreateFile, $OfsCStr("CONIN$"), GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, 0, 0
invoke ReadFile, xax, addr ConInput, sizeof(ConInput), addr dBytesRead, NULL
SysDone
invoke ExitProcess,0
ret
start endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Quote from: HSE on April 27, 2022, 07:42:24 AM
*** Start D:\masm32\MasmBasic\Res\bldallRM.bat ***
*** 64-bit assembly ***
*** Assemble, link and run q2a ***
*** Assemble using \masm32\bin64\ml64 ***
El sistema no puede encontrar la ruta especificada.
*** Assembly error ***
So you are using an
OPT_Assembler \masm32\bin64\ml64 in your source... sorry, that won't work. RichMasm assumes that all your tools reside in \masm32\bin\*. Copy ml64.exe there, and use
OPT_Assembler ml (or let RichMasm use the default \masm32\bin\UAsm64.exe) :cool:
QuoteQuote from: jj2007 on April 27, 2022, 07:19:53 AM
Does \Masm32\MasmBasic\Res\JBasic.inc exist?
The error is in JBasic.inc
Right - sorry. So what is at the error line 554 in your JBasic.inc, causing fatal error A1000:cannot open file : \Masm32\MasmBasic☺?
Open "I", #0, repargA(fname)
xchg rsi, rax
jinvoke GetFileSize, rsi, addr bytesWritten
inc rax
xchg rax, rdi <<<<<<<<<<<<<<<<<<<<<<<<< line 554 <<<<<<<<<<<<<<<<<<
jinvoke HeapAlloc, MbProHeap, HEAP_GENERATE_EXCEPTIONS, rdi
mov MbFileReadPtr, rax
jinvoke ReadFile, rsi, rax, rdi, addr bytesWritten, 0
mov rdx, MbFileReadPtr
QuoteQuote from: jj2007 on April 27, 2022, 07:19:53 AM
Can you post the code that produces garbage for numero 15?
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
...
I can't see where it fails, can you post the exe, please?
Quote from: jj2007 on April 27, 2022, 09:21:43 AM
... sorry, that won't work. RichMasm assumes that all your tools reside in \masm32\bin\*
All 64 bits tools are in bin64 folder because Masm64 SDK standard, but not problem. We can wait until you make the corrections :biggrin: :biggrin: :biggrin:
Quote from: jj2007 on April 27, 2022, 09:21:43 AM
Right - sorry. So what is at the error line 554 in your JBasic.inc, causing fatal error A1000:cannot open file : \Masm32\MasmBasic☺?
:biggrin: I have DualMacs.inc but I lost DualWin.inc somewhere. And I never see some pt.inc. For sure I miss some update.
Quote from: jj2007 on April 27, 2022, 09:21:43 AM
I can't see where it fails, can you post the exe, please?
Adjunted.
Quote from: HSE on April 27, 2022, 09:50:16 AM
Quote from: jj2007 on April 27, 2022, 09:21:43 AM
I can't see where it fails, can you post the exe, please?
Adjunted.
Quote@@: movzx ecx, byte ptr [rsi+rdx]
jecxz NoNumber
mov eax, ecx
sar al, 4
and al, 15
add al, "0"
stosw
In this forum, beating the CRT by at least a factor 5 is our favourite pastime :biggrin:
This program was assembled with ml64 in 64-bit format.
561 ticks for crt swprintf
Result=123456789012345678
109 ticks for q2a
Result=123456789012345678
This program was assembled with ml in 32-bit format.
687 ticks for crt swprintf
Result=123456789012345678
109 ticks for q2a
Result=123456789012345678
:biggrin:numero qword 5000
q2aBuffer = 50 [q2u.asm, 69]
Quote from: HSE on April 27, 2022, 09:54:53 PM
:biggrin:numero qword 5000
q2aBuffer = 50 [q2u.asm, 69]
Result=5000
:biggrin:
Post your exe...
I found a very interesting procedure from bitRAKE for ASCII, pretty easy to make for Unicode:.data
align 64
digit_table dw '0','1','2','3','4','5','6','7','8','9'
dw 'A','B','C','D','E','F','G','H','I','J'
dw 'K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'
.code
;-------------------------------------------------------------------------------
; Proc UINT64__Baseform
; Modification from bitRAKE's fasmg_playground
; https://github.com/bitRAKE/fasmg_playground/blob/master/string/baseform.asm
;-------------------------------------------------------------------------------
UINT64__Baseform:
; RAX number to convert
; RCX number base to use [2,36]
; RDI string buffer of length [65,14] bytes
push rbx
push rdi
lea rdi, q2aBuffer
lea rbx, digit_table
push 0
A: xor edx,edx
div rcx
push qword ptr [rbx+rdx*2]
test rax,rax
jnz A
B: pop rax
stosw
test al,al
jnz B
mov rax, rdi ; comment for timing
pop rdi
pop rbx
ret
; RCX unchanged
; RAX end of null-terminated string
mov rax, 1500
mov rcx, 10
call UINT64__Baseform
LATER: JJ your algorithm fail because can not to manage a "00" byte :thdn: :sad:
JJ, this work:q2a:
push rsi
push rdi
mov rsi, offset q2aBuffer+32
lea rdi, [rsi-32]
FBSTP REAL10 ptr [rsi]
push REAL10
pop rdx
mov r8, 0
@@:
movzx ecx, byte ptr [rsi+rdx]
add r8, rcx
test r8, r8
je NoNumber
mov r8, 1
mov eax, ecx
shr al, 4
or al, "0"
stosw
mov al, cl
and al, 15
or al, "0"
stosw
NoNumber:
dec rdx
jns @B
pop rdi
pop rsi
ret
Once you have a number, "00" is valid,
Clever :thumbsup:
Do you really need the test r8, r8?
Quote from: jj2007 on April 28, 2022, 02:19:52 AM
Do you really need the test r8, r8?
Clever :thumbsup:
Hi
I checked both procs, UINT64__Baseform (bitRAKE) and q2a.
Apart from the ugly leading "0" of q2a, UINT64__Baseform is faster (depending on argument size) and the representation base can be changed.
Argument = 123 => 2x faster (Base = 10)
Argument = 1234567890 => same performance
A signed version is not too hard to code.
Biterider
Very nice, Biterider :thumbsup:
This program was assembled with ml64 in 64-bit format.
87 bytes for q2a
125 bytes for UINT64
1482 ticks for crt swprintf
Result=123456789
452 ticks for q2a
484 ticks for q2a
452 ticks for q2a
468 ticks for q2a
Result=123456789
468 ticks for UINT64__Baseform
484 ticks for UINT64__Baseform
468 ticks for UINT64__Baseform
468 ticks for UINT64__Baseform
Result=123456789
For short strings up to 12345678, your 64-bit code is faster; above 123456789 mine is faster. Your 32-bit version is significantly faster.
The leading zero problem is solved. My routine is signed, but that's a minor difference, of course.
Attached source and executables (built with ML64, but I recommend UAsm64 (http://www.terraspace.co.uk/uasm.html#p2)).
Hi Biterider!
Quote from: Biterider on April 28, 2022, 07:09:06 AM
UINT64__Baseform is faster (depending on argument size) and the representation base can be changed.
Yes, very elegant and versatil. I think uq2baseW is an enough descriptive name.
Quote from: Biterider on April 28, 2022, 07:09:06 AM
A signed version is not too hard to code.
That could be sq2baseW.
HSE
JJ:
Have you to adjust your glasses? :biggrin:
Quote from: jj2007 on April 28, 2022, 10:44:42 AM
your 64-bit code is faster; above 123456789 mine is faster. Your 32-bit version is significantly faster.
bitRAKE could sound similar to Biterider but are different known persons. I also deserve some credit, essentially I changed some "e" by "r" :biggrin: :biggrin: :biggrin:
Old AMD
This program was assembled with ml in 32-bit format.
81 bytes for q2a
126 bytes for UINT64
281 ticks for crt swprintf
Result=123456789
31 ticks for q2a
47 ticks for q2a
47 ticks for q2a
47 ticks for q2a
Result=123456789
46 ticks for UINT64__Baseform
63 ticks for UINT64__Baseform
47 ticks for UINT64__Baseform
62 ticks for UINT64__Baseform
Result=111111111
--- hit any key ---
This program was assembled with ml64 in 64-bit format.
87 bytes for q2a
125 bytes for UINT64
219 ticks for crt swprintf
Result=123456789
31 ticks for q2a
47 ticks for q2a
46 ticks for q2a
32 ticks for q2a
Result=123456789
62 ticks for UINT64__Baseform
47 ticks for UINT64__Baseform
62 ticks for UINT64__Baseform
63 ticks for UINT64__Baseform
Result=111111111
--- hit any key ---
Hi
While coding the signed version of UINT64__Baseform, I became unsure what we expect to see from the conversion from let's say -123 (decimal) to base 16 or to base 2. Are minus signs allowed on bases other than 10?
Does anyone know for sure the correct answer?
Biterider
Quote from: Biterider on April 29, 2022, 06:09:02 AM
Are minus signs allowed on bases other than 10?
:biggrin: Maybe is a wrong question because that is obvious.
A negative number is negative in any base, the number is always the same.
Perhaps the question is: Are used negative numbers expressed in other bases than 10? :thumbsup:
Just that in computation a negative number in base 2 it's not a binary number, and a negative number in base 16 is not hexadecimal (because complement and fixed size of register for binary and hexadecimal).
Quote from: Biterider on April 29, 2022, 06:09:02 AMAre minus signs allowed on bases other than 10?
They are not forbidden but highly unusual. In the meantime, I gave my routines a little speed boost - grateful for some timings:
This program was assembled with UAsm64 in 64-bit format.
87 bytes for q2a
72 bytes for q2asc
117 bytes for UINT64
2699 ticks for crt swprintf
2699 ticks for crt swprintf
Result=123456789012345678
499 ticks for q2a
515 ticks for q2a
499 ticks for q2a
Result=123456789012345678
187 ticks for q2asc
187 ticks for q2asc
187 ticks for q2asc
Result=123456789012345678
1030 ticks for UINT64__Baseform
1045 ticks for UINT64__Baseform
1014 ticks for UINT64__Baseform
Result=123456789012345678
686 ticks for crt swprintf
671 ticks for crt swprintf
Result=123
453 ticks for q2a
436 ticks for q2a
453 ticks for q2a
Result=123
31 ticks for q2asc
31 ticks for q2asc
31 ticks for q2asc
Result=123
156 ticks for UINT64__Baseform
156 ticks for UINT64__Baseform
172 ticks for UINT64__Baseform
Result=123
2265 ticks for crt swprintf
2157 ticks for crt swprintf
Result=123456789012345678
407 ticks for q2a
390 ticks for q2a
422 ticks for q2a
Result=123456789012345678
125 ticks for q2asc
141 ticks for q2asc
109 ticks for q2asc
Result=123456789012345678
766 ticks for UINT64__Baseform
781 ticks for UINT64__Baseform
766 ticks for UINT64__Baseform
Result=123456789012345678
578 ticks for crt swprintf
609 ticks for crt swprintf
Result=123
359 ticks for q2a
375 ticks for q2a
375 ticks for q2a
Result=123
32 ticks for q2asc
15 ticks for q2asc
16 ticks for q2asc
Result=123
109 ticks for UINT64__Baseform
110 ticks for UINT64__Baseform
125 ticks for UINT64__Baseform
Result=123
--- hit any key ---
Old AMD
This program was assembled with UAsm64 in 64-bit format.
87 bytes for q2a
72 bytes for q2asc
117 bytes for UINT64
4103 ticks for crt swprintf
4056 ticks for crt swprintf
Result=123456789012345678
468 ticks for q2a
484 ticks for q2a
468 ticks for q2a
Result=123456789012345678
265 ticks for q2asc
265 ticks for q2asc
281 ticks for q2asc
Result=123456789012345678
1607 ticks for UINT64__Baseform
1607 ticks for UINT64__Baseform
1622 ticks for UINT64__Baseform
Result=123456789012345678
936 ticks for crt swprintf
936 ticks for crt swprintf
Result=123
375 ticks for q2a
358 ticks for q2a
359 ticks for q2a
Result=123
63 ticks for q2asc
31 ticks for q2asc
32 ticks for q2asc
Result=123
140 ticks for UINT64__Baseform
140 ticks for UINT64__Baseform
141 ticks for UINT64__Baseform
Result=123
--- hit any key ---
Thanks, Timo & Hector :thup:
I've been working on this very problem.
AVX512 for 16-figures : 2400mb/s
The idea was stolen :toothy:
;==============================================================
;Integer to String Using AVX512. RCX=unsigned long long. RDX=ptr to char string.
;==============================================================
IntToChar_4 proc
mov r8,rdx
mov rdx,12379400392853802749
mov rax,rcx
mulx rax,rax,rax
mov rdx,rcx
shr rax,26
mov rdx,rax
imul rdx,100000000
sub rcx,rdx
vpxor xmm2,xmm2,xmm2 ;maintain integer domain
vpxor xmm3,xmm3,xmm3
vpbroadcastq zmm0, rax
vpbroadcastq zmm1, rcx
;vmovq xmm2, zeroZ ;original code. don't understand purpose. < 52-bit cutoff.
;vmovdqa64 zmm3, zmm2
vpmadd52luq zmm2, zmm0, zmmword ptr iFMAZ
vpmadd52luq zmm3, zmm1, zmmword ptr iFMAZ
vpbroadcastq zmm4, qword ptr TenZ
vpbroadcastq zmm5, qword ptr CharZ
vmovdqa64 zmm0, zmm5
vpmadd52huq zmm0, zmm4, zmm2
vpmadd52huq zmm5, zmm4, zmm3
vpxor xmm1,xmm1,xmm1 ;not necessary
vmovdqu xmm1, xmmword ptr permZ
vpermi2b zmm1,zmm5,zmm0
vmovdqu xmmword ptr [r8],xmm1
vzeroupper
ret
permZ BYTE 78h,70h,68h,60h,58h,50h,48h,40h,38h,30h,28h,20h,18h,10h,8h,0 ;selects bytes from 2 zmmwords. 0 to 127.
iFMAZ QWORD 0000199999999999ah,0000028f5c28f5c29h, 0000004189374bc6bh, 000000068db8bac72h, 00000000a7c5ac472h,0000000010c6f7a0ch, 0000000001ad7f29bh,00000000002af31dch ;2^52/10^y
;zeroZ QWORD 1A1A400h ;Serves no purpose. Does this become zero?
TenZ QWORD 10
CharZ QWORD '0'
IntToChar_4 endp
No idea how to efficiently do all 20-figures or remove the 0's.
Quote from: InfiniteLoop on April 30, 2022, 07:36:33 AM
No idea how to efficiently do all 20-figures or remove the 0's.
Keep trying :thumbsup:
Hi
I wrote a routine that is a combination of the bitRAKE and the JJ algorithm.
It has the advantage of writing to the beginning of the destination buffer, which avoids an extra string copy in most cases. It also returns the number of bytes written to the buffer.
Performance is much better than UINT64__Baseform and slightly slower than q2asc, but when you add a string copy to the last, it far outperforms both.
In the attached file are the unsigned an signed versions out the routine.
Biterider
Quote from: Biterider on May 03, 2022, 12:46:29 AM
In the attached file are the unsigned an signed versions out the routine.
Fantastic :thumbsup:
Meanwhile, I added to ObjMemEFI the bitRAKE algorithm:
; ==================================================================================================
; Title: uq2baseW.asm
; Author: Héctor S. Enrique
; Version: C.1.0
; Notes: Version C.1.0, April 2022
; - First release.
; ---------------------------------------------------------------
; Modification from bitRAKE's Proc UINT64__Baseform in fasmg_playground
; https://github.com/bitRAKE/fasmg_playground/blob/master/string/baseform.asm
; ==================================================================================================
% include @Environ(OBJASM_PATH)\\Code\\OA_SetupEFI.inc
% include &ObjMemPath&ObjMem.cop
.data
align 64
digit_table dw '0','1','2','3','4','5','6','7','8','9'
dw 'A','B','C','D','E','F','G','H','I','J'
dw 'K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'
.code
; ——————————————————————————————————————————————————————————————————————————————————————————————————
; Procedure: uq2baseW
; Purpose: Converts a QWORD to its base WIDE string representation.
; Arguments: Arg1: -> Destination WIDE string buffer.
; Arg2: QWORD value.
; Arg3: QWORD base.
; Return: Nothing.
; Notes: In code
align ALIGN_CODE
uq2baseW proc uses xbx xsi xdi lpBuffer:POINTER, uqValue:QWORD, uqBase:QWORD
mov rax, uqValue ; RAX number to convert
mov rcx, uqBase ; RCX number base to use [2,36]
mov rdi, lpBuffer ; RDI string buffer of length [65,14] bytes
lea rbx, digit_table
push 0
A: xor edx,edx
div rcx
push qword ptr [rbx+rdx*2]
test rax,rax
jnz A
B: pop rax
stosw
test al,al
jnz B
mov rax, rdi ; comment for timing
ret
; RCX unchanged
; RAX end of null-terminated string
uq2baseW endp
end
Quote from: Biterider on May 03, 2022, 12:46:29 AM
I wrote a routine that is a combination of the bitRAKE and the JJ algorithm.
Your uqw2dec is pretty fast :thumbsup:
This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
328 ticks for uqw2dec
312 ticks for uqw2dec
312 ticks for uqw2dec
Result=1234567890123456789
249 ticks for q2asc
250 ticks for q2asc
249 ticks for q2asc
Result=1234567890123456789
172 ticks for uqw2dec
156 ticks for uqw2dec
172 ticks for uqw2dec
Result=1234567890
202 ticks for q2asc
125 ticks for q2asc
125 ticks for q2asc
Result=1234567890
250 ticks for q2asc
249 ticks for q2asc
Result=1234567890123456789
1420 ticks for Baseform (bitRAKE)
1388 ticks for Baseform (bitRAKE)
Result=1234567890123456789
96 bytes for uqw2dec
72 bytes for q2a32
88 bytes for q2asc
80 bytes for UINT64
Quote from: Biterider on May 03, 2022, 12:46:29 AMIt has the advantage of writing to the beginning of the destination buffer, which avoids an extra string copy in most cases.
That has been solved some time ago for q2asc.
Hi JJ
Quote from: jj2007 on May 03, 2022, 09:18:04 AM
That has been solved some time ago for q2asc.
That's great!
I got my q2asc version from here http://masm32.com/board/index.php?topic=10022.15#:~:text=q2asc.zip%20(6.04%20kB%20%2D%20downloaded%207%20times.) (http://masm32.com/board/index.php?topic=10022.15#:~:text=q2asc.zip%20(6.04%20kB%20%2D%20downloaded%207%20times.)). Is there a new one?
Biterider
Quote from: Biterider on May 03, 2022, 03:31:32 PMIs there a new one?
Hi Biterider,
There are many new ones, it's a mess :badgrin:
This is the tail of my current q2asc version. As you can see, it does indeed a 3*16=48 bytes copy, but it's fast:
movups xmm0, [rdi] ; src
movups xmm1, [rdi+16]
if @64
movups xmm2, [rdi+32] ; not needed in 32-bit code because limited to DWORD
endif
movaps [rax], xmm0 ; dest is align 16
movaps [rax+16], xmm1
if @64
movaps [rax+32], xmm2
endif
pop rbx
ife @64
pop rcx
endif
pop rdi
ret
I tested BiteRider's code. VS2022 doesn't like macro statements.
For reference SPrintf(): ~100mb/s using random xorshift64 unsigned long longs.
The naiive 20-figure "divide by 10" loop achieves 287mb/s and removes zero's.
This "SWAR" algorithm is the fastest (scalar) yet ~973mb/s, although its still 16-figures with the zeros.
;==============================================================
;Integer to String using SWAR method. RCX=num RDX=str
;==============================================================
EncodeTens proc ;rcx,rdx
shl rdx,32
or rcx,rdx
mov rax,20972
imul rax,rcx
shr rax,21
mov r8, 7f0000007fh ;((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
and rax,r8 ;top
mov rdx,100
imul rdx,rax
sub rcx,rdx ;bottom
shl rcx,16
add rcx, rax ;hundreds
mov rax,103
imul rax,rcx
shr rax, 10 ;tens
mov r8,0f000f000f000fh
and rax,r8
lea rdx, [rax+rax]
lea rdx, [rdx*4+rdx]
sub rcx,rdx
shl rcx,8
add rax,rcx
ret
EncodeTens endp
IntToChar_SWAR proc
mov r11,rdx
mov rdx,12379400392853802749
mov r8, 100000000
mov rax,rcx
mulx rax,rax,rax
shr rax,26 ;top
imul r8,rax
sub rcx,r8 ;bottom
push rcx
mov ecx,3518437209
imul rcx,rax
shr rcx,45 ;top\10^4
mov edx,10000
imul edx,ecx
sub eax,edx
mov edx,eax
call EncodeTens
mov r10,3030303030303030h
add rax,r10
mov qword ptr [r11],rax
pop rax
mov ecx,3518437209
imul rcx,rax
shr rcx,45 ;top\10^4
mov edx,10000
imul edx,ecx
sub eax,edx
mov edx,eax
call EncodeTens
add rax,r10
mov qword ptr [r11+8],rax
ret
IntToChar_SWAR endp
;==============================================================
Looks interesting, but can you post working code? What does RCX=num RDX=str mean?
See new Lab post The joy of beating the CRT by a factor 10 (http://masm32.com/board/index.php?topic=10037.0).
When debugging some benchmarks, I stumbled over some code that looked very familiar:
.while (eax > 0)
mov ebx,eax
mul ecx
shr edx, 3
mov eax,edx
lea edx,[edx*4+edx]
add edx,edx
sub ebx,edx
add bl,'0'
mov [edi],bl
add edi, 1
.endw
I'm sure Biterider will recognise it, too :biggrin:
Check \Masm32\m32lib\dwtoa.asm :cool:
Hi
Quote from: jj2007 on May 05, 2022, 04:57:06 AM
I'm sure Biterider will recognise it, too :biggrin:
Seems quite familiar to me :biggrin:
Today I found some time to play with this procedure a bit more. I have tried to combine all the code pieces we discussed before, using all available x64 registers, removing all unnecessary frame instructions and interleaving other instructions.
I came up with a combination that gave the best results on my machine and looks like this:
OPTION PROC:NONE
uqw2dec2W proc pBuffer:POINTER, qNumber:QWORD
sub rsp, 32h
lea r9, [rsp + 30h]
mov rax, rdx
mov word ptr [r9], 0
mov r10, 0CCCCCCCCCCCCCCCDh
@@:
sub r9, 2
mov r8, rax
mul r10
shr rdx, 3
mov rax, rdx
lea rdx, [4*rdx + rdx]
lea rdx, [2*rdx - "0"]
sub r8, rdx
mov [r9], r8w
test rax, rax
jne @B
movups xmm0, [r9]
movups xmm1, [r9 + 16]
movups xmm2, [r9 + 32]
add rsp, 32h
movups [rcx], xmm0
mov rax, rsp
movups [rcx + 16], xmm1
sub rax, r9
movups [rcx + 32], xmm2
ret
uqw2dec2W endp
OPTION PROC:DEFAULT
The only requirement is that the destination buffer has at least 48 bytes.
On return, eax contains the number of bytes written, including the zero termination char.
Biterider
Looks good, Biterider :thumbsup:
This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
905 ticks for uqw2dec
889 ticks for uqw2dec
890 ticks for uqw2dec
Result=1234567890123456789
436 ticks for uqw2dec2W
437 ticks for uqw2dec2W
453 ticks for uqw2dec2W
Result=1234567890
421 ticks for q2asc
421 ticks for q2asc
421 ticks for q2asc
Result=1234567890
842 ticks for q2asc
858 ticks for q2asc
796 ticks for q2asc
Result=1234567890123456789
105 bytes for uqw2dec
88 bytes for q2asc
Hi JJ
Thank you for sharing your recent modifications. :thumbsup:
I tried the lea-not-lea sequence in the main loop, but didn't get the improvement you see on your machine.
The last real boost came from the XMM copy you introduced recently. I didn't use the aligned write because I often concatenate strings and the target isn't guaranteed to be aligned. I timed the change and didn't see a disadvantage, but that may be different on other CPUs.
The returned value of bytes written is very useful for a general API and has little impact on timing.
Biterider
Quote from: Biterider on May 05, 2022, 04:18:29 PMI tried the lea-not-lea sequence in the main loop, but didn't get the improvement you see on your machine.
The
lea-not-lea boost is not that big, but it saves one
test rax, rax, of course. The current MasmBasic Str$() uses it in the DWORD and QWORD to Ansi versions. My DWORD to Ansi is roughly 30% faster than the Masm32 SDK dwtoa.
QuoteThe last real boost came from the XMM copy you introduced recently. I didn't use the aligned write because I often concatenate strings and the target isn't guaranteed to be aligned. I timed the change and didn't see a disadvantage, but that may be different on other CPUs.
The returned value of bytes written is very useful for a general API and has little impact on timing.
Unaligned write, too, for the latest MasmBasic version (http://masm32.com/board/index.php?topic=94.0). However, I chose to return the end position of the last write, as demonstrated in the Lab post The joy of beating the CRT by a factor 10 (http://masm32.com/board/index.php?topic=10037.0).
Hi
Searching through previously written code I found one by P. Dixon. It's not new as it's been discussed here several times (check the old forum).
It took me some time to build and test the x64 version.
Testing the code is not easy because checking all possible input values takes ages.
The performance is outstanding, surpassing the previously discussed routines by a factor of ~2.
Regards, Biterider
:thumbsup: Is working.
Hi Biterider!
Look like there is a problem with ZTC, and previous string in buffer remain moved to right (at least running from UEFI)
Regards, HSE.
Hi HSE
Thanks for the feedback.
For a better understanding, could you please write 3-4 lines of code showing the problem?
Biterider
PS: I don't think that it is an UEFI thing :tongue:
Hi HSE
I think I found the problem. There was a typo when setting the ZTC in the wide version of the proc.
I replaced the download from the post above (reply #44).
Biterider
Hi Biterider
Perfect now :thumbsup:
HSE