News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

qword to unicode?

Started by HSE, April 27, 2022, 01:07:01 AM

Previous topic - Next topic

InfiniteLoop

I've been working on this very problem.
AVX512 for 16-figures : 2400mb/s
The idea was stolen  :toothy:
;==============================================================
;Integer to String Using AVX512. RCX=unsigned long long. RDX=ptr to char string.
;==============================================================
IntToChar_4 proc
mov r8,rdx
mov rdx,12379400392853802749
mov rax,rcx
mulx rax,rax,rax
mov rdx,rcx
shr rax,26
mov rdx,rax
imul rdx,100000000
sub rcx,rdx
vpxor xmm2,xmm2,xmm2 ;maintain integer domain
vpxor xmm3,xmm3,xmm3
vpbroadcastq zmm0, rax
vpbroadcastq zmm1, rcx
;vmovq xmm2, zeroZ ;original code. don't understand purpose. < 52-bit cutoff.
;vmovdqa64 zmm3, zmm2
vpmadd52luq zmm2, zmm0, zmmword ptr iFMAZ
vpmadd52luq zmm3, zmm1, zmmword ptr iFMAZ
vpbroadcastq zmm4, qword ptr TenZ
vpbroadcastq zmm5, qword ptr CharZ
vmovdqa64 zmm0, zmm5
vpmadd52huq zmm0, zmm4, zmm2
vpmadd52huq zmm5, zmm4, zmm3
vpxor xmm1,xmm1,xmm1 ;not necessary
vmovdqu xmm1, xmmword ptr permZ
vpermi2b zmm1,zmm5,zmm0
vmovdqu xmmword ptr [r8],xmm1
vzeroupper
ret
permZ BYTE 78h,70h,68h,60h,58h,50h,48h,40h,38h,30h,28h,20h,18h,10h,8h,0 ;selects bytes from 2 zmmwords. 0 to 127.
iFMAZ QWORD 0000199999999999ah,0000028f5c28f5c29h, 0000004189374bc6bh, 000000068db8bac72h, 00000000a7c5ac472h,0000000010c6f7a0ch, 0000000001ad7f29bh,00000000002af31dch ;2^52/10^y
;zeroZ QWORD 1A1A400h ;Serves no purpose. Does this become zero?
TenZ QWORD 10
CharZ QWORD '0'
IntToChar_4 endp


No idea how to efficiently do all 20-figures or remove the 0's.



jj2007

Quote from: InfiniteLoop on April 30, 2022, 07:36:33 AM
No idea how to efficiently do all 20-figures or remove the 0's.

Keep trying :thumbsup:

Biterider

Hi
I wrote a routine that is a combination of the bitRAKE and the JJ algorithm.

It has the advantage of writing to the beginning of the destination buffer, which avoids an extra string copy in most cases. It also returns the number of bytes written to the buffer.

Performance is much better than UINT64__Baseform and slightly slower than q2asc, but when you add a string copy to the last, it far outperforms both.

In the attached file are the unsigned an signed versions out the routine.


Biterider




HSE

#33
Quote from: Biterider on May 03, 2022, 12:46:29 AM
In the attached file are the unsigned an signed versions out the routine.

Fantastic  :thumbsup:

Meanwhile, I added to ObjMemEFI the bitRAKE algorithm:; ==================================================================================================
; Title:      uq2baseW.asm
; Author:     Héctor S. Enrique
; Version:    C.1.0
; Notes:      Version C.1.0, April 2022
;               - First release.
; ---------------------------------------------------------------
;  Modification from bitRAKE's Proc UINT64__Baseform in fasmg_playground
;  https://github.com/bitRAKE/fasmg_playground/blob/master/string/baseform.asm
; ==================================================================================================

% include @Environ(OBJASM_PATH)\\Code\\OA_SetupEFI.inc
% include &ObjMemPath&ObjMem.cop
.data

    align 64
    digit_table dw '0','1','2','3','4','5','6','7','8','9'
                dw 'A','B','C','D','E','F','G','H','I','J'
                dw 'K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'

.code

; ——————————————————————————————————————————————————————————————————————————————————————————————————
; Procedure:  uq2baseW
; Purpose:    Converts a QWORD to its base WIDE string representation.
; Arguments:  Arg1: -> Destination WIDE string buffer.
;             Arg2: QWORD value.
;             Arg3: QWORD base.
; Return:     Nothing.
; Notes:      In code


align ALIGN_CODE
uq2baseW proc uses xbx xsi xdi lpBuffer:POINTER, uqValue:QWORD, uqBase:QWORD
    mov rax, uqValue        ; RAX number to convert
    mov rcx, uqBase         ; RCX number base to use [2,36]
    mov rdi, lpBuffer       ; RDI string buffer of length [65,14] bytes
    lea rbx, digit_table
push 0
A: xor edx,edx
div rcx
push qword ptr [rbx+rdx*2]
test rax,rax
jnz A

B: pop rax
stosw
test al,al
jnz B
       mov rax, rdi         ; comment for timing
ret

; RCX unchanged
; RAX end of null-terminated string

uq2baseW endp

end
Equations in Assembly: SmplMath

jj2007

Quote from: Biterider on May 03, 2022, 12:46:29 AM
I wrote a routine that is a combination of the bitRAKE and the JJ algorithm.

Your uqw2dec is pretty fast :thumbsup:

This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

328 ticks for uqw2dec
312 ticks for uqw2dec
312 ticks for uqw2dec
    Result=1234567890123456789

249 ticks for q2asc
250 ticks for q2asc
249 ticks for q2asc
    Result=1234567890123456789

172 ticks for uqw2dec
156 ticks for uqw2dec
172 ticks for uqw2dec
    Result=1234567890

202 ticks for q2asc
125 ticks for q2asc
125 ticks for q2asc
    Result=1234567890

250 ticks for q2asc
249 ticks for q2asc
    Result=1234567890123456789

1420 ticks for Baseform (bitRAKE)
1388 ticks for Baseform (bitRAKE)
    Result=1234567890123456789

96      bytes for uqw2dec
72      bytes for q2a32
88      bytes for q2asc
80      bytes for UINT64


Quote from: Biterider on May 03, 2022, 12:46:29 AMIt has the advantage of writing to the beginning of the destination buffer, which avoids an extra string copy in most cases.

That has been solved some time ago for q2asc.

Biterider


jj2007

Quote from: Biterider on May 03, 2022, 03:31:32 PMIs there a new one?

Hi Biterider,

There are many new ones, it's a mess :badgrin:

This is the tail of my current q2asc version. As you can see, it does indeed a 3*16=48 bytes copy, but it's fast:

  movups xmm0, [rdi] ; src
  movups xmm1, [rdi+16]
  if @64
movups xmm2, [rdi+32] ; not needed in 32-bit code because limited to DWORD
  endif
  movaps [rax], xmm0 ; dest is align 16
  movaps [rax+16], xmm1
  if @64
movaps [rax+32], xmm2
  endif
  pop rbx
  ife @64
pop rcx
  endif
  pop rdi
  ret

InfiniteLoop

I tested BiteRider's code. VS2022 doesn't like macro statements.
For reference SPrintf(): ~100mb/s using random xorshift64 unsigned long longs.
The naiive 20-figure "divide by 10" loop achieves 287mb/s and removes zero's.
This "SWAR" algorithm is the fastest (scalar) yet ~973mb/s, although its still 16-figures with the zeros.

;==============================================================
;Integer to String using SWAR method. RCX=num RDX=str
;==============================================================
EncodeTens proc ;rcx,rdx
shl rdx,32
or rcx,rdx
mov rax,20972
imul rax,rcx
shr rax,21
mov r8, 7f0000007fh ;((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
and rax,r8 ;top
mov rdx,100
imul rdx,rax
sub rcx,rdx ;bottom
shl rcx,16
add rcx, rax ;hundreds
mov rax,103
imul rax,rcx
shr rax, 10 ;tens
mov r8,0f000f000f000fh
and rax,r8
lea rdx, [rax+rax]
lea rdx, [rdx*4+rdx]
sub rcx,rdx
shl rcx,8
add rax,rcx
ret
EncodeTens endp

IntToChar_SWAR proc
mov r11,rdx
mov rdx,12379400392853802749
mov r8, 100000000
mov rax,rcx
mulx rax,rax,rax
shr rax,26 ;top
imul r8,rax
sub rcx,r8 ;bottom
push rcx
mov ecx,3518437209
imul rcx,rax
shr rcx,45 ;top\10^4
mov edx,10000
imul edx,ecx
sub eax,edx
mov edx,eax
call EncodeTens
mov r10,3030303030303030h
add rax,r10
mov qword ptr [r11],rax
pop rax
mov ecx,3518437209
imul rcx,rax
shr rcx,45 ;top\10^4
mov edx,10000
imul edx,ecx
sub eax,edx
mov edx,eax
call EncodeTens
add rax,r10
mov qword ptr [r11+8],rax
ret
IntToChar_SWAR endp
;==============================================================

jj2007

Looks interesting, but can you post working code? What does RCX=num RDX=str mean?

jj2007

See new Lab post The joy of beating the CRT by a factor 10.

When debugging some benchmarks, I stumbled over some code that looked very familiar:
    .while (eax > 0)
      mov ebx,eax
      mul ecx
      shr edx, 3
      mov eax,edx
      lea edx,[edx*4+edx]
      add edx,edx
      sub ebx,edx
      add bl,'0'
      mov [edi],bl
      add edi, 1
    .endw


I'm sure Biterider will recognise it, too :biggrin:

Check \Masm32\m32lib\dwtoa.asm :cool:

Biterider

Hi

Quote from: jj2007 on May 05, 2022, 04:57:06 AM
I'm sure Biterider will recognise it, too :biggrin:
Seems quite familiar to me :biggrin:

Today I found some time to play with this procedure a bit more. I have tried to combine all the code pieces we discussed before, using all available x64 registers, removing all unnecessary frame instructions and interleaving other instructions.
I came up with a combination that gave the best results on my machine and looks like this:
OPTION PROC:NONE
uqw2dec2W proc pBuffer:POINTER, qNumber:QWORD
  sub rsp, 32h
  lea r9, [rsp + 30h]
  mov rax, rdx
  mov word ptr [r9], 0
  mov r10, 0CCCCCCCCCCCCCCCDh
@@:
  sub r9, 2
  mov r8, rax
  mul r10
  shr rdx, 3
  mov rax, rdx
  lea rdx, [4*rdx + rdx]
  lea rdx, [2*rdx - "0"]
  sub r8, rdx
  mov [r9], r8w
  test rax, rax
  jne @B
  movups xmm0, [r9]
  movups xmm1, [r9 + 16]
  movups xmm2, [r9 + 32]
  add rsp, 32h
  movups [rcx], xmm0
  mov rax, rsp
  movups [rcx + 16], xmm1
  sub rax, r9
  movups [rcx + 32], xmm2
  ret
uqw2dec2W endp
OPTION PROC:DEFAULT


The only requirement is that the destination buffer has at least 48 bytes.
On return, eax contains the number of bytes written, including the zero termination char.

Biterider

jj2007

Looks good, Biterider :thumbsup:

This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

905 ticks for uqw2dec
889 ticks for uqw2dec
890 ticks for uqw2dec
    Result=1234567890123456789

436 ticks for uqw2dec2W
437 ticks for uqw2dec2W
453 ticks for uqw2dec2W
    Result=1234567890

421 ticks for q2asc
421 ticks for q2asc
421 ticks for q2asc
    Result=1234567890

842 ticks for q2asc
858 ticks for q2asc
796 ticks for q2asc
    Result=1234567890123456789

105     bytes for uqw2dec
88      bytes for q2asc

Biterider

Hi JJ
Thank you for sharing your recent modifications.  :thumbsup:

I tried the lea-not-lea sequence in the main loop, but didn't get the improvement you see on your machine.

The last real boost came from the XMM copy you introduced recently. I didn't use the aligned write because I often concatenate strings and the target isn't guaranteed to be aligned. I timed the change and didn't see a disadvantage, but that may be different on other CPUs.
The returned value of bytes written is very useful for a general API and has little impact on timing.

Biterider

jj2007

Quote from: Biterider on May 05, 2022, 04:18:29 PMI tried the lea-not-lea sequence in the main loop, but didn't get the improvement you see on your machine.

The lea-not-lea boost is not that big, but it saves one test rax, rax, of course. The current MasmBasic Str$() uses it in the DWORD and QWORD to Ansi versions. My DWORD to Ansi is roughly 30% faster than the Masm32 SDK dwtoa.

QuoteThe last real boost came from the XMM copy you introduced recently. I didn't use the aligned write because I often concatenate strings and the target isn't guaranteed to be aligned. I timed the change and didn't see a disadvantage, but that may be different on other CPUs.
The returned value of bytes written is very useful for a general API and has little impact on timing.

Unaligned write, too, for the latest MasmBasic version. However, I chose to return the end position of the last write, as demonstrated in the Lab post The joy of beating the CRT by a factor 10.

Biterider

#44
Hi
Searching through previously written code I found one by P. Dixon. It's not new as it's been discussed here several times (check the old forum).

It took me some time to build and test the x64 version.
Testing the code is not easy because checking all possible input values takes ages.
The performance is outstanding, surpassing the previously discussed routines by a factor of ~2.

Regards, Biterider