The MASM Forum

General => The Laboratory => Topic started by: jj2007 on March 01, 2024, 11:56:20 PM

Title: Decimal ascii to dword conversion
Post by: jj2007 on March 01, 2024, 11:56:20 PM
Can I have some timings, please? (spinoff from this thread (https://masm32.com/board/index.php?topic=11739.0))

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

2257    cycles for 100 * atodw (Masm32 SDK)
1862    cycles for 100 * s2int A
2145    cycles for 100 * s2int B

2165    cycles for 100 * atodw (Masm32 SDK)
1925    cycles for 100 * s2int A
2159    cycles for 100 * s2int B

2274    cycles for 100 * atodw (Masm32 SDK)
1965    cycles for 100 * s2int A
1988    cycles for 100 * s2int B

2178    cycles for 100 * atodw (Masm32 SDK)
2102    cycles for 100 * s2int A
1977    cycles for 100 * s2int B

Averages:
2218    cycles for atodw (Masm32 SDK)
1945    cycles for s2int A
2066    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
30      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

The algos:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
s2intA proc arg
  pop eax
  pop edx
  push eax
  xor eax, eax
  .While 1
movzx ecx, byte ptr [edx]
sub ecx, "0"
js @F
if 1  ; A
imul eax, 10
add eax, ecx
inc edx
else  ; B
lea eax, [eax+4*eax] ; *5
inc edx
lea eax, [2*eax+ecx]
endif
  .Endw
  @@:
  retn
s2intA endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Decimal ascii to dword conversion
Post by: TimoVJL on March 02, 2024, 12:06:31 AM
AMD Athlon(tm) II X2 220 Processor (SSE3)

4637    cycles for 100 * atodw (Masm32 SDK)
3985    cycles for 100 * s2int A
6634    cycles for 100 * s2int B

4638    cycles for 100 * atodw (Masm32 SDK)
3984    cycles for 100 * s2int A
6423    cycles for 100 * s2int B

4634    cycles for 100 * atodw (Masm32 SDK)
3986    cycles for 100 * s2int A
6642    cycles for 100 * s2int B

4690    cycles for 100 * atodw (Masm32 SDK)
3981    cycles for 100 * s2int A
6543    cycles for 100 * s2int B

Averages:
4638    cycles for atodw (Masm32 SDK)
3984    cycles for s2int A
6588    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
30      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

--- ok ---
Title: Re: Decimal ascii to dword conversion
Post by: HSE on March 02, 2024, 01:23:22 AM
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

2783    cycles for 100 * atodw (Masm32 SDK)
2894    cycles for 100 * s2int A
2869    cycles for 100 * s2int B

2686    cycles for 100 * atodw (Masm32 SDK)
2859    cycles for 100 * s2int A
2875    cycles for 100 * s2int B

2769    cycles for 100 * atodw (Masm32 SDK)
2920    cycles for 100 * s2int A
2897    cycles for 100 * s2int B

2668    cycles for 100 * atodw (Masm32 SDK)
2881    cycles for 100 * s2int A
2951    cycles for 100 * s2int B

Averages:
2728    cycles for atodw (Masm32 SDK)
2888    cycles for s2int A
2886    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
30      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

-
Title: Re: Decimal ascii to dword conversion
Post by: jj2007 on March 02, 2024, 03:17:28 AM
New version:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Averages:
3727    cycles for atodw (Masm32 SDK)
2627    cycles for s2int A
2945    cycles for s2int B

AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
2185    cycles for atodw (Masm32 SDK)
1822    cycles for s2int A
2008    cycles for s2int B

Quote from: HSE on March 02, 2024, 01:23:22 AMAverages:
2728    cycles for atodw (Masm32 SDK)
2888    cycles for s2int A
2886    cycles for s2int B

Interesting :rolleyes:
Title: Re: Decimal ascii to dword conversion
Post by: HSE on March 02, 2024, 04:53:57 AM
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

2679    cycles for 100 * atodw (Masm32 SDK)
2302    cycles for 100 * s2int A
2810    cycles for 100 * s2int B

2633    cycles for 100 * atodw (Masm32 SDK)
2251    cycles for 100 * s2int A
2836    cycles for 100 * s2int B

2661    cycles for 100 * atodw (Masm32 SDK)
2294    cycles for 100 * s2int A
2832    cycles for 100 * s2int B

2635    cycles for 100 * atodw (Masm32 SDK)
2247    cycles for 100 * s2int A
2819    cycles for 100 * s2int B

Averages:
2648    cycles for atodw (Masm32 SDK)
2272    cycles for s2int A
2826    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
34      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

--- ok ---

 :thumbsup:
Title: Re: Decimal ascii to dword conversion
Post by: sinsi on March 02, 2024, 07:33:26 AM
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (SSE4)

1920    cycles for 100 * atodw (Masm32 SDK)
1710    cycles for 100 * s2int A
2080    cycles for 100 * s2int B

1926    cycles for 100 * atodw (Masm32 SDK)
1688    cycles for 100 * s2int A
2059    cycles for 100 * s2int B

1923    cycles for 100 * atodw (Masm32 SDK)
1681    cycles for 100 * s2int A
2061    cycles for 100 * s2int B

1927    cycles for 100 * atodw (Masm32 SDK)
1680    cycles for 100 * s2int A
2062    cycles for 100 * s2int B

Averages:
1924    cycles for atodw (Masm32 SDK)
1684    cycles for s2int A
2062    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
34      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B
Title: Re: Decimal ascii to dword conversion
Post by: jj2007 on March 02, 2024, 09:59:17 AM
Just for fun, the 64-bit version (assembles with ML64, UAsm and JWasm, in 64- or 32-bit mode):

This program was assembled with ml64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890123456789 in 891 ms (version A)
Res=1234567890123456789 in 922 ms (version B)
Res=1234567890123456789 in 812 ms (version A)
Res=1234567890123456789 in 906 ms (version B)
Res=1234567890123456789 in 813 ms (version A)
Res=1234567890123456789 in 922 ms (version B)
This program was assembled with ML in 32-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890 in 515 ms (version A)
Res=1234567890 in 485 ms (version B)
Res=1234567890 in 437 ms (version A)
Res=1234567890 in 469 ms (version B)
Res=1234567890 in 469 ms (version A)
Res=1234567890 in 469 ms (version B)
Title: Re: Decimal ascii to dword conversion
Post by: InfiniteLoop on March 02, 2024, 12:55:59 PM
This one works even better than the last one. AVX2.
Returns the lowest 20 figures of the first number in the string.
e.g.
rcx = "xxxxx0000000000000999999xx11xxx11222233334444xxx5555xxxxxxxxxxxxxxx1xxx23xxxx123"
rax = 999999

;-----------------------------------------------------------------------------------
;----------------------- String to Integer 64-bit ----------------------
;-----------------------------------------------------------------------------------
String_To_Int_AVX2 proc
vmovdqu ymm0,ymmword ptr [rcx]
vpbroadcastb ymm4, byte ptr [StrASCII]
vpbroadcastb ymm3,byte ptr [StrCmp9]
vpsubb ymm0,ymm0,ymm4
vpcmpeqd ymm2,ymm2,ymm2
vpcmpgtb ymm5,ymm3,ymm0
vpcmpgtb ymm4,ymm0,ymm2
vpand ymm4,ymm4,ymm5
vpand ymm0,ymm0,ymm4 ;"xx0012340xx1xx" -> "00111111100100"
vpmovmskb eax,ymm4 ;001111111100100
tzcnt edx,eax ;5
shrx eax,eax,edx ;111111100100
not eax ;111100011001110011111111111
tzcnt eax,eax ;16
add eax,edx ;shift l 20-eax = 18?
lea rcx,[StrConvShift]
lea rcx, [rcx + rax - 20]
vmovdqu ymm4, ymmword ptr [rcx] ;h
vmovdqu ymm5, ymmword ptr [rcx+16] ;l
vperm2i128 ymm1,ymm0,ymm0,010001b ;h,h
vinserti128 ymm2,ymm0,xmm0,1 ;l,l
vpshufb ymm3,ymm1,ymm4
vpshufb ymm0,ymm2,ymm5
vpor ymm0,ymm0,ymm3
vmovdqu ymm1,ymmword ptr [MulStr101B]
vmovdqu ymm2,ymmword ptr [MulStr101W]
vpmaddubsw ymm0,ymm0,ymm1 ;|ab,cd,ef,gh,ij,kl,mn,op|qr,st|
vpmaddwd ymm0,ymm0,ymm2 ;|abcd,0000,efgh,0000,ijkl,0000,mnop,0000|qrst,0000| save
vmovdqu xmm5,xmmword ptr [MulStr101D] ;|abcd,efgh,0000,0000,ijkl,mnop,0000,0000|
vpshufb xmm1,xmm0,xmmword ptr [StrConvShif16]
vpmaddwd xmm1,xmm1,xmm5 ;|00000000abcdefgh,00000000ijklmnop||
vpbroadcastd xmm2,dword ptr [Mul10000]
vpmuludq xmm1,xmm1,xmm2 ;|abcdefgh0000,ijklmnop0000|
vextracti128 xmm0,ymm0,1
vpshufd xmm3,xmm1,00001110b
vpaddq xmm0,xmm0,xmm3
vmovq rax,xmm1
imul rax,100000000
vmovq rdx,xmm0
add rax,rdx
ret
BYTE 32 DUP(-1)
StrConvShift:
BYTE 16 DUP(-1)
BYTE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
BYTE 16 DUP(-1)
BYTE 32 DUP(-1)
StrCmp9 BYTE 10
StrASCII BYTE 48
Mul10000 DWORD 10000
MulStr101B BYTE 10 DUP(10,1),12 DUP(0)
MulStr101W WORD 5 DUP(100,1),6 DUP(0)
MulStr101D WORD 2 DUP(10000,1,0,0)
StrConvShif16 BYTE 0,1,4,5,4 DUP(-1),8,9,12,13,4 DUP(-1)
String_To_Int_AVX2 endp

Title: Re: Decimal ascii to dword conversion
Post by: jj2007 on March 02, 2024, 12:59:34 PM
Quote from: InfiniteLoop on March 02, 2024, 12:55:59 PMThis one works even better

shrx eax,eax,edx: Error A2049: Invalid instruction operands

Post complete working code, please.

P.S.: I was curious, so I investigated how to encode this stuff with the latest UAsm release:

db 0C4h, 0E2h, 06Bh, 0F7h, 0C0h      ; shrx eax,eax,edx 111111100100
db 62h, 0F2h, 75h, 28h, 8Dh, 0C0h                    ; vpermb ymm0{k0},ymm1,ymm0

The disassembly looks ok now, but my brand new AMD Athlon Gold 3150U throws an "illegal encoding" error for vpermb. Does anybody have a cpu that can handle this instruction? Can we see benchmarks?

Don't get me wrong: your code looks very interesting, and I suppose you put a lot of brain work into it. However, AVX-512 runs only on a handful of CPUs, so the value of such code for a library is limited. Besides, looking at the sheer number of instructions, I'd like to see if it's really faster than the straightforward "old" versions.
Title: Re: Decimal ascii to dword conversion
Post by: InfiniteLoop on March 03, 2024, 10:42:54 AM
Its fixed now. I've tried every combination of letters and zeros. No more vpermb.
Zen 4 and Alderlake use AVX512. Zen 5 is supposed to upgrade it further.
Title: Re: Decimal ascii to dword conversion
Post by: jj2007 on March 03, 2024, 12:20:28 PM
Quote from: InfiniteLoop on March 03, 2024, 10:42:54 AMIts fixed now. I've tried every combination of letters and zeros. No more vpermb.
Zen 4 and Alderlake use AVX512. Zen 5 is supposed to upgrade it further.

Compliments, I got it running :thumbsup:

This program was assembled with UAsm64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890123456789 in 500 ms (version AVX-512)
Res=1234567890123456789 in 828 ms (imul)
Res=1234567890123456789 in 891 ms (lea)
Res=1234567890123456789 in 500 ms (version AVX-512)
Res=1234567890123456789 in 797 ms (imul)
Res=1234567890123456789 in 922 ms (lea)
Res=123456789012 in 484 ms (version AVX-512)
Res=123456789012 in 547 ms (imul)
Res=123456789012 in 594 ms (lea)
Res=123456789012 in 484 ms (version AVX-512)
Res=123456789012 in 531 ms (imul)
Res=123456789012 in 594 ms (lea)
Res=12345678 in 500 ms (version AVX-512)
Res=12345678 in 359 ms (imul)
Res=12345678 in 422 ms (lea)
Res=12345678 in 516 ms (version AVX-512)
Res=12345678 in 359 ms (imul)
Res=12345678 in 422 ms (lea)
s2intY (AVX):  491 bytes
s2intA (imul): 29 bytes
s2intB (lea):  30 bytes
Title: Re: Decimal ascii to dword conversion
Post by: sinsi on March 03, 2024, 12:55:20 PM
Code bloat! Seriously though, is the trade-off between code size and execution time worth it?
This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Res=1234567890123456789 in 281 ms (version AVX-512)
Res=1234567890123456789 in 500 ms (imul)
Res=1234567890123456789 in 484 ms (lea)
Res=1234567890123456789 in 281 ms (version AVX-512)
Res=1234567890123456789 in 500 ms (imul)
Res=1234567890123456789 in 485 ms (lea)
Res=123456789012 in 297 ms (version AVX-512)
Res=123456789012 in 312 ms (imul)
Res=123456789012 in 328 ms (lea)
Res=123456789012 in 297 ms (version AVX-512)
Res=123456789012 in 328 ms (imul)
Res=123456789012 in 328 ms (lea)
Res=12345678 in 282 ms (version AVX-512)
Res=12345678 in 218 ms (imul)
Res=12345678 in 250 ms (lea)
Res=12345678 in 282 ms (version AVX-512)
Res=12345678 in 218 ms (imul)
Res=12345678 in 250 ms (lea)
s2intY (AVX):  491 bytes
s2intA (imul): 29 bytes
s2intB (lea):  30 bytes
Is that actually AVX-512? I don't think my CPU supports that.
Title: Re: Decimal ascii to dword conversion
Post by: jj2007 on March 03, 2024, 01:17:56 PM
Quote from: sinsi on March 03, 2024, 12:55:20 PMIs that actually AVX-512?

I'm not sure. The name of the function is String_To_Int_AVX2 :rolleyes:
Title: Re: Decimal ascii to dword conversion
Post by: fearless on March 03, 2024, 04:51:23 PM
This program was assembled with UAsm64 in 64-bit format.
AMD Ryzen 9 5950X 16-Core Processor
Res=1234567890123456789 in 171 ms (version AVX-512)
Res=1234567890123456789 in 407 ms (imul)
Res=1234567890123456789 in 265 ms (lea)
Res=1234567890123456789 in 172 ms (version AVX-512)
Res=1234567890123456789 in 391 ms (imul)
Res=1234567890123456789 in 297 ms (lea)
Res=123456789012 in 172 ms (version AVX-512)
Res=123456789012 in 250 ms (imul)
Res=123456789012 in 250 ms (lea)
Res=123456789012 in 171 ms (version AVX-512)
Res=123456789012 in 219 ms (imul)
Res=123456789012 in 266 ms (lea)
Res=12345678 in 172 ms (version AVX-512)
Res=12345678 in 172 ms (imul)
Res=12345678 in 187 ms (lea)
Res=12345678 in 172 ms (version AVX-512)
Res=12345678 in 172 ms (imul)
Res=12345678 in 187 ms (lea)
s2intY (AVX):  491 bytes
s2intA (imul): 29 bytes
s2intB (lea):  30 bytes
Title: Re: Decimal ascii to dword conversion
Post by: TimoVJL on March 03, 2024, 08:07:02 PM
Quote from: jj2007 on March 03, 2024, 01:17:56 PM
Quote from: sinsi on March 03, 2024, 12:55:20 PMIs that actually AVX-512?

I'm not sure. The name of the function is String_To_Int_AVX2 :rolleyes:
Must be AVX2, as AMD Athlon Gold 3150U with Radeon Graphics run it.
Title: Re: Decimal ascii to dword conversion
Post by: jj2007 on March 03, 2024, 10:05:46 PM
Here's one more, fresh from the Lab:
AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

1981    cycles for 100 * atodw (Masm32 SDK)
1507    cycles for 100 * s2int imul
1750    cycles for 100 * s2int lea
364     cycles for 100 * StringToDwordSimd

1975    cycles for 100 * atodw (Masm32 SDK)
1512    cycles for 100 * s2int imul
1709    cycles for 100 * s2int lea
396     cycles for 100 * StringToDwordSimd

1983    cycles for 100 * atodw (Masm32 SDK)
1492    cycles for 100 * s2int imul
1711    cycles for 100 * s2int lea
370     cycles for 100 * StringToDwordSimd

1988    cycles for 100 * atodw (Masm32 SDK)
1520    cycles for 100 * s2int imul
1718    cycles for 100 * s2int lea
398     cycles for 100 * StringToDwordSimd

10      bytes for atodw (Masm32 SDK)
34      bytes for s2int imul
30      bytes for s2int lea
114     bytes for StringToDwordSimd

12345678        = eax atodw (Masm32 SDK)
12345678        = eax s2int imul
12345678        = eax s2int lea
12345678        = eax StringToDwordSimd

(Guess what? No MasmBasic needed...)