News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Decimal ascii to dword conversion

Started by jj2007, March 01, 2024, 11:56:20 PM

Previous topic - Next topic

jj2007

Can I have some timings, please? (spinoff from this thread)

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

2257    cycles for 100 * atodw (Masm32 SDK)
1862    cycles for 100 * s2int A
2145    cycles for 100 * s2int B

2165    cycles for 100 * atodw (Masm32 SDK)
1925    cycles for 100 * s2int A
2159    cycles for 100 * s2int B

2274    cycles for 100 * atodw (Masm32 SDK)
1965    cycles for 100 * s2int A
1988    cycles for 100 * s2int B

2178    cycles for 100 * atodw (Masm32 SDK)
2102    cycles for 100 * s2int A
1977    cycles for 100 * s2int B

Averages:
2218    cycles for atodw (Masm32 SDK)
1945    cycles for s2int A
2066    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
30      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

The algos:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
s2intA proc arg
  pop eax
  pop edx
  push eax
  xor eax, eax
  .While 1
movzx ecx, byte ptr [edx]
sub ecx, "0"
js @F
if 1  ; A
imul eax, 10
add eax, ecx
inc edx
else  ; B
lea eax, [eax+4*eax] ; *5
inc edx
lea eax, [2*eax+ecx]
endif
  .Endw
  @@:
  retn
s2intA endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

4637    cycles for 100 * atodw (Masm32 SDK)
3985    cycles for 100 * s2int A
6634    cycles for 100 * s2int B

4638    cycles for 100 * atodw (Masm32 SDK)
3984    cycles for 100 * s2int A
6423    cycles for 100 * s2int B

4634    cycles for 100 * atodw (Masm32 SDK)
3986    cycles for 100 * s2int A
6642    cycles for 100 * s2int B

4690    cycles for 100 * atodw (Masm32 SDK)
3981    cycles for 100 * s2int A
6543    cycles for 100 * s2int B

Averages:
4638    cycles for atodw (Masm32 SDK)
3984    cycles for s2int A
6588    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
30      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

--- ok ---
May the source be with you

HSE

Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

2783    cycles for 100 * atodw (Masm32 SDK)
2894    cycles for 100 * s2int A
2869    cycles for 100 * s2int B

2686    cycles for 100 * atodw (Masm32 SDK)
2859    cycles for 100 * s2int A
2875    cycles for 100 * s2int B

2769    cycles for 100 * atodw (Masm32 SDK)
2920    cycles for 100 * s2int A
2897    cycles for 100 * s2int B

2668    cycles for 100 * atodw (Masm32 SDK)
2881    cycles for 100 * s2int A
2951    cycles for 100 * s2int B

Averages:
2728    cycles for atodw (Masm32 SDK)
2888    cycles for s2int A
2886    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
30      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

-
Equations in Assembly: SmplMath

jj2007

New version:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Averages:
3727    cycles for atodw (Masm32 SDK)
2627    cycles for s2int A
2945    cycles for s2int B

AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
2185    cycles for atodw (Masm32 SDK)
1822    cycles for s2int A
2008    cycles for s2int B

Quote from: HSE on March 02, 2024, 01:23:22 AMAverages:
2728    cycles for atodw (Masm32 SDK)
2888    cycles for s2int A
2886    cycles for s2int B

Interesting :rolleyes:

HSE

Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

2679    cycles for 100 * atodw (Masm32 SDK)
2302    cycles for 100 * s2int A
2810    cycles for 100 * s2int B

2633    cycles for 100 * atodw (Masm32 SDK)
2251    cycles for 100 * s2int A
2836    cycles for 100 * s2int B

2661    cycles for 100 * atodw (Masm32 SDK)
2294    cycles for 100 * s2int A
2832    cycles for 100 * s2int B

2635    cycles for 100 * atodw (Masm32 SDK)
2247    cycles for 100 * s2int A
2819    cycles for 100 * s2int B

Averages:
2648    cycles for atodw (Masm32 SDK)
2272    cycles for s2int A
2826    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
34      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

--- ok ---

 :thumbsup:
Equations in Assembly: SmplMath

sinsi

Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (SSE4)

1920    cycles for 100 * atodw (Masm32 SDK)
1710    cycles for 100 * s2int A
2080    cycles for 100 * s2int B

1926    cycles for 100 * atodw (Masm32 SDK)
1688    cycles for 100 * s2int A
2059    cycles for 100 * s2int B

1923    cycles for 100 * atodw (Masm32 SDK)
1681    cycles for 100 * s2int A
2061    cycles for 100 * s2int B

1927    cycles for 100 * atodw (Masm32 SDK)
1680    cycles for 100 * s2int A
2062    cycles for 100 * s2int B

Averages:
1924    cycles for atodw (Masm32 SDK)
1684    cycles for s2int A
2062    cycles for s2int B

10      bytes for atodw (Masm32 SDK)
34      bytes for s2int A
30      bytes for s2int B

123456789       = eax atodw (Masm32 SDK)
123456789       = eax s2int A
123456789       = eax s2int B

jj2007

Just for fun, the 64-bit version (assembles with ML64, UAsm and JWasm, in 64- or 32-bit mode):

This program was assembled with ml64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890123456789 in 891 ms (version A)
Res=1234567890123456789 in 922 ms (version B)
Res=1234567890123456789 in 812 ms (version A)
Res=1234567890123456789 in 906 ms (version B)
Res=1234567890123456789 in 813 ms (version A)
Res=1234567890123456789 in 922 ms (version B)
This program was assembled with ML in 32-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890 in 515 ms (version A)
Res=1234567890 in 485 ms (version B)
Res=1234567890 in 437 ms (version A)
Res=1234567890 in 469 ms (version B)
Res=1234567890 in 469 ms (version A)
Res=1234567890 in 469 ms (version B)

InfiniteLoop

#7
This one works even better than the last one. AVX2.
Returns the lowest 20 figures of the first number in the string.
e.g.
rcx = "xxxxx0000000000000999999xx11xxx11222233334444xxx5555xxxxxxxxxxxxxxx1xxx23xxxx123"
rax = 999999

;-----------------------------------------------------------------------------------
;----------------------- String to Integer 64-bit ----------------------
;-----------------------------------------------------------------------------------
String_To_Int_AVX2 proc
vmovdqu ymm0,ymmword ptr [rcx]
vpbroadcastb ymm4, byte ptr [StrASCII]
vpbroadcastb ymm3,byte ptr [StrCmp9]
vpsubb ymm0,ymm0,ymm4
vpcmpeqd ymm2,ymm2,ymm2
vpcmpgtb ymm5,ymm3,ymm0
vpcmpgtb ymm4,ymm0,ymm2
vpand ymm4,ymm4,ymm5
vpand ymm0,ymm0,ymm4 ;"xx0012340xx1xx" -> "00111111100100"
vpmovmskb eax,ymm4 ;001111111100100
tzcnt edx,eax ;5
shrx eax,eax,edx ;111111100100
not eax ;111100011001110011111111111
tzcnt eax,eax ;16
add eax,edx ;shift l 20-eax = 18?
lea rcx,[StrConvShift]
lea rcx, [rcx + rax - 20]
vmovdqu ymm4, ymmword ptr [rcx] ;h
vmovdqu ymm5, ymmword ptr [rcx+16] ;l
vperm2i128 ymm1,ymm0,ymm0,010001b ;h,h
vinserti128 ymm2,ymm0,xmm0,1 ;l,l
vpshufb ymm3,ymm1,ymm4
vpshufb ymm0,ymm2,ymm5
vpor ymm0,ymm0,ymm3
vmovdqu ymm1,ymmword ptr [MulStr101B]
vmovdqu ymm2,ymmword ptr [MulStr101W]
vpmaddubsw ymm0,ymm0,ymm1 ;|ab,cd,ef,gh,ij,kl,mn,op|qr,st|
vpmaddwd ymm0,ymm0,ymm2 ;|abcd,0000,efgh,0000,ijkl,0000,mnop,0000|qrst,0000| save
vmovdqu xmm5,xmmword ptr [MulStr101D] ;|abcd,efgh,0000,0000,ijkl,mnop,0000,0000|
vpshufb xmm1,xmm0,xmmword ptr [StrConvShif16]
vpmaddwd xmm1,xmm1,xmm5 ;|00000000abcdefgh,00000000ijklmnop||
vpbroadcastd xmm2,dword ptr [Mul10000]
vpmuludq xmm1,xmm1,xmm2 ;|abcdefgh0000,ijklmnop0000|
vextracti128 xmm0,ymm0,1
vpshufd xmm3,xmm1,00001110b
vpaddq xmm0,xmm0,xmm3
vmovq rax,xmm1
imul rax,100000000
vmovq rdx,xmm0
add rax,rdx
ret
BYTE 32 DUP(-1)
StrConvShift:
BYTE 16 DUP(-1)
BYTE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
BYTE 16 DUP(-1)
BYTE 32 DUP(-1)
StrCmp9 BYTE 10
StrASCII BYTE 48
Mul10000 DWORD 10000
MulStr101B BYTE 10 DUP(10,1),12 DUP(0)
MulStr101W WORD 5 DUP(100,1),6 DUP(0)
MulStr101D WORD 2 DUP(10000,1,0,0)
StrConvShif16 BYTE 0,1,4,5,4 DUP(-1),8,9,12,13,4 DUP(-1)
String_To_Int_AVX2 endp


jj2007

#8
Quote from: InfiniteLoop on March 02, 2024, 12:55:59 PMThis one works even better

shrx eax,eax,edx: Error A2049: Invalid instruction operands

Post complete working code, please.

P.S.: I was curious, so I investigated how to encode this stuff with the latest UAsm release:

db 0C4h, 0E2h, 06Bh, 0F7h, 0C0h      ; shrx eax,eax,edx 111111100100
db 62h, 0F2h, 75h, 28h, 8Dh, 0C0h                    ; vpermb ymm0{k0},ymm1,ymm0

The disassembly looks ok now, but my brand new AMD Athlon Gold 3150U throws an "illegal encoding" error for vpermb. Does anybody have a cpu that can handle this instruction? Can we see benchmarks?

Don't get me wrong: your code looks very interesting, and I suppose you put a lot of brain work into it. However, AVX-512 runs only on a handful of CPUs, so the value of such code for a library is limited. Besides, looking at the sheer number of instructions, I'd like to see if it's really faster than the straightforward "old" versions.

InfiniteLoop

Its fixed now. I've tried every combination of letters and zeros. No more vpermb.
Zen 4 and Alderlake use AVX512. Zen 5 is supposed to upgrade it further.

jj2007

Quote from: InfiniteLoop on March 03, 2024, 10:42:54 AMIts fixed now. I've tried every combination of letters and zeros. No more vpermb.
Zen 4 and Alderlake use AVX512. Zen 5 is supposed to upgrade it further.

Compliments, I got it running :thumbsup:

This program was assembled with UAsm64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890123456789 in 500 ms (version AVX-512)
Res=1234567890123456789 in 828 ms (imul)
Res=1234567890123456789 in 891 ms (lea)
Res=1234567890123456789 in 500 ms (version AVX-512)
Res=1234567890123456789 in 797 ms (imul)
Res=1234567890123456789 in 922 ms (lea)
Res=123456789012 in 484 ms (version AVX-512)
Res=123456789012 in 547 ms (imul)
Res=123456789012 in 594 ms (lea)
Res=123456789012 in 484 ms (version AVX-512)
Res=123456789012 in 531 ms (imul)
Res=123456789012 in 594 ms (lea)
Res=12345678 in 500 ms (version AVX-512)
Res=12345678 in 359 ms (imul)
Res=12345678 in 422 ms (lea)
Res=12345678 in 516 ms (version AVX-512)
Res=12345678 in 359 ms (imul)
Res=12345678 in 422 ms (lea)
s2intY (AVX):  491 bytes
s2intA (imul): 29 bytes
s2intB (lea):  30 bytes

sinsi

Code bloat! Seriously though, is the trade-off between code size and execution time worth it?
This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Res=1234567890123456789 in 281 ms (version AVX-512)
Res=1234567890123456789 in 500 ms (imul)
Res=1234567890123456789 in 484 ms (lea)
Res=1234567890123456789 in 281 ms (version AVX-512)
Res=1234567890123456789 in 500 ms (imul)
Res=1234567890123456789 in 485 ms (lea)
Res=123456789012 in 297 ms (version AVX-512)
Res=123456789012 in 312 ms (imul)
Res=123456789012 in 328 ms (lea)
Res=123456789012 in 297 ms (version AVX-512)
Res=123456789012 in 328 ms (imul)
Res=123456789012 in 328 ms (lea)
Res=12345678 in 282 ms (version AVX-512)
Res=12345678 in 218 ms (imul)
Res=12345678 in 250 ms (lea)
Res=12345678 in 282 ms (version AVX-512)
Res=12345678 in 218 ms (imul)
Res=12345678 in 250 ms (lea)
s2intY (AVX):  491 bytes
s2intA (imul): 29 bytes
s2intB (lea):  30 bytes
Is that actually AVX-512? I don't think my CPU supports that.

jj2007

Quote from: sinsi on March 03, 2024, 12:55:20 PMIs that actually AVX-512?

I'm not sure. The name of the function is String_To_Int_AVX2 :rolleyes:

fearless

This program was assembled with UAsm64 in 64-bit format.
AMD Ryzen 9 5950X 16-Core Processor
Res=1234567890123456789 in 171 ms (version AVX-512)
Res=1234567890123456789 in 407 ms (imul)
Res=1234567890123456789 in 265 ms (lea)
Res=1234567890123456789 in 172 ms (version AVX-512)
Res=1234567890123456789 in 391 ms (imul)
Res=1234567890123456789 in 297 ms (lea)
Res=123456789012 in 172 ms (version AVX-512)
Res=123456789012 in 250 ms (imul)
Res=123456789012 in 250 ms (lea)
Res=123456789012 in 171 ms (version AVX-512)
Res=123456789012 in 219 ms (imul)
Res=123456789012 in 266 ms (lea)
Res=12345678 in 172 ms (version AVX-512)
Res=12345678 in 172 ms (imul)
Res=12345678 in 187 ms (lea)
Res=12345678 in 172 ms (version AVX-512)
Res=12345678 in 172 ms (imul)
Res=12345678 in 187 ms (lea)
s2intY (AVX):  491 bytes
s2intA (imul): 29 bytes
s2intB (lea):  30 bytes

TimoVJL

Quote from: jj2007 on March 03, 2024, 01:17:56 PM
Quote from: sinsi on March 03, 2024, 12:55:20 PMIs that actually AVX-512?

I'm not sure. The name of the function is String_To_Int_AVX2 :rolleyes:
Must be AVX2, as AMD Athlon Gold 3150U with Radeon Graphics run it.
May the source be with you