Can I have some timings, please? (spinoff from this thread (https://masm32.com/board/index.php?topic=11739.0))
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
2257 cycles for 100 * atodw (Masm32 SDK)
1862 cycles for 100 * s2int A
2145 cycles for 100 * s2int B
2165 cycles for 100 * atodw (Masm32 SDK)
1925 cycles for 100 * s2int A
2159 cycles for 100 * s2int B
2274 cycles for 100 * atodw (Masm32 SDK)
1965 cycles for 100 * s2int A
1988 cycles for 100 * s2int B
2178 cycles for 100 * atodw (Masm32 SDK)
2102 cycles for 100 * s2int A
1977 cycles for 100 * s2int B
Averages:
2218 cycles for atodw (Masm32 SDK)
1945 cycles for s2int A
2066 cycles for s2int B
10 bytes for atodw (Masm32 SDK)
30 bytes for s2int A
30 bytes for s2int B
123456789 = eax atodw (Masm32 SDK)
123456789 = eax s2int A
123456789 = eax s2int B
The algos:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
s2intA proc arg
pop eax
pop edx
push eax
xor eax, eax
.While 1
movzx ecx, byte ptr [edx]
sub ecx, "0"
js @F
if 1 ; A
imul eax, 10
add eax, ecx
inc edx
else ; B
lea eax, [eax+4*eax] ; *5
inc edx
lea eax, [2*eax+ecx]
endif
.Endw
@@:
retn
s2intA endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
AMD Athlon(tm) II X2 220 Processor (SSE3)
4637 cycles for 100 * atodw (Masm32 SDK)
3985 cycles for 100 * s2int A
6634 cycles for 100 * s2int B
4638 cycles for 100 * atodw (Masm32 SDK)
3984 cycles for 100 * s2int A
6423 cycles for 100 * s2int B
4634 cycles for 100 * atodw (Masm32 SDK)
3986 cycles for 100 * s2int A
6642 cycles for 100 * s2int B
4690 cycles for 100 * atodw (Masm32 SDK)
3981 cycles for 100 * s2int A
6543 cycles for 100 * s2int B
Averages:
4638 cycles for atodw (Masm32 SDK)
3984 cycles for s2int A
6588 cycles for s2int B
10 bytes for atodw (Masm32 SDK)
30 bytes for s2int A
30 bytes for s2int B
123456789 = eax atodw (Masm32 SDK)
123456789 = eax s2int A
123456789 = eax s2int B
--- ok ---
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
2783 cycles for 100 * atodw (Masm32 SDK)
2894 cycles for 100 * s2int A
2869 cycles for 100 * s2int B
2686 cycles for 100 * atodw (Masm32 SDK)
2859 cycles for 100 * s2int A
2875 cycles for 100 * s2int B
2769 cycles for 100 * atodw (Masm32 SDK)
2920 cycles for 100 * s2int A
2897 cycles for 100 * s2int B
2668 cycles for 100 * atodw (Masm32 SDK)
2881 cycles for 100 * s2int A
2951 cycles for 100 * s2int B
Averages:
2728 cycles for atodw (Masm32 SDK)
2888 cycles for s2int A
2886 cycles for s2int B
10 bytes for atodw (Masm32 SDK)
30 bytes for s2int A
30 bytes for s2int B
123456789 = eax atodw (Masm32 SDK)
123456789 = eax s2int A
123456789 = eax s2int B
-
New version:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Averages:
3727 cycles for atodw (Masm32 SDK)
2627 cycles for s2int A
2945 cycles for s2int B
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
2185 cycles for atodw (Masm32 SDK)
1822 cycles for s2int A
2008 cycles for s2int B
Quote from: HSE on March 02, 2024, 01:23:22 AMAverages:
2728 cycles for atodw (Masm32 SDK)
2888 cycles for s2int A
2886 cycles for s2int B
Interesting :rolleyes:
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
2679 cycles for 100 * atodw (Masm32 SDK)
2302 cycles for 100 * s2int A
2810 cycles for 100 * s2int B
2633 cycles for 100 * atodw (Masm32 SDK)
2251 cycles for 100 * s2int A
2836 cycles for 100 * s2int B
2661 cycles for 100 * atodw (Masm32 SDK)
2294 cycles for 100 * s2int A
2832 cycles for 100 * s2int B
2635 cycles for 100 * atodw (Masm32 SDK)
2247 cycles for 100 * s2int A
2819 cycles for 100 * s2int B
Averages:
2648 cycles for atodw (Masm32 SDK)
2272 cycles for s2int A
2826 cycles for s2int B
10 bytes for atodw (Masm32 SDK)
34 bytes for s2int A
30 bytes for s2int B
123456789 = eax atodw (Masm32 SDK)
123456789 = eax s2int A
123456789 = eax s2int B
--- ok ---
:thumbsup:
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (SSE4)
1920 cycles for 100 * atodw (Masm32 SDK)
1710 cycles for 100 * s2int A
2080 cycles for 100 * s2int B
1926 cycles for 100 * atodw (Masm32 SDK)
1688 cycles for 100 * s2int A
2059 cycles for 100 * s2int B
1923 cycles for 100 * atodw (Masm32 SDK)
1681 cycles for 100 * s2int A
2061 cycles for 100 * s2int B
1927 cycles for 100 * atodw (Masm32 SDK)
1680 cycles for 100 * s2int A
2062 cycles for 100 * s2int B
Averages:
1924 cycles for atodw (Masm32 SDK)
1684 cycles for s2int A
2062 cycles for s2int B
10 bytes for atodw (Masm32 SDK)
34 bytes for s2int A
30 bytes for s2int B
123456789 = eax atodw (Masm32 SDK)
123456789 = eax s2int A
123456789 = eax s2int B
Just for fun, the 64-bit version (assembles with ML64, UAsm and JWasm, in 64- or 32-bit mode):
This program was assembled with ml64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890123456789 in 891 ms (version A)
Res=1234567890123456789 in 922 ms (version B)
Res=1234567890123456789 in 812 ms (version A)
Res=1234567890123456789 in 906 ms (version B)
Res=1234567890123456789 in 813 ms (version A)
Res=1234567890123456789 in 922 ms (version B)
This program was assembled with ML in 32-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890 in 515 ms (version A)
Res=1234567890 in 485 ms (version B)
Res=1234567890 in 437 ms (version A)
Res=1234567890 in 469 ms (version B)
Res=1234567890 in 469 ms (version A)
Res=1234567890 in 469 ms (version B)
This one works even better than the last one. AVX2.
Returns the lowest 20 figures of the first number in the string.
e.g.
rcx = "xxxxx0000000000000999999xx11xxx11222233334444xxx5555xxxxxxxxxxxxxxx1xxx23xxxx123"
rax = 999999
;-----------------------------------------------------------------------------------
;----------------------- String to Integer 64-bit ----------------------
;-----------------------------------------------------------------------------------
String_To_Int_AVX2 proc
vmovdqu ymm0,ymmword ptr [rcx]
vpbroadcastb ymm4, byte ptr [StrASCII]
vpbroadcastb ymm3,byte ptr [StrCmp9]
vpsubb ymm0,ymm0,ymm4
vpcmpeqd ymm2,ymm2,ymm2
vpcmpgtb ymm5,ymm3,ymm0
vpcmpgtb ymm4,ymm0,ymm2
vpand ymm4,ymm4,ymm5
vpand ymm0,ymm0,ymm4 ;"xx0012340xx1xx" -> "00111111100100"
vpmovmskb eax,ymm4 ;001111111100100
tzcnt edx,eax ;5
shrx eax,eax,edx ;111111100100
not eax ;111100011001110011111111111
tzcnt eax,eax ;16
add eax,edx ;shift l 20-eax = 18?
lea rcx,[StrConvShift]
lea rcx, [rcx + rax - 20]
vmovdqu ymm4, ymmword ptr [rcx] ;h
vmovdqu ymm5, ymmword ptr [rcx+16] ;l
vperm2i128 ymm1,ymm0,ymm0,010001b ;h,h
vinserti128 ymm2,ymm0,xmm0,1 ;l,l
vpshufb ymm3,ymm1,ymm4
vpshufb ymm0,ymm2,ymm5
vpor ymm0,ymm0,ymm3
vmovdqu ymm1,ymmword ptr [MulStr101B]
vmovdqu ymm2,ymmword ptr [MulStr101W]
vpmaddubsw ymm0,ymm0,ymm1 ;|ab,cd,ef,gh,ij,kl,mn,op|qr,st|
vpmaddwd ymm0,ymm0,ymm2 ;|abcd,0000,efgh,0000,ijkl,0000,mnop,0000|qrst,0000| save
vmovdqu xmm5,xmmword ptr [MulStr101D] ;|abcd,efgh,0000,0000,ijkl,mnop,0000,0000|
vpshufb xmm1,xmm0,xmmword ptr [StrConvShif16]
vpmaddwd xmm1,xmm1,xmm5 ;|00000000abcdefgh,00000000ijklmnop||
vpbroadcastd xmm2,dword ptr [Mul10000]
vpmuludq xmm1,xmm1,xmm2 ;|abcdefgh0000,ijklmnop0000|
vextracti128 xmm0,ymm0,1
vpshufd xmm3,xmm1,00001110b
vpaddq xmm0,xmm0,xmm3
vmovq rax,xmm1
imul rax,100000000
vmovq rdx,xmm0
add rax,rdx
ret
BYTE 32 DUP(-1)
StrConvShift:
BYTE 16 DUP(-1)
BYTE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
BYTE 16 DUP(-1)
BYTE 32 DUP(-1)
StrCmp9 BYTE 10
StrASCII BYTE 48
Mul10000 DWORD 10000
MulStr101B BYTE 10 DUP(10,1),12 DUP(0)
MulStr101W WORD 5 DUP(100,1),6 DUP(0)
MulStr101D WORD 2 DUP(10000,1,0,0)
StrConvShif16 BYTE 0,1,4,5,4 DUP(-1),8,9,12,13,4 DUP(-1)
String_To_Int_AVX2 endp
Quote from: InfiniteLoop on March 02, 2024, 12:55:59 PMThis one works even better
shrx eax,eax,edx: Error A2049: Invalid instruction operands
Post complete working code, please.
P.S.: I was curious, so I investigated how to encode this stuff with the latest UAsm release:
db 0C4h, 0E2h, 06Bh, 0F7h, 0C0h ; shrx eax,eax,edx 111111100100
db 62h, 0F2h, 75h, 28h, 8Dh, 0C0h ; vpermb ymm0{k0},ymm1,ymm0
The disassembly looks ok now, but my brand new AMD Athlon Gold 3150U throws an "illegal encoding" error for vpermb. Does anybody have a cpu that can handle this instruction? Can we see benchmarks?
Don't get me wrong: your code looks very interesting, and I suppose you put a lot of brain work into it. However, AVX-512 runs only on a handful of CPUs, so the value of such code for a library is limited. Besides, looking at the sheer number of instructions, I'd like to see if it's really faster than the straightforward "old" versions.
Its fixed now. I've tried every combination of letters and zeros. No more vpermb.
Zen 4 and Alderlake use AVX512. Zen 5 is supposed to upgrade it further.
Quote from: InfiniteLoop on March 03, 2024, 10:42:54 AMIts fixed now. I've tried every combination of letters and zeros. No more vpermb.
Zen 4 and Alderlake use AVX512. Zen 5 is supposed to upgrade it further.
Compliments, I got it running :thumbsup:
This program was assembled with UAsm64 in 64-bit format.
AMD Athlon Gold 3150U with Radeon Graphics
Res=1234567890123456789 in 500 ms (version AVX-512)
Res=1234567890123456789 in 828 ms (imul)
Res=1234567890123456789 in 891 ms (lea)
Res=1234567890123456789 in 500 ms (version AVX-512)
Res=1234567890123456789 in 797 ms (imul)
Res=1234567890123456789 in 922 ms (lea)
Res=123456789012 in 484 ms (version AVX-512)
Res=123456789012 in 547 ms (imul)
Res=123456789012 in 594 ms (lea)
Res=123456789012 in 484 ms (version AVX-512)
Res=123456789012 in 531 ms (imul)
Res=123456789012 in 594 ms (lea)
Res=12345678 in 500 ms (version AVX-512)
Res=12345678 in 359 ms (imul)
Res=12345678 in 422 ms (lea)
Res=12345678 in 516 ms (version AVX-512)
Res=12345678 in 359 ms (imul)
Res=12345678 in 422 ms (lea)
s2intY (AVX): 491 bytes
s2intA (imul): 29 bytes
s2intB (lea): 30 bytes
Code bloat! Seriously though, is the trade-off between code size and execution time worth it?
This program was assembled with UAsm64 in 64-bit format.
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Res=1234567890123456789 in 281 ms (version AVX-512)
Res=1234567890123456789 in 500 ms (imul)
Res=1234567890123456789 in 484 ms (lea)
Res=1234567890123456789 in 281 ms (version AVX-512)
Res=1234567890123456789 in 500 ms (imul)
Res=1234567890123456789 in 485 ms (lea)
Res=123456789012 in 297 ms (version AVX-512)
Res=123456789012 in 312 ms (imul)
Res=123456789012 in 328 ms (lea)
Res=123456789012 in 297 ms (version AVX-512)
Res=123456789012 in 328 ms (imul)
Res=123456789012 in 328 ms (lea)
Res=12345678 in 282 ms (version AVX-512)
Res=12345678 in 218 ms (imul)
Res=12345678 in 250 ms (lea)
Res=12345678 in 282 ms (version AVX-512)
Res=12345678 in 218 ms (imul)
Res=12345678 in 250 ms (lea)
s2intY (AVX): 491 bytes
s2intA (imul): 29 bytes
s2intB (lea): 30 bytes
Is that actually AVX-512? I don't think my CPU supports that.
Quote from: sinsi on March 03, 2024, 12:55:20 PMIs that actually AVX-512?
I'm not sure. The name of the function is String_To_Int_AVX
2 :rolleyes:
This program was assembled with UAsm64 in 64-bit format.
AMD Ryzen 9 5950X 16-Core Processor
Res=1234567890123456789 in 171 ms (version AVX-512)
Res=1234567890123456789 in 407 ms (imul)
Res=1234567890123456789 in 265 ms (lea)
Res=1234567890123456789 in 172 ms (version AVX-512)
Res=1234567890123456789 in 391 ms (imul)
Res=1234567890123456789 in 297 ms (lea)
Res=123456789012 in 172 ms (version AVX-512)
Res=123456789012 in 250 ms (imul)
Res=123456789012 in 250 ms (lea)
Res=123456789012 in 171 ms (version AVX-512)
Res=123456789012 in 219 ms (imul)
Res=123456789012 in 266 ms (lea)
Res=12345678 in 172 ms (version AVX-512)
Res=12345678 in 172 ms (imul)
Res=12345678 in 187 ms (lea)
Res=12345678 in 172 ms (version AVX-512)
Res=12345678 in 172 ms (imul)
Res=12345678 in 187 ms (lea)
s2intY (AVX): 491 bytes
s2intA (imul): 29 bytes
s2intB (lea): 30 bytes
Quote from: jj2007 on March 03, 2024, 01:17:56 PMQuote from: sinsi on March 03, 2024, 12:55:20 PMIs that actually AVX-512?
I'm not sure. The name of the function is String_To_Int_AVX2 :rolleyes:
Must be
AVX2, as
AMD Athlon Gold 3150U with Radeon Graphics run it.
Here's one more, fresh from the Lab:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
1981 cycles for 100 * atodw (Masm32 SDK)
1507 cycles for 100 * s2int imul
1750 cycles for 100 * s2int lea
364 cycles for 100 * StringToDwordSimd
1975 cycles for 100 * atodw (Masm32 SDK)
1512 cycles for 100 * s2int imul
1709 cycles for 100 * s2int lea
396 cycles for 100 * StringToDwordSimd
1983 cycles for 100 * atodw (Masm32 SDK)
1492 cycles for 100 * s2int imul
1711 cycles for 100 * s2int lea
370 cycles for 100 * StringToDwordSimd
1988 cycles for 100 * atodw (Masm32 SDK)
1520 cycles for 100 * s2int imul
1718 cycles for 100 * s2int lea
398 cycles for 100 * StringToDwordSimd
10 bytes for atodw (Masm32 SDK)
34 bytes for s2int imul
30 bytes for s2int lea
114 bytes for StringToDwordSimd
12345678 = eax atodw (Masm32 SDK)
12345678 = eax s2int imul
12345678 = eax s2int lea
12345678 = eax StringToDwordSimd
(Guess what? No MasmBasic needed...)