Hello.
I have just created a starting point for validating the ALGO to convert
some millions strings into numbers.
Here is the example I used in PowerBasic, that is considered quite a good Compiler,
and the results I got. From here the ASM versions should start and confront the speed
against it.
The Code I used, quite simple:
QuoteSUB Convert
TIX CycleCount
FOR Counter = 1 TO Indice
MyNumbers(Counter) = VAL(MyStrings(Counter))
NEXT
TIX END CycleCount
END SUB
The strings are 7,110,268 in the string array.
The CPU tix used: 2,607,804,194
Let's see how we can speed it up with some ASM code. :eusa_dance:
The file that I use for the test is quite big > 300 MB. They are real data from
S&P 500 quotations from 1997 up to now with one minute lapse.
Try the sval("123") macro, it's part of masm32rt.inc:
include \masm32\include\masm32rt.inc
.data
x123 db "-12345", 0
.code
start:
MsgBox 0, cat$(str$(sval(offset x123)), " is the value"), "val is great:", MB_OK
exit
end start
MasmBasic Val(a string) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1201) is probably a bit slower (it's an allrounder that reads dec, hex, bin in various flavours).
This algo only deals with non signed numbers.
;first approach
;"1234" == 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0 == 1000+200+30+4
mov eax,"1234" ;30303030h to 39393939h
xor eax,"0000" ;sub "0000"
xor ecx,ecx
xor edx,edx
xor ebx,ebx
shld ecx,eax,8 ;"1"
shl eax,8 ;"234"
lea ebx,[ecx*8] ;1*8=8
lea ebx,[ebx+ecx*2] ;8+(1*2)=10
xor ecx,ecx
shld ecx,eax,8 ;"2"
shl eax,8 ;"34"
add ebx,ecx ;10+2=12
lea edx,[ebx*8] ;12*8=96
lea edx,[edx+ebx*2] ;96+(12*2)=120
xor ecx,ecx
xor ebx,ebx
shld ecx,eax,8 ;"3"
shl eax,8 ;"4"
add edx,ecx ;120+3=123
lea ebx,[edx*8] ;123*8=984
lea ebx,[ebx+edx*2] ;984+(123*2)=1230
xor ecx,ecx
shld ecx,eax,8 ;"4"
add ebx,ecx ;1230+4=1234
;ebx="1234" converted to binary equals to 4d2h
You can create a lookup table with all possible combinations from 0000 to 9999 instead of do calculus of each number each time. Will have a execution time gain in 1 million attempts.
Quote from: jj2007 on November 09, 2021, 10:53:52 AM
Try the sval("123") macro, it's part of masm32rt.inc:
include \masm32\include\masm32rt.inc
.data
x123 db "-12345", 0
.code
start:
MsgBox 0, cat$(str$(sval(offset x123)), " is the value"), "val is great:", MB_OK
exit
end start
MasmBasic Val(a string) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1201) is probably a bit slower (it's an allrounder that reads dec, hex, bin in various flavours).
For didactical purpose I'd like to use just pure and simple ASM opcodes, registers, jumps, moves, adds and so on.
Using a MACRO or a FUNCTION that somebody else has assembled can just be done to test the efficiency and speed
but it's not fully satisfying for my thirst of ASM knowledge :-)
Moreover I can't find the sval() macro in MASM32 help, so I can't see what is it and how it works.
Thanks anyway JJ for your suggestions. :-)
Quote from: mineiro on November 09, 2021, 01:11:54 PM
This algo only deals with non signed numbers.
;first approach
;"1234" == 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0 == 1000+200+30+4
mov eax,"1234" ;30303030h to 39393939h
xor eax,"0000" ;sub "0000"
xor ecx,ecx
xor edx,edx
xor ebx,ebx
shld ecx,eax,8 ;"1"
shl eax,8 ;"234"
lea ebx,[ecx*8] ;1*8=8
lea ebx,[ebx+ecx*2] ;8+(1*2)=10
xor ecx,ecx
shld ecx,eax,8 ;"2"
shl eax,8 ;"34"
add ebx,ecx ;10+2=12
lea edx,[ebx*8] ;12*8=96
lea edx,[edx+ebx*2] ;96+(12*2)=120
xor ecx,ecx
xor ebx,ebx
shld ecx,eax,8 ;"3"
shl eax,8 ;"4"
add edx,ecx ;120+3=123
lea ebx,[edx*8] ;123*8=984
lea ebx,[ebx+edx*2] ;984+(123*2)=1230
xor ecx,ecx
shld ecx,eax,8 ;"4"
add ebx,ecx ;1230+4=1234
;ebx="1234" converted to binary equals to 4d2h
You can create a lookup table with all possible combinations from 0000 to 9999 instead of do calculus of each number each time. Will have a execution time gain in 1 million attempts.
Yes Mineiro, this is the kind of reasoning I'm doing myself.
Just need time to regain my ability to use the MASM opcodes.
My data set has not negative numbers, all of them are not signed.
Thanks
I am thinking about an ASCII approach to the conversion from string to number.
If I have the string number "4735", every digit has an ASCII representation, and value:
4 is ASCII 52
7 is ASCII 55
3 is ASCII 51
5 is ASCII 53
The four bytes in memory have these ASCII codes 52,55,51,53.
If I move these 4 bytes into four registers: EAX, EBX, ECX, EDX
and subtract from each the value 48, I have the value
4 in EAX
7 in EBX
3 in ECX
5 in EDX
EAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units
the sum of the contents of the 4 registers is the number that is represented in the ASCII string.
I don't know if this approach is efficient for speed, but the logic should work.
If there are MMX or SSE opcodes that do in parallel the work on all the four bytes/registers the
efficiency should be better. But I am not aware at the moment if these opcodes exist.
Quote from: frktons on November 09, 2021, 02:24:00 PM
I am thinking about an ASCII approach to the conversion from string to number.
If I have the string number "4735", every digit has an ASCII representation, and value:
4 is ASCII 52
7 is ASCII 55
3 is ASCII 51
5 is ASCII 53
The four bytes in memory have these ASCII codes 52,55,51,53.
If I move these 4 bytes into four registers: EAX, EBX, ECX, EDX
and subtract from each the value 48, I have the value
4 in EAX
7 in EBX
3 in ECX
5 in EDX
EAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units
the sum of the contents of the 4 registers is the number that is represented in the ASCII string.
I don't know if this approach is efficient for speed, but the logic should work.
If there are MMX or SSE opcodes that do in parallel the work on all the four bytes/registers the
efficiency should be better. But I am not aware at the moment if these opcodes exist.
you can get ascii numbers being " 123" too instead of "0123",so I prefer AND to mask out 32+16 instead of sub
Yes daydreamer, I think it is an efficient way of getting the same result.
I am not used to using these tricks, and my ASM is a bit rusty from long inactivity.
But I understand the point. Thanks
In the data set I am using there are no spaces, but leading "0" if the case.
There are no cases of leading zero, all the values are between 1000 and 5000 indeed.
:thumbsup:
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units
include \masm32\include\masm32rt.inc
.data
x123 db "1234", 0
.code
start:
mov ecx, dword ptr x123
movzx eax, cl ; "1"
and eax, 15 ; 1
imul eax, eax, 1000 ; 1000
movzx edx, ch ; "2"
and edx, 15 ; 2
imul edx, edx, 100 ; 200
add eax, edx ; 1200
bswap ecx
movzx edx, ch ; "3"
and edx, 15 ; 3
imul edx, edx, 10 ; 30
add eax, edx ; 1230
and ecx, 15 ; "4" -> 4
add eax, ecx ; 1234
MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK
exit
end start
Hi JJ.
Nice solution using only one register and its one byte segments.
Great idea, compliment my friend.
I was thinking about something similar, but I was wondering if Full 32 bit registers are faster
doing these operations. Maybe not. Need to check it out.
:thumbsup:
What about a lookup table with 0000 to 9999 members ? Only 40k.
This is a variation from daydreamer code.
.data
align 8
total dw 1,10,100,1000
.code
movd xmm1,qword ptr [total]
pxor xmm2,xmm2
mov eax,"1234" ;mov eax,dword ptr []
;bswap eax
xor eax,"0000" ;and eax,0f0f0f0fh
movd xmm0,eax ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1 ;mul and add
phaddd xmm0,xmm0 ;add
movd eax,xmm0 ;eax=4d2h
Quote from: hutch-- on November 09, 2021, 11:32:05 PM
What about a lookup table with 0000 to 9999 members ? Only 40k.
How would you look up, for example, "1234"? Conversion is from Ascii to binary...
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.
Quote from: mineiro on November 10, 2021, 01:15:12 AM
This is a variation from daydreamer code.
.data
align 8
total dw 1,10,100,1000
.code
movd xmm1,qword ptr [total]
pxor xmm2,xmm2
mov eax,"1234" ;mov eax,dword ptr []
;bswap eax
xor eax,"0000" ;and eax,0f0f0f0fh
movd xmm0,eax ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1 ;mul and add
phaddd xmm0,xmm0 ;add
movd eax,xmm0 ;eax=4d2h
great :thumbsup:
actually my code was just first to get it working right,after that make ascii->double and double->ascii and possible 128bit and 256bit
@Hutch
LUT works fast,how fast if there is a prime among first 1-65535 LUT used code somewhere in forum
Quote from: hutch-- on November 10, 2021, 06:32:03 AM
Simple, make a dedicated integer conversion that dealt exactly with 4 characters then use it as an index for the array. What is in the array ? Anything you like.
This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...
Let's do some timings, just for fun? :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
469 cycles for 100 * C2D_J
1594 cycles for 100 * atodw
21894 cycles for 100 * sscanf
485 cycles for 100 * C2D_J
1582 cycles for 100 * atodw
22022 cycles for 100 * sscanf
469 cycles for 100 * C2D_J
1590 cycles for 100 * atodw
21863 cycles for 100 * sscanf
476 cycles for 100 * C2D_J
1573 cycles for 100 * atodw
21894 cycles for 100 * sscanf
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
Quote from: jj2007 on November 10, 2021, 07:59:05 AM
This "dedicated integer conversion" looks suspiciously like the method I used in reply #8. Only that no array is needed after you got the index...
I was thinking something like this:
lea esi,lookup_table
mov edx,0f0f0f0fh ;mask, lower bits of each byte
mov ecx,"1234" ;source
pext eax,ecx,edx ;index=eax=00001234h, 64k lookup table *2 (element sizeof) ;BMI2 cpuid
movzx ebx,word ptr [esi+eax*2] ;ebx=converted value
I tried a similar solution like Steve suggested, but in a different logic and shape.
I used 4 X 64 elements array, and at the elements with the index = ASCII code (for "1" that is 49)
I initialized the 49th(ASCII code for "1") element of Thousand array with the value 1000 and did the same
for others value up to "9".
This array was for Thousands, the second array was for Hundreds, and The Tens, and Units.
The conversion from "1234" to its corresponding number is just a SUM Thousands(ASCII code first digit) +
SUM Hundreds(ASCII code second digit) + SUM Tens(ASCII code third digit) + SUM Units(ASCII code fourth digit).
I tried it in PowerBasic, and it is 30% faster than the standard VAL() function.
I suppose in ASM it can be much faster because we can use registers, shifts, and ADD to get the
result with arrays with the values we need: 1-9, 10-90,100-900, 1000-9000.
Thanks all for your suggestions and inspiration. It's a great place to be MASM FORUM.
:thumbsup:
Quote from: jj2007 on November 09, 2021, 10:13:07 PM
Quote from: frktons on November 09, 2021, 02:24:00 PMEAX contains the thousands (4) so it's value has to be multiplied for 1000
EBX contains the hundreds (7) so it's value has to be multiplied for 100
ECX contains the tens (3) so it's value has to be multiplied for 10
EDX contains the units
include \masm32\include\masm32rt.inc
.data
x123 db "1234", 0
.code
start:
mov ecx, dword ptr x123
movzx eax, cl ; "1"
and eax, 15 ; 1
imul eax, eax, 1000 ; 1000
movzx edx, ch ; "2"
and edx, 15 ; 2
imul edx, edx, 100 ; 200
add eax, edx ; 1200
bswap ecx
movzx edx, ch ; "3"
and edx, 15 ; 3
imul edx, edx, 10 ; 30
add eax, edx ; 1230
and ecx, 15 ; "4" -> 4
add eax, ecx ; 1234
MsgBox 0, cat$(str$(eax), " is the value"), "That was simple:", MB_OK
exit
end start
Hi JJ.
The solution you suggested is quite simple and elegant.
Pure 32 bit ASM code.
I suppose that the code:
AND EAX, 15 is the equivalent of SUB EAX, 48 (maybe faster?)
If we have an array with all the rounded thousands, hundreds, tens
and we use directly the ASCII code as an index to these arrays
and at the end we just add the four elements corresponding to
the four ASCII codes, could we have a faster result?
I think IMUL is slower than ADD, and we don't need the AND REG, 15.
What do you think my friend?
Enjoy
Hi mineiro & Frank,
Attached a version that you can use for your algos. TestD and TestE are free - just insert your code as demonstrated below the TestA_s: label.
No MasmBasic required, just the plain Masm32 SDK. Enjoy :thumbsup:
Thanks sir jj2007;
Follow results, please note that code size of my procedure is wrong, is bigger than appears:
~/.wine/drive_c/FourCharsToDword2v1$ wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
449 cycles for 100 * C2D_J
1499 cycles for 100 * atodw
12673 cycles for 100 * sscanf
342 cycles for 100 * mineiro
?? cycles for 100 * testE
442 cycles for 100 * C2D_J
1523 cycles for 100 * atodw
12670 cycles for 100 * sscanf
342 cycles for 100 * mineiro
?? cycles for 100 * testE
435 cycles for 100 * C2D_J
1514 cycles for 100 * atodw
12673 cycles for 100 * sscanf
344 cycles for 100 * mineiro
?? cycles for 100 * testE
435 cycles for 100 * C2D_J
1511 cycles for 100 * atodw
12891 cycles for 100 * sscanf
329 cycles for 100 * mineiro
?? cycles for 100 * testE
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
46 bytes for mineiro
2 bytes for testE
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax mineiro
2 = eax testE
Hi JJ, Mineiro. :eusa_clap:
As supposed, with SSE/SSE2 and more advanced opcodes we can have great results
when we work with multiple registers in parallel operations.
It will take time for me to translate into ASM my ideas, but I presume the results could
be interesting.
Ad maiora.
:thumbsup:
Quote from: mineiro on November 10, 2021, 01:23:05 PM
Follow results, please note that code size of my procedure is wrong, is bigger than appears:
Very nice! Code size is calculated including the loop & call part but not including data, such as the 8 bytes of your
total variable. On my old i5 your code is a tick slower than mine, but I see it's much faster on your i7 :thumbsup:
This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
movq xmm1,qword ptr [total] ;static values
pxor xmm2,xmm2
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
495 cycles for 100 * C2D_J
1640 cycles for 100 * atodw
21981 cycles for 100 * sscanf
544 cycles for 100 * mineiro
469 cycles for 100 * C2D_J
1578 cycles for 100 * atodw
21924 cycles for 100 * sscanf
550 cycles for 100 * mineiro
471 cycles for 100 * C2D_J
1579 cycles for 100 * atodw
21940 cycles for 100 * sscanf
545 cycles for 100 * mineiro
469 cycles for 100 * C2D_J
1577 cycles for 100 * atodw
22206 cycles for 100 * sscanf
545 cycles for 100 * mineiro
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
54 bytes for mineiro
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax mineiro
P.S.: There is a useE=1 at the beginning of the file. Set useE=0 to suppress the unused TestE loop.
Quote from: jj2007 on November 10, 2021, 07:23:10 PM
This should actually go into the proc, not before the loop, in order to be comparable to my algo; it does not influence speed, though:
movq xmm1,qword ptr [total] ;static values
pxor xmm2,xmm2
Inserting that code inside procedure I got a very close value as your code posted before.
I have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.
424 cycles for 100 * C2D_J
1504 cycles for 100 * atodw
12760 cycles for 100 * sscanf
652 cycles for 100 * mineiro ;;converting 2 values simultaneously, all code inside procedure
I remember in past when we compete code that we avoid the use of mul/imul in favor of lea. I have tried that "first approach" posted in this topic but performance was not good in my pc (ever preserving register usage (ebx)), something like 700 ~ 720 cycles.
So, let's wait next round sir jj2007.
Quote from: mineiro on November 10, 2021, 09:04:43 PMI have tried to favor paralellization and inserted 2 values to be converted at one time call (eax and edx registers); sounds ok.
I'm not quite sure what your algo does. The other algos get a string pointer as argument:
push eax
invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
pop eax
Quote from: jj2007 on November 10, 2021, 10:40:38 PM
I'm not quite sure what your algo does. The other algos get a string pointer as argument:
Well, so we need define some rules.
Your algo does the same as mine, only reversed string being used.
mov eax, "4321"
call C2D_J
Follow algo that can perform 2 conversions (min2cvt), and that one that uses lea:
I will edit this message and post other computer results:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
553 cycles for 100 * C2D_J
1487 cycles for 100 * atodw
12890 cycles for 100 * sscanf
739 cycles for 100 * min2cvt
333 cycles for 100 * min1cvt
895 cycles for 100 * min_lea
562 cycles for 100 * C2D_J
1485 cycles for 100 * atodw
12871 cycles for 100 * sscanf
743 cycles for 100 * min2cvt
333 cycles for 100 * min1cvt
895 cycles for 100 * min_lea
566 cycles for 100 * C2D_J
1486 cycles for 100 * atodw
12900 cycles for 100 * sscanf
746 cycles for 100 * min2cvt
332 cycles for 100 * min1cvt
751 cycles for 100 * min_lea
561 cycles for 100 * C2D_J
1559 cycles for 100 * atodw
12603 cycles for 100 * sscanf
742 cycles for 100 * min2cvt
332 cycles for 100 * min1cvt
895 cycles for 100 * min_lea
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
77 bytes for min2cvt
44 bytes for min1cvt
92 bytes for min_lea
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax min_lea
wine FourCharsToDword.exe
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
561 cycles for 100 * C2D_J
1912 cycles for 100 * atodw
19520 cycles for 100 * sscanf
672 cycles for 100 * min2cvt
458 cycles for 100 * min1cvt
765 cycles for 100 * min_lea
568 cycles for 100 * C2D_J
1919 cycles for 100 * atodw
19520 cycles for 100 * sscanf
670 cycles for 100 * min2cvt
457 cycles for 100 * min1cvt
767 cycles for 100 * min_lea
561 cycles for 100 * C2D_J
1918 cycles for 100 * atodw
19488 cycles for 100 * sscanf
672 cycles for 100 * min2cvt
456 cycles for 100 * min1cvt
767 cycles for 100 * min_lea
571 cycles for 100 * C2D_J
1912 cycles for 100 * atodw
19498 cycles for 100 * sscanf
671 cycles for 100 * min2cvt
458 cycles for 100 * min1cvt
769 cycles for 100 * min_lea
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
77 bytes for min2cvt
44 bytes for min1cvt
92 bytes for min_lea
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax min_lea
I changed my previous code to be string pointer as argument:
Previous cycles result don't changed.
I removed lea algo.
Quote from: mineiro on November 11, 2021, 12:32:30 AM
I changed my previous code to be string pointer as argument
I'm afraid you are
not passing a string pointer:
mov eax, "4321" ;first value to be converted
mov edx, "1234" ;second value to be converted
mov eax, chr$("1234") would pass a string pointer, as in
.Repeat
push eax
invoke crt_sscanf, chr$("1234"), chr$("%d"), esp
pop eax
dec ebx
.Until Sign?
Quote from: jj2007 on November 11, 2021, 01:34:32 AM
I'm afraid you are not passing a string pointer:
mov eax, "4321" ;first value to be converted
mov edx, "1234" ;second value to be converted
mov eax, chr$("1234") would pass a string pointer, as in
My and your procedure are passing data direct loaded into eax register while other functions (atodw/crt_sscanf) are passing data pointer by stack.
RULES:
All procedures should pass a pointer to data by using stack, stdcall.
I changed C2D_J,min2cvt and min1cvt.
wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
557 cycles for 100 * C2D_J
1524 cycles for 100 * atodw
12665 cycles for 100 * sscanf
767 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt
579 cycles for 100 * C2D_J
1499 cycles for 100 * atodw
12649 cycles for 100 * sscanf
785 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt
602 cycles for 100 * C2D_J
1496 cycles for 100 * atodw
12631 cycles for 100 * sscanf
783 cycles for 100 * min2cvt
429 cycles for 100 * min1cvt
591 cycles for 100 * C2D_J
1497 cycles for 100 * atodw
12631 cycles for 100 * sscanf
747 cycles for 100 * min2cvt
431 cycles for 100 * min1cvt
54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
Excellent, mineiro :thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
467 cycles for 100 * C2D_J
1592 cycles for 100 * atodw
21900 cycles for 100 * sscanf
761 cycles for 100 * min2cvt
543 cycles for 100 * min1cvt
467 cycles for 100 * C2D_J
1588 cycles for 100 * atodw
21908 cycles for 100 * sscanf
763 cycles for 100 * min2cvt
543 cycles for 100 * min1cvt
467 cycles for 100 * C2D_J
1600 cycles for 100 * atodw
21898 cycles for 100 * sscanf
768 cycles for 100 * min2cvt
549 cycles for 100 * min1cvt
468 cycles for 100 * C2D_J
1591 cycles for 100 * atodw
21924 cycles for 100 * sscanf
767 cycles for 100 * min2cvt
545 cycles for 100 * min1cvt
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.
I wrote some code to prepare such an LUT (once which should take only a few nanosecs) then used it to convert an ascii number 1,000,000,000 times (10^9). The timing on my older laptop was 1.5 secs!
Code for preparing LUT
.data
t10 dd 10 DUP(0)
t100 dd 10 DUP(0)
t1000 dd 10 DUP(0)
numbers db "3578"
start:
;fill LUTs of multiples
;of 10
lea edi,t10
xor eax,eax
mov edx,10
mov ecx,10
@@:
stosd
add eax,edx
dec ecx
jnz @B
;of 100
xor eax,eax
mov edx,100
mov ecx,10
@@:
stosd
add eax,edx
dec ecx
jnz @B
;of 1000
xor eax,eax
mov edx,1000
mov ecx,10
@@:
stosd
add eax,edx
dec ecx
jnz @B
;there is NO need to have a look-up table for multiples of units!!!
Then I ran the following code for converting ascii to binary, using the QueryPerformanceFrequency function for the timing
mov ecx,1000000000
start1:
lea esi,numbers
mov edx,[esi]
and edx,0f0f0f0fh
movzx ebx,dl ;1000s
mov eax,t1000[ebx*4]
movzx ebx,dh ;100s
add eax,t100[ebx*4]
bswap edx
movzx ebx,dh ;10s
add eax,t10[ebx*4]
movzx ebx,dl ;units
add eax,ebx
dec ecx
jnz start1
AMD Athlon(tm) II X2 220 Processor (SSE3)
833 cycles for 100 * C2D_J
2410 cycles for 100 * atodw
34220 cycles for 100 * sscanf
1296 cycles for 100 * min2cvt
1057 cycles for 100 * min1cvt
806 cycles for 100 * C2D_J
2437 cycles for 100 * atodw
34498 cycles for 100 * sscanf
1297 cycles for 100 * min2cvt
1053 cycles for 100 * min1cvt
805 cycles for 100 * C2D_J
2436 cycles for 100 * atodw
34558 cycles for 100 * sscanf
1297 cycles for 100 * min2cvt
1055 cycles for 100 * min1cvt
805 cycles for 100 * C2D_J
2410 cycles for 100 * atodw
34503 cycles for 100 * sscanf
1296 cycles for 100 * min2cvt
1060 cycles for 100 * min1cvt
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
--- ok ---
LUT with Pelles C1: #include <stdio.h>
2: int a10[] = {0,10,20,30,40,50,60,70,80,90};
3: int a100[] = {0,100,200,300,400,500,600,700,800,900};
4: int a1000[] = {0,1000,2000,3000,4000,5000,6000,7000,8000,9000};
5:
6: int __cdecl main(void)
_main:
[00000000] 55 push ebp
[00000001] 89E5 mov ebp,esp
[00000003] 83EC08 sub esp,8
7: {
8: char anum[] = "1234";
[00000006] C745FB31323334 mov dword ptr [ebp-5],34333231
[0000000D] C645FF00 mov byte ptr [ebp-1],0
9: int num;
10:
11: (*(int*)&anum) &= 0x0f0f0f0f;
[00000011] 8165FB0F0F0F0F and dword ptr [ebp-5],F0F0F0F
12: num = a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
[00000018] 0FBE45FB movsx eax,byte ptr [ebp-5]
[0000001C] 8B048500000000 mov eax,dword ptr [eax*4+_a1000]
[00000023] 0FBE55FC movsx edx,byte ptr [ebp-4]
[00000027] 03049500000000 add eax,dword ptr [edx*4+_a100]
[0000002E] 0FBE55FD movsx edx,byte ptr [ebp-3]
[00000032] 03049500000000 add eax,dword ptr [edx*4+_a10]
[00000039] 0FBE55FE movsx edx,byte ptr [ebp-2]
[0000003D] 01D0 add eax,edx
13: printf("%u\n", num);
[0000003F] 50 push eax
[00000040] 6800000000 push @6
[00000045] E800000000 call _printf
[0000004A] 83C408 add esp,8
14: return 0;
[0000004D] 31C0 xor eax,eax
[0000004F] 89EC mov esp,ebp
[00000051] 5D pop ebp
[00000052] C3 ret
using temporary string
char anum2[] = "1234";
char anum[5];
int num;
12: (*(int*)&anum) = (*(int*)&anum2) & 0x0f0f0f0f;
[00000011] 8B45FB mov eax,dword ptr [ebp-5]
[00000014] 250F0F0F0F and eax,F0F0F0F
[00000019] 8945F6 mov dword ptr [ebp-A],eax
Quote from: raymond on November 11, 2021, 03:18:06 PM
I would think that using Look-up tables for converting numbers from ascii format to binary may be the fastest option. I don't have the facilities to compare various procedures but someone else may want to test it.
Thank you, Raymond. Here it is as
ConvertLUT:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
475 cycles for 100 * C2D_J
1597 cycles for 100 * atodw
22224 cycles for 100 * sscanf
772 cycles for 100 * min2cvt
555 cycles for 100 * min1cvt
640 cycles for 100 * ConvertLUT
497 cycles for 100 * C2D_J
1595 cycles for 100 * atodw
22401 cycles for 100 * sscanf
800 cycles for 100 * min2cvt
584 cycles for 100 * min1cvt
631 cycles for 100 * ConvertLUT
489 cycles for 100 * C2D_J
1598 cycles for 100 * atodw
22796 cycles for 100 * sscanf
776 cycles for 100 * min2cvt
551 cycles for 100 * min1cvt
640 cycles for 100 * ConvertLUT
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT
Good job sir raymond and sir TimoVJL;
I created a static aligned LUT table to your procedures in data section.
I passed data pointer by stack, so, inserted a few modifications to deal with this in both codes.
raymond; when I changed bswap in your code by shr and respective changes it performed a bit better in my machine.
I'm leaving original code in testcase, only changed ebx to ecx register usage to follow register preservations without need of push/pop ebx.
TimoVJL; your code suffered performance because value being passed by stack instead of data pointer being passed by stack. I know this can be done inside our code, but just to follow rules. I have tried to insert same overhead in all procedures so result should be honest.
After "and 0f0f0f0fh" I inserted value back to stack. So, I confess it's not the best try because data is in register and go back to stack and read stack data digits each time. If you know better modifications please tell.
I have tried to not touch in yours code, only adapt to windows rules.
sir jj2007; in your testcase some procedures are getting data pointer from register while other procedures are receiving that from stack. This is creating different overheads. In first case it's necessary "mov reg,[reg]" while in other "mov reg,[esp+4] mov reg,[reg]", and at end of procedure first procedure do only "retn" while other "ret 4". Suggestions?
wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
562 cycles for 100 * C2D_J
1482 cycles for 100 * atodw
12649 cycles for 100 * sscanf
752 cycles for 100 * min2cvt
477 cycles for 100 * min1cvt
471 cycles for 100 * ray_LUT
741 cycles for 100 * TimoVJL_PellesC_LUT
638 cycles for 100 * C2D_J
1480 cycles for 100 * atodw
12835 cycles for 100 * sscanf
756 cycles for 100 * min2cvt
433 cycles for 100 * min1cvt
476 cycles for 100 * ray_LUT
744 cycles for 100 * TimoVJL_PellesC_LUT
564 cycles for 100 * C2D_J
1505 cycles for 100 * atodw
12630 cycles for 100 * sscanf
778 cycles for 100 * min2cvt
427 cycles for 100 * min1cvt
470 cycles for 100 * ray_LUT
729 cycles for 100 * TimoVJL_PellesC_LUT
561 cycles for 100 * C2D_J
1478 cycles for 100 * atodw
12558 cycles for 100 * sscanf
751 cycles for 100 * min2cvt
432 cycles for 100 * min1cvt
465 cycles for 100 * ray_LUT
730 cycles for 100 * TimoVJL_PellesC_LUT
54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt
62 bytes for ray_LUT
70 bytes for TimoVJL_PellesC_LUT
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ray_LUT
1234 = eax TimoVJL_PellesC_LUT
PS: sir jj2007 told me in pm that my executables are not running from GUI, so, you must open console to run; or if someone can create an executable direct using windows just post that. I'm doing these thing using linux. Appreciate.
These are results from jj2007 testcase:
wine FourCharsToDword.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
479 cycles for 100 * C2D_J
1528 cycles for 100 * atodw
12745 cycles for 100 * sscanf
725 cycles for 100 * min2cvt
493 cycles for 100 * min1cvt
502 cycles for 100 * ConvertLUT
478 cycles for 100 * C2D_J
1610 cycles for 100 * atodw
12623 cycles for 100 * sscanf
745 cycles for 100 * min2cvt
495 cycles for 100 * min1cvt
501 cycles for 100 * ConvertLUT
550 cycles for 100 * C2D_J
1483 cycles for 100 * atodw
12590 cycles for 100 * sscanf
741 cycles for 100 * min2cvt
581 cycles for 100 * min1cvt
500 cycles for 100 * ConvertLUT
50 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT
Thanks for the tip sir jj2007;
I checked command line when building executable and subsystem:WINDOWS was default; I changed that to CONSOLE.
Link version that I was using is v8.???, changed that to Version 5.12.8078.
So, I suppose now can work by double clicking.
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
min1cvt proc _data:dword
option prologue:none
option epilogue:none
mov ecx,[esp+4]
mov eax,[ecx]
xor eax,"0000" ;and eax,0f0f0f0fh
movd xmm0,eax ;01020304h
punpcklbw xmm0,xmm2 ;0001000200030004h
pmaddwd xmm0,xmm1 ;mul and add
;---------------------------
;db 66h,0fh,38h,02h,0c0h ; phaddd xmm0,xmm0 ;SSE3
;movd eax,xmm0
movd eax,xmm0
psrlq xmm0,32
movd ecx,xmm0
add eax,ecx
ret 4
min1cvt endp
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
655 cycles for 100 * C2D_J
1605 cycles for 100 * atodw
22178 cycles for 100 * sscanf
785 cycles for 100 * min2cvt
475 cycles for 100 * min1cvt
599 cycles for 100 * ray_LUT
661 cycles for 100 * TimoVJL_PellesC_LUT
651 cycles for 100 * C2D_J
1599 cycles for 100 * atodw
22808 cycles for 100 * sscanf
823 cycles for 100 * min2cvt
508 cycles for 100 * min1cvt
607 cycles for 100 * ray_LUT
717 cycles for 100 * TimoVJL_PellesC_LUT
714 cycles for 100 * C2D_J
1615 cycles for 100 * atodw
22657 cycles for 100 * sscanf
778 cycles for 100 * min2cvt
488 cycles for 100 * min1cvt
587 cycles for 100 * ray_LUT
657 cycles for 100 * TimoVJL_PellesC_LUT
696 cycles for 100 * C2D_J
1613 cycles for 100 * atodw
22193 cycles for 100 * sscanf
778 cycles for 100 * min2cvt
487 cycles for 100 * min1cvt
595 cycles for 100 * ray_LUT
664 cycles for 100 * TimoVJL_PellesC_LUT
54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
84 bytes for min2cvt
52 bytes for min1cvt
62 bytes for ray_LUT
70 bytes for TimoVJL_PellesC_LUT
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ray_LUT
1234 = eax TimoVJL_PellesC_LUT
It's a pity that this is limited to the exotic test case "four digits" :sad:
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
It's a pity that this is limited to the exotic test case "four digits" :sad:
its not,thanks everyone for testing different approaches to fast ascii->integer :thumbsup:
there is still fastest ascii->double and fastest ascii->64bit integer
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.
I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...
Quote from: mineiro on November 12, 2021, 06:13:25 AM
Quote from: jj2007 on November 12, 2021, 01:03:01 AM
Thanks, mineiro - it works, and now we have a clear winner, at least on my cpu:
Not really, I have absolute sure that this code can be defeat by others, and will perform different in others cpus. This is what I learn in this board.
I forgot how to activate xmm, do you remember? That code is being translated to mmx.
I'm adding a "66h byte" as preffix in instructions in other tests to work with xmm, but I remember that exist other way to do this. Maybe macros!?
db 66h
movd xmm0,eax ;01020304h
db 66h
punpcklbw xmm0,xmm2 ;0001000200030004h
...
change to newer assembler than ml 6.14 solves it without need for macros
Quote from: daydreamer on November 12, 2021, 06:54:11 AM
change to newer assembler than ml 6.14 solves it without need for macros
Thanks sir daydreamer.
That's great Guys.
I am happy to see many options are coming out and the speed of the code is quite nice. :thumbsup:
Compliments to everyone partecipating in the game. :eusa_clap:
Here is another approach, a finite state machine, AKA look up tree. 4 character strings fed in, integers fed out. Its big but its reasonably fast. Timings below are with my old i7.
Output
2
9999
5432
8619
9876
2345
1234
7776
8193
4352
test return values - done
timing 10 million matches next, press any key ....
ms = 109 duration
Press any key to continue ...
Finite state machine, AKA look up tree vs my algo. I hope I didn't make any mistakes in translation. The attachment does not contain the num.asm file - see Hutch' post.
:thumbsup:
That is genuinely fast and it looks like its producing the right output numbers.
I managed to integrate Hutch' finite state machine into the testbed. To assemble it, the num.asm file must be in the same folder. It's still plain Masm32 SDK, MasmBasic is not required. The min1cvt algo is a clear winner :thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
665 cycles for 100 * C2D_J
1600 cycles for 100 * atodw
21983 cycles for 100 * sscanf
766 cycles for 100 * min2cvt
545 cycles for 100 * min1cvt
683 cycles for 100 * ConvertLUT
2292 cycles for 100 * FSM (Hutch)
690 cycles for 100 * C2D_J
1605 cycles for 100 * atodw
21970 cycles for 100 * sscanf
769 cycles for 100 * min2cvt
544 cycles for 100 * min1cvt
653 cycles for 100 * ConvertLUT
2326 cycles for 100 * FSM (Hutch)
660 cycles for 100 * C2D_J
1612 cycles for 100 * atodw
22241 cycles for 100 * sscanf
770 cycles for 100 * min2cvt
547 cycles for 100 * min1cvt
626 cycles for 100 * ConvertLUT
2263 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
The attached zip file contains the source for the FSM, it is every number between 0000 and 9999 and will test all of them if you make the function call.
It occurs in this form.
0000 0000
0001 0001
0002 0002
0003 0003
0004 0004
............
9995 9995
9996 9996
9997 9997
9998 9998
9999 9999
Pass the first string to the integers procedure and it will return the actual integer in EAX.
AMD Athlon(tm) II X2 220 Processor (SSE3)
1186 cycles for 100 * C2D_J
2428 cycles for 100 * atodw
35137 cycles for 100 * sscanf
1304 cycles for 100 * min2cvt
1059 cycles for 100 * min1cvt
712 cycles for 100 * ConvertLUT
2780 cycles for 100 * FSM (Hutch)
1189 cycles for 100 * C2D_J
2426 cycles for 100 * atodw
34990 cycles for 100 * sscanf
1305 cycles for 100 * min2cvt
1065 cycles for 100 * min1cvt
721 cycles for 100 * ConvertLUT
2738 cycles for 100 * FSM (Hutch)
1185 cycles for 100 * C2D_J
2427 cycles for 100 * atodw
34735 cycles for 100 * sscanf
1311 cycles for 100 * min2cvt
1058 cycles for 100 * min1cvt
710 cycles for 100 * ConvertLUT
3155 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
532 cycles for 100 * C2D_J
1401 cycles for 100 * atodw
25676 cycles for 100 * sscanf
523 cycles for 100 * min2cvt
435 cycles for 100 * min1cvt
615 cycles for 100 * ConvertLUT
2124 cycles for 100 * FSM (Hutch)
524 cycles for 100 * C2D_J
1371 cycles for 100 * atodw
26048 cycles for 100 * sscanf
524 cycles for 100 * min2cvt
431 cycles for 100 * min1cvt
618 cycles for 100 * ConvertLUT
2129 cycles for 100 * FSM (Hutch)
512 cycles for 100 * C2D_J
1383 cycles for 100 * atodw
25954 cycles for 100 * sscanf
517 cycles for 100 * min2cvt
434 cycles for 100 * min1cvt
615 cycles for 100 * ConvertLUT
2127 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
22 bytes for sscanf
98 bytes for min2cvt
62 bytes for min1cvt
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax sscanf
1234 = eax min2cvt
1234 = eax min1cvt
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
-
Example how msvc C optimizer works:
int str4int( char *s)
{
char anum[5];
(*(int*)&anum) = (*(int*)s) & 0x0f0f0f0f;
return a1000[anum[0]] + a100[anum[1]] + a10[anum[2]] + anum[3];
}
_str4int:
[00000000] 8B442404 mov eax,dword ptr [esp+4]
[00000004] 53 push ebx
[00000005] 8B18 mov ebx,dword ptr [eax]
[00000007] 81E30F0F0F0F and ebx,F0F0F0Fh
[0000000D] 8BC3 mov eax,ebx
[0000000F] C1E810 shr eax,10h
[00000012] 0FB6D0 movzx edx,al
[00000015] 8BC3 mov eax,ebx
[00000017] C1E808 shr eax,8
[0000001A] 0FB6C8 movzx ecx,al
[0000001D] 8B049500000000 mov eax,dword ptr [edx*4+_a10]
[00000024] 03048D00000000 add eax,dword ptr [ecx*4+_a100]
[0000002B] 0FB6CB movzx ecx,bl
[0000002E] C1EB18 shr ebx,18h
[00000031] 03048D00000000 add eax,dword ptr [ecx*4+_a1000]
[00000038] 03C3 add eax,ebx
[0000003A] 5B pop ebx
[0000003B] C3 ret
New version, my algo and mineiro's are somewhat faster now:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
554 cycles for 100 * C2D_J
1594 cycles for 100 * atodw
776 cycles for 100 * min2cvt
476 cycles for 100 * min1cvt (SSE3)
625 cycles for 100 * ConvertLUT
2273 cycles for 100 * FSM (Hutch)
545 cycles for 100 * C2D_J
1575 cycles for 100 * atodw
768 cycles for 100 * min2cvt
478 cycles for 100 * min1cvt (SSE3)
625 cycles for 100 * ConvertLUT
2252 cycles for 100 * FSM (Hutch)
548 cycles for 100 * C2D_J
1594 cycles for 100 * atodw
765 cycles for 100 * min2cvt
478 cycles for 100 * min1cvt (SSE3)
637 cycles for 100 * ConvertLUT
2262 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
@Timo: I've tried the MSVC optimised version, but it's 16% slower and gives a wrong result :sad:
See below C2D_J: (or search for msvc inside the file)
In my machine, "shr reg,16" performs better than "bswap", and respective changes after this modification. The final gain in cycles is about 010~013 cycles.
ConvertLUT can be optimized a bit more, by removing "push/pop ebx" and using ecx register instead.
These are results of last benchmark test:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
513 cycles for 100 * C2D_J
1475 cycles for 100 * atodw
687 cycles for 100 * min2cvt
463 cycles for 100 * min1cvt (SSE3)
640 cycles for 100 * ConvertLUT
1383 cycles for 100 * FSM (Hutch)
467 cycles for 100 * C2D_J
1477 cycles for 100 * atodw
696 cycles for 100 * min2cvt
456 cycles for 100 * min1cvt (SSE3)
639 cycles for 100 * ConvertLUT
1383 cycles for 100 * FSM (Hutch)
456 cycles for 100 * C2D_J
1474 cycles for 100 * atodw
702 cycles for 100 * min2cvt
503 cycles for 100 * min1cvt (SSE3)
494 cycles for 100 * ConvertLUT
1431 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
453 cycles for 100 * C2D_J
1476 cycles for 100 * atodw
745 cycles for 100 * min2cvt
483 cycles for 100 * min1cvt (SSE3)
477 cycles for 100 * ConvertLUT
1397 cycles for 100 * FSM (Hutch)
457 cycles for 100 * C2D_J
1458 cycles for 100 * atodw
695 cycles for 100 * min2cvt
461 cycles for 100 * min1cvt (SSE3)
470 cycles for 100 * ConvertLUT
1366 cycles for 100 * FSM (Hutch)
450 cycles for 100 * C2D_J
1456 cycles for 100 * atodw
692 cycles for 100 * min2cvt
471 cycles for 100 * min1cvt (SSE3)
473 cycles for 100 * ConvertLUT
1374 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
-
@minerio about shr reg,16 vs bswap
with your own SSE version you should try if byte shuffle is faster?
my cpu is haswell based,so it supports avx2,anyone more than hutch that can test avx2 code?
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.
Quote from: TimoVJL on November 14, 2021, 08:15:40 AM
A test file with msvc 2019, it output 1234 as expexted.
Just for to see, how C works with LUT.
For me it didn't work, the result is 1114 instead of 1234. Please search for msvc inside FourCharsToDword.asm, it should be line 74; change line 56 to
if 0 to test the msvc code. Maybe I made an error when copying & pasting your code :sad:
To appeal to your sense of humour, I split the FSM into 4 procedures, much smaller file, got it to produce the right numbers but it was about 30% slower than the single FSM procedure. :undecided:
Ok, here is version 5, with Timo's MSVC "optimised" algo. On my machine, mineiro's SSE3 algo is a clear winner:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
545 cycles for 100 * C2D_J
1587 cycles for 100 * atodw
769 cycles for 100 * min2cvt
503 cycles for 100 * min1cvt (SSE3)
626 cycles for 100 * ConvertLUT
2267 cycles for 100 * FSM (Hutch)
698 cycles for 100 * MSVC (Timo)
550 cycles for 100 * C2D_J
1593 cycles for 100 * atodw
768 cycles for 100 * min2cvt
477 cycles for 100 * min1cvt (SSE3)
624 cycles for 100 * ConvertLUT
2244 cycles for 100 * FSM (Hutch)
698 cycles for 100 * MSVC (Timo)
547 cycles for 100 * C2D_J
1589 cycles for 100 * atodw
768 cycles for 100 * min2cvt
483 cycles for 100 * min1cvt (SSE3)
624 cycles for 100 * ConvertLUT
2246 cycles for 100 * FSM (Hutch)
698 cycles for 100 * MSVC (Timo)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)
Minieros also my favourite
, but for ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"
Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"
Really? How exactly would you do that?
ConvertLUT:
mov edx, [eax] ; pointer to string, e.g. "1234"
and edx, 0f0f0f0fh ; convert "1" to 1
movzx ecx, dl ;1000s
mov eax, t1000[ecx*4]
movzx ecx,dh ;100s
add eax, t100[ecx*4]
bswap edx
movzx ecx,dh ;10s
add eax, t10[ecx*4]
movzx ecx, dl ;units
add eax, ecx
retn
Sadly test program don't work with AMD Athlon(tm) II X2 220 Processor (SSE3)
AMD Athlon(tm) II X2 220 Processor (SSE3)
909 cycles for 100 * C2D_J
2480 cycles for 100 * atodw
1301 cycles for 100 * min2cvt
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
556 cycles for 100 * C2D_J
1699 cycles for 100 * atodw
847 cycles for 100 * min2cvt
553 cycles for 100 * min1cvt (SSE3)
588 cycles for 100 * ConvertLUT
1597 cycles for 100 * FSM (Hutch)
552 cycles for 100 * C2D_J
1701 cycles for 100 * atodw
847 cycles for 100 * min2cvt
557 cycles for 100 * min1cvt (SSE3)
586 cycles for 100 * ConvertLUT
1596 cycles for 100 * FSM (Hutch)
554 cycles for 100 * C2D_J
1701 cycles for 100 * atodw
846 cycles for 100 * min2cvt
565 cycles for 100 * min1cvt (SSE3)
588 cycles for 100 * ConvertLUT
1587 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
348 cycles for 100 * C2D_J
1080 cycles for 100 * atodw
501 cycles for 100 * min2cvt
323 cycles for 100 * min1cvt (SSE3)
359 cycles for 100 * ConvertLUT
1019 cycles for 100 * FSM (Hutch)
338 cycles for 100 * C2D_J
1051 cycles for 100 * atodw
509 cycles for 100 * min2cvt
366 cycles for 100 * min1cvt (SSE3)
385 cycles for 100 * ConvertLUT
1015 cycles for 100 * FSM (Hutch)
330 cycles for 100 * C2D_J
1119 cycles for 100 * atodw
516 cycles for 100 * min2cvt
326 cycles for 100 * min1cvt (SSE3)
376 cycles for 100 * ConvertLUT
1041 cycles for 100 * FSM (Hutch)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
536 cycles for 100 * C2D_J
1365 cycles for 100 * atodw
517 cycles for 100 * min2cvt
353 cycles for 100 * min1cvt (SSE3)
615 cycles for 100 * ConvertLUT
2123 cycles for 100 * FSM (Hutch)
618 cycles for 100 * MSVC (Timo)
546 cycles for 100 * C2D_J
1363 cycles for 100 * atodw
516 cycles for 100 * min2cvt
350 cycles for 100 * min1cvt (SSE3)
615 cycles for 100 * ConvertLUT
2152 cycles for 100 * FSM (Hutch)
620 cycles for 100 * MSVC (Timo)
536 cycles for 100 * C2D_J
1371 cycles for 100 * atodw
568 cycles for 100 * min2cvt
350 cycles for 100 * min1cvt (SSE3)
610 cycles for 100 * ConvertLUT
2142 cycles for 100 * FSM (Hutch)
624 cycles for 100 * MSVC (Timo)
54 bytes for C2D_J
10 bytes for atodw
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
118 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)
Quote from: jj2007 on November 15, 2021, 12:46:38 AM
Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"
Really? How exactly would you do that?
ConvertLUT:
mov edx, [eax] ; pointer to string, e.g. "1234"
and edx, 0f0f0f0fh ; convert "1" to 1
movzx ecx, dl ;1000s
mov eax, t1000[ecx*4]
movzx ecx,dh ;100s
add eax, t100[ecx*4]
bswap edx
movzx ecx,dh ;10s
add eax, t10[ecx*4]
movzx ecx, dl ;units
add eax, ecx
retn
.data
;integer part
t1000 real4 0.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0
t100 real4 0.0,100.0,200.0,300.0,400.0,500.0,600.0,700.0,800.0,900.0
t10 real4 0.0,10.0,20.0,30.0,40.0,50.0,60.0,70.0,80.0,90.0
t1 real4 0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
;decimals
t0dot1 real4 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
t0dot01 real4 0.0,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09
t0dot001 real4 0.0,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009
t0dot0001 real4 0.0,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.0009
Ok, these are the data, and now the code please, daydreamer :thumbsup:
deleted
Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.
The qeditor Help section has a listing for opcodes, along with the expected clock cycles for each. The BSWAP one has the following:
QuoteUsage: BSWAP reg32
Modifies flags: none
Changes the byte order of a 32 bit register from big endian to
little endian or vice versa. Result left in destination register
is undefined if the operand is a 16 bit register.
Clocks Size
Operands 808x 286 386 486 Bytes
reg32 - - - 1 2
0F C8+ rd BSWAP r32 Reverses the byte order of a 32-bit register.
If you consider BSWAP as a "
relatively slow instruction", maybe Hutch should review all the tables of that Help section.
Then, in addition, you mention that "
IMUL is actually a rather fast instruction", BUT the imul instruction help section in qeditor shows the following:
Quote
Clocks Size
Operands 808x 286 386 486 Bytes
reg8 80-98 13 9-14 13-18 2
reg16 128-154 21 9-22 13-26 2
reg32 - - 9-38 12-42 2
mem8 86-104 16 12-17 13-18 2-4
mem16 134-160 24 12-25 13-26 2-4
mem32 - - 12-41 13-42 2-4
reg16,reg16 - - 9-22 13-26 3-5
reg32,reg32 - - 9-38 13-42 3-5
reg16,mem16 - - 12-25 13-26 3-5
reg32,mem32 - - 12-41 13-42 3-5
reg16,immed - 21 9-22 13-26 3
reg32,immed - 21 9-38 13-42 3-6
reg16,reg16,immed - 2 9-22 13-26 3-6
reg32,reg32,immed - 21 9-38 13-42 3-6
reg16,mem16,immed - 24 12-25 13-26 3-6
reg32,mem32,immed - 24 12-41 13-42 3-6
Any explanation for this????
Could you provide us with YOUR sources for clock cycles?
Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.
...
86804 cycles 3.asm: pmaddwd
89518 cycles 1.asm: imul
94277 cycles 4.asm: pmaddwd+bswap
108254 cycles 2.asm: imul+bswap[/tt]
Thanks, Nidud. Here are my results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
546 cycles for 100 * C2D_J
793 cycles for 100 * imul (Nidud)
762 cycles for 100 * min2cvt
479 cycles for 100 * min1cvt (SSE3)
625 cycles for 100 * ConvertLUT
2243 cycles for 100 * FSM (Hutch)
709 cycles for 100 * MSVC (Timo)
554 cycles for 100 * C2D_J
793 cycles for 100 * imul (Nidud)
775 cycles for 100 * min2cvt
476 cycles for 100 * min1cvt (SSE3)
627 cycles for 100 * ConvertLUT
2245 cycles for 100 * FSM (Hutch)
696 cycles for 100 * MSVC (Timo)
546 cycles for 100 * C2D_J
792 cycles for 100 * imul (Nidud)
766 cycles for 100 * min2cvt
475 cycles for 100 * min1cvt (SSE3)
643 cycles for 100 * ConvertLUT
2252 cycles for 100 * FSM (Hutch)
703 cycles for 100 * MSVC (Timo)
54 bytes for C2D_J
58 bytes for imul (Nidud)
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
122 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax imul (Nidud)
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)
Quote from: raymond on November 16, 2021, 04:21:42 AM
Could you provide us with YOUR sources for clock cycles?
I don't know what Nidud's sources are, but here is mine - Agner Fog: one cycle for bswap (https://www.agner.org/optimize/instruction_tables.pdf)
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
537 cycles for 100 * C2D_J
723 cycles for 100 * imul (Nidud)
571 cycles for 100 * min2cvt
339 cycles for 100 * min1cvt (SSE3)
521 cycles for 100 * ConvertLUT
2133 cycles for 100 * FSM (Hutch)
630 cycles for 100 * MSVC (Timo)
542 cycles for 100 * C2D_J
632 cycles for 100 * imul (Nidud)
510 cycles for 100 * min2cvt
340 cycles for 100 * min1cvt (SSE3)
510 cycles for 100 * ConvertLUT
2134 cycles for 100 * FSM (Hutch)
621 cycles for 100 * MSVC (Timo)
537 cycles for 100 * C2D_J
639 cycles for 100 * imul (Nidud)
511 cycles for 100 * min2cvt
341 cycles for 100 * min1cvt (SSE3)
510 cycles for 100 * ConvertLUT
2131 cycles for 100 * FSM (Hutch)
619 cycles for 100 * MSVC (Timo)
54 bytes for C2D_J
58 bytes for imul (Nidud)
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
122 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax imul (Nidud)
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)
deleted
deleted
wonder how fast avx2 will be?
Quote from: daydreamer on November 16, 2021, 07:45:15 AM
wonder how fast avx2 will be?
We are still waiting for your code in reply #61, daydreamer.
deleted
deleted
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
343 cycles for 100 * C2D_J
660 cycles for 100 * imul (Nidud)
449 cycles for 100 * min2cvt
346 cycles for 100 * min1cvt (SSE3)
424 cycles for 100 * ConvertLUT
984 cycles for 100 * FSM (Hutch)
350 cycles for 100 * MSVC (Timo)
343 cycles for 100 * C2D_J
659 cycles for 100 * imul (Nidud)
449 cycles for 100 * min2cvt
347 cycles for 100 * min1cvt (SSE3)
424 cycles for 100 * ConvertLUT
970 cycles for 100 * FSM (Hutch)
546 cycles for 100 * MSVC (Timo)
341 cycles for 100 * C2D_J
662 cycles for 100 * imul (Nidud)
449 cycles for 100 * min2cvt
354 cycles for 100 * min1cvt (SSE3)
412 cycles for 100 * ConvertLUT
969 cycles for 100 * FSM (Hutch)
347 cycles for 100 * MSVC (Timo)
54 bytes for C2D_J
58 bytes for imul (Nidud)
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
122 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax imul (Nidud)
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)
-
AMD Ryzen 7 3700X 8-Core Processor (SSE4)
362 cycles for 100 * C2D_J
496 cycles for 100 * imul (Nidud)
365 cycles for 100 * min2cvt
282 cycles for 100 * min1cvt (SSE3)
438 cycles for 100 * ConvertLUT
1093 cycles for 100 * FSM (Hutch)
304 cycles for 100 * MSVC (Timo)
362 cycles for 100 * C2D_J
494 cycles for 100 * imul (Nidud)
365 cycles for 100 * min2cvt
277 cycles for 100 * min1cvt (SSE3)
439 cycles for 100 * ConvertLUT
1092 cycles for 100 * FSM (Hutch)
292 cycles for 100 * MSVC (Timo)
374 cycles for 100 * C2D_J
495 cycles for 100 * imul (Nidud)
364 cycles for 100 * min2cvt
279 cycles for 100 * min1cvt (SSE3)
437 cycles for 100 * ConvertLUT
1100 cycles for 100 * FSM (Hutch)
292 cycles for 100 * MSVC (Timo)
54 bytes for C2D_J
58 bytes for imul (Nidud)
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
122 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)
1234 = eax C2D_J
1234 = eax atodw
1234 = eax imul (Nidud)
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)
--- ok ---
I opened a new thread called "Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,.. (http://masm32.com/board/index.php?topic=9638.0)" in the Lab :cool:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
23 cycles for 100 * imul 10
54 cycles for 100 * lea: *10
4871 cycles for 100 * lodsd (25 DWORDs)
4863 cycles for 100 * mov eax, [esi] + add esi, 4
58 cycles for 100 * lea10, add eax
57 cycles for 100 * lea10, shl eax, 1
27 cycles for 100 * bswap
62 cycles for 100 * ror 16