I've been working on a fast conversion function for strings (containing decimal digits) to integer form, for an upcoming book on 32-bit ARM assembly language. The code I wrote for the traditional conversion (read a char, multiply accumulator by 10, add in char) ran in about 56 seconds on a Pi 400 (100,000,000 iterations with about a dozen sample input values). I then used the character set lookup table algorithm (used in the hexadecimal string-to-integer algorithm I've posted elsewhere on this board). That ran in about 40 seconds.
One thing that always bothered me is why we can process multiple characters at a time (similar to how the fastest numeric-to-string algorithms do this). I broke down and wrote some ARM assembly to do this (including handling special cases like preventing fetching two bytes across an MMU page boundary, which could lead to a segmentation fault). I created a huge state machine to process character pairs of the form:
1. <digit><digit>
2. <digit><underscore>
3. <underscore><digit>
4. <digit><delimiter>
5. <delimiter><anything>
6. <digit><illegal character>
7. <illegal character><anything>
I used the character pair as an index into a 65,536-entry table (each four bytes), which held the address of the code to handle that character pair (about 120 cases, or so). Obviously, the function was quite large (the lookup table alone was 256KB).
End result: my sample data (100,000,000 iterations) ran in 38 seconds. It would be hard to justify a function of this size for a two-second improvement (about 5%). Especially as it lays waste to the cache and may affect the performance of the rest of an application that uses it. As such, I consider this approach a failure.
I'm not going to bother converting the code to x86. You can find the (ARM) code here (https://randallhyde.com/Forum/viewtopic.php?t=707 (https://randallhyde.com/Forum/viewtopic.php?t=707)), if you are interested in taking a look. There is probably a better way to process two characters at a time, but I doubt I'd come up with a scheme that is almost twice as fast as the single-character algorithm, so I don't think I'll be trying to advance this idea any further.
If someone has a (very) fast algorithm for converting decimal strings to integer values, I'd love to see it.
Cheers,
Randy Hyde
Sorry, but I have to ask: what's the point of this? I mean, who really cares how much time it takes to convert character representations of numbers to binary? Is there actually some real-world application where this would be necessary? or even helpful? Some huuuuge database? I can't think of a single case.
Quote from: hyder on March 01, 2024, 08:17:46 AMYou can find the (ARM) code here (https://randallhyde.com/Forum/viewtopic.php?t=707 (https://randallhyde.com/Forum/viewtopic.php?t=707)), if you are interested in taking a look.
QuoteI came across many comments along the lines of "premature optimization is a waste of time" and "why do you care how fast this function is?" If the original poster was asking how to do this for a single application, some argument could be made about not rushing to optimize the function until you've profiled your code and determined where the real performance bottlenecks lie. However, this argument does not apply to library code (such as a string-to-integer conversion). Library code is going to be used it a wide variety of applications; some where the performance doesn't matter, some where it matters a lot. Anyone who produces a lot of XML or JSON output can testify that converting integers to human-readable strings, and vice-versa, can be real bottlenecks. Library code should always be optimized, independent of some individual application that uses it, because the library's author(s) will never know how important that code might be.
Wise words :thumbsup:
I'd like to hear the OP's take on this.
This is the best I can do for now. Extracts are painful. I think splitting the higher part would avoid 1.
String_To_Int proc
;abcd efgh ijkl mnop qrst1 -> ab ef ij mn qr -> a e i m q
vmovdqu ymm0,ymmword ptr [rcx]
vpbroadcastb ymm4, byte ptr [StrASCI]
vpsubb ymm0,ymm0,ymm4
vmovdqu ymm1,ymmword ptr [MulStr101B]
vmovdqu ymm2,ymmword ptr [MulStr101W]
vpmaddubsw ymm0,ymm0,ymm1 ;a,b,c,d,e,f,g,h,i,j
vpmaddwd ymm0,ymm0,ymm2 ;a,b,c,d,e
vmovdqu ymm3,ymmword ptr [StrConvPermD] ;aa,bb,cc,dd
vpermd ymm2,ymm3,ymm0
vmovdqu ymm5,ymmword ptr [MulStr101D]
vpmuludq ymm2,ymm2,ymm5 ;abcd0000,efgh,ijkl0000,mnop
vpshufd ymm3,ymm2,00000010b ;0a,0b,0c,0d + 0b,xx,0d,xx
vpaddd ymm2,ymm2,ymm3 ;ab,0,x,x,cd,0,x,x
vextracti128 xmm1,ymm2,1
vmovd eax,xmm2 ;ab
vmovd edx,xmm1 ;cd
imul rax,100000000
add rax,rdx ;abcd
vextracti128 xmm0,ymm0,1
vmovd r8d,xmm0 ;e
imul rax,10000
add rax,r8
ret
StrASCI BYTE 48
MulStr101B BYTE 10 DUP(10,1),12 DUP(0)
MulStr101W WORD 5 DUP(100,1),6 DUP(0)
MulStr101D DWORD 2 DUP(10000,0,1,0)
StrConvPermD DWORD 0,0,1,1,2,2,3,3
String_To_Int endp
Quoteread a char, multiply accumulator by 10, add in char
Would shifts (or lea, maybe) be any faster?
Something like this?
; # to multiply X 10 in EAX:
MOV EDX, EAX
SHL EDX, 3
SHL EAX, 1
ADD EAX, EDX
Would that be faster than a MUL?
Quote from: InfiniteLoop on March 01, 2024, 01:44:27 PMString_To_Int proc
It would be nice to see a working example. I can build it but the results are not correct.
Quote from: sinsi on March 01, 2024, 03:05:07 PMQuoteread a char, multiply accumulator by 10, add in char
Would shifts (or lea, maybe) be any faster?
Most of the time,
imul is faster (https://masm32.com/board/index.php?msg=105827) (and 2 bytes shorter):
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
Averages:
0 cycles for *10 with imul
29 cycles for *10 with lea, shl 1
30 cycles for *10 with lea, add eax, eax
Zero cycles means there is no difference to an empty loop.
The three variants:
mov eax, 12345678
imul eax, eax, 10
mov eax, 12345678
lea eax, [4*eax+eax]
shl eax, 1
mov eax, 12345678
lea eax, [4*eax+eax]
add eax, eax
However, the picture changes when using a REPEAT 9:
Averages:
744 cycles for *10 with imul
485 cycles for *10 with lea, shl 1
500 cycles for *10 with lea, add eax, eax
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
mov eax, 1
REPEAT 9
imul eax, eax, 10
ENDM
dec ebx
.Until Sign?
ret
TestA endp
What does that mean for real life applications? I don't know :cool:
Modern Real world might take advantage of standard unicode 16 bit = save SSE unpack instructions before SIMD multiplications?
Go ahead, daydreamer, it's time for your SIMD version (https://masm32.com/board/index.php?msg=127589) :thumbsup:
Quote from: NoCforMe on March 01, 2024, 08:35:07 AMSorry, but I have to ask: what's the point of this? I mean, who really cares how much time it takes to convert character representations of numbers to binary? Is there actually some real-world application where this would be necessary? or even helpful? Some huuuuge database? I can't think of a single case.
For normal programs, you may very well be correct. However, for library code you really want to optimize the heck out of the functions because you have no control over how that code will be used. If someone uses a really fast library function in a program that doesn't need the speed, little is lost. However, if you use a slow library routine in a program that *does* require speed, you'll wish the author put a little more effort into that code.
Someone suggested elsewhere on this board that numeric to string and string to numeric conversions don't need to be fast because most of the time they're used to process input from a user (keyboard). While it may be true that those particular applications don't require high performance, consider the person parsing large XML or JSON files. Such applications may very well have a bottleneck associated with string-to-numeric (or vice-versa) conversions. Screwing over the author of such programs because you don't believe high-performance code isn't important for such functions is bad library design methodology.
Cheers,
Randy Hyde