An (failed) idea for decimal string-to-numeric

hyder · March 01, 2024, 08:17:46 AM

I've been working on a fast conversion function for strings (containing decimal digits) to integer form, for an upcoming book on 32-bit ARM assembly language. The code I wrote for the traditional conversion (read a char, multiply accumulator by 10, add in char) ran in about 56 seconds on a Pi 400 (100,000,000 iterations with about a dozen sample input values). I then used the character set lookup table algorithm (used in the hexadecimal string-to-integer algorithm I've posted elsewhere on this board). That ran in about 40 seconds.

One thing that always bothered me is why we can process multiple characters at a time (similar to how the fastest numeric-to-string algorithms do this). I broke down and wrote some ARM assembly to do this (including handling special cases like preventing fetching two bytes across an MMU page boundary, which could lead to a segmentation fault). I created a huge state machine to process character pairs of the form:

1. <digit><digit>
2. <digit><underscore>
3. <underscore><digit>
4. <digit><delimiter>
5. <delimiter><anything>
6. <digit><illegal character>
7. <illegal character><anything>

I used the character pair as an index into a 65,536-entry table (each four bytes), which held the address of the code to handle that character pair (about 120 cases, or so). Obviously, the function was quite large (the lookup table alone was 256KB).

End result: my sample data (100,000,000 iterations) ran in 38 seconds. It would be hard to justify a function of this size for a two-second improvement (about 5%). Especially as it lays waste to the cache and may affect the performance of the rest of an application that uses it. As such, I consider this approach a failure.

I'm not going to bother converting the code to x86. You can find the (ARM) code here (https://randallhyde.com/Forum/viewtopic.php?t=707), if you are interested in taking a look. There is probably a better way to process two characters at a time, but I doubt I'd come up with a scheme that is almost twice as fast as the single-character algorithm, so I don't think I'll be trying to advance this idea any further.

If someone has a (very) fast algorithm for converting decimal strings to integer values, I'd love to see it.
Cheers,
Randy Hyde

NoCforMe · March 01, 2024, 08:35:07 AM

Sorry, but I have to ask: what's the point of this? I mean, who really cares how much time it takes to convert character representations of numbers to binary? Is there actually some real-world application where this would be necessary? or even helpful? Some huuuuge database? I can't think of a single case.

jj2007 · March 01, 2024, 10:42:38 AM

Quote from: hyder on March 01, 2024, 08:17:46 AMYou can find the (ARM) code here (https://randallhyde.com/Forum/viewtopic.php?t=707), if you are interested in taking a look.

QuoteI came across many comments along the lines of "premature optimization is a waste of time" and "why do you care how fast this function is?" If the original poster was asking how to do this for a single application, some argument could be made about not rushing to optimize the function until you've profiled your code and determined where the real performance bottlenecks lie. However, this argument does not apply to library code (such as a string-to-integer conversion). Library code is going to be used it a wide variety of applications; some where the performance doesn't matter, some where it matters a lot. Anyone who produces a lot of XML or JSON output can testify that converting integers to human-readable strings, and vice-versa, can be real bottlenecks. Library code should always be optimized, independent of some individual application that uses it, because the library's author(s) will never know how important that code might be.

Wise words

NoCforMe · March 01, 2024, 11:21:50 AM

I'd like to hear the OP's take on this.

InfiniteLoop · March 01, 2024, 01:44:27 PM

This is the best I can do for now. Extracts are painful. I think splitting the higher part would avoid 1.

Code Select

String_To_Int proc
;abcd efgh ijkl mnop qrst1 -> ab ef ij mn qr -> a e i m q
vmovdqu ymm0,ymmword ptr [rcx]
vpbroadcastb ymm4, byte ptr [StrASCI]
vpsubb ymm0,ymm0,ymm4
vmovdqu ymm1,ymmword ptr [MulStr101B]
vmovdqu ymm2,ymmword ptr [MulStr101W]
vpmaddubsw ymm0,ymm0,ymm1        ;a,b,c,d,e,f,g,h,i,j
vpmaddwd ymm0,ymm0,ymm2            ;a,b,c,d,e
vmovdqu ymm3,ymmword ptr [StrConvPermD]    ;aa,bb,cc,dd
vpermd ymm2,ymm3,ymm0
vmovdqu ymm5,ymmword ptr [MulStr101D]
vpmuludq ymm2,ymm2,ymm5    ;abcd0000,efgh,ijkl0000,mnop
vpshufd ymm3,ymm2,00000010b    ;0a,0b,0c,0d + 0b,xx,0d,xx
vpaddd ymm2,ymm2,ymm3        ;ab,0,x,x,cd,0,x,x
vextracti128 xmm1,ymm2,1
vmovd eax,xmm2    ;ab
vmovd edx,xmm1    ;cd
imul rax,100000000
add rax,rdx        ;abcd
vextracti128 xmm0,ymm0,1
vmovd r8d,xmm0    ;e
imul rax,10000
add rax,r8
ret
StrASCI BYTE 48
MulStr101B BYTE 10 DUP(10,1),12 DUP(0)
MulStr101W WORD 5 DUP(100,1),6 DUP(0)
MulStr101D DWORD 2 DUP(10000,0,1,0)
StrConvPermD DWORD 0,0,1,1,2,2,3,3
String_To_Int endp

sinsi · March 01, 2024, 03:05:07 PM

Quoteread a char, multiply accumulator by 10, add in char

Would shifts (or lea, maybe) be any faster?

NoCforMe · March 01, 2024, 07:26:06 PM

Something like this?

Code Select

; # to multiply X 10 in EAX:
    MOV    EDX, EAX
    SHL    EDX, 3
    SHL    EAX, 1
    ADD    EAX, EDX

Would that be faster than a MUL?

jj2007 · March 01, 2024, 09:54:37 PM

Quote from: InfiniteLoop on March 01, 2024, 01:44:27 PMString_To_Int proc

It would be nice to see a working example. I can build it but the results are not correct.

Quote from: sinsi on March 01, 2024, 03:05:07 PM
Quoteread a char, multiply accumulator by 10, add in char
Would shifts (or lea, maybe) be any faster?

Most of the time, imul is faster (and 2 bytes shorter):

Code Select

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)
Averages:
0      cycles for *10 with imul
29      cycles for *10 with lea, shl 1
30      cycles for *10 with lea, add eax, eax

Zero cycles means there is no difference to an empty loop.

The three variants:
mov eax, 12345678
imul eax, eax, 10

mov eax, 12345678
lea eax, [4*eax+eax]
shl eax, 1

mov eax, 12345678
lea eax, [4*eax+eax]
add eax, eax

However, the picture changes when using a REPEAT 9:

Code Select

Averages:
744    cycles for *10 with imul
485    cycles for *10 with lea, shl 1
500    cycles for *10 with lea, add eax, eax

Code Select

TestA proc
  mov ebx, AlgoLoops-1    ; loop e.g. 100x
  align 4
  .Repeat
    mov eax, 1
    REPEAT 9
        imul eax, eax, 10
    ENDM
    dec ebx
  .Until Sign?
  ret
TestA endp

What does that mean for real life applications? I don't know

daydreamer · March 02, 2024, 01:58:13 AM

Modern Real world might take advantage of standard unicode 16 bit = save SSE unpack instructions before SIMD multiplications?

jj2007 · March 02, 2024, 04:31:05 AM

Go ahead, daydreamer, it's time for your SIMD version

hyder · May 14, 2024, 01:23:10 PM

Quote from: NoCforMe on March 01, 2024, 08:35:07 AMSorry, but I have to ask: what's the point of this? I mean, who really cares how much time it takes to convert character representations of numbers to binary? Is there actually some real-world application where this would be necessary? or even helpful? Some huuuuge database? I can't think of a single case.

For normal programs, you may very well be correct. However, for library code you really want to optimize the heck out of the functions because you have no control over how that code will be used. If someone uses a really fast library function in a program that doesn't need the speed, little is lost. However, if you use a slow library routine in a program that *does* require speed, you'll wish the author put a little more effort into that code.

Someone suggested elsewhere on this board that numeric to string and string to numeric conversions don't need to be fast because most of the time they're used to process input from a user (keyboard). While it may be true that those particular applications don't require high performance, consider the person parsing large XML or JSON files. Such applications may very well have a bottleneck associated with string-to-numeric (or vice-versa) conversions. Screwing over the author of such programs because you don't believe high-performance code isn't important for such functions is bad library design methodology.
Cheers,
Randy Hyde

The MASM Forum

News:

An (failed) idea for decimal string-to-numeric

hyder

NoCforMe

jj2007

NoCforMe

InfiniteLoop

sinsi

NoCforMe

jj2007

daydreamer

jj2007

hyder