News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

An (failed) idea for decimal string-to-numeric

Started by hyder, March 01, 2024, 08:17:46 AM

Previous topic - Next topic

hyder

I've been working on a fast conversion function for strings (containing decimal digits) to integer form, for an upcoming book on 32-bit ARM assembly language. The code I wrote for the traditional conversion (read a char, multiply accumulator by 10, add in char) ran in about 56 seconds on a Pi 400 (100,000,000 iterations with about a dozen sample input values). I then used the character set lookup table algorithm (used in the hexadecimal string-to-integer algorithm I've posted elsewhere on this board). That ran in about 40 seconds.

One thing that always bothered me is why we can process multiple characters at a time (similar to how the fastest numeric-to-string algorithms do this). I broke down and wrote some ARM assembly to do this (including handling special cases like preventing fetching two bytes across an MMU page boundary, which could lead to a segmentation fault). I created a huge state machine to process character pairs of the form:

1. <digit><digit>
2. <digit><underscore>
3. <underscore><digit>
4. <digit><delimiter>
5. <delimiter><anything>
6. <digit><illegal character>
7. <illegal character><anything>

I used the character pair as an index into a 65,536-entry table (each four bytes), which held the address of the code to handle that character pair (about 120 cases, or so). Obviously, the function was quite large (the lookup table alone was 256KB).

End result: my sample data (100,000,000 iterations) ran in 38 seconds. It would be hard to justify a function of this size for a two-second improvement (about 5%). Especially as it lays waste to the cache and may affect the performance of the rest of an application that uses it. As such, I consider this approach a failure.

I'm not going to bother converting the code to x86. You can find the (ARM) code here (https://randallhyde.com/Forum/viewtopic.php?t=707), if you are interested in taking a look. There is probably a better way to process two characters at a time, but I doubt I'd come up with a scheme that is almost twice as fast as the single-character algorithm, so I don't think I'll be trying to advance this idea any further.

If someone has a (very) fast algorithm for converting decimal strings to integer values, I'd love to see it.
Cheers,
Randy Hyde

NoCforMe

Sorry, but I have to ask: what's the point of this? I mean, who really cares how much time it takes to convert character representations of numbers to binary? Is there actually some real-world application where this would be necessary? or even helpful? Some huuuuge database? I can't think of a single case.
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: hyder on March 01, 2024, 08:17:46 AMYou can find the (ARM) code here (https://randallhyde.com/Forum/viewtopic.php?t=707), if you are interested in taking a look.

QuoteI came across many comments along the lines of "premature optimization is a waste of time" and "why do you care how fast this function is?" If the original poster was asking how to do this for a single application, some argument could be made about not rushing to optimize the function until you've profiled your code and determined where the real performance bottlenecks lie. However, this argument does not apply to library code (such as a string-to-integer conversion). Library code is going to be used it a wide variety of applications; some where the performance doesn't matter, some where it matters a lot. Anyone who produces a lot of XML or JSON output can testify that converting integers to human-readable strings, and vice-versa, can be real bottlenecks. Library code should always be optimized, independent of some individual application that uses it, because the library's author(s) will never know how important that code might be.

Wise words :thumbsup:

NoCforMe

Assembly language programming should be fun. That's why I do it.

InfiniteLoop

This is the best I can do for now. Extracts are painful. I think splitting the higher part would avoid 1.
String_To_Int proc
;abcd efgh ijkl mnop qrst1 -> ab ef ij mn qr -> a e i m q
vmovdqu ymm0,ymmword ptr [rcx]
vpbroadcastb ymm4, byte ptr [StrASCI]
vpsubb ymm0,ymm0,ymm4
vmovdqu ymm1,ymmword ptr [MulStr101B]
vmovdqu ymm2,ymmword ptr [MulStr101W]
vpmaddubsw ymm0,ymm0,ymm1        ;a,b,c,d,e,f,g,h,i,j
vpmaddwd ymm0,ymm0,ymm2            ;a,b,c,d,e
vmovdqu ymm3,ymmword ptr [StrConvPermD]    ;aa,bb,cc,dd
vpermd ymm2,ymm3,ymm0
vmovdqu ymm5,ymmword ptr [MulStr101D]
vpmuludq ymm2,ymm2,ymm5    ;abcd0000,efgh,ijkl0000,mnop
vpshufd ymm3,ymm2,00000010b    ;0a,0b,0c,0d + 0b,xx,0d,xx
vpaddd ymm2,ymm2,ymm3        ;ab,0,x,x,cd,0,x,x
vextracti128 xmm1,ymm2,1
vmovd eax,xmm2    ;ab
vmovd edx,xmm1    ;cd
imul rax,100000000
add rax,rdx        ;abcd
vextracti128 xmm0,ymm0,1
vmovd r8d,xmm0    ;e
imul rax,10000
add rax,r8
ret
StrASCI BYTE 48
MulStr101B BYTE 10 DUP(10,1),12 DUP(0)
MulStr101W WORD 5 DUP(100,1),6 DUP(0)
MulStr101D DWORD 2 DUP(10000,0,1,0)
StrConvPermD DWORD 0,0,1,1,2,2,3,3
String_To_Int endp

sinsi

Quoteread a char, multiply accumulator by 10, add in char
Would shifts (or lea, maybe) be any faster?
🍺🍺🍺

NoCforMe

Something like this?
; # to multiply X 10 in EAX:
    MOV    EDX, EAX
    SHL    EDX, 3
    SHL    EAX, 1
    ADD    EAX, EDX
Would that be faster than a MUL?
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: InfiniteLoop on March 01, 2024, 01:44:27 PMString_To_Int proc

It would be nice to see a working example. I can build it but the results are not correct.

Quote from: sinsi on March 01, 2024, 03:05:07 PM
Quoteread a char, multiply accumulator by 10, add in char
Would shifts (or lea, maybe) be any faster?

Most of the time, imul is faster (and 2 bytes shorter):

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)
Averages:
0      cycles for *10 with imul
29      cycles for *10 with lea, shl 1
30      cycles for *10 with lea, add eax, eax

Zero cycles means there is no difference to an empty loop.

The three variants:
    mov eax, 12345678
    imul eax, eax, 10

    mov eax, 12345678
    lea eax, [4*eax+eax]
    shl eax, 1

    mov eax, 12345678
    lea eax, [4*eax+eax]
    add eax, eax

However, the picture changes when using a REPEAT 9:
Averages:
744    cycles for *10 with imul
485    cycles for *10 with lea, shl 1
500    cycles for *10 with lea, add eax, eax

TestA proc
  mov ebx, AlgoLoops-1    ; loop e.g. 100x
  align 4
  .Repeat
    mov eax, 1
    REPEAT 9
        imul eax, eax, 10
    ENDM
    dec ebx
  .Until Sign?
  ret
TestA endp

What does that mean for real life applications? I don't know :cool:

daydreamer

Modern Real world might take advantage of standard unicode 16 bit = save SSE unpack instructions before SIMD multiplications?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding


hyder

Quote from: NoCforMe on March 01, 2024, 08:35:07 AMSorry, but I have to ask: what's the point of this? I mean, who really cares how much time it takes to convert character representations of numbers to binary? Is there actually some real-world application where this would be necessary? or even helpful? Some huuuuge database? I can't think of a single case.

For normal programs, you may very well be correct. However, for library code you really want to optimize the heck out of the functions because you have no control over how that code will be used. If someone uses a really fast library function in a program that doesn't need the speed, little is lost. However, if you use a slow library routine in a program that *does* require speed, you'll wish the author put a little more effort into that code.

Someone suggested elsewhere on this board that numeric to string and string to numeric conversions don't need to be fast because most of the time they're used to process input from a user (keyboard). While it may be true that those particular applications don't require high performance, consider the person parsing large XML or JSON files. Such applications may very well have a bottleneck associated with string-to-numeric (or vice-versa) conversions. Screwing over the author of such programs because you don't believe high-performance code isn't important for such functions is bad library design methodology.
Cheers,
Randy Hyde