UTF-8 StrLen (String length computation)

Antariy · September 15, 2013, 01:40:24 AM

The Unicode UTF-8 character / string length computation code. The actual code is bold. Any ideas on simpler implementation?

u8_loadchar MACRO charPtr:REQ, theReg:=<EAX>
LOCAL v1, v2, v3
v3 TEXTEQU <@CatStr(%OPATTR(charPtr))>
IF v3 AND 10000Y ; reg
v1 TEXTEQU <@CatStr(%SIZE(TYPE(charPtr)))>
v2 TEXTEQU <@CatStr(%SIZE(TYPE(EAX)))>
IF v1 GT v2 ; byte-sized reg bug fix
v1 TEXTEQU <1>
ENDIF
IF v1 LT v2
movzx theReg,charPtr
ELSE
mov theReg,charPtr
ENDIF
ELSEIF v3 AND 11Y ; mem ref
movzx theReg,byte ptr charPtr

ELSE
echo Unknown reference to a u8_loadchar macro!
.ERR
ENDIF
ENDM

u8clen MACRO theReg:=<EAX>
LOCAL bytereg

bytereg TEXTEQU @CatStr(@SubStr(theReg,2,1),<l>)

; for instance reg is 11111101 / 01111111 for a
; first char of a 6-byte UTF-8 char / one standard-ASCII UTF-8 char

; get stop bit pos and take care of a standard-ASCII char
not bytereg ; 00000010 / 10000000
bsr theReg,theReg ; 1 / 7
add bytereg,-7 ; -6 / 0
neg bytereg ; 6 / 0
sbb bytereg,-1 ; add 1 and minus carry (if reg was not 0 before neg
; then this makes nothing) - 6 / 1 (NEG 0 clears CF)

ENDM

u8strlen MACRO strPtr:REQ, strLen

LOCAL l1
LOCAL l0
xor eax,eax
cdq
jmp l1

align 16
l0:
u8_loadchar [strPtr+edx],ecx
u8clen ecx
add edx,ecx
inc eax
l1:
ifb <strLen>
cmp byte ptr [strPtr+edx],0
jnz l0
else
cmp edx,strLen
jb l0

endif

ENDM

Antariy · September 15, 2013, 07:09:51 AM

Maybe there is a question like "why you need a specific code to calculate UTF-8 string length, why not use usual ASCII StrLen?", here is the explanation.

UTF-8 character has variable byte lenght. It maybe from one byte for characters with code below than 128, and up to 6 bytes (actual standard of characters set has 4 bytes width). I.e. in one string you may have, for an instance, 3 chars, every of which has different byte length. Or 1000 characters with width warying from 1 to 4/6 bytes. The only way to know characters count is the walking through all of them and computation the length of every char. The same thing if you want to get Nth char of the string: you should walk through Nth-1 chars, because the offset in the string is unknown - it may not be accessed so easy as for ASCII or UTF-16 (like mov al,[stringPtr+Nth-1]).

Usual ASCII StrLen will not return correct string length (characters count) for UTF-8 string. The same for accessing the parts of the string - if you want to access to arbitrary positions in the string, you should use the code like above for proper positioning, otherwice you will break UTF-8 bitstream and get garbage instead of the part of a text.

jj2007 · September 15, 2013, 07:16:06 AM

Interesting, Alex :t

Can you give an example of usage? Right now, I can see more examples where it would not work because a byte count would be expected, like printing part of a string to a file...

Antariy · September 15, 2013, 07:30:52 AM

Jochen, a byte count isn't a hard thing to be calculated, the real hard thing with UTF-8 is arbitrary and exact positioning inside the string - I explained why in the post above.

Since UTF-8 is a Unicode, it may contain text in any language - characters with variable length, so, if you have a text, for an instance, that contains not only standard ASCII chars (with code less than 128), you may not use straight pointers to access inside the string. For an instance, "This is testing text. Это тестовый текст. [here are another language or specifical characters, like mathematical, pseudographical etc - Unicode allows them all in one text]. 123 - try to access the substring", try to acccess to, let's say, 40th character.

Quote from: Antariy on September 15, 2013, 07:09:51 AM
The same thing if you want to get Nth char of the string: you should walk through Nth-1 chars, because the offset in the string is unknown - it may not be accessed so easy as for ASCII or UTF-16 (like mov al,[stringPtr+Nth-1]).

In short words: if you want, for an instance, to extract part of a text in UTF-8 without converting it to UTF-16 first, you need to properly position the pointer, computing the characters length.
The point may become clear when you try to do anything with the string, that required positioning in it, the explanation is probably a bit unclear with my English.

jj2007 · September 15, 2013, 07:51:22 AM

Alex,

What you write makes sense, of course. But I wonder where we could use it in a real life app...
I just made a test with my wPrint console stuff, it works, but it is UTF-16 because that's what you get from a resource file. The byte count of the 7-char string in UTF-8 is 21, by the way.

Maybe it would be useful for webpages, they are very often UTF-8.

wPrint wLeft$(wRes$(1), 1), wCrLf$
...
wPrint wLeft$(wRes$(1), 7), wCrLf$

Antariy · September 15, 2013, 08:20:46 AM

Quote from: jj2007 on September 15, 2013, 07:51:22 AM
But I wonder where we could use it in a real life app...

Simplest example - parsing of UTF-8 string as it is without conversion to UTF-16.

Quote from: jj2007 on September 15, 2013, 07:51:22 AM
The byte count of the 7-char string in UTF-8 is 21, by the way.

Jochen, this example string is a bit too simplificated. In real app you will have to work with the real text, containing latin characters (reffering to the webpages, that's tags, the latin text etc) and characters of any other languages in any intermix possible (the same text may contain phrases in different languages, specifical symbols etc etc etc).
(As for 7 char - 21 byte - that's because these Chinese characters are 3 bytes width every, but insert one ASCII letter into text - let's say "-" or dot, and it will become 8 chars and 22 bytes length, insert Russian letter, and it will become 9 chars and 24 bytes length, etc. For real text you will have a mix of characters (words from characters) with different widths, that makes positioning difficult).

Antariy · September 15, 2013, 11:38:52 AM

u8clen macro returns the length of an UTF-8 char - need to load the first byte of the char into the reg, and then execute this macro, and it will return in the same reg the length in bytes of the char that contains this (first or only one) byte. String length computation code based on this macro - it just sequentally accesses every char, starting from first, computes the length of a char in bytes, adds this length to the pointer so it points to the next char in the string, grabs the first byte of that char, computes its length, adds it to the pointer, and so on.

As a simple example. One having three strings: ASCII/ANSI (8 bits), UTF-16 (16 bits, "usual" Unicode), UTF-8.
To get 10th char of the every string, you need to do so:

ASCII/ANSI:

mov al,[stringPointer+9]

UTF-16:

mov ax,[stringPointer+9*2]

UTF-8:

; find the char position in the string
mov ecx,9
xor edx,edx
@@:
mov al,[stringPointer+edx]
u8clen eax
add edx,eax
loop @B
; now stringPointer+edx is the pos of the 10th char
mov al,[stringPointer+edx]
u8clen eax
; now the length of the 10th char is in eax

This is just an example.

The point is that such manipulations with UTF-8 are much slower than with fixed width charsets, but it may have benefits like limited memory usage and overall speedup if there is need to process huge number of small strings or one huge string in a way where you need an arbitrary access inside the string, but not very often in the same string, so you may not spend time and memory for conversion entire string(s) to UTF-16 just to have simpler access to the chars.

The MASM Forum

News:

UTF-8 StrLen (String length computation)

Antariy

Antariy

jj2007

Antariy

jj2007

Antariy

Antariy