News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

UTF-8 StrLen (String length computation)

Started by Antariy, September 15, 2013, 01:40:24 AM

Previous topic - Next topic

Antariy

The Unicode UTF-8 character / string length computation code. The actual code is bold. Any ideas on simpler implementation?

    u8_loadchar MACRO   charPtr:REQ, theReg:=<EAX>
        LOCAL v1, v2, v3
        v3 TEXTEQU <@CatStr(%OPATTR(charPtr))>
        IF v3 AND 10000Y   ; reg
            v1 TEXTEQU <@CatStr(%SIZE(TYPE(charPtr)))>
            v2 TEXTEQU <@CatStr(%SIZE(TYPE(EAX)))>
            IF v1 GT v2  ; byte-sized reg bug fix
                v1 TEXTEQU <1>
            ENDIF
            IF v1 LT v2
                movzx theReg,charPtr
            ELSE
                mov theReg,charPtr
            ENDIF
        ELSEIF v3 AND 11Y  ; mem ref
            movzx theReg,byte ptr charPtr
       
        ELSE
            echo Unknown reference to a u8_loadchar macro!
            .ERR
        ENDIF
    ENDM


    u8clen    MACRO    theReg:=<EAX>
    LOCAL bytereg

    bytereg TEXTEQU @CatStr(@SubStr(theReg,2,1),<l>)

        ; for instance reg is 11111101 / 01111111 for a
        ; first char of a 6-byte UTF-8 char / one standard-ASCII UTF-8 char

        ; get stop bit pos and take care of a standard-ASCII char
        not bytereg         ; 00000010 / 10000000
        bsr theReg,theReg   ; 1 / 7
        add bytereg,-7      ; -6 / 0
        neg bytereg         ; 6 / 0
        sbb bytereg,-1      ; add 1 and minus carry (if reg was not 0 before neg
                            ; then this makes nothing) - 6 / 1 (NEG 0 clears CF)

    ENDM
   
    u8strlen    MACRO strPtr:REQ, strLen

       
    LOCAL l1
    LOCAL l0
        xor eax,eax
        cdq
        jmp l1
       
    align 16
    l0:
        u8_loadchar [strPtr+edx],ecx
        u8clen ecx
        add edx,ecx
        inc eax
    l1:
    ifb <strLen>
        cmp byte ptr [strPtr+edx],0
        jnz l0
    else
        cmp edx,strLen
        jb l0
   
    endif
   
    ENDM

Antariy

Maybe there is a question like "why you need a specific code to calculate UTF-8 string length, why not use usual ASCII StrLen?", here is the explanation.

UTF-8 character has variable byte lenght. It maybe from one byte for characters with code below than 128, and up to 6 bytes (actual standard of characters set has 4 bytes width). I.e. in one string you may have, for an instance, 3 chars, every of which has different byte length. Or 1000 characters with width warying from 1 to 4/6 bytes. The only way to know characters count is the walking through all of them and computation the length of every char. The same thing if you want to get Nth char of the string: you should walk through Nth-1 chars, because the offset in the string is unknown - it may not be accessed so easy as for ASCII or UTF-16 (like mov al,[stringPtr+Nth-1]).

Usual ASCII StrLen will not return correct string length (characters count) for UTF-8 string. The same for accessing the parts of the string - if you want to access to arbitrary positions in the string, you should use the code like above for proper positioning, otherwice you will break UTF-8 bitstream and get garbage instead of the part of a text.

jj2007

Interesting, Alex :t

Can you give an example of usage? Right now, I can see more examples where it would not work because a byte count would be expected, like printing part of a string to a file...

Antariy

Jochen, a byte count isn't a hard thing to be calculated, the real hard thing with UTF-8 is arbitrary and exact positioning inside the string - I explained why in the post above.

Since UTF-8 is a Unicode, it may contain text in any language - characters with variable length, so, if you have a text, for an instance, that contains not only standard ASCII chars (with code less than 128), you may not use straight pointers to access inside the string. For an instance, "This is testing text. Это тестовый текст. [here are another language or specifical characters, like mathematical, pseudographical etc - Unicode allows them all in one text]. 123 - try to access the substring", try to acccess to, let's say, 40th character.

Quote from: Antariy on September 15, 2013, 07:09:51 AM
The same thing if you want to get Nth char of the string: you should walk through Nth-1 chars, because the offset in the string is unknown - it may not be accessed so easy as for ASCII or UTF-16 (like mov al,[stringPtr+Nth-1]).

In short words: if you want, for an instance, to extract part of a text in UTF-8 without converting it to UTF-16 first, you need to properly position the pointer, computing the characters length.
The point may become clear when you try to do anything with the string, that required positioning in it, the explanation is probably a bit unclear with my English.

jj2007

Alex,

What you write makes sense, of course. But I wonder where we could use it in a real life app...
I just made a test with my wPrint console stuff, it works, but it is UTF-16 because that's what you get from a resource file. The byte count of the 7-char string in UTF-8 is 21, by the way.

Maybe it would be useful for webpages, they are very often UTF-8.

   wPrint wLeft$(wRes$(1), 1), wCrLf$
...
   wPrint wLeft$(wRes$(1), 7), wCrLf$


Antariy

Quote from: jj2007 on September 15, 2013, 07:51:22 AM
But I wonder where we could use it in a real life app...

Simplest example - parsing of UTF-8 string as it is without conversion to UTF-16.

Quote from: jj2007 on September 15, 2013, 07:51:22 AM
The byte count of the 7-char string in UTF-8 is 21, by the way.

Jochen, this example string is a bit too simplificated. In real app you will have to work with the real text, containing latin characters (reffering to the webpages, that's tags, the latin text etc) and characters of any other languages in any intermix possible (the same text may contain phrases in different languages, specifical symbols etc etc etc).
(As for 7 char - 21 byte - that's because these Chinese characters are 3 bytes width every, but insert one ASCII letter into text - let's say "-" or dot, and it will become 8 chars and 22 bytes length, insert Russian letter, and it will become 9 chars and 24 bytes length, etc. For real text you will have a mix of characters (words from characters) with different widths, that makes positioning difficult).

Antariy

u8clen macro returns the length of an UTF-8 char - need to load the first byte of the char into the reg, and then execute this macro, and it will return in the same reg the length in bytes of the char that contains this (first or only one) byte. String length computation code based on this macro - it just sequentally accesses every char, starting from first, computes the length of a char in bytes, adds this length to the pointer so it points to the next char in the string, grabs the first byte of that char, computes its length, adds it to the pointer, and so on.

As a simple example. One having three strings: ASCII/ANSI (8 bits), UTF-16 (16 bits, "usual" Unicode), UTF-8.
To get 10th char of the every string, you need to do so:

ASCII/ANSI:

mov al,[stringPointer+9]


UTF-16:

mov ax,[stringPointer+9*2]


UTF-8:

; find the char position in the string
mov ecx,9
xor edx,edx
@@:
mov al,[stringPointer+edx]
u8clen eax
add edx,eax
loop @B
; now stringPointer+edx is the pos of the 10th char
mov al,[stringPointer+edx]
u8clen eax
; now the length of the 10th char is in eax


This is just an example.

The point is that such manipulations with UTF-8 are much slower than with fixed width charsets, but it may have benefits like limited memory usage and overall speedup if there is need to process huge number of small strings or one huge string in a way where you need an arbitrary access inside the string, but not very often in the same string, so you may not spend time and memory for conversion entire string(s) to UTF-16 just to have simpler access to the chars.