Faster Memcopy ...

HSE · August 17, 2023, 07:04:39 AM

Talking about missing people, rrr314159 don't post anything in Quora from March 4. Not good.

guga · August 17, 2023, 08:22:14 AM

Quote from: jj2007 on August 17, 2023, 06:55:55 AM
Quote from: guga on August 17, 2023, 04:21:21 AMit has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Hi Guga,

I'm not aware of such problems, can you post an example of a string that misbehaves, in Masm syntax?

There is a known issue with VirtualAlloc'ed strings.

Hi JJ, ok....I´ll post 2 examples. One now to you see what´s wrong (and my fix). And then, i´ll try to recreate the error on the Ansi version of the string len algo.

StrlenW - Unicode version

Code Select

_________________________________
.data
UnicodeString2  db 1 ;  Yes...but as a byte to force it to be unaligned
                text "UTF-16LE", 'Hello',0
NextData        dw 9
                dw 1
                dw 1
                dw 1
                dw 1
                dw 1
                dw 1
                dw 1
.code
(.................)

mov edi, offset UnicodeString2
inc edi ; Make the string starts exactly at "Hello", so inside our data chain, which is unaligned
push edi
call StrLenW
add eax 2
add edi eax ; edi should point now after the null terminated word, So, it should point to "NextData", but it points to the 2nd "l" in Hello

The function was (the one that returns incorrect values in some circumstances):

Code Select

StrLenW        proc near
Input           = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                push    edx
                mov     eax, [ebp+Input]
                mov     ecx, eax
                and     eax, 0FFFFFFF0h
                and     ecx, 0Fh
                or      edx, 0FFFFFFFFh
                shl     edx, cl
                movdqu  xmm1, xmmword ptr [eax] ; movdqu  xmm1, [eax] ?
                xorps   xmm0, xmm0
                pcmpeqw xmm0, xmmword ptr [eax] ; pcmpeqw xmm0, [eax] ?
                add     eax, 10h
                pmovmskb ecx, xmm0
                xorps   xmm0, xmm0
                and     ecx, edx
                jnz     short loc_4040B0

loc_40409E: 
                movups  xmm1, xmmword ptr [eax]
                pcmpeqw xmm1, xmm0
                pmovmskb ecx, xmm1
                add     eax, 10h
                test    ecx, ecx
                jz      short loc_40409E

loc_4040B0:
                bsf     ecx, ecx
                lea     eax, [ecx+eax-10h]
                sub     eax, [ebp+Input]
                shr     eax, 1
                pop     edx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLenW        endp

On the above example i used Unicode Text encapsulated between 2 datas (1 byte before the string starts) and a sequence of Words after it. This was to force check for alignment problems.

The same problem happens if you replace "Hello" with other Copdepage, such as chinese etc...Try replacee Hello with dw 04f60, 0597d, 056fe, 06211, 04e00, 0 And this kind error will happens more often.

To fix that i changed the code to only this (masm syntax):
My fix

Code Select

StrLenW        proc near
AddBytes        = dword ptr -4
Input           = dword ptr  8

                push    ebp
                mov     ebp, esp
                sub     esp, 4
                push    ecx
                mov     [ebp+AddBytes], 0
                xorps   xmm0, xmm0
                mov     ecx, [ebp+Input]
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                pmovmskb eax, xmm0
                test    ax, ax
                jnz     short loc_4040BA

loc_4040A1:
                add     ecx, 10h
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short loc_4040A1
                sub     ecx, [ebp+Input]
                mov     [ebp+AddBytes], ecx

loc_4040BA:
                and     ax, 101010101010101b
                bsf     ax, ax
                add     eax, [ebp+AddBytes]
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLenW2        endp

Of course, this can also be optimized (as i did for strlen, but i´ll do it later after i post the Ansi version for you and we check the speed. On this fix for the Unicode version (strlenW) it works correctly, no matter if the string is encapsulated, no matter if it has extra bytes after the last null terminated word, no matter if it is unaligned and specially, no matter what codepage you use, and also will work even if the string is zero. So it will work for regular Unicode "Ascii+Zero" and for others Codepages (2 bytes as in chinese, russian etc) or even weird CodePages that uses 3 or even 4 bytes to represent a char.

The resultant value in eax is the total amount of bytes used on the Unicode String (and not the amount of chars). I made it return the amount of bytes insetad chars, to we can also use it in strings with other Codepages without having to be forced to convert with M$ Apis etc.

I´ll try to reproduce the error now in the strlen version (ansi), but the error happens on the same principle, altough it is a bit hard to find

Note:
Below is the RosAsm version of StrLenW (unoptimized yet) but with all comments i did while i was trying to fix it.

Code Select

Proc StrLenW2::
    Arguments @Input
    Local @AddBytes
    Uses ecx

    mov D@AddBytes 0 ; Let´s create a variable to we store potential extrabytes later to be added to the string len
    xorps xmm0 xmm0 ; zero register xmm0. Why zero ? Because we are focusing in words untill we find the null terminated word (So 2 zeroes).
                    ; Since each register in xmm0 contains 8 words (16 bytes). We need to scan later those 8 words to we find the double 0

    mov ecx D@Input
    movups xmm1 X$ecx ; Load the string onto xmm1 register
    pcmpeqw xmm0 xmm1 ; Compare the strings (as words) in xmm1 to the value we copied onto xmm0 (so, compare each word to 0 = our double zero to be found)
                      ; pcmpeqw threat the words as vectors and since we are comparing if the words in xmm1 are equal to 0whenever we find a 0, whenever we find a double zero,
                      ; the corresponding word is set all the bits of that word to 1 . So it will result in 0FFFF (per word) whenever a zero is found or 0 if not found.
                      ; At this point if the string is less then or equal to 8 bytes long (not counting the null terminated word), no matter where the null terminated word is located
                      ; it always will show up in one of he 8 words. Like this: Say we have the unicode string "Hello", 0
                      ; On xmm1 it will be displayed as:
                      ; Word7 Word6 Word5 Word4 Word3 Word2 Word1 Word0 ==> (bits 128 - 0)
                      ; 0001  0007  0000  006F  006C  006C  0065  0048 ==> 0001 and 0007 are extra data after the end of the string (that may exists or not, so it´s useless anyway) follow by 'olleH'
                      ;   ?     ?    Ok    o      l     l     e     H
                      ;            (zero)

                      ; After pcmpeqw xmm0 will become:

                      ; Word7 Word6 Word5 Word4 Word3 Word2 Word1 Word0 ==> (bits 128 - 0)
                      ; 0000  0000  FFFF  0000  0000  0000  0000  0000 ==> So, our null terminated word is located at the 6th word = "Word5", and all bits on that word were settled 0FFFF = 00__1111_1111__1111_1111
                      ; which means that ouur null terminated word is located at the 6th position (in words). So, in theory, we found ouur len (in bytes) which is 6-1 = 5. Ok ?
                      ; Well we now know the total size of our string (in bytes), but it is expressed in words positions, and we need to calculate the amount of bytes of it. So, let´s continue.

    pmovmskb eax xmm0 ; PMOVMSKB creates a mask made up of the most significant bit of each byte of xmm0 (the src) and stores the result in eax. But hold on...what does it means ?
                      ; It means that we are simply transposing the position of our nll terminated word from xmm0 to eax. But, wait....xmm registers are 64 bits, and eax is only 32.
                      ; So what's going on here ? Well... pmovmskb will transpose the position of 0FFFF (6th word containing 16 bits settled to 1) to the same position, but in bytes (since eax = 32 bits only)
                      ; and also, insetad masking it as a 0FFFF (word) it will flag it as a half of a Byte (11 in binary = 3 in decimal)
                      ; therefore, eax will become: 0C00 = 00__0000_1100__0000_0000 Why this? because each 16 bits from xmm0 will correspond to only 16 bits (8 positions) in eax on the same position but in 32 bits.Therefore:

                      ; Half  Half  Half  Half  Half  Half  Half  Half
                      ; Byte7 Byte6 Byte5 Byte4 Byte3 Byte2 Byte1 Byte0 ==> (bits 16 - 0)
                      ;  00    00    11    00    00    00    00    00   <==== bits = 00__0000_1100__0000_0000 (binary) = 0C00 (in hexa)

                      ; So, in other words, all that we will need is only the value in ax (low part of eax), whereas the high part will always be 0 because we previously used pcmpeqw comparing only 8 words
                      ; So, whatever the value here is, the maximum value for eax = 0FFFF

    test ax ax | jnz L1> ; We need to test eax with 0, because eax will contains information about the lenght of the string (whatever the len is). So, the string sequence stored in xmm0 will contains only 8 words.
                         ; In fact, ax is all we need since pmovmskb stores the result in the low word of eax and clears the hi word
                           ; If any of those words contains a null terminated word (double zero), one of them will contains a mask of 0FFFF, which later in pmovmskb will flag the half of he byte where the zero was found with 11 (in binary)
                           ; But if the sequence of 8 words does not contrains any zero, then xmm0 will have only 0 and, therefore, eax will also be only 0 (because no pair of bits will be settled).
                           ; What means to say that whenever our string has less then 7 words/chars (14 Bytes) the routine will jump over to to "L1" and perform the actual computation of the bytes that the string has.
                           ; Otherwise we go to the next line, add 16 bytes to the string (to go to the next 16 bytes to look for the double 0) and continue a loop untill we found it.

        L0:
                add ecx 16 ; point to the next 16 bytes of our string
                movups xmm1 X$ecx | pcmpeqw xmm0 xmm1 | pmovmskb eax xmm0
            test ax ax | jz L0< ; Did we found any double 0 in eax ? No, jmp back and do it again. Otherwise, go to the next line
        sub ecx D@Input     ; Now comes something easier. Since we have some value in eax that will represent the size of the string (or at leats, this last chain of 16 bytes), all we need to do is add those bytes to the amount of
                            ; bytes we already calculated. To do that, we simply subtract the current location in ecx with the start of our string at Input. This will give us the total amount of bytes we have so far and we store it in AddBytes
        mov D@AddBytes ecx
L1:

    ;movzx eax ax ; On eax = 0 means we have reached the limit of 7 bytes
    ; Now comes the trick. At this point eax (or better speaking, ax) contains the position (in bytes) of the double zero (found at the 6th position in Half Byte 5) starting from half Byte0 to Half Byte 5.
    ; Since we are dealing with a word, we ended with 2 bits set. Out 1st zero starts at the 1st one, so we can safely discard the second bit (from right to left). Doing this we will be able to identify
    ; the least significant that corresponds to the index (len) of our string

    ; So, for each pair of bits we simply and with 01 (in binary) because we will then discard the 2nd bit (most significand one) on each pair. Since we have 8 "half bytes" we then need to and also 8 half bytes with:
    ; 00_01_01_01_01__01_01_01_01 (in binary) = 05555 (in hexa)
    and ax 00_01_01_01_01__01_01_01_01
    ; Now that we get rid of the 1st bit on each pair, we have only 8 pairs of bits of least significant bits (the 1st one to be checked). So, after the "and" eax will result in 0400 (hexa) = 00__0000_0100__0000_0000 (in binary)
    ;  which correspond exactly to:
    ; Half  Half  Half  Half  Half  Half  Half  Half
    ; Byte7 Byte6 Byte5 Byte4 Byte3 Byte2 Byte1 Byte0 ==> (bits 16 - 0)
    ;  00    00    01    00    00    00    00    00   <==== bits = 00__0000_0100__0000_0000 (binary) = 0400 (in hexa)
    ; Ok, now we have the most significand bit found at the 6th position (Half Byte5) which is preciselly, lenght 6 (in words). Since we removed the most significand bit all we need to do is scan (search) the position of that bit with bsf
    ; since whateever position the bit was set, it always be the 1st (of the pair), therefore, it will never result in odd positions, so we don´ need to align or do a shr eax 1 | shl eax 1 pair or and eax 0FFFF_FFFE to remove the 1st bit (bit0)
    ; that´ because we are, at the end, counting positions. Could we do with and ax 00_10_10_10_10__10_10_10_10 ? Yes, but this would result in odd values and then we would need to fix with eax 0FFFF_FFFE or with shr eax 1 | shl eax 1
    bsf ax ax ; got the total amount of bytes of this chunck of words from xmm1 (earlier stored there, remember ?). We don´ need to bsf eax becauuse all we have is values in ax. So, perhaps, bsf ax ax is faster
              ; Consider later do it with lzcnt or popcnt
    add eax D@AddBytes ; and now we can simply add whatever bytes we found before on the loop (if we entered onto a loop earlier) to this resultant value in eax, which will be the correct lenght of the string.


EndP

jj2007 · August 17, 2023, 08:39:56 AM

Hi Guga,

Your string behaves just fine with wLen() resp. MbStrLenW - see attachment. Are you using an older algo?

guga · August 17, 2023, 09:20:47 AM

Quote from: jj2007 on August 17, 2023, 08:39:56 AMHi Guga,

Your string behaves just fine with wLen() resp. MbStrLenW - see attachment. Are you using an older algo?

I don´t know, perhaps. I didn´t touched the strlen and strlenw algo in a long time. But what are the correct/updated algos ? I can´t find it in masm basic. I saw the souuce you uploaded but the algos arent´t there (in masm syntax)

jj2007 · August 17, 2023, 11:35:27 AM

Quote from: guga on August 17, 2023, 09:20:47 AMBut what are the correct/updated algos?

See your PM

Btw it seems we never timed the Unicode version of StrLen, here it is:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23327   cycles for 100 * CRT wcslen
20836   cycles for 100 * Masm32 ucLen
9566    cycles for 100 * MasmBasic wLen
8981    cycles for 100 * _MbStrLenW

23340   cycles for 100 * CRT wcslen
20809   cycles for 100 * Masm32 ucLen
9566    cycles for 100 * MasmBasic wLen
8988    cycles for 100 * _MbStrLenW

23284   cycles for 100 * CRT wcslen
20852   cycles for 100 * Masm32 ucLen
9157    cycles for 100 * MasmBasic wLen
9003    cycles for 100 * _MbStrLenW

23262   cycles for 100 * CRT wcslen
20883   cycles for 100 * Masm32 ucLen
9586    cycles for 100 * MasmBasic wLen
8992    cycles for 100 * _MbStrLenW

14      bytes for CRT wcslen
10      bytes for Masm32 ucLen
10      bytes for MasmBasic wLen
66      bytes for _MbStrLenW

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax MasmBasic wLen
100     = eax _MbStrLenW

guga · August 17, 2023, 04:12:03 PM

Hi JJ. Tks, indeed i was using an older version. But can u test it to we compare the speed ?

jj2007 · August 17, 2023, 07:46:51 PM

Quote from: guga on August 17, 2023, 04:12:03 PMHi JJ. Tks, indeed i was using an older version. But can u test it to we compare the speed ?

Hi Guga,

IMHO your algo is fast enough

You beat poor CRT by a factor 9, but ok, as Assembly programmers we are used to that, aren't we?

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23713   cycles for 100 * CRT wcslen
20882   cycles for 100 * Masm32 ucLen
9471    cycles for 100 * MasmBasic wLen
9030    cycles for 100 * _MbStrLenW
9099    cycles for 100 * _MbStrLenW2
2781    cycles for 100 * StrLenW Guga

23718   cycles for 100 * CRT wcslen
20855   cycles for 100 * Masm32 ucLen
9555    cycles for 100 * MasmBasic wLen
9012    cycles for 100 * _MbStrLenW
9014    cycles for 100 * _MbStrLenW2
2696    cycles for 100 * StrLenW Guga

23641   cycles for 100 * CRT wcslen
20836   cycles for 100 * Masm32 ucLen
9541    cycles for 100 * MasmBasic wLen
9023    cycles for 100 * _MbStrLenW
9010    cycles for 100 * _MbStrLenW2
2699    cycles for 100 * StrLenW Guga

23658   cycles for 100 * CRT wcslen
20844   cycles for 100 * Masm32 ucLen
9435    cycles for 100 * MasmBasic wLen
9008    cycles for 100 * _MbStrLenW
9006    cycles for 100 * _MbStrLenW2
2701    cycles for 100 * StrLenW Guga

14      bytes for CRT wcslen
10      bytes for Masm32 ucLen
10      bytes for MasmBasic wLen
66      bytes for _MbStrLenW
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax MasmBasic wLen
100     = eax _MbStrLenW
100     = eax _MbStrLenW2
100     = eax StrLenW Guga

guga · August 17, 2023, 11:45:44 PM

Tks a lot, JJ

It seems fast, indeed

I´ll do now a few more tests on the strlen version (The ansi) and finish some comments on the code, so i wont forget what i´ve done

Code Select

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

16116   cycles for 100 * CRT wcslen
25196   cycles for 100 * Masm32 ucLen
7524    cycles for 100 * MasmBasic wLen
7574    cycles for 100 * _MbStrLenW
7405    cycles for 100 * _MbStrLenW2
2568    cycles for 100 * StrLenW Guga

16085   cycles for 100 * CRT wcslen
25030   cycles for 100 * Masm32 ucLen
7537    cycles for 100 * MasmBasic wLen
7419    cycles for 100 * _MbStrLenW
7424    cycles for 100 * _MbStrLenW2
2545    cycles for 100 * StrLenW Guga

16308   cycles for 100 * CRT wcslen
25578   cycles for 100 * Masm32 ucLen
7614    cycles for 100 * MasmBasic wLen
7419    cycles for 100 * _MbStrLenW
7456    cycles for 100 * _MbStrLenW2
2573    cycles for 100 * StrLenW Guga

16086   cycles for 100 * CRT wcslen
25075   cycles for 100 * Masm32 ucLen
7600    cycles for 100 * MasmBasic wLen
7447    cycles for 100 * _MbStrLenW
7396    cycles for 100 * _MbStrLenW2
2547    cycles for 100 * StrLenW Guga

14      bytes for CRT wcslen
10      bytes for Masm32 ucLen
10      bytes for MasmBasic wLen
66      bytes for _MbStrLenW
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax MasmBasic wLen
100     = eax _MbStrLenW
100     = eax _MbStrLenW2
100     = eax StrLenW Guga

--- ok ---

Btw..you may use it in masmbasic if you wish and others can use it too, of course:)

I needed to finish those routines to continue updating them in RosAsm. During the updates i found those minor errors on both functions that required fixes. I hope i can finish the updates (in rosasm) soon to release the next version, but it´s a bit hard because i made several changes in RosAsm code to try making it faster and more independent of the interface and accidentally broke one thing or two, specially in the resources editor (i´m trying to fix it right now). A hell, a true hell to do.

One question, on your optimization, you returned eax as chars instead of bytes. Considering that the new strlenW function can be used in other CodePages (chinese, russia, greek strings etc) and also others that may uses 3 or 4 bytes to represent a character, returning chars in eax is the better instead of bytes ? What if the user codepage contains UTF8 (or others that uses more then 2 bytes for a char ) ?
https://www.ibm.com/docs/en/db2-for-zos/11?topic=unicode-utfs

jj2007 · August 18, 2023, 12:14:46 AM

Quote from: guga on August 17, 2023, 11:45:44 PMOne question, on your optimization, you returned eax as chars instead of bytes.

Practically all Windows functions require chars, not bytes. For Utf8, you can get chars with uLen().

guga · August 18, 2023, 12:54:26 AM

Quote from: jj2007 on August 18, 2023, 12:14:46 AMPractically all Windows functions require chars, not bytes. For Utf8, you can get chars with uLen().

Hi JJ. Ok, i´ll change to chars, then. One question....on your version did you needed to detect the CodePage ?

Code Select

Len, wLen, uLen                                get string length
                mov ebx, Len(My$)
                mov eax, Len("123")                        ; Len returns 3 in eax, so Olly will show mov eax, eax
                void Len("123")                                ; put the result into eax (void avoids the mov eax, eax)
                mov eax, wLen(MyUnicode$)            ; wideLen for Unicode strings
                void wLen(wChr$("123"))                ; returns 3 chars in eax
            Let esi="Добро пожаловатьäöü"                ; assign a UTF-8 string
            uPrint "[Добро пожаловатьäöü]"                ; or print it directly
                Print Str$(" is %i characters long (expected: 19)\n", uLen(esi))
Rem        returns length in eax; bytes for ANSI, chars for Unicode; for UTF-8 encoded strings, Len returns bytes, uLen returns chars

I mean detecting if the sequence of bytes before the null terminated word belongs to a UTF8, UTF7, UTF32 etc ? maybe we can make a small update at the end of strlenW to calculate if the string is UTF8 in 2, 3 or 4 bytes long representing a character.

A single line after the last code at "add eax, ecx" maybe enough, (with a flag ) perhaps.

Something like. If the result must be from a UTF7 (3 bytes per char), we divide the result by 3, If UTF32, byt 4 and so on. So, add also aother parameter of the function used as a flag perhaps.

Code Select

add eax ecx

test Flag UTF7 | je L1> ; is the string in UTF7 ? divide the result by 3
    mov edx 0AAAAAAAB
    mul edx
    shr edx 1
    mov eax edx
jmp L2>
test Flag UTF32 | je L1> ; Is UTF32 ? divide by 4
shr eax 2
jmp L2> 
L1:  ; else keep as Unicode (2 bytes per char)
    shr eax 1
L2:

Is it worthful extending the algo to handle UTF7 or even UTF32 or it is unnecessary for general usage ? I´m asking because i know that UTF7 and UTF32 do exists, but i don´t know if they are commonly used in windows (or other systems).

If it won´t be useful, i´ll stay as you said and use only the shr eax 1 at the end to return chars instead bytes.

jj2007 · August 18, 2023, 01:17:04 AM

Quote from: guga on August 18, 2023, 12:54:26 AMon your version did you needed to detect the CodePage ?

Utf8 is basically Ansi with some tricks. For most uses of Instr() or Len(), like parsing etc, the plain Ansi functions are appropriate. Only in rare cases you need to know the chars - that's when uLen jumps in. There is no codepage test in this sense, but you tell it explicitly "treat this string as Utf8".

QuoteIs it worthful extending the algo to handle UTF7 or even UTF32 or it is unnecessary for general usage?

I wouldn't invest any efforts. I've never seen these formats in the wild.

guga · August 18, 2023, 01:34:53 AM

Ok, i guess i understood now. I´ll do as you said then.

I´ll keep strlenW to return in bytes as it was already (since it will be more usefull for parsing etc) and create a variation to return in chars (with shr eax 1 after the add eax ecx ) named as UstrLen, UCharLen or something like that.

jj2007 · August 18, 2023, 01:49:00 AM

Quote from: guga on August 18, 2023, 01:34:53 AMOk, i guess i understood now. I´ll do as you said then.

I´ll keep strlenW to return in bytes as it was already (since it will be more usefull for parsing etc) and create a variation to return in chars (with shr eax 1 after the add eax ecx ) named as UstrLen, UCharLen or something like that.

Attention: Windows expects chars from "W" variants. Many WinAPI string functions have two variants, "A" and "W". Both expect chars, but in the case of the Ansi variants chars=bytes.

I use uLen() to indicate Utf8 chars, which are less than bytes because one char can be composed of one, two or three bytes.

include \masm32\MasmBasic\MasmBasic.inc
Init
Let esi="Нажмите на эту кнопку" ; click on this button
mov ecx, uLen(esi)
Print Str$("The string is %i bytes long", Len(esi)), Str$(", but has %i chars\n", ecx)
PrintLine "[", uLeft$(esi, 7), "] - correct, 7 chars"
PrintLine "[", Left$(esi, 7), "] - incorrect, 7 bytes"
uMsgBox 0, esi, "Hi", MB_OK

EndOfCode

Output:
The string is 39 bytes long, but has 21 chars
[Нажмите] - correct, 7 chars
[Наж�] - incorrect, 7 bytes

lingo · August 19, 2023, 12:11:46 AM

Hi,
It is my old StrLenW.

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Align 16
db 7 dup (90h) 
StrLenW_Lingo	                proc
				movdqu	 xmm1, [eax]
				pxor	 xmm0, xmm0
				pcmpeqw	 xmm1, xmm0
				pmovmskb ecx,  xmm1	
				test	 ecx,  ecx
				jne	 @L_End
				mov	 edx,  eax
				and	 eax,  -16
@@:
				pcmpeqw	 xmm0, [eax+16]
				pcmpeqw	 xmm1, [eax+32]
				por	 xmm1, xmm0		
				add	 eax,  32		
				pmovmskb ecx,  xmm1	
				test	 ecx,  ecx
				jz	 @b
				shl	 ecx,  16	
				sub	 eax,  edx
				pmovmskb edx,  xmm0
				add	 ecx,  edx	
                              ; and  cx,5555h
				bsf	 ecx,  ecx
				lea	 eax,  [eax+ecx-16]
				ret
@L_End:
                              ; and  cx,5555h
				bsf	 eax, ecx
				ret
StrLenW_Lingo	                endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

guga · August 19, 2023, 03:46:49 AM

Hi Lingo

It´s very similar. I remember it was biased on one JJ and i did a long time ago on strlen, perhaps a variation of yours, i don´t remember.

This one, i adapted from strlen, and now i updated because it had flaws on unaligned strigns and UTF16 (chinese etc). Update yours to this too:

Code Select

StrLenW         proc near
InputString     = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+InputString]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                add     ecx, 16
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+InputString]
                add     ecx, -16
                and     ax, 101010101010101b
                bsf     ax, ax
                add     eax, ecx
                shr     eax, 1
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLenW         endp

It works for all codepages that uses 2 bytes to create a char.

If you want one that can be used for parsing and result the lenght in bytes, you can use this as well

Code Select

UniStrLen         proc near
InputString     = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+InputString]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                add     ecx, 16
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+InputString]
                add     ecx, -16
                and     ax, 101010101010101b
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
UniStrLen         endp

Or this for Ansi

Code Select

StrLen         proc near
InputString     = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+InputString]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqb xmm0, xmm1
                add     ecx, 16
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+InputString]
                add     ecx, -16
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLen         endp

Don´t know if yours old version had those fixes, but feel free to update them

The MASM Forum

News:

Faster Memcopy ...

HSE

guga

jj2007

guga

jj2007

guga

jj2007

guga

jj2007

guga

jj2007

guga

jj2007

lingo

guga