Fast String Copy Algo

guga · October 04, 2020, 01:43:38 AM

Hi Guys

Someone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?

I have one (for ansi) that i use for copy memory (and also fixed len strings) adapted from JJ but it is for fixed lenght, as this:

Code Select



Proc memcpy_SSE_V3:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx, edx, eax

    mov edi D@pDest
    mov esi D@pSource
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx D@Length
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

EndP

I´m trying to do the same, but for null termination string (Ansi and Unicode), but i´m clueless how to make it work :(

jj2007 · October 04, 2020, 06:56:53 AM

Quote from: guga on October 04, 2020, 01:43:38 AMSomeone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?

What about this one? Or that endless thread?

guga · October 04, 2020, 08:45:36 AM

Quote from: jj2007 on October 04, 2020, 06:56:53 AM
Quote from: guga on October 04, 2020, 01:43:38 AMSomeone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?

What about this one? Or that endless thread?

Hi JJ.

Yes, i still have those for memcopy and strlen, but how to make a string copy from those ?

I tried to simply calculate 1st the strlen and with the resultant value use it in memcpy, but it will be slower then doing it on a direct way. Do you have any idea how to merge both algos to create a Ansi (and later a unicode) nulll terminated string copy? (For unaligned strings, preferentially )

For example, a normal Ansi null terminated copy string can be done like this:

Code Select


; eax = Size of the copied string
Proc CopyStringEx:
    Arguments @Input, @Output
    Uses esi, edi

    mov esi D@Input
    mov edi D@Output
    .While B$esi <> 0
        movsb
    .End_While
    mov B$edi 0

    mov eax edi
    sub eax D@Output

EndP

How to do it using SSE2 ?

I thought in load the contents of esi in xmm0, then check if any of the 16 bytes is zero (on a similar way it was done in strlen, perhaps ?). If they are not, copy to the output buffer in edi and do the loop again. And later, when it finds one single byte that contains zero, deals with the remainder data. But..how to do it in SSE2 ? I mean, how to compare if any of the 16 bytes in xmm register, contains, at least 1 zero that represents the null termination byte ?

jj2007 · October 04, 2020, 02:30:40 PM

IIRC I tested that, and it turns out that getting the strlen plus copying was faster than the "integrated" version. However, I used the notoriously fast MasmBasic Len() for that...

Code Select

6983    cycles for 100 * CRT strlen
5460    cycles for 100 * Masm32 StrLen
20044   cycles for 100 * Windows lstrlen
2060    cycles for 100 * MasmBasic Len

(from the Timer example at RichMasm's File/New Masm source menu)

TouEnMasm · October 04, 2020, 06:32:41 PM

Perhaps this one can help
http://masm32.com/board/index.php?topic=3715.msg39270#msg39270

jj2007 · October 04, 2020, 07:21:05 PM

Quote from: guga on October 04, 2020, 08:45:36 AMI tried to simply calculate 1st the strlen and with the resultant value use it in memcpy, but it will be slower then doing it on a direct way.

What makes you think that it will be slower, did you time it?

What are we talking about, small (<100 bytes) or big (>1MB) strings?

For small strings, I bet that the calculate Len() first then copy variant will be faster.
For big strings, I can't imagine a case where you don't know its length...

guga · October 04, 2020, 07:38:24 PM

Hi JJ. No i didn´t timed yet. I was just thinking that could be slow because we will need to use 2 functions rather then one.

Quote
What are we talking about, small (<100 bytes) or big (>1MB) strings?

For small strings, I bet that the calculate Len() first then copy variant will be faster.
For big strings, I can't imagine a case where you don't know its length...

I agree that it could be very rare using strings with 1 Mb, but we can have a situaton on a file that uses strcpy lots of time. (In RosAsm, for example, i use strcpy frequently, for the assembler and disassmbler). On the disassmbler side we don´t know the size of strings that are beeing passed, that´s why i thought in a variation of a strcpy using one single function rather then embedding 2.

I´ll give a try using both functions as you suggested and try to benchmark the result to see how fast it could be :)

guga · October 04, 2020, 09:03:29 PM

Ok, i made a small variation merging both functions. I´ll do the tests to see if it is faster then the normal movsb way

Code Select


Proc strcpy_SSE:
    Arguments @pDest, @pSource
    Local @StrSize
    Uses ecx, edx, esi, edi

    ; 1st we compute the lenght of the string. Guga/JJ strlen/memcpy

    mov edi D@pDest
    mov esi D@pSource

    mov eax D@pSource
    mov ecx eax
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@pSource | jle L5>> ; If smaller or equal to zero, exit
    mov D@StrSize eax ; save the size of the string

    ; 2nd. Now we are ready to copy the data
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx eax
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

    mov eax D@StrSize ; return the size of the string in eax

L5:

EndP

guga · October 04, 2020, 09:39:44 PM

Ok, i did the tests.

For small strings chunks (smaller then 16 bytes), there´s no significant difference between the old movsb and the SSE version. In fact, the old movsb is a bit faster
But, for big strings chunks (bigger or equal to 16 bytes) the difference is huge. The SSE variation is about 8 times faster

Small String to copy: "Hello world."
Time on SSE version: 45.09486968871570 clocks
Time on movsb version: 36.25383294288191 clocks

Bigger strings to copy: "Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.", 0
Time on SSE version: 63.40321596420246 clocks
Time on movsb version: 442.87924519310099 clocks

Versions:

SSE2

Code Select


Proc strcpy_SSE:
    Arguments @pDest, @pSource
    Local @StrSize, @EaxPreserve
    Uses ecx, edx, esi, edi

    ; 1st we compute the lenght of the string. Guga/JJ strlen/memcpy

    mov edi D@pDest
    mov esi D@pSource
    mov D@EaxPreserve eax

    mov eax esi
    mov ecx eax
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@pSource | jle L5>> ; If smaller or equal to zero, exit
    mov D@StrSize eax ; save the size of the string

    ; 2nd. Now we are ready to copy the data
    ; we are copying a memory from 128 to 128 bytes at once
    mov ecx eax
    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

        ; No we must compute he remainder, to see how many times we will loop
        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
        mov edx 0 ; here it is used as an index
        L1:
            movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
            movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
            dec ecx
            lea edx D$edx+2
            jnz L1<
        test eax eax | jz L4> ; No remainders ? Exit
        jmp L9> ; jmp to the remainder computation

L0:
   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
    test eax eax | jz L4>  ; No remainders ? Exit
L9:
        lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
        lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3:  movsb | dec eax | jnz L3<

L4:

    mov edi D@pDest
    mov eax D@StrSize ; return the size of the string in eax
    mov B$edi+eax 0 ; Make sure it will end with a null termination byte

L5:
    mov eax D@EaxPreserve

EndP

Normal Version: (On the same conditions as the SSe version. I.e: Returning the lenght in eax too.

Code Select

Proc strcpy:
    Arguments @OutBound, @InBound
    Uses ecx, esi, edi

    xor ecx ecx
    mov esi D@InBound

    Do
        inc ecx
        inc esi
    Loop_Until B$esi = 0

    mov esi D@InBound | mov edi D@OutBound | rep movsb

    mov B$edi 0 ; Make sure it will end with a null termination byte
    mov eax edi
    sub eax edi ; retuurn string lenght in eax

EndP

Many tks for the tip, JJ. I´ll stay with the SSE variation of strcpy

The MASM Forum

News:

Fast String Copy Algo

guga

jj2007

guga

jj2007

TouEnMasm

jj2007

guga

guga

guga