### Author Topic: Fast String Copy Algo  (Read 1092 times)

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Fast String Copy Algo
« on: October 04, 2020, 01:43:38 AM »
Hi Guys

Someone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?

I have one (for ansi) that i use for copy memory (and also fixed len strings) adapted from JJ but it is for fixed lenght, as this:

Code: [Select]
`Proc memcpy_SSE_V3:    Arguments @pDest, @pSource, @Length    Uses esi, edi, ecx, edx, eax    mov edi D@pDest    mov esi D@pSource    ; we are copying a memory from 128 to 128 bytes at once    mov ecx D@Length    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over        ; No we must compute he remainder, to see how many times we will loop        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes        mov edx 0 ; here it is used as an index        L1:            movdqu XMM1 X\$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM            movdqu X\$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi            dec ecx            lea edx D\$edx+2            jnz L1<        test eax eax | jz L4> ; No remainders ? Exit        jmp L9> ; jmp to the remainder computationL0:   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytesL2:    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.    test eax eax | jz L4>  ; No remainders ? ExitL9:        lea edi D\$edi+edx*8 ; mul edx by 8 to get the pos        lea esi D\$esi+edx*8 ; mul edx by 8 to get the posL3:  movsb | dec eax | jnz L3<L4:EndP`

I´m trying to do the same, but for null termination string (Ansi and Unicode), but i´m clueless how to make it work :(
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### jj2007

• Member
• Posts: 11440
• Assembler is fun ;-)
##### Re: Fast String Copy Algo
« Reply #1 on: October 04, 2020, 06:56:53 AM »
Someone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Fast String Copy Algo
« Reply #2 on: October 04, 2020, 08:45:36 AM »
Someone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?

Hi JJ.

Yes, i still have those for memcopy and strlen,  but how to make a string copy from those ?

I tried to simply calculate 1st the strlen and with the resultant value use it in memcpy, but it will be slower then doing it on a direct way.  Do you have any idea how to merge both algos to create a Ansi (and later a unicode) nulll terminated string copy? (For unaligned strings, preferentially )

For example, a normal Ansi null terminated copy string can be done like this:

Code: [Select]
`; eax = Size of the copied stringProc CopyStringEx:    Arguments @Input, @Output    Uses esi, edi    mov esi D@Input    mov edi D@Output    .While B\$esi <> 0        movsb    .End_While    mov B\$edi 0    mov eax edi    sub eax D@OutputEndP`
How to do it using SSE2 ?

I thought in load the contents of esi in xmm0, then check if any of the 16 bytes is zero (on a similar way it was done in strlen, perhaps ?). If they are not, copy to the output buffer in edi and do the loop again. And later, when it finds one single byte that contains zero, deals with the remainder data. But..how to do it in SSE2 ? I mean, how to compare if any of the 16 bytes in xmm register, contains, at least 1 zero that represents the null termination byte ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### jj2007

• Member
• Posts: 11440
• Assembler is fun ;-)
##### Re: Fast String Copy Algo
« Reply #3 on: October 04, 2020, 02:30:40 PM »
IIRC I tested that, and it turns out that getting the strlen plus copying was faster than the "integrated" version. However, I used the notoriously fast MasmBasic Len() for that...

Code: [Select]
`6983    cycles for 100 * CRT strlen5460    cycles for 100 * Masm32 StrLen20044   cycles for 100 * Windows lstrlen2060    cycles for 100 * MasmBasic Len`
(from the Timer example at RichMasm's File/New Masm source menu)

#### TouEnMasm

• Member
• Posts: 1650
##### Re: Fast String Copy Algo
« Reply #4 on: October 04, 2020, 06:32:41 PM »
Fa is a musical note to play with CL

#### jj2007

• Member
• Posts: 11440
• Assembler is fun ;-)
##### Re: Fast String Copy Algo
« Reply #5 on: October 04, 2020, 07:21:05 PM »
I tried to simply calculate 1st the strlen and with the resultant value use it in memcpy, but it will be slower then doing it on a direct way.

What makes you think that it will be slower, did you time it?

What are we talking about, small (<100 bytes) or big (>1MB) strings?

For small strings, I bet that the calculate Len() first then copy variant will be faster.
For big strings, I can't imagine a case where you don't know its length...

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Fast String Copy Algo
« Reply #6 on: October 04, 2020, 07:38:24 PM »
Hi JJ. No i didn´t timed yet. I was just thinking that could be slow because we will need to use 2 functions rather then one.

Quote
What are we talking about, small (<100 bytes) or big (>1MB) strings?

For small strings, I bet that the calculate Len() first then copy variant will be faster.
For big strings, I can't imagine a case where you don't know its length...

I agree that it could be very rare using strings with 1 Mb, but we can have a situaton on a file that uses strcpy lots of time. (In RosAsm, for example,  i use strcpy frequently, for the assembler and disassmbler). On the disassmbler side we don´t know the size of strings that are beeing passed, that´s why i thought in a variation of a strcpy using one single function rather then embedding 2.

I´ll give a try using both functions as you suggested and try to benchmark the result to see how fast it could be :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Fast String Copy Algo
« Reply #7 on: October 04, 2020, 09:03:29 PM »
Ok, i made a small variation merging both functions. I´ll do the tests to see if it is faster then the normal movsb way

Code: [Select]
`Proc strcpy_SSE:    Arguments @pDest, @pSource    Local @StrSize    Uses ecx, edx, esi, edi    ; 1st we compute the lenght of the string. Guga/JJ strlen/memcpy    mov edi D@pDest    mov esi D@pSource    mov eax D@pSource    mov ecx eax    and eax 0-16    and ecx 16-1    or edx 0-1    shl edx cl    xorps xmm0 xmm0    pcmpeqb xmm0 X\$eax    add eax 16    pmovmskb ecx xmm0    xorps xmm0 xmm0    and ecx edx    jnz L2>L1:    movups xmm1 X\$eax    pcmpeqb xmm1 xmm0    pmovmskb ecx xmm1    add eax 16    test ecx ecx    jz L1<L2:    bsf ecx ecx    lea eax D\$eax+ecx-16    sub eax D@pSource | jle L5>> ; If smaller or equal to zero, exit    mov D@StrSize eax ; save the size of the string    ; 2nd. Now we are ready to copy the data    ; we are copying a memory from 128 to 128 bytes at once    mov ecx eax    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over        ; No we must compute he remainder, to see how many times we will loop        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes        mov edx 0 ; here it is used as an index        L1:            movdqu XMM1 X\$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM            movdqu X\$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi            dec ecx            lea edx D\$edx+2            jnz L1<        test eax eax | jz L4> ; No remainders ? Exit        jmp L9> ; jmp to the remainder computationL0:   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytesL2:    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.    test eax eax | jz L4>  ; No remainders ? ExitL9:        lea edi D\$edi+edx*8 ; mul edx by 8 to get the pos        lea esi D\$esi+edx*8 ; mul edx by 8 to get the posL3:  movsb | dec eax | jnz L3<L4:    mov eax D@StrSize ; return the size of the string in eaxL5:EndP`
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Fast String Copy Algo
« Reply #8 on: October 04, 2020, 09:39:44 PM »
Ok, i did the tests.

For small strings chunks (smaller then 16 bytes), there´s no significant difference between the old movsb and the SSE version. In fact, the old movsb is a bit faster
But, for big strings chunks (bigger or equal to 16 bytes) the difference is huge. The SSE variation is about 8 times faster

Small String to copy: "Hello world."
Time on SSE version: 45.09486968871570 clocks
Time on movsb version:  36.25383294288191 clocks

Bigger strings to copy: "Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.", 0
Time on SSE version: 63.40321596420246 clocks
Time on movsb version:  442.87924519310099 clocks

Versions:

SSE2

Code: [Select]
`Proc strcpy_SSE:    Arguments @pDest, @pSource    Local @StrSize, @EaxPreserve    Uses ecx, edx, esi, edi    ; 1st we compute the lenght of the string. Guga/JJ strlen/memcpy    mov edi D@pDest    mov esi D@pSource    mov D@EaxPreserve eax    mov eax esi    mov ecx eax    and eax 0-16    and ecx 16-1    or edx 0-1    shl edx cl    xorps xmm0 xmm0    pcmpeqb xmm0 X\$eax    add eax 16    pmovmskb ecx xmm0    xorps xmm0 xmm0    and ecx edx    jnz L2>L1:    movups xmm1 X\$eax    pcmpeqb xmm1 xmm0    pmovmskb ecx xmm1    add eax 16    test ecx ecx    jz L1<L2:    bsf ecx ecx    lea eax D\$eax+ecx-16    sub eax D@pSource | jle L5>> ; If smaller or equal to zero, exit    mov D@StrSize eax ; save the size of the string    ; 2nd. Now we are ready to copy the data    ; we are copying a memory from 128 to 128 bytes at once    mov ecx eax    mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)    jz L0> ; The memory size if smaller then 16 bytes long. Jmp over        ; No we must compute he remainder, to see how many times we will loop        mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes        mov edx 0 ; here it is used as an index        L1:            movdqu XMM1 X\$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM            movdqu X\$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi            dec ecx            lea edx D\$edx+2            jnz L1<        test eax eax | jz L4> ; No remainders ? Exit        jmp L9> ; jmp to the remainder computationL0:   ; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.   mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytesL2:    ; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.    test eax eax | jz L4>  ; No remainders ? ExitL9:        lea edi D\$edi+edx*8 ; mul edx by 8 to get the pos        lea esi D\$esi+edx*8 ; mul edx by 8 to get the posL3:  movsb | dec eax | jnz L3<L4:    mov edi D@pDest    mov eax D@StrSize ; return the size of the string in eax    mov B\$edi+eax 0 ; Make sure it will end with a null termination byteL5:    mov eax D@EaxPreserveEndP`
Normal Version: (On the same conditions as the SSe version. I.e: Returning the lenght in eax too.

Code: [Select]
`Proc strcpy:    Arguments @OutBound, @InBound    Uses ecx, esi, edi    xor ecx ecx    mov esi D@InBound    Do        inc ecx        inc esi    Loop_Until B\$esi = 0    mov esi D@InBound | mov edi D@OutBound | rep movsb    mov B\$edi 0 ; Make sure it will end with a null termination byte    mov eax edi    sub eax edi ; retuurn string lenght in eaxEndP`

Many tks for the tip, JJ. I´ll stay with the SSE variation of strcpy
« Last Edit: October 05, 2020, 12:36:08 AM by guga »
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com