Hi Guys
Someone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?
I have one (for ansi) that i use for copy memory (and also fixed len strings) adapted from JJ but it is for fixed lenght, as this:
Proc memcpy_SSE_V3:
Arguments @pDest, @pSource, @Length
Uses esi, edi, ecx, edx, eax
mov edi D@pDest
mov esi D@pSource
; we are copying a memory from 128 to 128 bytes at once
mov ecx D@Length
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
mov edx 0 ; here it is used as an index
L1:
movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
dec ecx
lea edx D$edx+2
jnz L1<
test eax eax | jz L4> ; No remainders ? Exit
jmp L9> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
L2:
; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
test eax eax | jz L4> ; No remainders ? Exit
L9:
lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
lea esi D$esi+edx*8 ; mul edx by 8 to get the pos
L3: movsb | dec eax | jnz L3<
L4:
EndP
I´m trying to do the same, but for null termination string (Ansi and Unicode), but i´m clueless how to make it work :(
Quote from: guga on October 04, 2020, 01:43:38 AMSomeone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?
What about this one? (http://masm32.com/board/index.php?topic=1971.0) Or that endless thread? (http://masm32.com/board/index.php?topic=4067.0)
Quote from: jj2007 on October 04, 2020, 06:56:53 AM
Quote from: guga on October 04, 2020, 01:43:38 AMSomeone has a example (or tested) a SSE2 version of a null terminated string copy function (Both, in Ansi and Unicode) ?
What about this one? (http://masm32.com/board/index.php?topic=1971.0) Or that endless thread? (http://masm32.com/board/index.php?topic=4067.0)
Hi JJ.
Yes, i still have those for memcopy and strlen, but how to make a string copy from those ?
I tried to simply calculate 1st the strlen and with the resultant value use it in memcpy, but it will be slower then doing it on a direct way. Do you have any idea how to merge both algos to create a Ansi (and later a unicode) nulll terminated string copy? (For unaligned strings, preferentially )
For example, a normal Ansi null terminated copy string can be done like this:
; eax = Size of the copied string
Proc CopyStringEx:
Arguments @Input, @Output
Uses esi, edi
mov esi D@Input
mov edi D@Output
.While B$esi <> 0
movsb
.End_While
mov B$edi 0
mov eax edi
sub eax D@Output
EndP
How to do it using SSE2 ?
I thought in load the contents of esi in xmm0, then check if any of the 16 bytes is zero (on a similar way it was done in strlen, perhaps ?). If they are not, copy to the output buffer in edi and do the loop again. And later, when it finds one single byte that contains zero, deals with the remainder data. But..how to do it in SSE2 ? I mean, how to compare if any of the 16 bytes in xmm register, contains, at least 1 zero that represents the null termination byte ?
IIRC I tested that, and it turns out that getting the strlen plus copying was faster than the "integrated" version. However, I used the notoriously fast MasmBasic Len() for that...
6983 cycles for 100 * CRT strlen
5460 cycles for 100 * Masm32 StrLen
20044 cycles for 100 * Windows lstrlen
2060 cycles for 100 * MasmBasic Len
(from the Timer example at RichMasm's File/New Masm source menu)
Perhaps this one can help
http://masm32.com/board/index.php?topic=3715.msg39270#msg39270 (http://masm32.com/board/index.php?topic=3715.msg39270#msg39270)
Quote from: guga on October 04, 2020, 08:45:36 AMI tried to simply calculate 1st the strlen and with the resultant value use it in memcpy, but it will be slower then doing it on a direct way.
What makes you think that it will be slower, did you time it?
What are we talking about, small (<100 bytes) or big (>1MB) strings?
For small strings, I bet that the
calculate Len() first then copy variant will be faster.
For big strings, I can't imagine a case where you don't know its length...
Hi JJ. No i didn´t timed yet. I was just thinking that could be slow because we will need to use 2 functions rather then one.
Quote
What are we talking about, small (<100 bytes) or big (>1MB) strings?
For small strings, I bet that the calculate Len() first then copy variant will be faster.
For big strings, I can't imagine a case where you don't know its length...
I agree that it could be very rare using strings with 1 Mb, but we can have a situaton on a file that uses strcpy lots of time. (In RosAsm, for example, i use strcpy frequently, for the assembler and disassmbler). On the disassmbler side we don´t know the size of strings that are beeing passed, that´s why i thought in a variation of a strcpy using one single function rather then embedding 2.
I´ll give a try using both functions as you suggested and try to benchmark the result to see how fast it could be :)
Ok, i made a small variation merging both functions. I´ll do the tests to see if it is faster then the normal movsb way
Proc strcpy_SSE:
Arguments @pDest, @pSource
Local @StrSize
Uses ecx, edx, esi, edi
; 1st we compute the lenght of the string. Guga/JJ strlen/memcpy
mov edi D@pDest
mov esi D@pSource
mov eax D@pSource
mov ecx eax
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
xorps xmm0 xmm0
pcmpeqb xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
xorps xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax
pcmpeqb xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx ecx
lea eax D$eax+ecx-16
sub eax D@pSource | jle L5>> ; If smaller or equal to zero, exit
mov D@StrSize eax ; save the size of the string
; 2nd. Now we are ready to copy the data
; we are copying a memory from 128 to 128 bytes at once
mov ecx eax
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
mov edx 0 ; here it is used as an index
L1:
movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
dec ecx
lea edx D$edx+2
jnz L1<
test eax eax | jz L4> ; No remainders ? Exit
jmp L9> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
L2:
; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
test eax eax | jz L4> ; No remainders ? Exit
L9:
lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
lea esi D$esi+edx*8 ; mul edx by 8 to get the pos
L3: movsb | dec eax | jnz L3<
L4:
mov eax D@StrSize ; return the size of the string in eax
L5:
EndP
Ok, i did the tests.
For small strings chunks (smaller then 16 bytes), there´s no significant difference between the old movsb and the SSE version. In fact, the old movsb is a bit faster
But, for big strings chunks (bigger or equal to 16 bytes) the difference is huge. The SSE variation is about 8 times faster
Small String to copy: "Hello world."
Time on SSE version: 45.09486968871570 clocks
Time on movsb version: 36.25383294288191 clocks
Bigger strings to copy: "Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.Hello world.", 0
Time on SSE version: 63.40321596420246 clocks
Time on movsb version: 442.87924519310099 clocks
Versions:
SSE2
Proc strcpy_SSE:
Arguments @pDest, @pSource
Local @StrSize, @EaxPreserve
Uses ecx, edx, esi, edi
; 1st we compute the lenght of the string. Guga/JJ strlen/memcpy
mov edi D@pDest
mov esi D@pSource
mov D@EaxPreserve eax
mov eax esi
mov ecx eax
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
xorps xmm0 xmm0
pcmpeqb xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
xorps xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax
pcmpeqb xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx ecx
lea eax D$eax+ecx-16
sub eax D@pSource | jle L5>> ; If smaller or equal to zero, exit
mov D@StrSize eax ; save the size of the string
; 2nd. Now we are ready to copy the data
; we are copying a memory from 128 to 128 bytes at once
mov ecx eax
mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)
jz L0> ; The memory size if smaller then 16 bytes long. Jmp over
; No we must compute he remainder, to see how many times we will loop
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
mov edx 0 ; here it is used as an index
L1:
movdqu XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM
movdqu X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi
dec ecx
lea edx D$edx+2
jnz L1<
test eax eax | jz L4> ; No remainders ? Exit
jmp L9> ; jmp to the remainder computation
L0:
; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.
mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes
L2:
; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.
test eax eax | jz L4> ; No remainders ? Exit
L9:
lea edi D$edi+edx*8 ; mul edx by 8 to get the pos
lea esi D$esi+edx*8 ; mul edx by 8 to get the pos
L3: movsb | dec eax | jnz L3<
L4:
mov edi D@pDest
mov eax D@StrSize ; return the size of the string in eax
mov B$edi+eax 0 ; Make sure it will end with a null termination byte
L5:
mov eax D@EaxPreserve
EndP
Normal Version: (On the same conditions as the SSe version. I.e: Returning the lenght in eax too.
Proc strcpy:
Arguments @OutBound, @InBound
Uses ecx, esi, edi
xor ecx ecx
mov esi D@InBound
Do
inc ecx
inc esi
Loop_Until B$esi = 0
mov esi D@InBound | mov edi D@OutBound | rep movsb
mov B$edi 0 ; Make sure it will end with a null termination byte
mov eax edi
sub eax edi ; retuurn string lenght in eax
EndP
Many tks for the tip, JJ. I´ll stay with the SSE variation of strcpy :thumbsup: :thumbsup: :thumbsup: :thumbsup: