:idea:
Because my previous thread "C/C++ vs Assembler" became very long and out of its target, I begin a new one.
I never believed that Win API was slower than crt or other libraries.
My previous thread gave me the chance to verify that.
Therefor, I thinked to test and other functions.
I wrote a simple function, named strCopy.
This is the code:
strCopy proc pDest:DWORD, pSource:DWORD
mov ecx, pSource
mov edx, pDest
copyLoop:
mov al, byte ptr [ecx]
inc ecx
mov byte ptr [edx],al
inc edx
cmp al, 0
jnz copyLoop
mov al, byte ptr 0
ret
strCopy endp
The above function take advantage of two channels of processor.
I tested this with Hutch's library szCopy, with lstrcpy WinAPI and with crt strcpy.
Follow are my source and the average of results:
.data
szText db "abcdefghijklmnopqrstuvwxyz123456789", 0
szBuffer db 0 dup (64)
includelib \MSVCRT.LIB
strcpy PROTO C :DWORD, :DWORD
strCopy PROTO pDest:DWORD, pSource:DWORD
;......................................
LOCAL dwTime :DWORD
invoke GetTickCount
mov dwTime, eax
push esi
xor esi, esi
TestLoop:
invoke lstrcpy, addr szBuffer, addr szText
; invoke strcpy, addr szBuffer, addr szText
; invoke szCopy, addr szText, addr szBuffer
; invoke strCopy, addr szBuffer, addr szText
inc esi
cmp esi, 10000000
jb TestLoop
pop esi
invoke GetTickCount
sub eax, dwTime
PrintDec eax
Results (average):
lstrcpy (API) Ticks 570
szCopy (Hutch lib) Ticks 515
strCopy (mine) Ticks 359
strcpy (crt) Ticks 172
Manos.
Nothing beats rep movsd. If you need inspiration, try this (http://www.masmforum.com/board/index.php?topic=1589.0) and this thread (http://www.masmforum.com/board/index.php?topic=10830.0).
Or, even better, the Code location sensitivity of timings (http://www.masmforum.com/board/index.php?topic=11454.0) thread, but beware, it's advanced stuff. Unfortunately the thread no longer has the attachments, but here they are below.
Quote from: jj2007 on May 07, 2013, 02:03:03 AM
Nothing beats rep movsd.
Jochen, i dont get it in my P4 :icon14:
Well i tested rep movsb
Quote from: RuiLoureiro on May 07, 2013, 02:30:03 AM
Jochen, i dont get it in my P4 :icon14:
No 5, MemCoP4 is best for you ;-)
Manos,
Here is a quick optimisation for your original algo, 1 less instruction per iteration and unrolled by 2.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
StrCpy2 PROTO src:DWORD,dst:DWORD
strCopy PROTO pDest:DWORD, pSource:DWORD
.data
align 4
item db "The game is done, I've won I've won quote she and whistled thrice",0
align 4
buff db " "
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
push esi
REPEAT 8
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke strCopy, ADDR buff,ADDR item
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Original",13,10
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke StrCpy2, ADDR item,ADDR buff
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Modified",13,10
ENDM
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
strCopy proc pDest:DWORD, pSource:DWORD
mov ecx, pSource
mov edx, pDest
copyLoop:
mov al, byte ptr [ecx]
inc ecx
mov byte ptr [edx],al
inc edx
cmp al, 0
jnz copyLoop
mov al, byte ptr 0
ret
strCopy endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy2 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
mov esi, -1
@@:
add esi, 1
movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
test eax, eax
jz @F
add esi, 1
movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
test eax, eax
jnz @B
@@:
pop esi
ret 8
StrCpy2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Timing on my Core2 Quad.
531 Original
344 Modified
531 Original
344 Modified
516 Original
359 Modified
516 Original
343 Modified
532 Original
343 Modified
532 Original
343 Modified
516 Original
359 Modified
516 Original
344 Modified
Press any key to continue ...
Pentium IV, 3.2 Ghz :
1000 Original
844 Modified
984 Original
844 Modified
984 Original
844 Modified
969 Original
859 Modified
969 Original
843 Modified
985 Original
844 Modified
984 Original
844 Modified
984 Original
844 Modified
Press any key to continue ...
Steve,
your function is much faster than my own.
Results, (average):
strCopy (mine) 360 ticks
StrCpy2 (your) 270 ticks
This is because, you have avoid increments by putting two times
the body of instructions set in the function.
I wrote a new one, named strCopyNew using the same trick
and the results are identical like your function.
strCopyNew proc pDest:DWORD, pSource:DWORD
mov ecx, pSource
mov edx, pDest
push esi
mov esi, -1
@@:
add esi, 1
movzx eax, byte ptr [ecx + esi]
mov byte ptr [edx + esi], al
test eax, eax
jz @F
add esi, 1
movzx eax, byte ptr [ecx + esi]
mov byte ptr [edx + esi], al
test eax, eax
jnz @B
@@:
pop esi
ret
strCopyNew endp
Results, (average):
strCopyNew 270 ticks
Manos.
Quote from: hutch-- on May 08, 2013, 10:53:51 PM
Manos,
Here is a quick optimisation for your original algo, 1 less instruction per iteration and unrolled by 2.
Hutch, here another optimisation: 1 less instruction per iteration
Results on my P4
Quote
1016 Original
844 Modified
781 Modified again
969 Original
844 Modified
796 Modified again
985 Original
844 Modified
765 Modified again
1000 Original
844 Modified
781 Modified again
985 Original
859 Modified
781 Modified again
1000 Original
828 Modified
782 Modified again
984 Original
859 Modified
766 Modified again
984 Original
844 Modified
781 Modified again
Press any key to continue ...
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
StrCpy3 PROTO src:DWORD,dst:DWORD
StrCpy2 PROTO src:DWORD,dst:DWORD
strCopy PROTO pDest:DWORD, pSource:DWORD
.data
align 4
item db "The game is done, I've won I've won quote she and whistled thrice",0
align 4
buff db " "
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
push esi
REPEAT 8
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke strCopy, ADDR buff,ADDR item
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Original",13,10
;----------------------------------------
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke StrCpy2, ADDR item,ADDR buff
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Modified",13,10
;----------------------------------------
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke StrCpy3, ADDR item,ADDR buff
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Modified again",13,10
ENDM
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
strCopy proc pDest:DWORD, pSource:DWORD
mov ecx, pSource
mov edx, pDest
copyLoop:
mov al, byte ptr [ecx]
inc ecx
mov byte ptr [edx],al
inc edx
cmp al, 0
jnz copyLoop
mov al, byte ptr 0
ret
strCopy endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy2 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
mov esi, -1
@@:
add esi, 1
movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
test eax, eax
jz @F
add esi, 1
movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
test eax, eax
jnz @B
@@:
pop esi
ret 8
StrCpy2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy3 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
mov esi, -1
@@:
add esi, 1
movzx eax, WORD PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
or al, al
jz @F
add esi, 1
;movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], ah
or ah, ah
jnz @B
@@:
pop esi
ret 8
StrCpy3 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Here another optimisation
Quote
1015 Original
829 Modified
750 Modified again
671 Modified again 2
954 Original
828 Modified
750 Modified again
672 Modified again 2
953 Original
812 Modified
766 Modified again
703 Modified again 2
937 Original
829 Modified
750 Modified again
703 Modified again 2
937 Original
813 Modified
750 Modified again
672 Modified again 2
937 Original
828 Modified
750 Modified again
656 Modified again 2
938 Original
812 Modified
750 Modified again
656 Modified again 2
969 Original
813 Modified
765 Modified again
656 Modified again 2
Press any key to continue ...
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
StrCpy4 PROTO src:DWORD,dst:DWORD
StrCpy3 PROTO src:DWORD,dst:DWORD
StrCpy2 PROTO src:DWORD,dst:DWORD
strCopy PROTO pDest:DWORD, pSource:DWORD
.data
align 4
item db "The game is done, I've won I've won quote she and whistled thrice",0
align 4
buff db " "
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
push esi
invoke StrCpy3, ADDR item,ADDR buff
print addr buff,13,10
invoke StrCpy4, ADDR item,ADDR buff
print addr buff,13,10
REPEAT 8
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke strCopy, ADDR buff,ADDR item
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Original",13,10
;----------------------------------------
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke StrCpy2, ADDR item,ADDR buff
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Modified",13,10
;----------------------------------------
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke StrCpy3, ADDR item,ADDR buff
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Modified again",13,10
;----------------------------------------
invoke GetTickCount
push eax
mov esi, 10000000
@@:
invoke StrCpy4, ADDR item,ADDR buff
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print ustr$(eax)," Modified again 2",13,10
ENDM
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
strCopy proc pDest:DWORD, pSource:DWORD
mov ecx, pSource
mov edx, pDest
copyLoop:
mov al, byte ptr [ecx]
inc ecx
mov byte ptr [edx],al
inc edx
cmp al, 0
jnz copyLoop
mov al, byte ptr 0
ret
strCopy endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy2 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
mov esi, -1
@@:
add esi, 1
movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
test eax, eax
jz @F
add esi, 1
movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
test eax, eax
jnz @B
@@:
pop esi
ret 8
StrCpy2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy3 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
mov esi, -1
@@:
add esi, 1
movzx eax, WORD PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
or al, al
jz @F
add esi, 1
;movzx eax, BYTE PTR [ecx+esi]
mov BYTE PTR [edx+esi], ah
or ah, ah
jnz @B
@@:
pop esi
ret 8
StrCpy3 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy4 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
xor esi, esi
@@:
movzx eax, WORD PTR [ecx+esi]
mov BYTE PTR [edx+esi], al
or al, al
jz @F
mov BYTE PTR [edx+esi+1], ah
add esi, 2
or ah, ah
jnz @B
@@:
pop esi
ret 8
StrCpy4 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
StrCpy5 proc src:DWORD,dst:DWORD
mov ecx, [esp+4] ; src
mov edx, [esp+8] ; dst
push esi
mov esi, -1
@@:
add esi, 1
mov eax, [ecx+esi]
mov BYTE PTR [edx+esi], al
or al, al
jz @F
add esi, 1
mov BYTE PTR [edx+esi], ah
or ah, ah
jz @F
shr eax, 16
add esi, 1
mov BYTE PTR [edx+esi], al
or al, al
jz @F
add esi, 1
mov BYTE PTR [edx+esi], ah
or ah, ah
jnz @B
@@:
pop esi
ret 8
StrCpy5 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Celeron M:
1016 Original
891 Modified
890 Modified again
891 Modified again 2
890 Modified again J
672 MasmBasic copy
"Modified again J" is a dword load plus bswap:
@@:
inc esi
mov eax, [ecx+esi]
mov BYTE PTR [edx+esi], al
test al, al
jz @F
inc esi
mov BYTE PTR [edx+esi], ah
test ah, ah
jz @F
bswap eax
inc esi
mov BYTE PTR [edx+esi], ah
test ah, ah
jz @F
inc esi
mov BYTE PTR [edx+esi], al
test al, al
jnz @B
test al, al instead of or al, al is a lot faster on my CPU.
Jochen,
or is faster on my P4
i tried StrCpy5 but not faster ! But you found out bswap ! ;)
I tested again Hutch's new function, crt function and API.
.data
szText db "abcdefghijklmnopqrstuvwxyz123456789", 0
szBuffer db 0 dup (64)
includelib \MSVCRT.LIB
StrCpy2 PROTO src:DWORD,dst:DWORD
strcpy PROTO C :DWORD, :DWORD
Results:
lstrcpy (API) 563
StrCpy2 (Hutch) 297
strcpy (crt) 172
Manos.
I'm curious if one is interested in an more statistical approach of measurement, whereas the timing is taken for a single call to the corresponding code and several milliseconds are waited between the calls to make sure that the cache is not involved as it is for the loop-x-thousands-times method? Especially for memory expensive functions like MemCopy or table bases methods this might be an better approach for measure the speed...
qWord is right here, stabilise the timings with a 100 ms delay gives a more realistic result, even though it does not change much. I did a quick play with the unroll rate and found 3 was very slightly faster where 4 and higher made no difference. Changing the proc alignment slowed it down and aligning the first label also slowed it down.
My own interest in these simple byte copy routines is how fast they are the first time, a factor I call "attack" over streamed tests as it is not uncommon to perform this capacity in the middle of a much more complex algorithm where the call overhead is a problem. Now while a 4 byte copy will usually be faster, it runs into alignment problems with string data which you cannot garrantee as 4 byte aligned where the simple byte level copy is insensitive to alignment.
Quote from: qWord on May 09, 2013, 08:34:40 AM
I'm curious if one is interested in an more statistical approach of measurement, whereas the timing is taken for a single call to the corresponding code and several milliseconds are waited between the calls to make sure that the cache is not involved as it is for the loop-x-thousands-times method?
If my poor English not mislead me, you means that I have done one only measurement for each case.
I inform you that I know to do measurements.
I am a physicist and the first thing that I taught in University is that we must do too many measurements and to take the average.
If look my first post in this thread, you 'll see that I refer in average.
If I write here my measurements, I 'll spend two pages.
I have done new test with a double lenght string.
Here are the results:
szText db "abcdefghijklmnopqrstuvwxyz123456789abcdefghijklmnopqrstuvwxyz123456789", 0
szBuffer db 0 dup (128)
lstrcpy (API) 883
StrCpy2 (Hutch) 383
strcpy (crt) 330Manos.
Quote from: Manos on May 09, 2013, 06:36:36 PMI am a physicist and the first thing that I taught in University is that we must do too many measurements and to take the average.
no doubts on that.
I thought of a method that wait x milliseconds between the calls (in a loop) in the hope that at the cache is cleared for the corresponding memory region. With such a method we can not get million measurement values, but I'm confidently that several hundred are also sufficient. For the recorded results, we kick the outlier and than take the average as usually.