One of the sobering facts writing code for x86 hardware is performance on any given procedure varies from one processor to another. On this HASWELL I am using I just did some speed tests on simple byte copy and the combination "rep movsb" is the fastest on small memory copy tasks. On very large data the SSE and AVX versions are faster but interestingly enough the historical "rep movsd" hybrid with "rep movsb" is actually slower on all data sizes.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
bcopy proc
; rcx = src
; rdx = dst
; r8 = count
mov r11, rsi
mov r10, rdi
mov rsi, rcx
mov rdi, rdx
mov rcx, r8
rep movsb
mov rsi, r11
mov rdi, r10
ret
bcopy endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
What about the "native" size for x64, rep movsq?
Its worth a try, I will have to do one at some stage. Long ago Intel made special provisions for using the old REP MOVS instructions and they were always reasonably fast but this stuff tends to vary from processor to processor. This is what I am getting timing wise with different instructions.
bcopy 7141
mcopy 7281
xmmcopya 5375
ymmcopya 5407
Press any key to continue...