Runs into a problem if byte count < 16. If the first jle @f is taken, RAX is not initialised but the final copy loop uses it as a counter.
xmmcopya proc
; *********************
; aligned memory copy
; *********************
; rcx = source address
; rdx = destination address
; r8 = byte count
mov r11, r8
shr r11, 4 ; div by 16 for loop count
xor r10, r10 ; zero r10 to use as index
cmp r8, 16
jle @F
lpst:
movdqa xmm0, [rcx+r10]
movntdq [rdx+r10], xmm0
add r10, 16
sub r11, 1
jnz lpst
mov rax, r8 ; calculate remainder if any
and rax, 15
test rax, rax
jnz @F
ret
@@:
mov r9b, [rcx+r10] ; copy any remainder
mov [rdx+r10], r9b
add r10, 1
sub rax, 1
jnz @B
ret
xmmcopya endp
I have only just pulled my head out of a linear parser, probably the best way is to test the length and if its under 16 bytes, branch to a simple byte copy. I did not envisage it being used for very small byte counts.
If you look at the design of the algorithm, it requires 16 byte aligned memory for the particular choice of mnemonics and will crash without correct alignment as per the Intel manual data so it is not a choice for memory copy under 16 bytes in length. An algorithm of this type would be used for block streaming of aligned memory, not for small byte counts. On small amounts its hard to beat REP MOVSB.
I realised later that it wasn't a general purpose routine, but some might see xmm and assume it's super-fast or something :biggrin: