There is no problem using SSE and AVX instructions, you just have to look up the instructions in the Intel manuals. Sad to say I have to do a lot more of the hacky stuff before I can get into the really fast stuff.
Now as far as register usage goes, long ago I have learnt to fully comply with whatever the appropriate ABI happens to be and you get reliable code that works across different version of Windows. From memory Linux has a slightly different ABI but its the same mentality, C compilers generally use the full spread of registers and use them according to the OS register convention so its worth playing safe here.
This algo below is an AVX memory copy procedure and it is faster than the legacy version and the SSE version. Note that the memory must be 256 bit aligned.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
ymmcopya proc
; rcx = source address
; rdx = destination address
; r8 = byte count
mov r11, r8
shr r11, 5 ; div by 32 for loop count
xor r10, r10 ; zero r10 to use as index
lpst:
vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
vmovntdq YMMWORD PTR [rdx+r10], ymm0
add r10, 32
sub r11, 1
jnz lpst
mov rax, r8 ; calculate remainder if any
and rax, 31
test rax, rax
jnz @F
ret
@@:
mov r9b, [rcx+r10] ; copy any remainder
mov [rdx+r10], r9b
add r10, 1
sub rax, 1
jnz @B
ret
ymmcopya endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤