movups xmm0,[esi]
movups xmm1,[edi]
subps xmm0,xmm1
movups [eax],xmm0
or
movups xmm0,[esi]
movaps xmm3,xmm0
movups xmm1,[edi]
movaps xmm4,xmm1
subps xmm3,xmm4
movups [eax],xmm3
Test it :t
The first version does the same with less instruction thus it have an size-advantage. Practical it will probably have nearly the same speed because of register renaming. Even the critical part is most probably the store opration.
I'm thinking that such code is need to avoid, because it's time-to-time situation based on cpu cache loading before store operations and pairing commands.
What counts more is if the sources are 16-byte aligned or not. In the tests below, destination is always aligned.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
--- sources aligned:
48 cycles for 1 * Farabi A
65 cycles for 1 * Farabi B
38 cycles for 1 * JJ misaligned
32 cycles for 1 * JJ aligned
48 cycles for 1 * Farabi A
65 cycles for 1 * Farabi B
38 cycles for 1 * JJ misaligned
32 cycles for 1 * JJ aligned
--- sources misaligned:
66 cycles for 1 * Farabi A
76 cycles for 1 * Farabi B
57 cycles for 1 * JJ misaligned
65 cycles for 1 * Farabi A
76 cycles for 1 * Farabi B
57 cycles for 1 * JJ misaligned
30 bytes for Farabi A
36 bytes for Farabi B
38 bytes for JJ misaligned
27 bytes for JJ aligned
QuoteIntel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
-++---+--+++++--++2 of 20 tests valid, loop overhead is approx. 1/1 cycles
46 cycles for 1 * Farabi A
43 cycles for 1 * Farabi B
70 cycles for 1 * JJ misaligned
34 cycles for 1 * JJ aligned
41 cycles for 1 * Farabi A
43 cycles for 1 * Farabi B
72 cycles for 1 * JJ misaligned
27 cycles for 1 * JJ aligned
41 cycles for 1 * Farabi A
43 cycles for 1 * Farabi B
70 cycles for 1 * JJ misaligned
25 cycles for 1 * JJ aligned
30 bytes for Farabi A
36 bytes for Farabi B
38 bytes for JJ misaligned
27 bytes for JJ aligned
BTW: AVX should be mentioned as latest instruction set (instead of SSE4)
So - last listing shows that only operands not cached Farabi A could be more slow - and only on some types of processor's caches, that follows from first listing.