Which is faster

Farabi · December 15, 2012, 11:19:10 PM


movups xmm0,[esi]
	movups xmm1,[edi]
	subps xmm0,xmm1
	movups [eax],xmm0

or

Code Select


movups xmm0,[esi]
movaps xmm3,xmm0
	movups xmm1,[edi]
movaps xmm4,xmm1
	subps xmm3,xmm4
	movups [eax],xmm3

qWord · December 16, 2012, 05:16:10 AM

Test it :t
The first version does the same with less instruction thus it have an size-advantage. Practical it will probably have nearly the same speed because of register renaming. Even the critical part is most probably the store opration.

Adamanteus · December 16, 2012, 07:39:40 AM

I'm thinking that such code is need to avoid, because it's time-to-time situation based on cpu cache loading before store operations and pairing commands.

jj2007 · December 16, 2012, 09:38:39 AM

What counts more is if the sources are 16-byte aligned or not. In the tests below, destination is always aligned.

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)

--- sources aligned:
48 cycles for 1 * Farabi A
65 cycles for 1 * Farabi B
38 cycles for 1 * JJ misaligned
32 cycles for 1 * JJ aligned

48 cycles for 1 * Farabi A
65 cycles for 1 * Farabi B
38 cycles for 1 * JJ misaligned
32 cycles for 1 * JJ aligned

--- sources misaligned:
66 cycles for 1 * Farabi A
76 cycles for 1 * Farabi B
57 cycles for 1 * JJ misaligned

65 cycles for 1 * Farabi A
76 cycles for 1 * Farabi B
57 cycles for 1 * JJ misaligned

30 bytes for Farabi A
36 bytes for Farabi B
38 bytes for JJ misaligned
27 bytes for JJ aligned

qWord · December 16, 2012, 09:46:01 AM

QuoteIntel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
-++---+--+++++--++2 of 20 tests valid, loop overhead is approx. 1/1 cycles

46 cycles for 1 * Farabi A
43 cycles for 1 * Farabi B
70 cycles for 1 * JJ misaligned
34 cycles for 1 * JJ aligned

41 cycles for 1 * Farabi A
43 cycles for 1 * Farabi B
72 cycles for 1 * JJ misaligned
27 cycles for 1 * JJ aligned

41 cycles for 1 * Farabi A
43 cycles for 1 * Farabi B
70 cycles for 1 * JJ misaligned
25 cycles for 1 * JJ aligned

30 bytes for Farabi A
36 bytes for Farabi B
38 bytes for JJ misaligned
27 bytes for JJ aligned

BTW: AVX should be mentioned as latest instruction set (instead of SSE4)

Adamanteus · December 17, 2012, 01:31:46 AM

So - last listing shows that only operands not cached Farabi A could be more slow - and only on some types of processor's caches, that follows from first listing.

The MASM Forum

News:

Which is faster

Farabi

qWord

Adamanteus

jj2007

qWord

Adamanteus