News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Which is faster

Started by Farabi, December 15, 2012, 11:19:10 PM

Previous topic - Next topic

Farabi


movups xmm0,[esi]
movups xmm1,[edi]
subps xmm0,xmm1
movups [eax],xmm0


or


movups xmm0,[esi]
movaps xmm3,xmm0
movups xmm1,[edi]
movaps xmm4,xmm1
subps xmm3,xmm4
movups [eax],xmm3
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

qWord

Test it  :t
The first version does the same with less instruction thus it have an size-advantage. Practical it will probably have nearly the same speed because of register renaming. Even the critical part is most probably the store opration.
MREAL macros - when you need floating point arithmetic while assembling!

Adamanteus

I'm thinking that such code is need to avoid, because it's time-to-time situation based on cpu cache loading before store operations and pairing commands.

jj2007

What counts more is if the sources are 16-byte aligned or not. In the tests below, destination is always aligned.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

--- sources aligned:
48      cycles for 1 * Farabi A
65      cycles for 1 * Farabi B
38      cycles for 1 * JJ misaligned
32      cycles for 1 * JJ aligned

48      cycles for 1 * Farabi A
65      cycles for 1 * Farabi B
38      cycles for 1 * JJ misaligned
32      cycles for 1 * JJ aligned

--- sources misaligned:
66      cycles for 1 * Farabi A
76      cycles for 1 * Farabi B
57      cycles for 1 * JJ misaligned

65      cycles for 1 * Farabi A
76      cycles for 1 * Farabi B
57      cycles for 1 * JJ misaligned

30      bytes for Farabi A
36      bytes for Farabi B
38      bytes for JJ misaligned
27      bytes for JJ aligned

qWord

QuoteIntel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
-++---+--+++++--++2 of 20 tests valid, loop overhead is approx. 1/1 cycles

46      cycles for 1 * Farabi A
43      cycles for 1 * Farabi B
70      cycles for 1 * JJ misaligned
34      cycles for 1 * JJ aligned

41      cycles for 1 * Farabi A
43      cycles for 1 * Farabi B
72      cycles for 1 * JJ misaligned
27      cycles for 1 * JJ aligned

41      cycles for 1 * Farabi A
43      cycles for 1 * Farabi B
70      cycles for 1 * JJ misaligned
25      cycles for 1 * JJ aligned

30      bytes for Farabi A
36      bytes for Farabi B
38      bytes for JJ misaligned
27      bytes for JJ aligned

BTW: AVX should be mentioned as latest instruction set (instead of SSE4)
MREAL macros - when you need floating point arithmetic while assembling!

Adamanteus

So - last listing shows that only operands not cached Farabi A could be more slow - and only on some types of processor's caches, that follows from first listing.