There seems to be a wide range when requiring a single double value into an xmm register.
movsd
movapd
movupd
movq
The mnemonic "movapd" requires 16 byte alignment when reading an 8 byte (REAL8) value into a register.
The task here is calculations and not streaming multi-media or similar. I have got all of the scalar instructions going but there are more instructions for packed double operations that would be useful.
LATER : Nothing meaningful in the difference. :biggrin:
I see some differences - surprisingly, movlps is slow, movups is fast ::)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
95 cycles for 100 * movsd*3
226 cycles for 100 * movlps*3
86 cycles for 100 * movq*3
83 cycles for 100 * movupd*3
57 cycles for 100 * movups
81 cycles for 100 * movapd
88 cycles for 100 * movsd*3
226 cycles for 100 * movlps*3
88 cycles for 100 * movq*3
88 cycles for 100 * movupd*3
65 cycles for 100 * movups
86 cycles for 100 * movapd
90 cycles for 100 * movsd*3
225 cycles for 100 * movlps*3
89 cycles for 100 * movq*3
83 cycles for 100 * movupd*3
59 cycles for 100 * movups
83 cycles for 100 * movapd
86 cycles for 100 * movsd*3
226 cycles for 100 * movlps*3
83 cycles for 100 * movq*3
88 cycles for 100 * movupd*3
66 cycles for 100 * movups
86 cycles for 100 * movapd
83 cycles for 100 * movsd*3
225 cycles for 100 * movlps*3
87 cycles for 100 * movq*3
85 cycles for 100 * movupd*3
59 cycles for 100 * movups
88 cycles for 100 * movapd
24 bytes for movsd*3
21 bytes for movlps*3
24 bytes for movq*3
24 bytes for movupd*3
21 bytes for movups
24 bytes for movapd
All tests are organised like this:
align_64
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
movsd xmm0, MyDouble
movsd xmm1, MyDouble
movsd xmm2, MyDouble
dec ebx
.Until Sign?
ret
Noting that this Haswell is a pig to get any timings on, this is the result for 5 billion iterations of memory loads to xmm registers. This is with a high priority class and a 100 ms delay between each test.
1500 movapd
1484 movsd
1500 movq
1516 movapd
1532 movsd
1453 movq
1485 movapd
1484 movsd
1484 movq
1516 movapd
1454 movsd
1437 movq
Press any key to continue...
Interesting. Can you post the exe? I've tried to load xmm0 ... xmm7 in the loop, and the picture changes - the advantage for movups is gone:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
290 cycles for 100 * movsd*8
773 cycles for 100 * movlps*8
234 cycles for 100 * movq*8
222 cycles for 100 * movupd*8
223 cycles for 100 * movups*8
223 cycles for 100 * movapd*8
224 cycles for 100 * movsd*8
636 cycles for 100 * movlps*8
223 cycles for 100 * movq*8
224 cycles for 100 * movupd*8
226 cycles for 100 * movups*8
225 cycles for 100 * movapd*8
224 cycles for 100 * movsd*8
629 cycles for 100 * movlps*8
226 cycles for 100 * movq*8
226 cycles for 100 * movupd*8
223 cycles for 100 * movups*8
223 cycles for 100 * movapd*8
222 cycles for 100 * movsd*8
630 cycles for 100 * movlps*8
222 cycles for 100 * movq*8
221 cycles for 100 * movupd*8
224 cycles for 100 * movups*8
225 cycles for 100 * movapd*8
222 cycles for 100 * movsd*8
631 cycles for 100 * movlps*8
225 cycles for 100 * movq*8
225 cycles for 100 * movupd*8
225 cycles for 100 * movups*8
222 cycles for 100 * movapd*8
64 bytes for movsd*8
56 bytes for movlps*8
64 bytes for movq*8
64 bytes for movupd*8
56 bytes for movups*8
64 bytes for movapd*8
If it is for moving doubles, why are movups and movlps in the test?
movups moves 4 packed singles
movlps moves 2 low packed singles
movhps moves 2 high packed singles
these 2 are not in the test,
movlpd move low packed double ( can be replaced by movsd )
movhpd move high packed double
Quote from: Siekmanski on September 10, 2018, 04:04:30 AM
If it is for moving doubles, why are movups and movlps in the test?
movlps moves a double, though not officially ;-)
Quotethese 2 are not in the test,
movlpd move low packed double ( can be replaced by movsd )
movhpd move high packed double
Added movlpd - it behaves like movlps:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
228 cycles for 100 * movsd*8
632 cycles for 100 * movlps*8
218 cycles for 100 * movq*8
218 cycles for 100 * movupd*8
225 cycles for 100 * movups*8
226 cycles for 100 * movapd*8
630 cycles for 100 * movlpd*8
225 cycles for 100 * movsd*8
634 cycles for 100 * movlps*8
217 cycles for 100 * movq*8
218 cycles for 100 * movupd*8
226 cycles for 100 * movups*8
226 cycles for 100 * movapd*8
632 cycles for 100 * movlpd*8
:biggrin:
Do you measure any difference between movaps and movups?
Yes, movaps is a tick slower. Jokes apart: In modern CPUs, there is no difference between the two. Except that one of them crashes for unaligned memory access :lol:
Yeah, I noticed the same behavior. :t
I found that movapd is not all that useful either, it requires 16 byte alignment to load a REAL8 where movsd and movq are just as fast and not fussy to use.