Has anyone timed the instructions loading a double into an xmm register.

hutch-- · September 09, 2018, 11:36:14 PM

There seems to be a wide range when requiring a single double value into an xmm register.

movsd
movapd
movupd
movq

The mnemonic "movapd" requires 16 byte alignment when reading an 8 byte (REAL8) value into a register.

The task here is calculations and not streaming multi-media or similar. I have got all of the scalar instructions going but there are more instructions for packed double operations that would be useful.

LATER : Nothing meaningful in the difference.

jj2007 · September 10, 2018, 02:18:21 AM

I see some differences - surprisingly, movlps is slow, movups is fast ::)

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

95      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
86      cycles for 100 * movq*3
83      cycles for 100 * movupd*3
57      cycles for 100 * movups
81      cycles for 100 * movapd

88      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
88      cycles for 100 * movq*3
88      cycles for 100 * movupd*3
65      cycles for 100 * movups
86      cycles for 100 * movapd

90      cycles for 100 * movsd*3
225     cycles for 100 * movlps*3
89      cycles for 100 * movq*3
83      cycles for 100 * movupd*3
59      cycles for 100 * movups
83      cycles for 100 * movapd

86      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
83      cycles for 100 * movq*3
88      cycles for 100 * movupd*3
66      cycles for 100 * movups
86      cycles for 100 * movapd

83      cycles for 100 * movsd*3
225     cycles for 100 * movlps*3
87      cycles for 100 * movq*3
85      cycles for 100 * movupd*3
59      cycles for 100 * movups
88      cycles for 100 * movapd

24      bytes for movsd*3
21      bytes for movlps*3
24      bytes for movq*3
24      bytes for movupd*3
21      bytes for movups
24      bytes for movapd

All tests are organised like this:

Code Select

align_64
TestA proc
  mov ebx, AlgoLoops-1	; loop e.g. 100x
  align 4
  .Repeat
	movsd xmm0, MyDouble
	movsd xmm1, MyDouble
	movsd xmm2, MyDouble
	dec ebx
  .Until Sign?
  ret

hutch-- · September 10, 2018, 02:26:32 AM

Noting that this Haswell is a pig to get any timings on, this is the result for 5 billion iterations of memory loads to xmm registers. This is with a high priority class and a 100 ms delay between each test.

1500 movapd
1484 movsd
1500 movq
1516 movapd
1532 movsd
1453 movq
1485 movapd
1484 movsd
1484 movq
1516 movapd
1454 movsd
1437 movq
Press any key to continue...

jj2007 · September 10, 2018, 03:21:03 AM

Interesting. Can you post the exe? I've tried to load xmm0 ... xmm7 in the loop, and the picture changes - the advantage for movups is gone:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

290     cycles for 100 * movsd*8
773     cycles for 100 * movlps*8
234     cycles for 100 * movq*8
222     cycles for 100 * movupd*8
223     cycles for 100 * movups*8
223     cycles for 100 * movapd*8

224     cycles for 100 * movsd*8
636     cycles for 100 * movlps*8
223     cycles for 100 * movq*8
224     cycles for 100 * movupd*8
226     cycles for 100 * movups*8
225     cycles for 100 * movapd*8

224     cycles for 100 * movsd*8
629     cycles for 100 * movlps*8
226     cycles for 100 * movq*8
226     cycles for 100 * movupd*8
223     cycles for 100 * movups*8
223     cycles for 100 * movapd*8

222     cycles for 100 * movsd*8
630     cycles for 100 * movlps*8
222     cycles for 100 * movq*8
221     cycles for 100 * movupd*8
224     cycles for 100 * movups*8
225     cycles for 100 * movapd*8

222     cycles for 100 * movsd*8
631     cycles for 100 * movlps*8
225     cycles for 100 * movq*8
225     cycles for 100 * movupd*8
225     cycles for 100 * movups*8
222     cycles for 100 * movapd*8

64      bytes for movsd*8
56      bytes for movlps*8
64      bytes for movq*8
64      bytes for movupd*8
56      bytes for movups*8
64      bytes for movapd*8

Siekmanski · September 10, 2018, 04:04:30 AM

If it is for moving doubles, why are movups and movlps in the test?

movups moves 4 packed singles
movlps moves 2 low packed singles
movhps moves 2 high packed singles

these 2 are not in the test,
movlpd move low packed double ( can be replaced by movsd )
movhpd move high packed double

jj2007 · September 10, 2018, 05:19:18 AM

Quote from: Siekmanski on September 10, 2018, 04:04:30 AM
If it is for moving doubles, why are movups and movlps in the test?

movlps moves a double, though not officially ;-)

Quotethese 2 are not in the test,
movlpd move low packed double ( can be replaced by movsd )
movhpd move high packed double

Added movlpd - it behaves like movlps:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

228     cycles for 100 * movsd*8
632     cycles for 100 * movlps*8
218     cycles for 100 * movq*8
218     cycles for 100 * movupd*8
225     cycles for 100 * movups*8
226     cycles for 100 * movapd*8
630     cycles for 100 * movlpd*8

225     cycles for 100 * movsd*8
634     cycles for 100 * movlps*8
217     cycles for 100 * movq*8
218     cycles for 100 * movupd*8
226     cycles for 100 * movups*8
226     cycles for 100 * movapd*8
632     cycles for 100 * movlpd*8

Siekmanski · September 10, 2018, 07:03:05 AM

Do you measure any difference between movaps and movups?

jj2007 · September 10, 2018, 07:20:29 AM

Yes, movaps is a tick slower. Jokes apart: In modern CPUs, there is no difference between the two. Except that one of them crashes for unaligned memory access :lol:

Siekmanski · September 10, 2018, 09:37:24 AM

Yeah, I noticed the same behavior. :t

hutch-- · September 10, 2018, 09:50:17 AM

I found that movapd is not all that useful either, it requires 16 byte alignment to load a REAL8 where movsd and movq are just as fast and not fussy to use.

The MASM Forum

News:

Has anyone timed the instructions loading a double into an xmm register.

hutch--

jj2007

hutch--

jj2007

Siekmanski

jj2007

Siekmanski

jj2007

Siekmanski

hutch--