News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Has anyone timed the instructions loading a double into an xmm register.

Started by hutch--, September 09, 2018, 11:36:14 PM

Previous topic - Next topic

hutch--

There seems to be a wide range when requiring a single double value into an xmm register.

movsd
movapd
movupd
movq

The mnemonic "movapd" requires 16 byte alignment when reading an 8 byte (REAL8) value into a register.

The task here is calculations and not streaming multi-media or similar. I have got all of the scalar instructions going but there are more instructions for packed double operations that would be useful.

LATER : Nothing meaningful in the difference.  :biggrin:

jj2007

I see some differences - surprisingly, movlps is slow, movups is fast ::)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

95      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
86      cycles for 100 * movq*3
83      cycles for 100 * movupd*3
57      cycles for 100 * movups
81      cycles for 100 * movapd

88      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
88      cycles for 100 * movq*3
88      cycles for 100 * movupd*3
65      cycles for 100 * movups
86      cycles for 100 * movapd

90      cycles for 100 * movsd*3
225     cycles for 100 * movlps*3
89      cycles for 100 * movq*3
83      cycles for 100 * movupd*3
59      cycles for 100 * movups
83      cycles for 100 * movapd

86      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
83      cycles for 100 * movq*3
88      cycles for 100 * movupd*3
66      cycles for 100 * movups
86      cycles for 100 * movapd

83      cycles for 100 * movsd*3
225     cycles for 100 * movlps*3
87      cycles for 100 * movq*3
85      cycles for 100 * movupd*3
59      cycles for 100 * movups
88      cycles for 100 * movapd

24      bytes for movsd*3
21      bytes for movlps*3
24      bytes for movq*3
24      bytes for movupd*3
21      bytes for movups
24      bytes for movapd


All tests are organised like this:
align_64
TestA proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
movsd xmm0, MyDouble
movsd xmm1, MyDouble
movsd xmm2, MyDouble
dec ebx
  .Until Sign?
  ret

hutch--

Noting that this Haswell is a pig to get any timings on, this is the result for 5 billion iterations of memory loads to xmm registers. This is with a high priority class and a 100 ms delay between each test.

1500 movapd
1484 movsd
1500 movq
1516 movapd
1532 movsd
1453 movq
1485 movapd
1484 movsd
1484 movq
1516 movapd
1454 movsd
1437 movq
Press any key to continue...

jj2007

Interesting. Can you post the exe? I've tried to load xmm0 ... xmm7 in the loop, and the picture changes - the advantage for movups is gone:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

290     cycles for 100 * movsd*8
773     cycles for 100 * movlps*8
234     cycles for 100 * movq*8
222     cycles for 100 * movupd*8
223     cycles for 100 * movups*8
223     cycles for 100 * movapd*8

224     cycles for 100 * movsd*8
636     cycles for 100 * movlps*8
223     cycles for 100 * movq*8
224     cycles for 100 * movupd*8
226     cycles for 100 * movups*8
225     cycles for 100 * movapd*8

224     cycles for 100 * movsd*8
629     cycles for 100 * movlps*8
226     cycles for 100 * movq*8
226     cycles for 100 * movupd*8
223     cycles for 100 * movups*8
223     cycles for 100 * movapd*8

222     cycles for 100 * movsd*8
630     cycles for 100 * movlps*8
222     cycles for 100 * movq*8
221     cycles for 100 * movupd*8
224     cycles for 100 * movups*8
225     cycles for 100 * movapd*8

222     cycles for 100 * movsd*8
631     cycles for 100 * movlps*8
225     cycles for 100 * movq*8
225     cycles for 100 * movupd*8
225     cycles for 100 * movups*8
222     cycles for 100 * movapd*8

64      bytes for movsd*8
56      bytes for movlps*8
64      bytes for movq*8
64      bytes for movupd*8
56      bytes for movups*8
64      bytes for movapd*8

Siekmanski

If it is for moving doubles, why are movups and movlps in the test?

movups moves 4 packed singles
movlps moves 2 low packed singles
movhps moves 2 high packed singles

these 2 are not in the test,
movlpd  move low packed double ( can be replaced by movsd )
movhpd  move high packed double
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Siekmanski on September 10, 2018, 04:04:30 AM
If it is for moving doubles, why are movups and movlps in the test?
movlps moves a double, though not officially ;-)

Quotethese 2 are not in the test,
movlpd  move low packed double ( can be replaced by movsd )
movhpd  move high packed double
Added movlpd - it behaves like movlps:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

228     cycles for 100 * movsd*8
632     cycles for 100 * movlps*8
218     cycles for 100 * movq*8
218     cycles for 100 * movupd*8
225     cycles for 100 * movups*8
226     cycles for 100 * movapd*8
630     cycles for 100 * movlpd*8

225     cycles for 100 * movsd*8
634     cycles for 100 * movlps*8
217     cycles for 100 * movq*8
218     cycles for 100 * movupd*8
226     cycles for 100 * movups*8
226     cycles for 100 * movapd*8
632     cycles for 100 * movlpd*8

Siekmanski

Creative coders use backward thinking techniques as a strategy.

jj2007

Yes, movaps is a tick slower. Jokes apart: In modern CPUs, there is no difference between the two. Except that one of them crashes for unaligned memory access :lol:

Siekmanski

Creative coders use backward thinking techniques as a strategy.

hutch--

I found that movapd is not all that useful either, it requires 16 byte alignment to load a REAL8 where movsd and movq are just as fast and not fussy to use.