Author Topic: Has anyone timed the instructions loading a double into an xmm register.  (Read 1566 times)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7528
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
There seems to be a wide range when requiring a single double value into an xmm register.

movsd
movapd
movupd
movq

The mnemonic "movapd" requires 16 byte alignment when reading an 8 byte (REAL8) value into a register.

The task here is calculations and not streaming multi-media or similar. I have got all of the scalar instructions going but there are more instructions for packed double operations that would be useful.

LATER : Nothing meaningful in the difference.  :biggrin:
« Last Edit: September 10, 2018, 02:07:29 AM by hutch-- »
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 10536
  • Assembler is fun ;-)
    • MasmBasic
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #1 on: September 10, 2018, 02:18:21 AM »
I see some differences - surprisingly, movlps is slow, movups is fast ::)
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

95      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
86      cycles for 100 * movq*3
83      cycles for 100 * movupd*3
57      cycles for 100 * movups
81      cycles for 100 * movapd

88      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
88      cycles for 100 * movq*3
88      cycles for 100 * movupd*3
65      cycles for 100 * movups
86      cycles for 100 * movapd

90      cycles for 100 * movsd*3
225     cycles for 100 * movlps*3
89      cycles for 100 * movq*3
83      cycles for 100 * movupd*3
59      cycles for 100 * movups
83      cycles for 100 * movapd

86      cycles for 100 * movsd*3
226     cycles for 100 * movlps*3
83      cycles for 100 * movq*3
88      cycles for 100 * movupd*3
66      cycles for 100 * movups
86      cycles for 100 * movapd

83      cycles for 100 * movsd*3
225     cycles for 100 * movlps*3
87      cycles for 100 * movq*3
85      cycles for 100 * movupd*3
59      cycles for 100 * movups
88      cycles for 100 * movapd

24      bytes for movsd*3
21      bytes for movlps*3
24      bytes for movq*3
24      bytes for movupd*3
21      bytes for movups
24      bytes for movapd

All tests are organised like this:
Code: [Select]
align_64
TestA proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
movsd xmm0, MyDouble
movsd xmm1, MyDouble
movsd xmm2, MyDouble
dec ebx
  .Until Sign?
  ret

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7528
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #2 on: September 10, 2018, 02:26:32 AM »
Noting that this Haswell is a pig to get any timings on, this is the result for 5 billion iterations of memory loads to xmm registers. This is with a high priority class and a 100 ms delay between each test.

1500 movapd
1484 movsd
1500 movq
1516 movapd
1532 movsd
1453 movq
1485 movapd
1484 movsd
1484 movq
1516 movapd
1454 movsd
1437 movq
Press any key to continue...
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

jj2007

  • Member
  • *****
  • Posts: 10536
  • Assembler is fun ;-)
    • MasmBasic
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #3 on: September 10, 2018, 03:21:03 AM »
Interesting. Can you post the exe? I've tried to load xmm0 ... xmm7 in the loop, and the picture changes - the advantage for movups is gone:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

290     cycles for 100 * movsd*8
773     cycles for 100 * movlps*8
234     cycles for 100 * movq*8
222     cycles for 100 * movupd*8
223     cycles for 100 * movups*8
223     cycles for 100 * movapd*8

224     cycles for 100 * movsd*8
636     cycles for 100 * movlps*8
223     cycles for 100 * movq*8
224     cycles for 100 * movupd*8
226     cycles for 100 * movups*8
225     cycles for 100 * movapd*8

224     cycles for 100 * movsd*8
629     cycles for 100 * movlps*8
226     cycles for 100 * movq*8
226     cycles for 100 * movupd*8
223     cycles for 100 * movups*8
223     cycles for 100 * movapd*8

222     cycles for 100 * movsd*8
630     cycles for 100 * movlps*8
222     cycles for 100 * movq*8
221     cycles for 100 * movupd*8
224     cycles for 100 * movups*8
225     cycles for 100 * movapd*8

222     cycles for 100 * movsd*8
631     cycles for 100 * movlps*8
225     cycles for 100 * movq*8
225     cycles for 100 * movupd*8
225     cycles for 100 * movups*8
222     cycles for 100 * movapd*8

64      bytes for movsd*8
56      bytes for movlps*8
64      bytes for movq*8
64      bytes for movupd*8
56      bytes for movups*8
64      bytes for movapd*8

Siekmanski

  • Member
  • *****
  • Posts: 2311
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #4 on: September 10, 2018, 04:04:30 AM »
If it is for moving doubles, why are movups and movlps in the test?

movups moves 4 packed singles
movlps moves 2 low packed singles
movhps moves 2 high packed singles

these 2 are not in the test,
movlpd  move low packed double ( can be replaced by movsd )
movhpd  move high packed double
Creative coders use backward thinking techniques as a strategy.

jj2007

  • Member
  • *****
  • Posts: 10536
  • Assembler is fun ;-)
    • MasmBasic
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #5 on: September 10, 2018, 05:19:18 AM »
If it is for moving doubles, why are movups and movlps in the test?
movlps moves a double, though not officially ;-)

Quote
these 2 are not in the test,
movlpd  move low packed double ( can be replaced by movsd )
movhpd  move high packed double
Added movlpd - it behaves like movlps:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

228     cycles for 100 * movsd*8
632     cycles for 100 * movlps*8
218     cycles for 100 * movq*8
218     cycles for 100 * movupd*8
225     cycles for 100 * movups*8
226     cycles for 100 * movapd*8
630     cycles for 100 * movlpd*8

225     cycles for 100 * movsd*8
634     cycles for 100 * movlps*8
217     cycles for 100 * movq*8
218     cycles for 100 * movupd*8
226     cycles for 100 * movups*8
226     cycles for 100 * movapd*8
632     cycles for 100 * movlpd*8

Siekmanski

  • Member
  • *****
  • Posts: 2311
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #6 on: September 10, 2018, 07:03:05 AM »
 :biggrin:

Do you measure any difference between movaps and movups?
Creative coders use backward thinking techniques as a strategy.

jj2007

  • Member
  • *****
  • Posts: 10536
  • Assembler is fun ;-)
    • MasmBasic
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #7 on: September 10, 2018, 07:20:29 AM »
Yes, movaps is a tick slower. Jokes apart: In modern CPUs, there is no difference between the two. Except that one of them crashes for unaligned memory access :lol:

Siekmanski

  • Member
  • *****
  • Posts: 2311
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #8 on: September 10, 2018, 09:37:24 AM »
Yeah, I noticed the same behavior.  :t
Creative coders use backward thinking techniques as a strategy.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7528
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Has anyone timed the instructions loading a double into an xmm register.
« Reply #9 on: September 10, 2018, 09:50:17 AM »
I found that movapd is not all that useful either, it requires 16 byte alignment to load a REAL8 where movsd and movq are just as fast and not fussy to use.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy: