News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Performance differences for MOVAPS, MOVDQA and MOVAPD

Started by jj2007, December 23, 2012, 08:29:46 PM

Previous topic - Next topic

jj2007

Following a friendly exchange of views between Germans, I got curious and set up a little testbed for these three very similar SIMD instructions. Here are first results:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Testing with 5000000 loops
2311 ms for movapd
2328 ms for movdqa
2298 ms for movaps

2315 ms for movapd
2338 ms for movdqa
2297 ms for movaps

2301 ms for movapd
2299 ms for movdqa
2288 ms for movaps

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
850 ms for movapd
853 ms for movdqa
776 ms for movaps

839 ms for movapd
851 ms for movdqa
776 ms for movaps

840 ms for movapd
851 ms for movdqa
776 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps


Since my i5 is already a bit outdated, I'd like to see some results for really modern CPUs ;)

Here is the main loop, identical for all three algos:
  .Repeat
   movapd [edi], xmm0
   paddd xmm0, xmm1   ; 4444
   movapd [edi+16], xmm0
   paddd xmm0, xmm2   ; 44xx
   movapd [edi+32], xmm0
   paddd xmm0, xmm3   ; xx44
   movapd [edi+48], xmm0
   paddd xmm0, xmm1   ; 4444
   movapd [edi+64], xmm0
   paddd xmm0, xmm4   ; xxxx
   inc ecx
   .if Zero?
       psubd xmm0, oword ptr Sub100a
   .elseif ecx==-4
      movapd xmm2, [esi+80]
      movapd xmm3, [esi+96]
      movapd xmm4, [esi+112]
   .elseif ecx==5
       psubd xmm0, oword ptr Sub100b
       xor ecx, ecx
   .endif
   add edi, 80
  .Until edi>=edx


hutch--



Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE3, SSSE3, SSE4.1)
Testing with 5000000 loops
856 ms for movapd
856 ms for movdqa
789 ms for movaps

855 ms for movapd
855 ms for movdqa
790 ms for movaps

856 ms for movapd
856 ms for movdqa
789 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps


--- ok ---


Gunther

Jochen,

here the results with my machine (Windows 7 - 64 bit)

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
668 ms for movapd
662 ms for movdqa
600 ms for movaps

667 ms for movapd
658 ms for movdqa
619 ms for movaps

668 ms for movapd
658 ms for movdqa
618 ms for movaps

136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps


--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t

Gunther

Hi Jochen,

Quote from: jj2007 on December 23, 2012, 10:46:12 PM
> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t

oh yes, it is. It has enough RAM for the next time and a large hard disk, but the installation of the operating systems was a bit tricky.

Gunther
You have to know the facts before you can distort them.

qWord

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
830 ms for movapd
806 ms for movdqa
744 ms for movaps

807 ms for movapd
799 ms for movdqa
720 ms for movaps

809 ms for movapd
814 ms for movdqa
746 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps

I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here for an counterexample (there is a conflict with ps <> sd).
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Hi qWord,

Quote from: qWord on December 24, 2012, 02:34:56 AM
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions.

that's right and you can read about that in Agner Fog's manuals.

Gunther

You have to know the facts before you can distort them.

jj2007

Quote from: qWord on December 24, 2012, 02:34:56 AM799 ms for movdqa
720 ms for movaps

I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here for an counterexample (there is a conflict with ps <> sd).

I like your "no disadvantage" ;-)

But you are perfectly right about what you flagged in replies #9+11 of that thread:
      movlps xmm0, R8s1
      mulsd xmm0, R8s2
      movlps R8d, xmm0

is indeed slow on recent Intel CPUs compared to
      movsd xmm0, R8s1
      mulsd xmm0, R8s2
      movsd R8d, xmm0

(but zero difference on most other CPUs...)
So we all learnt something today :biggrin:

dedndave


frktons

My tests:
Win 7/64
Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE3, SSSE3)
Testing with 5000000 loops
1105 ms for movapd
1100 ms for movdqa
998 ms for movaps

1281 ms for movapd
1080 ms for movdqa
997 ms for movaps

1282 ms for movapd
1079 ms for movdqa
1199 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps


--- ok ---

This test resembles the one we did a couple of years ago
for the fastest clearbuffer
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Farabi

In case for clearing a buffer or moving a memory content, which is faster?
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

frktons

Quote from: Farabi on December 24, 2012, 02:44:52 PM
In case for clearing a buffer or moving a memory content, which is faster?

If the data size is under 1 Mb then 16 aligned data moves faster with
movdqa/movaps. If you want a more detailed test, have a look
at the old forum and search for "fastest clear buffer".
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Farabi

Most of the instruction took about 5 millions operation per second in average. I think for current standard, intel floating point processors is poor compared to what nVidia can do.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

phaap

Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)

qWord

Quote from: phaap on January 18, 2013, 01:28:10 AM
Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)
it would be really nice if you can upload those two tests!
MREAL macros - when you need floating point arithmetic while assembling!