Performance differences for MOVAPS, MOVDQA and MOVAPD

jj2007 · December 23, 2012, 08:29:46 PM

Following a friendly exchange of views between Germans, I got curious and set up a little testbed for these three very similar SIMD instructions. Here are first results:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Testing with 5000000 loops
2311 ms for movapd
2328 ms for movdqa
2298 ms for movaps

2315 ms for movapd
2338 ms for movdqa
2297 ms for movaps

2301 ms for movapd
2299 ms for movdqa
2288 ms for movaps

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
850 ms for movapd
853 ms for movdqa
776 ms for movaps

839 ms for movapd
851 ms for movdqa
776 ms for movaps

840 ms for movapd
851 ms for movdqa
776 ms for movaps

136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps

Since my i5 is already a bit outdated, I'd like to see some results for really modern CPUs ;)

Here is the main loop, identical for all three algos:
.Repeat
   movapd [edi], xmm0
   paddd xmm0, xmm1   ; 4444
   movapd [edi+16], xmm0
   paddd xmm0, xmm2   ; 44xx
   movapd [edi+32], xmm0
   paddd xmm0, xmm3   ; xx44
   movapd [edi+48], xmm0
   paddd xmm0, xmm1   ; 4444
   movapd [edi+64], xmm0
   paddd xmm0, xmm4   ; xxxx
   inc ecx
   .if Zero?
     psubd xmm0, oword ptr Sub100a
   .elseif ecx==-4
      movapd xmm2, [esi+80]
      movapd xmm3, [esi+96]
      movapd xmm4, [esi+112]
   .elseif ecx==5
     psubd xmm0, oword ptr Sub100b
     xor ecx, ecx
   .endif
   add edi, 80
.Until edi>=edx

hutch-- · December 23, 2012, 09:25:31 PM

Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE3, SSSE3, SSE4.1)
Testing with 5000000 loops
856 ms for movapd
856 ms for movdqa
789 ms for movaps

855 ms for movapd
855 ms for movdqa
790 ms for movaps

856 ms for movapd
856 ms for movdqa
789 ms for movaps

136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps

--- ok ---

Gunther · December 23, 2012, 10:40:39 PM

Jochen,

here the results with my machine (Windows 7 - 64 bit)

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
668 ms for movapd
662 ms for movdqa
600 ms for movaps

667 ms for movapd
658 ms for movdqa
619 ms for movaps

668 ms for movapd
658 ms for movdqa
618 ms for movaps

136	bytes for movapd
136	bytes for movdqa
128	bytes for movaps


--- ok ---

Gunther

jj2007 · December 23, 2012, 10:46:12 PM

> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t

Gunther · December 23, 2012, 10:51:14 PM

Hi Jochen,

Quote from: jj2007 on December 23, 2012, 10:46:12 PM
> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t

oh yes, it is. It has enough RAM for the next time and a large hard disk, but the installation of the operating systems was a bit tricky.

Gunther

qWord · December 24, 2012, 02:34:56 AM

Code Select

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
830 ms for movapd
806 ms for movdqa
744 ms for movaps

807 ms for movapd
799 ms for movdqa
720 ms for movaps

809 ms for movapd
814 ms for movdqa
746 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps

I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here for an counterexample (there is a conflict with ps <> sd).

Gunther · December 24, 2012, 02:46:05 AM

Hi qWord,

Quote from: qWord on December 24, 2012, 02:34:56 AM
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions.

that's right and you can read about that in Agner Fog's manuals.

Gunther

jj2007 · December 24, 2012, 03:28:09 AM

Quote from: qWord on December 24, 2012, 02:34:56 AM799 ms for movdqa
720 ms for movaps
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here for an counterexample (there is a conflict with ps <> sd).

I like your "no disadvantage" ;-)

But you are perfectly right about what you flagged in replies #9+11 of that thread:
      movlps xmm0, R8s1
      mulsd xmm0, R8s2
      movlps R8d, xmm0
is indeed slow on recent Intel CPUs compared to
      movsd xmm0, R8s1
      mulsd xmm0, R8s2
      movsd R8d, xmm0
(but zero difference on most other CPUs...)
So we all learnt something today

dedndave · December 24, 2012, 06:49:57 AM

i thought we wanted to move 128 bits :P

frktons · December 24, 2012, 08:50:09 AM

My tests:
Win 7/64

Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE3, SSSE3)
Testing with 5000000 loops
1105 ms for movapd
1100 ms for movdqa
998 ms for movaps

1281 ms for movapd
1080 ms for movdqa
997 ms for movaps

1282 ms for movapd
1079 ms for movdqa
1199 ms for movaps

136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps

--- ok ---

This test resembles the one we did a couple of years ago
for the fastest clearbuffer

Farabi · December 24, 2012, 02:44:52 PM

In case for clearing a buffer or moving a memory content, which is faster?

frktons · December 24, 2012, 02:49:10 PM

Quote from: Farabi on December 24, 2012, 02:44:52 PM
In case for clearing a buffer or moving a memory content, which is faster?

If the data size is under 1 Mb then 16 aligned data moves faster with
movdqa/movaps. If you want a more detailed test, have a look
at the old forum and search for "fastest clear buffer".

Farabi · December 29, 2012, 04:16:01 PM

Most of the instruction took about 5 millions operation per second in average. I think for current standard, intel floating point processors is poor compared to what nVidia can do.

phaap · January 18, 2013, 01:28:10 AM

Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)

qWord · January 18, 2013, 01:55:09 AM

Quote from: phaap on January 18, 2013, 01:28:10 AM
Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)

it would be really nice if you can upload those two tests!

The MASM Forum

News: