The MASM Forum

General => The Laboratory => Topic started by: jj2007 on December 23, 2012, 08:29:46 PM

Title: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: jj2007 on December 23, 2012, 08:29:46 PM
Following a friendly exchange of views between Germans (http://masm32.com/board/index.php?topic=1123.msg10809#msg10809), I got curious and set up a little testbed for these three very similar SIMD instructions. Here are first results:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Testing with 5000000 loops
2311 ms for movapd
2328 ms for movdqa
2298 ms for movaps

2315 ms for movapd
2338 ms for movdqa
2297 ms for movaps

2301 ms for movapd
2299 ms for movdqa
2288 ms for movaps

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
850 ms for movapd
853 ms for movdqa
776 ms for movaps

839 ms for movapd
851 ms for movdqa
776 ms for movaps

840 ms for movapd
851 ms for movdqa
776 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps


Since my i5 is already a bit outdated, I'd like to see some results for really modern CPUs ;)

Here is the main loop, identical for all three algos:
  .Repeat
   movapd [edi], xmm0
   paddd xmm0, xmm1   ; 4444
   movapd [edi+16], xmm0
   paddd xmm0, xmm2   ; 44xx
   movapd [edi+32], xmm0
   paddd xmm0, xmm3   ; xx44
   movapd [edi+48], xmm0
   paddd xmm0, xmm1   ; 4444
   movapd [edi+64], xmm0
   paddd xmm0, xmm4   ; xxxx
   inc ecx
   .if Zero?
       psubd xmm0, oword ptr Sub100a
   .elseif ecx==-4
      movapd xmm2, [esi+80]
      movapd xmm3, [esi+96]
      movapd xmm4, [esi+112]
   .elseif ecx==5
       psubd xmm0, oword ptr Sub100b
       xor ecx, ecx
   .endif
   add edi, 80
  .Until edi>=edx

Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: hutch-- on December 23, 2012, 09:25:31 PM


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE3, SSSE3, SSE4.1)
Testing with 5000000 loops
856 ms for movapd
856 ms for movdqa
789 ms for movaps

855 ms for movapd
855 ms for movdqa
790 ms for movaps

856 ms for movapd
856 ms for movdqa
789 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps


--- ok ---

Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: Gunther on December 23, 2012, 10:40:39 PM
Jochen,

here the results with my machine (Windows 7 - 64 bit)

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
668 ms for movapd
662 ms for movdqa
600 ms for movaps

667 ms for movapd
658 ms for movdqa
619 ms for movaps

668 ms for movapd
658 ms for movdqa
618 ms for movaps

136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps


--- ok ---


Gunther
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: jj2007 on December 23, 2012, 10:46:12 PM
> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: Gunther on December 23, 2012, 10:51:14 PM
Hi Jochen,

Quote from: jj2007 on December 23, 2012, 10:46:12 PM
> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t

oh yes, it is. It has enough RAM for the next time and a large hard disk, but the installation of the operating systems was a bit tricky.

Gunther
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: qWord on December 24, 2012, 02:34:56 AM
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
830 ms for movapd
806 ms for movdqa
744 ms for movaps

807 ms for movapd
799 ms for movdqa
720 ms for movaps

809 ms for movapd
814 ms for movdqa
746 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps

I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here (http://www.masmforum.com/board/index.php?topic=18425.0) for an counterexample (there is a conflict with ps <> sd).
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: Gunther on December 24, 2012, 02:46:05 AM
Hi qWord,

Quote from: qWord on December 24, 2012, 02:34:56 AM
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions.

that's right and you can read about that in Agner Fog's manuals.

Gunther

Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: jj2007 on December 24, 2012, 03:28:09 AM
Quote from: qWord on December 24, 2012, 02:34:56 AM799 ms for movdqa
720 ms for movaps

I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here (http://www.masmforum.com/board/index.php?topic=18425.0) for an counterexample (there is a conflict with ps <> sd).

I like your "no disadvantage" ;-)

But you are perfectly right about what you flagged in replies #9+11 of that thread:
      movlps xmm0, R8s1
      mulsd xmm0, R8s2
      movlps R8d, xmm0

is indeed slow on recent Intel CPUs compared to
      movsd xmm0, R8s1
      mulsd xmm0, R8s2
      movsd R8d, xmm0

(but zero difference on most other CPUs...)
So we all learnt something today :biggrin:
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: dedndave on December 24, 2012, 06:49:57 AM
i thought we wanted to move 128 bits   :P
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: frktons on December 24, 2012, 08:50:09 AM
My tests:
Win 7/64
Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE3, SSSE3)
Testing with 5000000 loops
1105 ms for movapd
1100 ms for movdqa
998 ms for movaps

1281 ms for movapd
1080 ms for movdqa
997 ms for movaps

1282 ms for movapd
1079 ms for movdqa
1199 ms for movaps

136     bytes for movapd
136     bytes for movdqa
128     bytes for movaps


--- ok ---

This test resembles the one we did a couple of years ago
for the fastest clearbuffer
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: Farabi on December 24, 2012, 02:44:52 PM
In case for clearing a buffer or moving a memory content, which is faster?
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: frktons on December 24, 2012, 02:49:10 PM
Quote from: Farabi on December 24, 2012, 02:44:52 PM
In case for clearing a buffer or moving a memory content, which is faster?

If the data size is under 1 Mb then 16 aligned data moves faster with
movdqa/movaps. If you want a more detailed test, have a look
at the old forum and search for "fastest clear buffer".
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: Farabi on December 29, 2012, 04:16:01 PM
Most of the instruction took about 5 millions operation per second in average. I think for current standard, intel floating point processors is poor compared to what nVidia can do.
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: phaap on January 18, 2013, 01:28:10 AM
Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: qWord on January 18, 2013, 01:55:09 AM
Quote from: phaap on January 18, 2013, 01:28:10 AM
Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)
it would be really nice if you can upload those two tests!
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: phaap on January 18, 2013, 03:36:34 AM
...
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: phaap on January 18, 2013, 03:37:47 AM
...
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: phaap on January 18, 2013, 03:41:11 AM
...
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: qWord on January 18, 2013, 04:24:21 AM
can you please explain how I should combine those blocks (your current code is an infinite loop). Also, how can I test the algorithm for correctness?
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: phaap on January 18, 2013, 05:26:05 AM
...
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: qWord on January 18, 2013, 05:45:24 AM
zip the source file and attach it  ;)
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: Gunther on January 18, 2013, 07:31:04 AM
Quote from: qWord on January 18, 2013, 05:45:24 AM
zip the source file and attach it  ;)

very good proposal.  :t

Gunther
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: qWord on January 18, 2013, 12:26:24 PM
phaap,
is there any special reason why you "cleared" your post? It is maybe that the code doesn't work correct?
At least I was not able to get any useful result (in compare to the test values from your link (http://tools.ietf.org/html/rfc1186)) - see attachment.

(for assembling you need jwasm, polink and Japhet's WinInc. For the used folder structure, take a look in WinIncRT.inc)
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: qWord on January 18, 2013, 11:06:21 PM
OK, at least for the case of an empty string I get the correct result in the low order DWORD of xmm2-5.
It would much be easier if attach your whole code.

regards, qWord
Title: Re: Performance differences for MOVAPS, MOVDQA and MOVAPD
Post by: frktons on January 19, 2013, 02:16:25 AM
Quote from: qWord on January 18, 2013, 11:06:21 PM
OK, at least for the case of an empty string I get the correct result in the low order DWORD of xmm2-5.
It would much be easier if attach your whole code.

regards, qWord
Maybe you've to wait until phaap'll pop-up again. His misterious
pop-out has to mean something. Let's wait and see... :lol: