Following a friendly exchange of views between Germans (http://masm32.com/board/index.php?topic=1123.msg10809#msg10809), I got curious and set up a little testbed for these three very similar SIMD instructions. Here are first results:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Testing with 5000000 loops
2311 ms for movapd
2328 ms for movdqa
2298 ms for movaps
2315 ms for movapd
2338 ms for movdqa
2297 ms for movaps
2301 ms for movapd
2299 ms for movdqa
2288 ms for movaps
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
850 ms for movapd
853 ms for movdqa
776 ms for movaps
839 ms for movapd
851 ms for movdqa
776 ms for movaps
840 ms for movapd
851 ms for movdqa
776 ms for movaps
136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps
Since my i5 is already a bit outdated, I'd like to see some results for really modern CPUs ;)
Here is the main loop, identical for all three algos:
.Repeat
movapd [edi], xmm0
paddd xmm0, xmm1 ; 4444
movapd [edi+16], xmm0
paddd xmm0, xmm2 ; 44xx
movapd [edi+32], xmm0
paddd xmm0, xmm3 ; xx44
movapd [edi+48], xmm0
paddd xmm0, xmm1 ; 4444
movapd [edi+64], xmm0
paddd xmm0, xmm4 ; xxxx
inc ecx
.if Zero?
psubd xmm0, oword ptr Sub100a
.elseif ecx==-4
movapd xmm2, [esi+80]
movapd xmm3, [esi+96]
movapd xmm4, [esi+112]
.elseif ecx==5
psubd xmm0, oword ptr Sub100b
xor ecx, ecx
.endif
add edi, 80
.Until edi>=edx
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE3, SSSE3, SSE4.1)
Testing with 5000000 loops
856 ms for movapd
856 ms for movdqa
789 ms for movaps
855 ms for movapd
855 ms for movdqa
790 ms for movaps
856 ms for movapd
856 ms for movdqa
789 ms for movaps
136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps
--- ok ---
Jochen,
here the results with my machine (Windows 7 - 64 bit)
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
668 ms for movapd
662 ms for movdqa
600 ms for movaps
667 ms for movapd
658 ms for movdqa
619 ms for movaps
668 ms for movapd
658 ms for movdqa
618 ms for movaps
136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps
--- ok ---
Gunther
> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t
Hi Jochen,
Quote from: jj2007 on December 23, 2012, 10:46:12 PM
> 600 ms for movaps
Gunther, that's a factor 4 faster than my Celeron - nice machine :t
oh yes, it is. It has enough RAM for the next time and a large hard disk, but the installation of the operating systems was a bit tricky.
Gunther
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Testing with 5000000 loops
830 ms for movapd
806 ms for movdqa
744 ms for movaps
807 ms for movapd
799 ms for movdqa
720 ms for movaps
809 ms for movapd
814 ms for movdqa
746 ms for movaps
136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here (http://www.masmforum.com/board/index.php?topic=18425.0) for an counterexample (there is a conflict with ps <> sd).
Hi qWord,
Quote from: qWord on December 24, 2012, 02:34:56 AM
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions.
that's right and you can read about that in Agner Fog's manuals.
Gunther
Quote from: qWord on December 24, 2012, 02:34:56 AM799 ms for movdqa
720 ms for movaps
I must admit that Integer instructions seems to have no disadvantage when using them with wrong typed instructions. However, see your test bed here (http://www.masmforum.com/board/index.php?topic=18425.0) for an counterexample (there is a conflict with ps <> sd).
I like your "no disadvantage" ;-)
But you are perfectly right about what you flagged in replies #9+11 of that thread:
movlps xmm0, R8s1
mulsd xmm0, R8s2
movlps R8d, xmm0is indeed slow on recent Intel CPUs compared to
movsd xmm0, R8s1
mulsd xmm0, R8s2
movsd R8d, xmm0(but zero difference on most other CPUs...)
So we all learnt something today :biggrin:
i thought we wanted to move 128 bits :P
My tests:
Win 7/64
Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE3, SSSE3)
Testing with 5000000 loops
1105 ms for movapd
1100 ms for movdqa
998 ms for movaps
1281 ms for movapd
1080 ms for movdqa
997 ms for movaps
1282 ms for movapd
1079 ms for movdqa
1199 ms for movaps
136 bytes for movapd
136 bytes for movdqa
128 bytes for movaps
--- ok ---
This test resembles the one we did a couple of years ago
for the fastest
clearbuffer
In case for clearing a buffer or moving a memory content, which is faster?
Quote from: Farabi on December 24, 2012, 02:44:52 PM
In case for clearing a buffer or moving a memory content, which is faster?
If the data size is under 1 Mb then 16 aligned data moves faster with
movdqa/movaps. If you want a more detailed test, have a look
at the old forum and search for "fastest clear buffer".
Most of the instruction took about 5 millions operation per second in average. I think for current standard, intel floating point processors is poor compared to what nVidia can do.
Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)
Quote from: phaap on January 18, 2013, 01:28:10 AM
Nice comparsion, but when it cames to heterogeneous context e.g. in my current project (md4-hash-algorithm) where i 'simple' change 'movdqa' by 'movaps' it leads to a DECREASE in performance...
So its nice to know about the abilities of the single instructions, but in 'real-world-applications' this knowledge often isn't enough... ...but i keep this fact in mind ;)
it would be really nice if you can upload those two tests!
...
...
...
can you please explain how I should combine those blocks (your current code is an infinite loop). Also, how can I test the algorithm for correctness?
...
zip the source file and attach it ;)
Quote from: qWord on January 18, 2013, 05:45:24 AM
zip the source file and attach it ;)
very good proposal. :t
Gunther
phaap,
is there any special reason why you "cleared" your post? It is maybe that the code doesn't work correct?
At least I was not able to get any useful result (in compare to the test values from your link (http://tools.ietf.org/html/rfc1186)) - see attachment.
(for assembling you need jwasm, polink and Japhet's WinInc. For the used folder structure, take a look in WinIncRT.inc)
OK, at least for the case of an empty string I get the correct result in the low order DWORD of xmm2-5.
It would much be easier if attach your whole code.
regards, qWord
Quote from: qWord on January 18, 2013, 11:06:21 PM
OK, at least for the case of an empty string I get the correct result in the low order DWORD of xmm2-5.
It would much be easier if attach your whole code.
regards, qWord
Maybe you've to wait until phaap'll pop-up again. His misterious
pop-out has to mean something. Let's wait and see... :lol: