News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Unaligned memory copy test piece.

Started by hutch--, December 06, 2021, 08:34:55 PM

Previous topic - Next topic

hutch--

JJ,

Keep in mind that the sample size being looped will effect the timing of most of the combinations. If you run a gigabyte sample, you get rid of those effects. The other factor is different hardware will give different results.

LiaoMi

Quote from: jj2007 on December 13, 2021, 12:03:37 AM
Quote from: LiaoMi on December 12, 2021, 11:31:29 PMplease add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286

"movntdq + mfence"
@@:
    movdqu  xmm0, [esi]                 ; esi may not be 16-bytes aligned
    movntdq [edi], xmm0                 ; edi should be 16-bytes aligned
    add     esi, 16
    add     edi, 16
    loop    @B
    mfence

I'm not impressed...

Very curious results  :rolleyes:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

17782   cycles for 100 * rep movsb
17554   cycles for 100 * rep movsd
115012  cycles for 100 * movlps qword ptr [esi+8*ecx]
52006   cycles for 100 * movaps xmm0, oword ptr [esi]
58101   cycles for 100 * movdqa + movntdq
57415   cycles for 100 * movdqu + movntdq
73437   cycles for 100 * movdqu + movntdq + mfence

18073   cycles for 100 * rep movsb
17701   cycles for 100 * rep movsd
110545  cycles for 100 * movlps qword ptr [esi+8*ecx]
42643   cycles for 100 * movaps xmm0, oword ptr [esi]
56827   cycles for 100 * movdqa + movntdq
58362   cycles for 100 * movdqu + movntdq
72001   cycles for 100 * movdqu + movntdq + mfence

19436   cycles for 100 * rep movsb
17883   cycles for 100 * rep movsd
107491  cycles for 100 * movlps qword ptr [esi+8*ecx]
43259   cycles for 100 * movaps xmm0, oword ptr [esi]
56876   cycles for 100 * movdqa + movntdq
57166   cycles for 100 * movdqu + movntdq
74082   cycles for 100 * movdqu + movntdq + mfence

18036   cycles for 100 * rep movsb
18419   cycles for 100 * rep movsd
106922  cycles for 100 * movlps qword ptr [esi+8*ecx]
42377   cycles for 100 * movaps xmm0, oword ptr [esi]
58547   cycles for 100 * movdqa + movntdq
57797   cycles for 100 * movdqu + movntdq
74547   cycles for 100 * movdqu + movntdq + mfence

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq
39      bytes for movdqu + movntdq + mfence


--- ok ---

hutch--

Here is a 2 test pieces for 32 bit memory copy, an unaligned SSE2 copy and a normal rep movsb copy. In every instance the SSE2 version is always faster and to the extent that the test pieces don't need to be stabilised or run with a higher priority. It is the same source for both but each has been saved as a separate exe file so that there is no cache interaction, between the two.

SSE2 Copy
--------
843 ms
--------
Press any key to continue ...

ByteCopy
--------
1219 ms
--------
Press any key to continue ...


LiaoMi

Quote from: hutch-- on December 13, 2021, 01:39:30 PM
Here is a 2 test pieces for 32 bit memory copy, an unaligned SSE2 copy and a normal rep movsb copy. In every instance the SSE2 version is always faster and to the extent that the test pieces don't need to be stabilised or run with a higher priority. It is the same source for both but each has been saved as a separate exe file so that there is no cache interaction, between the two.

SSE2 Copy
--------
843 ms
--------
Press any key to continue ...

ByteCopy
--------
1219 ms
--------
Press any key to continue ...


SSE2 Copy
--------
640 ms
--------
Press any key to continue ...

ByteCopy
--------
719 ms
--------
Press any key to continue ...

jj2007

Similar for me. The interesting bit, though: if you use movups instead of movnt, rep movsb is faster.

My test pieces were made with a shorter copy but more iterations. Which implies that they used the cache, and then, apparently, movnt has no advantage.

hutch--

I think that is normally the case with SSE mnemonics, they are designed for streaming and apparently have a wind up effect that makes it difficult for short duration loop code. In the past, the advice on SSE code was to more or less forget integer code optimisation and set up the SSE code so it did the work. I have never yet got any gain out of SSE code by unrolling it so it really is a different system built into the CPU.

This much, over time the unaligned mnemonics are a lot closer in speed terms to the fully aligned ones and there is little gain in using the aligned instructions unless your code design requires it.

TimoVJL

ByteCopy--------
1266 ms
--------
SSE2--------
875 ms
--------
May the source be with you

mineiro

bytecopy.exe
--------
695 ms
--------
sse2copy.exe
--------
680 ms
--------

Quote from: jj2007 on December 13, 2021, 07:58:53 PM
Similar for me. The interesting bit, though: if you use movups instead of movnt, rep movsb is faster.
I do some tests here, when I align data to 64, rep movs(bd) execution decreased by a 50% 40% ratio. Others functions results remains unchanged.

align 16
db 1
align 16
db 1
align 16
db 1
align 16
somestring db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
REPEAT 99
db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
ENDM

I'd rather be this ambulant metamorphosis than to have that old opinion about everything

daydreamer

I am interested in timings used with ddraw or sdl, with someone have pci express to/from vram still has the advantage with movntdqa system ram -> vram, and. The disadvantage read from vram 100* slower?
Can't test with laptop with shared memory
Wonder if dx loadtexturefrommemory api uses movntdqa?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++++++-+++8 of 20 tests valid,
315842  kCycles for 1 * rep movsb
241232  kCycles for 1 * rep movsd
304695  kCycles for 1 * movlps qword ptr [esi+8*ecx]
302014  kCycles for 1 * movaps xmm0, oword ptr [esi]
208018  kCycles for 1 * movdqa + movntdq
207876  kCycles for 1 * movdqu + movntdq
207752  kCycles for 1 * movdqu + movntdq + mfence

249181  kCycles for 1 * rep movsb
239809  kCycles for 1 * rep movsd
304868  kCycles for 1 * movlps qword ptr [esi+8*ecx]
301253  kCycles for 1 * movaps xmm0, oword ptr [esi]
207931  kCycles for 1 * movdqa + movntdq
208272  kCycles for 1 * movdqu + movntdq
207503  kCycles for 1 * movdqu + movntdq + mfence

249727  kCycles for 1 * rep movsb
241799  kCycles for 1 * rep movsd
303516  kCycles for 1 * movlps qword ptr [esi+8*ecx]
301728  kCycles for 1 * movaps xmm0, oword ptr [esi]
207608  kCycles for 1 * movdqa + movntdq
208094  kCycles for 1 * movdqu + movntdq
208854  kCycles for 1 * movdqu + movntdq + mfence

248574  kCycles for 1 * rep movsb
240836  kCycles for 1 * rep movsd
304675  kCycles for 1 * movlps qword ptr [esi+8*ecx]
301674  kCycles for 1 * movaps xmm0, oword ptr [esi]
208379  kCycles for 1 * movdqa + movntdq
207882  kCycles for 1 * movdqu + movntdq
207742  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
+-++++++++++++++++++
261718  kCycles for 1 * rep movsb
238384  kCycles for 1 * rep movsd
269729  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236182  kCycles for 1 * movaps xmm0, oword ptr [esi]
156614  kCycles for 1 * movdqa + movntdq
156042  kCycles for 1 * movdqu + movntdq
156594  kCycles for 1 * movdqu + movntdq + mfence

236577  kCycles for 1 * rep movsb
236369  kCycles for 1 * rep movsd
270908  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236626  kCycles for 1 * movaps xmm0, oword ptr [esi]
156354  kCycles for 1 * movdqa + movntdq
155998  kCycles for 1 * movdqu + movntdq
156243  kCycles for 1 * movdqu + movntdq + mfence

235567  kCycles for 1 * rep movsb
236734  kCycles for 1 * rep movsd
276802  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236012  kCycles for 1 * movaps xmm0, oword ptr [esi]
156233  kCycles for 1 * movdqa + movntdq
156445  kCycles for 1 * movdqu + movntdq
156944  kCycles for 1 * movdqu + movntdq + mfence

237039  kCycles for 1 * rep movsb
238233  kCycles for 1 * rep movsd
270702  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236677  kCycles for 1 * movaps xmm0, oword ptr [esi]
155610  kCycles for 1 * movdqa + movntdq
156935  kCycles for 1 * movdqu + movntdq
156013  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence
May the source be with you

Siekmanski

AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
-----------9 of 20 tests valid,
103297  kCycles for 1 * rep movsb
83453   kCycles for 1 * rep movsd
151305  kCycles for 1 * movlps qword ptr [esi+8*ecx]
140797  kCycles for 1 * movaps xmm0, oword ptr [esi]
85881   kCycles for 1 * movdqa + movntdq
87420   kCycles for 1 * movdqu + movntdq
85314   kCycles for 1 * movdqu + movntdq + mfence

82482   kCycles for 1 * rep movsb
81107   kCycles for 1 * rep movsd
148720  kCycles for 1 * movlps qword ptr [esi+8*ecx]
140735  kCycles for 1 * movaps xmm0, oword ptr [esi]
87417   kCycles for 1 * movdqa + movntdq
85541   kCycles for 1 * movdqu + movntdq
86765   kCycles for 1 * movdqu + movntdq + mfence

83181   kCycles for 1 * rep movsb
81348   kCycles for 1 * rep movsd
149105  kCycles for 1 * movlps qword ptr [esi+8*ecx]
141740  kCycles for 1 * movaps xmm0, oword ptr [esi]
85743   kCycles for 1 * movdqa + movntdq
86256   kCycles for 1 * movdqu + movntdq
87608   kCycles for 1 * movdqu + movntdq + mfence

81210   kCycles for 1 * rep movsb
81663   kCycles for 1 * rep movsd
150942  kCycles for 1 * movlps qword ptr [esi+8*ecx]
140339  kCycles for 1 * movaps xmm0, oword ptr [esi]
86239   kCycles for 1 * movdqa + movntdq
87931   kCycles for 1 * movdqu + movntdq
85408   kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


Hi Jochen,
What does the number of valid tests mean?
Creative coders use backward thinking techniques as a strategy.

FORTRANS

Hi Jochen,

   Two systems.  Tests valid?

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
+++-+++++++++++5 of 20 tests valid,
352139  kCycles for 1 * rep movsb
237663  kCycles for 1 * rep movsd
287198  kCycles for 1 * movlps qword ptr [esi+8*ecx]
279630  kCycles for 1 * movaps xmm0, oword ptr [esi]
181083  kCycles for 1 * movdqa + movntdq
181125  kCycles for 1 * movdqu + movntdq
180909  kCycles for 1 * movdqu + movntdq + mfence

193980  kCycles for 1 * rep movsb
214331  kCycles for 1 * rep movsd
278425  kCycles for 1 * movlps qword ptr [esi+8*ecx]
274866  kCycles for 1 * movaps xmm0, oword ptr [esi]
179814  kCycles for 1 * movdqa + movntdq
179520  kCycles for 1 * movdqu + movntdq
179469  kCycles for 1 * movdqu + movntdq + mfence

192139  kCycles for 1 * rep movsb
210664  kCycles for 1 * rep movsd
279349  kCycles for 1 * movlps qword ptr [esi+8*ecx]
274878  kCycles for 1 * movaps xmm0, oword ptr [esi]
180115  kCycles for 1 * movdqa + movntdq
179785  kCycles for 1 * movdqu + movntdq
179769  kCycles for 1 * movdqu + movntdq + mfence

191809  kCycles for 1 * rep movsb
216878  kCycles for 1 * rep movsd
277053  kCycles for 1 * movlps qword ptr [esi+8*ecx]
277263  kCycles for 1 * movaps xmm0, oword ptr [esi]
178899  kCycles for 1 * movdqa + movntdq
179083  kCycles for 1 * movdqu + movntdq
178847  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


--- ok ---


Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
+++++15 of 20 tests valid,
241261  kCycles for 1 * rep movsb
201928  kCycles for 1 * rep movsd
250473  kCycles for 1 * movlps qword ptr [esi+8*ecx]
245765  kCycles for 1 * movaps xmm0, oword ptr [esi]
182061  kCycles for 1 * movdqa + movntdq
187806  kCycles for 1 * movdqu + movntdq
197382  kCycles for 1 * movdqu + movntdq + mfence

224850  kCycles for 1 * rep movsb
202630  kCycles for 1 * rep movsd
234536  kCycles for 1 * movlps qword ptr [esi+8*ecx]
228211  kCycles for 1 * movaps xmm0, oword ptr [esi]
191152  kCycles for 1 * movdqa + movntdq
188628  kCycles for 1 * movdqu + movntdq
185565  kCycles for 1 * movdqu + movntdq + mfence

206426  kCycles for 1 * rep movsb
206008  kCycles for 1 * rep movsd
233301  kCycles for 1 * movlps qword ptr [esi+8*ecx]
229024  kCycles for 1 * movaps xmm0, oword ptr [esi]
181524  kCycles for 1 * movdqa + movntdq
198103  kCycles for 1 * movdqu + movntdq
177373  kCycles for 1 * movdqu + movntdq + mfence

199886  kCycles for 1 * rep movsb
200050  kCycles for 1 * rep movsd
233793  kCycles for 1 * movlps qword ptr [esi+8*ecx]
228413  kCycles for 1 * movaps xmm0, oword ptr [esi]
177392  kCycles for 1 * movdqa + movntdq
175842  kCycles for 1 * movdqu + movntdq
175220  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


--- ok ---


Regards,

Steve N.

LiaoMi

Quote from: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
++++---++-+--+++-++1 of 20 tests valid,
111187  kCycles for 1 * rep movsb
86590   kCycles for 1 * rep movsd
131239  kCycles for 1 * movlps qword ptr [esi+8*ecx]
119079  kCycles for 1 * movaps xmm0, oword ptr [esi]
99788   kCycles for 1 * movdqa + movntdq
89096   kCycles for 1 * movdqu + movntdq
92749   kCycles for 1 * movdqu + movntdq + mfence

99740   kCycles for 1 * rep movsb
92438   kCycles for 1 * rep movsd
119977  kCycles for 1 * movlps qword ptr [esi+8*ecx]
111659  kCycles for 1 * movaps xmm0, oword ptr [esi]
76366   kCycles for 1 * movdqa + movntdq
79162   kCycles for 1 * movdqu + movntdq
77279   kCycles for 1 * movdqu + movntdq + mfence

89597   kCycles for 1 * rep movsb
85665   kCycles for 1 * rep movsd
125051  kCycles for 1 * movlps qword ptr [esi+8*ecx]
111892  kCycles for 1 * movaps xmm0, oword ptr [esi]
76149   kCycles for 1 * movdqa + movntdq
76483   kCycles for 1 * movdqu + movntdq
76167   kCycles for 1 * movdqu + movntdq + mfence

86964   kCycles for 1 * rep movsb
85324   kCycles for 1 * rep movsd
121596  kCycles for 1 * movlps qword ptr [esi+8*ecx]
111769  kCycles for 1 * movaps xmm0, oword ptr [esi]
76968   kCycles for 1 * movdqa + movntdq
75970   kCycles for 1 * movdqu + movntdq
75677   kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


--- ok ---



mineiro

Quote from: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:
Quote from: mineiro on December 14, 2021, 01:24:01 AM
I do some tests here, when I align data to 64, rep movs(bd) execution decreased by a 50% 40% ratio. Others functions results remains unchanged.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything