I have a task where the memory copy cannot be controlled to SSE alignment.
The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.
movdqu xmm0, [rcx+r10]
movntdq [rdx+r10], xmm0
I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.
I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.
This is the result in my machine:
I don't have all that include files and tools, if you can release only executable file of rep movsb I can run it here.
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3338 milliseconds
--------------------------------
i3-10100 not so fast :biggrin:
xmmcopyu:
--------------------------------
50 gig copy in 7531 milliseconds
--------------------------------
ByteCopy:
--------------------------------
50 gig copy in 10563 milliseconds
--------------------------------
Thanks guys, all of these results are very useful to me.
I added the rep movsd version as a zip file.
--------------------------------
50 gig copy in 6625 milliseconds rep movsb
--------------------------------
--------------------------------
50 gig copy in 4578 milliseconds movdqu xmm0, [rcx+r10] : movntdq [rdx+r10], xmm0
--------------------------------
What I am chasing is the ratio difference as the SSE version will be used to copy memory that has originated from a MMF written to by a 32 bit app.
umcmovsb:
--------------------------------
50 gig copy in 6015 milliseconds
--------------------------------
Press any key to continue...
umc:
--------------------------------
50 gig copy in 4719 milliseconds
--------------------------------
Press any key to continue...
umcmovsb:
--------------------------------
50 gig copy in 9547 milliseconds
--------------------------------
Press any key to continue...
AMD Ryzen 3700X
umcmovsb:
--------------------------------
50 gig copy in 5522 milliseconds
--------------------------------
umc:
--------------------------------
50 gig copy in 2902 milliseconds
--------------------------------
AMD Ryzen 9 5950X 16-Core Processor
umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------
umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------
Quote from: Siekmanski on December 07, 2021, 03:32:51 AM
AMD Ryzen 9 5950X 16-Core Processor
umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------
umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------
Well, the result for movsb is surprising. :thumbsup:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3384 milliseconds
--------------------------------
wine umcmovsb.exe
--------------------------------
50 gig copy in 3450 milliseconds
--------------------------------
--------------------------------
50 gig copy in 9064 milliseconds
--------------------------------
--------------------------------
50 gig copy in 10343 milliseconds
--------------------------------
Thanks all, it seems that in every case over a wide range of different hardware that the SSE2 version is faster in every instance and that is useful for the task I have in mind. :biggrin:
Are you sure it's unaligned? My debugger says halloc() delivers a 16-byte aligned buffer. I also wonder whether lodsd would be faster than lodsb
What I have to do is load data from a 32 bit app via a memory mapped file into a 64 bit app which uses HeapAlloc() to store the data. The input source from 32 bit can be rough byte aligned string data or anything else that will fit into the memory mapped file size. If it was going to be fully controlled alignment at both ends, I would use the faster aligned SSE2 instructions.
"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry and I have not seen examples of "rep lodsb" being faster so I have used "rep movsb" as a reference to compare the SSE2 version and across multiple CPUs that the folks here have tested on, the SSE2 version is always faster.
I already have prototypes of the task up and running using a 1gb memory mapped file as the data transfer window and the idea is to be able to work with a 32 bit app that can store multiple 1gb blocks in the 64 bit "container" and work on any of them 1 at a time. I had used "rep movsb" for the unaligned data transfer and it worked OK but as you start using larger blocks of memory, speed starts to matter.
Would be interesting test the smaller SSE movups /movaps,because if same speed as SSE2 moves can fit more instructions in cache
Quote from: hutch-- on December 06, 2021, 08:34:55 PM
I have a task where the memory copy cannot be controlled to SSE alignment.
The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.
movdqu xmm0, [rcx+r10]
movntdq [rdx+r10], xmm0
I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.
I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.
Hi Hutch,
i7-11800h
--------------------------------
50 gig copy in 3500 milliseconds
--------------------------------
Press any key to continue...
rep movsd
--------------------------------
50 gig copy in 3750 milliseconds
--------------------------------
Press any key to continue...
Quote from: hutch-- on December 07, 2021, 11:27:06 AM"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry
Not on my machine, at least with 32-bit code...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
64674 cycles for 100 * rep movsb
64365 cycles for 100 * rep movsd
206043 cycles for 100 * movlps qword ptr [esi+8*ecx]
122243 cycles for 100 * movaps xmm0, oword ptr [esi]
195049 cycles for 100 * movntdq xmm0, oword ptr [esi]
65058 cycles for 100 * rep movsb
63966 cycles for 100 * rep movsd
206036 cycles for 100 * movlps qword ptr [esi+8*ecx]
122348 cycles for 100 * movaps xmm0, oword ptr [esi]
193151 cycles for 100 * movntdq xmm0, oword ptr [esi]
65376 cycles for 100 * rep movsb
64353 cycles for 100 * rep movsd
206087 cycles for 100 * movlps qword ptr [esi+8*ecx]
122278 cycles for 100 * movaps xmm0, oword ptr [esi]
193349 cycles for 100 * movntdq xmm0, oword ptr [esi]
65125 cycles for 100 * rep movsb
63895 cycles for 100 * rep movsd
205977 cycles for 100 * movlps qword ptr [esi+8*ecx]
121872 cycles for 100 * movaps xmm0, oword ptr [esi]
193156 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
I get much the same on this old Haswell. I have usually used a combination of rep movsd and used rep movsb but the single use of rep movsb is close enough to as fast as the rep movsd version and does not suffer from the switch from DWORD to BYTE.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
64649 cycles for 100 * rep movsb
64474 cycles for 100 * rep movsd
158890 cycles for 100 * movlps qword ptr [esi+8*ecx]
80167 cycles for 100 * movaps xmm0, oword ptr [esi]
63923 cycles for 100 * rep movsb
65043 cycles for 100 * rep movsd
158972 cycles for 100 * movlps qword ptr [esi+8*ecx]
82787 cycles for 100 * movaps xmm0, oword ptr [esi]
65293 cycles for 100 * rep movsb
66105 cycles for 100 * rep movsd
158830 cycles for 100 * movlps qword ptr [esi+8*ecx]
81359 cycles for 100 * movaps xmm0, oword ptr [esi]
64850 cycles for 100 * rep movsb
66085 cycles for 100 * rep movsd
159823 cycles for 100 * movlps qword ptr [esi+8*ecx]
81326 cycles for 100 * movaps xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
28 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
--- ok ---
Thanks LiaoMi, interesting result in that they are much closer than earlier hardware. Looks like a nice fast box.
New machine. I added movntdq
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
57654 cycles for 100 * rep movsb
65892 cycles for 100 * rep movsd
112954 cycles for 100 * movlps qword ptr [esi+8*ecx]
58152 cycles for 100 * movaps xmm0, oword ptr [esi]
129800 cycles for 100 * movntdq xmm0, oword ptr [esi]
59723 cycles for 100 * rep movsb
59356 cycles for 100 * rep movsd
113875 cycles for 100 * movlps qword ptr [esi+8*ecx]
57518 cycles for 100 * movaps xmm0, oword ptr [esi]
130509 cycles for 100 * movntdq xmm0, oword ptr [esi]
59061 cycles for 100 * rep movsb
63768 cycles for 100 * rep movsd
112908 cycles for 100 * movlps qword ptr [esi+8*ecx]
57839 cycles for 100 * movaps xmm0, oword ptr [esi]
132310 cycles for 100 * movntdq xmm0, oword ptr [esi]
59031 cycles for 100 * rep movsb
58619 cycles for 100 * rep movsd
129052 cycles for 100 * movlps qword ptr [esi+8*ecx]
57675 cycles for 100 * movaps xmm0, oword ptr [esi]
131438 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
62494 cycles for 100 * rep movsb
63721 cycles for 100 * rep movsd
120658 cycles for 100 * movlps qword ptr [esi+8*ecx]
58538 cycles for 100 * movaps xmm0, oword ptr [esi]
129730 cycles for 100 * movntdq xmm0, oword ptr [esi]
63098 cycles for 100 * rep movsb
63117 cycles for 100 * rep movsd
119950 cycles for 100 * movlps qword ptr [esi+8*ecx]
58756 cycles for 100 * movaps xmm0, oword ptr [esi]
129155 cycles for 100 * movntdq xmm0, oword ptr [esi]
62565 cycles for 100 * rep movsb
62759 cycles for 100 * rep movsd
119914 cycles for 100 * movlps qword ptr [esi+8*ecx]
57471 cycles for 100 * movaps xmm0, oword ptr [esi]
126619 cycles for 100 * movntdq xmm0, oword ptr [esi]
63080 cycles for 100 * rep movsb
62948 cycles for 100 * rep movsd
119847 cycles for 100 * movlps qword ptr [esi+8*ecx]
57513 cycles for 100 * movaps xmm0, oword ptr [esi]
123895 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
umc.exe
--------------------------------
50 gig copy in 4453 milliseconds
--------------------------------
umcmovsb.exe
--------------------------------
50 gig copy in 6469 milliseconds
--------------------------------
--------------------------------
50 gig copy in 7547 milliseconds
--------------------------------
Press any key to continue...
I have 20GB memory installed,turbo 3.1ghz
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
31040 cycles for 100 * rep movsb
31143 cycles for 100 * rep movsd
117139 cycles for 100 * movlps qword ptr [esi+8*ecx]
72688 cycles for 100 * movaps xmm0, oword ptr [esi]
109706 cycles for 100 * movntdq xmm0, oword ptr [esi]
31381 cycles for 100 * rep movsb
31663 cycles for 100 * rep movsd
116001 cycles for 100 * movlps qword ptr [esi+8*ecx]
71727 cycles for 100 * movaps xmm0, oword ptr [esi]
110933 cycles for 100 * movntdq xmm0, oword ptr [esi]
31644 cycles for 100 * rep movsb
37560 cycles for 100 * rep movsd
114454 cycles for 100 * movlps qword ptr [esi+8*ecx]
72541 cycles for 100 * movaps xmm0, oword ptr [esi]
124899 cycles for 100 * movntdq xmm0, oword ptr [esi]
31097 cycles for 100 * rep movsb
31010 cycles for 100 * rep movsd
115056 cycles for 100 * movlps qword ptr [esi+8*ecx]
72463 cycles for 100 * movaps xmm0, oword ptr [esi]
109951 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
-
Quote from: jj2007 on December 07, 2021, 09:05:54 PM
New machine. I added movntdq
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
13916 cycles for 100 * rep movsb
16152 cycles for 100 * rep movsd
106432 cycles for 100 * movlps qword ptr [esi+8*ecx]
42050 cycles for 100 * movaps xmm0, oword ptr [esi]
59500 cycles for 100 * movntdq xmm0, oword ptr [esi]
16298 cycles for 100 * rep movsb
15607 cycles for 100 * rep movsd
109919 cycles for 100 * movlps qword ptr [esi+8*ecx]
41897 cycles for 100 * movaps xmm0, oword ptr [esi]
58949 cycles for 100 * movntdq xmm0, oword ptr [esi]
15691 cycles for 100 * rep movsb
16515 cycles for 100 * rep movsd
108793 cycles for 100 * movlps qword ptr [esi+8*ecx]
41640 cycles for 100 * movaps xmm0, oword ptr [esi]
101036 cycles for 100 * movntdq xmm0, oword ptr [esi]
17390 cycles for 100 * rep movsb
16209 cycles for 100 * rep movsd
117106 cycles for 100 * movlps qword ptr [esi+8*ecx]
42124 cycles for 100 * movaps xmm0, oword ptr [esi]
60058 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
--- ok ---
Quote from: LiaoMi on December 09, 2021, 05:10:05 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
13916 cycles for 100 * rep movsb
...
59500 cycles for 100 * movntdq xmm0, oword ptr [esi]
That looks odd, and I wondered whether my counts were correct. But I can't find an error... how can movntdq be so slow?
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
35139 cycles for 100 * rep movsb
36189 cycles for 100 * rep movsd
161839 cycles for 100 * movlps qword ptr [esi+8*ecx]
82736 cycles for 100 * movaps xmm0, oword ptr [esi]
173215 cycles for 100 * movntdq xmm0, oword ptr [esi]
35248 cycles for 100 * rep movsb
36580 cycles for 100 * rep movsd
160325 cycles for 100 * movlps qword ptr [esi+8*ecx]
82958 cycles for 100 * movaps xmm0, oword ptr [esi]
174700 cycles for 100 * movntdq xmm0, oword ptr [esi]
35392 cycles for 100 * rep movsb
36231 cycles for 100 * rep movsd
160691 cycles for 100 * movlps qword ptr [esi+8*ecx]
83033 cycles for 100 * movaps xmm0, oword ptr [esi]
174148 cycles for 100 * movntdq xmm0, oword ptr [esi]
35310 cycles for 100 * rep movsb
36172 cycles for 100 * rep movsd
162454 cycles for 100 * movlps qword ptr [esi+8*ecx]
83124 cycles for 100 * movaps xmm0, oword ptr [esi]
173325 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
22808 cycles for 100 * rep movsb
22940 cycles for 100 * rep movsd
82232 cycles for 100 * movlps qword ptr [esi+8*ecx]
55785 cycles for 100 * movaps xmm0, oword ptr [esi]
148033 cycles for 100 * movntdq xmm0, oword ptr [esi]
22471 cycles for 100 * rep movsb
22846 cycles for 100 * rep movsd
82406 cycles for 100 * movlps qword ptr [esi+8*ecx]
57255 cycles for 100 * movaps xmm0, oword ptr [esi]
151683 cycles for 100 * movntdq xmm0, oword ptr [esi]
22507 cycles for 100 * rep movsb
23157 cycles for 100 * rep movsd
82990 cycles for 100 * movlps qword ptr [esi+8*ecx]
55098 cycles for 100 * movaps xmm0, oword ptr [esi]
144060 cycles for 100 * movntdq xmm0, oword ptr [esi]
22462 cycles for 100 * rep movsb
22567 cycles for 100 * rep movsd
82398 cycles for 100 * movlps qword ptr [esi+8*ecx]
54862 cycles for 100 * movaps xmm0, oword ptr [esi]
142853 cycles for 100 * movntdq xmm0, oword ptr [esi]
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]
--- ok ---
AMD Athlon Gold 3150U with Radeon Graphics (SSE4) Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4) Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
cycles for 100 * rep movsb 57654 35139 22808 31040 13916
cycles for 100 * rep movsd 65892 36189 22940 31143 16152
cycles for 100 * movlps qword ptr [esi+8*ec 112954 161839 82232 117139 106432
cycles for 100 * movaps xmm0, oword ptr [es 58152 82736 55785 72688 42050
cycles for 100 * movntdq xmm0, oword ptr [e 129800 173215 148033 109706 59500
cycles for 100 * rep movsb 59723 35248 22471 31381 16298
cycles for 100 * rep movsd 59356 36580 22846 31663 15607
cycles for 100 * movlps qword ptr [esi+8*ec 113875 160325 82406 116001 109919
cycles for 100 * movaps xmm0, oword ptr [es 57518 82958 57255 71727 41897
cycles for 100 * movntdq xmm0, oword ptr [e 130509 174700 151683 110933 58949
cycles for 100 * rep movsb 59061 35392 22507 31644 15691
cycles for 100 * rep movsd 63768 36231 23157 37560 16515
cycles for 100 * movlps qword ptr [esi+8*ec 112908 160691 82990 114454 108793
cycles for 100 * movaps xmm0, oword ptr [es 57839 83033 55098 72541 41640
cycles for 100 * movntdq xmm0, oword ptr [e 132310 174148 144060 124899 101036
cycles for 100 * rep movsb 59031 35310 22462 31097 17390
cycles for 100 * rep movsd 58619 36172 22567 31010 16209
cycles for 100 * movlps qword ptr [esi+8*ec 129052 162454 82398 115056 117106
cycles for 100 * movaps xmm0, oword ptr [es 57675 83124 54862 72463 42124
cycles for 100 * movntdq xmm0, oword ptr [e 131438 173325 142853 109951 60058
bytes for rep movsb 19 19 19 19 19
bytes for rep movsd 19 19 19 19 19
bytes for movlps qword ptr [esi+8*ecx] 29 29 29 29 29
bytes for movaps xmm0, oword ptr [esi] 34 34 34 34 34
bytes for movntdq xmm0, oword ptr [esi] 35 35 35 35 35
Sub ImportValues()
Row = 1 ' check title
Col = 2 ' first value
sTab = Chr(9)
With ActiveSheet
While .Cells(Row, Col) <> Empty
Debug.Print .Cells(Row, Col)
Col = Col + 1
Wend
End With
With Application.FileDialog(msoFileDialogFilePicker)
.AllowMultiSelect = False
.Filters.Add "Text Files", "*.txt", 1
.Show
sFileName = .SelectedItems.Item(1)
End With
nFileNro = FreeFile
Row = 0 ' start
Open sFileName For Input As #nFileNro
Do While Not EOF(nFileNro)
Line Input #nFileNro, sTextRow
'Debug.Print sTextRow
If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow ' title
Row = Row + 1
If Row >= 3 Then
Pos = InStr(1, sTextRow, Chr(9)) ' is tab in line
If Pos = 0 Then Pos = 8 ' no tab
If Left(sTextRow, 1) <> "" Then
If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
'Debug.Print nClk
If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
End If
End If
If Row = 37 Then Exit Do
Loop
Close #nFileNro
End Sub
Quote from: jj2007 on December 09, 2021, 06:53:58 AM
Quote from: LiaoMi on December 09, 2021, 05:10:05 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
13916 cycles for 100 * rep movsb
...
59500 cycles for 100 * movntdq xmm0, oword ptr [esi]
That looks odd, and I wondered whether my counts were correct. But I can't find an error... how can movntdq be so slow?
Hi jj2007,
I would also like to know what is the reason for such slowdowns :undecided:
Random slow downs with AVX2 code - https://community.intel.com/t5/Intel-ISA-Extensions/Random-slow-downs-with-AVX2-code/m-p/1084764
Depending on CPU, 10x is approximately the difference between L1 cache and L3 cache latency. Is your thread pinned?
Jim Dempsey
How L1 and L2 CPU Caches Work, and Why They're an Essential Part of Modern Chips - https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
Reducing Memory Access Times with Caches - https://developers.redhat.com/blog/2016/03/01/reducing-memory-access-times-with-caches#
What is a "cache-friendly" code? - https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code
CS 201 Writing Cache-Friendly Code - Portland State University - http://web.cecs.pdx.edu/~jrb/cs201/lectures/cache.friendly.code.pdf
Very slow performance of
VMOVNTDQ instruction - https://community.intel.com/t5/Intel-ISA-Extensions/Very-slow-performance-of-VMOVNTDQ-instruction/td-p/941697
Thanks for the link. Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.
AVX transition penalties and OS support - https://community.intel.com/t5/Intel-ISA-Extensions/AVX-transition-penalties-and-OS-support/m-p/931977
Intel Avoiding AVX-SSE Transition Penalties - https://web.archive.org/web/20160409073240/software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
What Every Programmer Should Know About Memory - https://www.akkadia.org/drepper/cpumemory.pdf
Few dozens of cycles... fascinating :rolleyes:
Intel Manual
66 0F E7 /r
MOVNTDQ m128, xmm1
A V/V SSE2 Move packed integer values in xmm1 to m128 using nontemporal hint.
Quote from: hutch-- on December 10, 2021, 10:33:46 PM
Intel Manual
66 0F E7 /r
MOVNTDQ m128, xmm1
A V/V SSE2 Move packed integer values in xmm1 to m128 using nontemporal hint.
Hi Hutch,
exactly :thup: :thup: :thup: :thup: Thanks!
What is the meaning of "non temporal" memory accesses in x86 - https://stackoverflow.com/questions/37070/what-is-the-meaning-of-non-temporal-memory-accesses-in-x86
When are x86 LFENCE, SFENCE and MFENCE instructions required? - https://stackoverflow.com/questions/27595595/when-are-x86-lfence-sfence-and-mfence-instructions-required
The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.
Notes on "non-temporal" (aka "streaming") stores - https://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
Optimizing Cache Usage With Nontemporal Accesses - https://vgatherps.github.io/2018-09-02-nontemporal/
void force_nt_store(cache_line *a) {
__m128i zeros = {0, 0}; // chosen to use zeroing idiom;
__asm volatile("movntdq %0, (%1)\n\t"
#if BYTES > 16
"movntdq %0, 16(%1)\n\t"
#endif
#if BYTES > 32
"movntdq %0, 32(%1)\n\t"
#endif
#if BYTES > 48
"movntdq %0, 48(%1)"
#endif
:
: "x" (zeros), "r" (&a->vec_val)
: "memory");
}
uint64_t run_timer_loop(void) {
mfence();
uint64_t start = rdtscp();
for (int i = 0; i < 32; i++) {
force_nt_store(&large_buffer[i]);
}
mfence();
uint64_t end = rdtscp();
}
nontemporal_storeshttps://github.com/vgatherps/nontemporal_stores/blob/master/basic_write_allocate/test.c
movntdq + mfence - Example
https://www.felixcloutier.com/x86/mfence
https://www.felixcloutier.com/x86/lfence
https://www.felixcloutier.com/x86/sfence
.686
.model flat,C
.xmm
.code
;------------------------------------------------------------------------------
; VOID *
; InternalMemCopyMem (
; IN VOID *Destination,
; IN VOID *Source,
; IN UINTN Count
; );
;------------------------------------------------------------------------------
InternalMemCopyMem PROC USES esi edi
mov esi, [esp + 16] ; esi <- Source
mov edi, [esp + 12] ; edi <- Destination
mov edx, [esp + 20] ; edx <- Count
lea eax, [esi + edx - 1] ; eax <- End of Source
cmp esi, edi
jae @F
cmp eax, edi ; Overlapped?
jae @CopyBackward ; Copy backward if overlapped
@@:
xor ecx, ecx
sub ecx, edi
and ecx, 15 ; ecx + edi aligns on 16-byte boundary
jz @F
cmp ecx, edx
cmova ecx, edx
sub edx, ecx ; edx <- remaining bytes to copy
rep movsb
@@:
mov ecx, edx
and edx, 15
shr ecx, 4 ; ecx <- # of DQwords to copy
jz @CopyBytes
add esp, -16
movdqu [esp], xmm0 ; save xmm0
@@:
movdqu xmm0, [esi] ; esi may not be 16-bytes aligned
movntdq [edi], xmm0 ; edi should be 16-bytes aligned
add esi, 16
add edi, 16
loop @B
mfence
movdqu xmm0, [esp] ; restore xmm0
add esp, 16 ; stack cleanup
jmp @CopyBytes
@CopyBackward:
mov esi, eax ; esi <- Last byte in Source
lea edi, [edi + edx - 1] ; edi <- Last byte in Destination
std
@CopyBytes:
mov ecx, edx
rep movsb
cld
mov eax, [esp + 12] ; eax <- Destination as return value
ret
InternalMemCopyMem ENDP
END
.686
.model flat,C
.xmm
.code
;------------------------------------------------------------------------------
; VOID *
; EFIAPI
; InternalMemSetMem (
; IN VOID *Buffer,
; IN UINTN Count,
; IN UINT8 Value
; );
;------------------------------------------------------------------------------
InternalMemSetMem PROC USES edi
mov edx, [esp + 12] ; edx <- Count
mov edi, [esp + 8] ; edi <- Buffer
mov al, [esp + 16] ; al <- Value
xor ecx, ecx
sub ecx, edi
and ecx, 15 ; ecx + edi aligns on 16-byte boundary
jz @F
cmp ecx, edx
cmova ecx, edx
sub edx, ecx
rep stosb
@@:
mov ecx, edx
and edx, 15
shr ecx, 4 ; ecx <- # of DQwords to set
jz @SetBytes
mov ah, al ; ax <- Value | (Value << 8)
add esp, -16
movdqu [esp], xmm0 ; save xmm0
movd xmm0, eax
pshuflw xmm0, xmm0, 0 ; xmm0[0..63] <- Value repeats 8 times
movlhps xmm0, xmm0 ; xmm0 <- Value repeats 16 times
@@:
movntdq [edi], xmm0 ; edi should be 16-byte aligned
add edi, 16
loop @B
mfence
movdqu xmm0, [esp] ; restore xmm0
add esp, 16 ; stack cleanup
@SetBytes:
mov ecx, edx
rep stosb
mov eax, [esp + 8] ; eax <- Buffer as return value
ret
InternalMemSetMem ENDP
END
.686
.model flat,C
.xmm
.code
;------------------------------------------------------------------------------
; VOID *
; EFIAPI
; InternalMemZeroMem (
; IN VOID *Buffer,
; IN UINTN Count
; );
;------------------------------------------------------------------------------
InternalMemZeroMem PROC USES edi
mov edi, [esp + 8]
mov edx, [esp + 12]
xor ecx, ecx
sub ecx, edi
xor eax, eax
and ecx, 15
jz @F
cmp ecx, edx
cmova ecx, edx
sub edx, ecx
rep stosb
@@:
mov ecx, edx
and edx, 15
shr ecx, 4
jz @ZeroBytes
pxor xmm0, xmm0
@@:
movntdq [edi], xmm0
add edi, 16
loop @B
mfence
@ZeroBytes:
mov ecx, edx
rep stosb
mov eax, [esp + 8]
ret
InternalMemZeroMem ENDP
END
"cpuid" before "rdtsc" - https://newbedev.com/cpuid-before-rdtsc
It's to prevent out-of-order execution. From a link that has now disappeared from the web (but which was fortuitously copied here before it disappeared), this text is from an article entitled "Performance monitoring" by one John Eckerdal:
The Pentium Pro and Pentium II processors support out-of-order execution instructions may be executed in another order as you programmed them. This can be a source of errors if not taken care of.
To prevent this the programmer must serialize the the instruction queue. This can be done by inserting a serializing instruction like CPUID instruction before the RDTSC instruction.
Two reasons:
As paxdiablo says, when the CPU sees a CPUID opcode it makes sure all the previous instructions are executed, then the CPUID taken, before any subsequent instructions execute. Without such an instruction, the CPU execution pipeline may end up executing TSC before the instruction(s) you'd like to time.
A significant proportion of machines fail to synchronise the TSC registers across cores. In you want to read it from a horse's mouth - knock yourself out at http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx. So, when measuring an interval between TSC readings, unless they're taken on the same core you'll have an effectively random but possibly constant (see below) interval introduced - it can easily be several seconds (yes seconds) even soon after bootup. This effectively reflects how long the BIOS was running on a single core before kicking off the others, plus - if you've any nasty power saving options on - increasing drift caused by cores running at different frequencies or shutting down again. So, if you haven't nailed the threads reading TSC registers to the same core then you'll need to build some kind of cross-core delta table and know the core id (which is returned by CPUID) of each TSC sample in order to compensate for this offset. That's another reason you can see CPUID alongside RDTSC, and indeed a reason why with newer RDTSCP many OSes are storing core id numbers into the extra TSC_AUX[31:0] data returned. (Available from Core i7 and Athlon 64 X2, RDTSCP is a much better option in all respects - the OS normally gives you the core id as mentioned, atomic to the TSC read, and prevent instruction reordering).
CPUID is serializing, preventing out-of-order execution of RDTSC.
These days you can safely use LFENCE instead. It's documented as serializing on the instruction stream (but not stores to memory) on Intel CPUs, and now also on AMD after their microcode update for Spectre.
https://hadibrais.wordpress.com/2018/05/14/the-significance-of-the-x86-lfence-instruction/ explains more about LFENCE.
See also https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf for a way to use RDTSCP that keeps CPUID (or LFENCE) out of the timed region:
LFENCE ; (or CPUID) Don't start the timed region until everything above has executed
RDTSC ; EDX:EAX = timestamp
mov ebx, eax ; low 32 bits of start time
code under test
RDTSCP ; built-in one way barrier stops it from running early
LFENCE ; (or CPUID) still use a barrier after to prevent anything weird
sub eax, ebx ; low 32 bits of end-start
I have generally found that the combination of CPUID and RDTSC stabilise timings and improves the accuracy of benchmarking.
@Hutch: I didn't know before why it was so :thumbsup:
movntps + sfence - Example
xorps macro XMMReg1, XMMReg2
db 0FH, 057H, 0C0H + (XMMReg1 * 8) + XMMReg2
endm
movntps macro GeneralReg, Offset, XMMReg
db 0FH, 02BH, 040H + (XmmReg * 8) + GeneralReg, Offset
endm
sfence macro
db 0FH, 0AEH, 0F8H
endm
movaps_load macro XMMReg, GeneralReg
db 0FH, 028H, (XMMReg * 8) + 4, (4 * 8) + GeneralReg
endm
movaps_store macro GeneralReg, XMMReg
db 0FH, 029H, (XMMReg * 8) + 4, (4 * 8) + GeneralReg
endm
;
; Register Definitions (for instruction macros).
;
rEAX equ 0
rECX equ 1
rEDX equ 2
rEBX equ 3
rESP equ 4
rEBP equ 5
rESI equ 6
rEDI equ 7
Test Proc
sti ; reenable context switching
movaps_store rESP, 0 ; save xmm0
mov ecx, Dest
call XMMZeroPage ; zero MEM
movaps_load 0, rESP ; restore xmm
Test ENDP
XMMZeroPage Proc
xorps 0, 0 ; zero xmm0 (128 bits)
mov eax, SIZE ; Number of Iterations
inner:
movntps rECX, 0, 0 ; store bytes 0 - 15
movntps rECX, 16, 0 ; 16 - 31
movntps rECX, 32, 0 ; 32 - 47
movntps rECX, 48, 0 ; 48 - 63
add ecx, 64 ; increment base
dec eax ; decrement loop count
jnz short inner
; Force all stores to complete before any other
; stores from this processor.
sfence
ifndef SFENCE_IS_NOT_BUSTED
; the next uncached write to this processor's apic
; may fail unless the store pipes have drained. sfence by
; itself is not enough. Force drainage now by doing an
; interlocked exchange.
xchg [esp-4], eax
endif
ret
XMMZeroPage ENDP
Intel memory ordering, fence instructions, and atomic operations - https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
MFENCE and LFENCE micro-architectural implementation (Patent) - https://patents.google.com/patent/US6678810B1/en or https://patentimages.storage.googleapis.com/d4/fd/41/fd35729a18a3cd/US6678810.pdf
MFENCE and LFENCE micro-architectural implementation method and system - https://patents.google.com/patent/US6651151B2/en or https://patentimages.storage.googleapis.com/fe/41/a3/ddea1fb5732c17/US6651151.pdf
Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?
x86 fence instructions can be briefly described as follows:
MFENCE prevents any later loads or stores from becoming globally observable before any earlier loads or stores. It drains the store buffer before later loads1 can execute.
LFENCE blocks instruction dispatch (Intel's terminology) until all earlier instructions retire. This is currently implemented by draining the ROB (ReOrder Buffer) before later instructions can issue into the back-end.
SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not necessarily force the store buffer to be drained before it retires from the ROB, so putting LFENCE after it doesn't add up to MFENCE.
A "serializing instruction" like CPUID (and IRET, etc) drains everything (ROB, store buffer) before later instructions can issue into the back-end. MFENCE + LFENCE would also do that, but true serializing instructions might also have other effects, I don't know.
Memory Reordering Caught in the Act - https://preshing.com/20120515/memory-reordering-caught-in-the-act/
Does the Intel Memory Model make SFENCE and LFENCE redundant? - https://stackoverflow.com/questions/32705169/does-the-intel-memory-model-make-sfence-and-lfence-redundant/32705560#32705560
If I am timing something really critical, I use the API SleepEx() to pause the thread for about 100 ms to try and get the start of a time slice.
Quote from: hutch-- on December 11, 2021, 02:16:41 AM
I have generally found that the combination of CPUID and RDTSC stabilise timings and improves the accuracy of benchmarking.
\Masm32\macros\timers.asm
xor eax, eax ;; Use same CPUID input value for each call
cpuid ;; Flush pipe & wait for pending ops to finish
rdtsc ;; Read Time Stamp Counter
Michael Webster :thumbsup:
deleted
Quote from: hutch-- on December 10, 2021, 10:33:46 PM
Intel Manual
66 0F E7 /r
MOVNTDQ m128, xmm1
A V/V SSE2 Move packed integer values in xmm1 to m128 using nontemporal hint.
is old advice to use for VRAM ->pciexpress to gpu is faster,but I forget to time different alternatives it when ddraw blend lots of circles,but using movaps looked very fast
note the 66 prefix,most packed SSE2 is one byte bigger than SSE versions so wonder how many more instructions in 64byte cacheline can be fit with SSE versions instead?
magnus,
I think you have missed something here, the mnemonic "movntdq" is designed to be used in conjunction with an instruction like either "movdqa" or "movdqu" where they are used to load memory through the cache and "movntdq" is used to write back to memory bypassing the cache. The reduction in cache pollution generally yields an improvement in performance.
Quote from: hutch-- on December 12, 2021, 02:05:43 PM"movntdq" is designed to be used in conjunction with an instruction like either "movdqa" or "movdqu"
I had
movaps before, but even with
movdqa or
movdqu it won't become any faster. Mysterious :rolleyes:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
63929 cycles for 100 * rep movsb
95770 cycles for 100 * rep movsd
209628 cycles for 100 * movlps qword ptr [esi+8*ecx]
120512 cycles for 100 * movaps xmm0, oword ptr [esi]
176887 cycles for 100 * movdqa + movntdq
175955 cycles for 100 * movdqu + movntdq
65768 cycles for 100 * rep movsb
64697 cycles for 100 * rep movsd
206155 cycles for 100 * movlps qword ptr [esi+8*ecx]
122034 cycles for 100 * movaps xmm0, oword ptr [esi]
174827 cycles for 100 * movdqa + movntdq
176240 cycles for 100 * movdqu + movntdq
65109 cycles for 100 * rep movsb
64308 cycles for 100 * rep movsd
208594 cycles for 100 * movlps qword ptr [esi+8*ecx]
120838 cycles for 100 * movaps xmm0, oword ptr [esi]
176082 cycles for 100 * movdqa + movntdq
176391 cycles for 100 * movdqu + movntdq
65057 cycles for 100 * rep movsb
64755 cycles for 100 * rep movsd
206689 cycles for 100 * movlps qword ptr [esi+8*ecx]
121700 cycles for 100 * movaps xmm0, oword ptr [esi]
175600 cycles for 100 * movdqa + movntdq
176981 cycles for 100 * movdqu + movntdq
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
36 bytes for movdqa + movntdq
36 bytes for movdqu + movntdq
Quote from: jj2007 on December 12, 2021, 08:45:19 PM
Quote from: hutch-- on December 12, 2021, 02:05:43 PM"movntdq" is designed to be used in conjunction with an instruction like either "movdqa" or "movdqu"
I had movaps before, but even with movdqa or movdqu it won't become any faster. Mysterious :rolleyes:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
63929 cycles for 100 * rep movsb
95770 cycles for 100 * rep movsd
209628 cycles for 100 * movlps qword ptr [esi+8*ecx]
120512 cycles for 100 * movaps xmm0, oword ptr [esi]
176887 cycles for 100 * movdqa + movntdq
175955 cycles for 100 * movdqu + movntdq
65768 cycles for 100 * rep movsb
64697 cycles for 100 * rep movsd
206155 cycles for 100 * movlps qword ptr [esi+8*ecx]
122034 cycles for 100 * movaps xmm0, oword ptr [esi]
174827 cycles for 100 * movdqa + movntdq
176240 cycles for 100 * movdqu + movntdq
65109 cycles for 100 * rep movsb
64308 cycles for 100 * rep movsd
208594 cycles for 100 * movlps qword ptr [esi+8*ecx]
120838 cycles for 100 * movaps xmm0, oword ptr [esi]
176082 cycles for 100 * movdqa + movntdq
176391 cycles for 100 * movdqu + movntdq
65057 cycles for 100 * rep movsb
64755 cycles for 100 * rep movsd
206689 cycles for 100 * movlps qword ptr [esi+8*ecx]
121700 cycles for 100 * movaps xmm0, oword ptr [esi]
175600 cycles for 100 * movdqa + movntdq
176981 cycles for 100 * movdqu + movntdq
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
36 bytes for movdqa + movntdq
36 bytes for movdqu + movntdq
Hi jj2007,
please add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286
"movntdq + mfence"@@:
movdqu xmm0, [esi] ; esi may not be 16-bytes aligned
movntdq [edi], xmm0 ; edi should be 16-bytes aligned
add esi, 16
add edi, 16
loop @B
mfence
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
17005 cycles for 100 * rep movsb
16922 cycles for 100 * rep movsd
106248 cycles for 100 * movlps qword ptr [esi+8*ecx]
41768 cycles for 100 * movaps xmm0, oword ptr [esi]
56037 cycles for 100 * movdqa + movntdq
55746 cycles for 100 * movdqu + movntdq
16797 cycles for 100 * rep movsb
17090 cycles for 100 * rep movsd
105885 cycles for 100 * movlps qword ptr [esi+8*ecx]
42111 cycles for 100 * movaps xmm0, oword ptr [esi]
56001 cycles for 100 * movdqa + movntdq
56026 cycles for 100 * movdqu + movntdq
17075 cycles for 100 * rep movsb
16702 cycles for 100 * rep movsd
107414 cycles for 100 * movlps qword ptr [esi+8*ecx]
41896 cycles for 100 * movaps xmm0, oword ptr [esi]
56205 cycles for 100 * movdqa + movntdq
56293 cycles for 100 * movdqu + movntdq
16736 cycles for 100 * rep movsb
17064 cycles for 100 * rep movsd
105788 cycles for 100 * movlps qword ptr [esi+8*ecx]
41915 cycles for 100 * movaps xmm0, oword ptr [esi]
56349 cycles for 100 * movdqa + movntdq
56819 cycles for 100 * movdqu + movntdq
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
36 bytes for movdqa + movntdq
36 bytes for movdqu + movntdq
--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
62547 cycles for 100 * rep movsb
63471 cycles for 100 * rep movsd
119633 cycles for 100 * movlps qword ptr [esi+8*ecx]
60383 cycles for 100 * movaps xmm0, oword ptr [esi]
120757 cycles for 100 * movdqa + movntdq
115172 cycles for 100 * movdqu + movntdq
63334 cycles for 100 * rep movsb
62718 cycles for 100 * rep movsd
118873 cycles for 100 * movlps qword ptr [esi+8*ecx]
60457 cycles for 100 * movaps xmm0, oword ptr [esi]
112820 cycles for 100 * movdqa + movntdq
116539 cycles for 100 * movdqu + movntdq
62664 cycles for 100 * rep movsb
63786 cycles for 100 * rep movsd
119998 cycles for 100 * movlps qword ptr [esi+8*ecx]
57309 cycles for 100 * movaps xmm0, oword ptr [esi]
118881 cycles for 100 * movdqa + movntdq
112190 cycles for 100 * movdqu + movntdq
63090 cycles for 100 * rep movsb
63073 cycles for 100 * rep movsd
118692 cycles for 100 * movlps qword ptr [esi+8*ecx]
59713 cycles for 100 * movaps xmm0, oword ptr [esi]
117861 cycles for 100 * movdqa + movntdq
117263 cycles for 100 * movdqu + movntdq
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
36 bytes for movdqa + movntdq
36 bytes for movdqu + movntdq
Quote from: LiaoMi on December 12, 2021, 11:31:29 PMplease add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286
"movntdq + mfence"
@@:
movdqu xmm0, [esi] ; esi may not be 16-bytes aligned
movntdq [edi], xmm0 ; edi should be 16-bytes aligned
add esi, 16
add edi, 16
loop @B
mfence
I'm not impressed...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
65904 cycles for 100 * rep movsb
70679 cycles for 100 * rep movsd
207177 cycles for 100 * movlps qword ptr [esi+8*ecx]
121524 cycles for 100 * movaps xmm0, oword ptr [esi]
191206 cycles for 100 * movdqa + movntdq
194912 cycles for 100 * movdqu + movntdq
197640 cycles for 100 * movdqu + movntdq + mfence
66396 cycles for 100 * rep movsb
64295 cycles for 100 * rep movsd
207218 cycles for 100 * movlps qword ptr [esi+8*ecx]
121237 cycles for 100 * movaps xmm0, oword ptr [esi]
192188 cycles for 100 * movdqa + movntdq
193955 cycles for 100 * movdqu + movntdq
195811 cycles for 100 * movdqu + movntdq + mfence
65465 cycles for 100 * rep movsb
ID 10616 1 MB at 14:01:30 12.12.2021 13:25:28 wb 0 = 11553508 / 1 h 0 MB/day firefox.exe
63888 cycles for 100 * rep movsd
209074 cycles for 100 * movlps qword ptr [esi+8*ecx]
122465 cycles for 100 * movaps xmm0, oword ptr [esi]
190494 cycles for 100 * movdqa + movntdq
192326 cycles for 100 * movdqu + movntdq
198034 cycles for 100 * movdqu + movntdq + mfence
65560 cycles for 100 * rep movsb
65119 cycles for 100 * rep movsd
206794 cycles for 100 * movlps qword ptr [esi+8*ecx]
121545 cycles for 100 * movaps xmm0, oword ptr [esi]
191100 cycles for 100 * movdqa + movntdq
196902 cycles for 100 * movdqu + movntdq
197136 cycles for 100 * movdqu + movntdq + mfence
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
36 bytes for movdqa + movntdq
36 bytes for movdqu + movntdq
39 bytes for movdqu + movntdq + mfence
JJ,
Keep in mind that the sample size being looped will effect the timing of most of the combinations. If you run a gigabyte sample, you get rid of those effects. The other factor is different hardware will give different results.
Quote from: jj2007 on December 13, 2021, 12:03:37 AM
Quote from: LiaoMi on December 12, 2021, 11:31:29 PMplease add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286
"movntdq + mfence"
@@:
movdqu xmm0, [esi] ; esi may not be 16-bytes aligned
movntdq [edi], xmm0 ; edi should be 16-bytes aligned
add esi, 16
add edi, 16
loop @B
mfence
I'm not impressed...
Very curious results :rolleyes:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
17782 cycles for 100 * rep movsb
17554 cycles for 100 * rep movsd
115012 cycles for 100 * movlps qword ptr [esi+8*ecx]
52006 cycles for 100 * movaps xmm0, oword ptr [esi]
58101 cycles for 100 * movdqa + movntdq
57415 cycles for 100 * movdqu + movntdq
73437 cycles for 100 * movdqu + movntdq + mfence
18073 cycles for 100 * rep movsb
17701 cycles for 100 * rep movsd
110545 cycles for 100 * movlps qword ptr [esi+8*ecx]
42643 cycles for 100 * movaps xmm0, oword ptr [esi]
56827 cycles for 100 * movdqa + movntdq
58362 cycles for 100 * movdqu + movntdq
72001 cycles for 100 * movdqu + movntdq + mfence
19436 cycles for 100 * rep movsb
17883 cycles for 100 * rep movsd
107491 cycles for 100 * movlps qword ptr [esi+8*ecx]
43259 cycles for 100 * movaps xmm0, oword ptr [esi]
56876 cycles for 100 * movdqa + movntdq
57166 cycles for 100 * movdqu + movntdq
74082 cycles for 100 * movdqu + movntdq + mfence
18036 cycles for 100 * rep movsb
18419 cycles for 100 * rep movsd
106922 cycles for 100 * movlps qword ptr [esi+8*ecx]
42377 cycles for 100 * movaps xmm0, oword ptr [esi]
58547 cycles for 100 * movdqa + movntdq
57797 cycles for 100 * movdqu + movntdq
74547 cycles for 100 * movdqu + movntdq + mfence
19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
36 bytes for movdqa + movntdq
36 bytes for movdqu + movntdq
39 bytes for movdqu + movntdq + mfence
--- ok ---
Here is a 2 test pieces for 32 bit memory copy, an unaligned SSE2 copy and a normal rep movsb copy. In every instance the SSE2 version is always faster and to the extent that the test pieces don't need to be stabilised or run with a higher priority. It is the same source for both but each has been saved as a separate exe file so that there is no cache interaction, between the two.
SSE2 Copy
--------
843 ms
--------
Press any key to continue ...
ByteCopy
--------
1219 ms
--------
Press any key to continue ...
Quote from: hutch-- on December 13, 2021, 01:39:30 PM
Here is a 2 test pieces for 32 bit memory copy, an unaligned SSE2 copy and a normal rep movsb copy. In every instance the SSE2 version is always faster and to the extent that the test pieces don't need to be stabilised or run with a higher priority. It is the same source for both but each has been saved as a separate exe file so that there is no cache interaction, between the two.
SSE2 Copy
--------
843 ms
--------
Press any key to continue ...
ByteCopy
--------
1219 ms
--------
Press any key to continue ...
SSE2 Copy
--------
640 ms
--------
Press any key to continue ...
ByteCopy
--------
719 ms
--------
Press any key to continue ...
Similar for me. The interesting bit, though: if you use movups instead of movnt, rep movsb is faster.
My test pieces were made with a shorter copy but more iterations. Which implies that they used the cache, and then, apparently, movnt has no advantage.
I think that is normally the case with SSE mnemonics, they are designed for streaming and apparently have a wind up effect that makes it difficult for short duration loop code. In the past, the advice on SSE code was to more or less forget integer code optimisation and set up the SSE code so it did the work. I have never yet got any gain out of SSE code by unrolling it so it really is a different system built into the CPU.
This much, over time the unaligned mnemonics are a lot closer in speed terms to the fully aligned ones and there is little gain in using the aligned instructions unless your code design requires it.
ByteCopy--------
1266 ms
--------
SSE2--------
875 ms
--------
bytecopy.exe
--------
695 ms
--------
sse2copy.exe
--------
680 ms
--------
Quote from: jj2007 on December 13, 2021, 07:58:53 PM
Similar for me. The interesting bit, though: if you use movups instead of movnt, rep movsb is faster.
I do some tests here, when I align data to 64, rep movs(bd) execution decreased by a
50% 40% ratio. Others functions results remains unchanged.
align 16
db 1
align 16
db 1
align 16
db 1
align 16
somestring db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
REPEAT 99
db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
ENDM
I am interested in timings used with ddraw or sdl, with someone have pci express to/from vram still has the advantage with movntdqa system ram -> vram, and. The disadvantage read from vram 100* slower?
Can't test with laptop with shared memory
Wonder if dx loadtexturefrommemory api uses movntdqa?
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++++++-+++8 of 20 tests valid,
315842 kCycles for 1 * rep movsb
241232 kCycles for 1 * rep movsd
304695 kCycles for 1 * movlps qword ptr [esi+8*ecx]
302014 kCycles for 1 * movaps xmm0, oword ptr [esi]
208018 kCycles for 1 * movdqa + movntdq
207876 kCycles for 1 * movdqu + movntdq
207752 kCycles for 1 * movdqu + movntdq + mfence
249181 kCycles for 1 * rep movsb
239809 kCycles for 1 * rep movsd
304868 kCycles for 1 * movlps qword ptr [esi+8*ecx]
301253 kCycles for 1 * movaps xmm0, oword ptr [esi]
207931 kCycles for 1 * movdqa + movntdq
208272 kCycles for 1 * movdqu + movntdq
207503 kCycles for 1 * movdqu + movntdq + mfence
249727 kCycles for 1 * rep movsb
241799 kCycles for 1 * rep movsd
303516 kCycles for 1 * movlps qword ptr [esi+8*ecx]
301728 kCycles for 1 * movaps xmm0, oword ptr [esi]
207608 kCycles for 1 * movdqa + movntdq
208094 kCycles for 1 * movdqu + movntdq
208854 kCycles for 1 * movdqu + movntdq + mfence
248574 kCycles for 1 * rep movsb
240836 kCycles for 1 * rep movsd
304675 kCycles for 1 * movlps qword ptr [esi+8*ecx]
301674 kCycles for 1 * movaps xmm0, oword ptr [esi]
208379 kCycles for 1 * movdqa + movntdq
207882 kCycles for 1 * movdqu + movntdq
207742 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
+-++++++++++++++++++
261718 kCycles for 1 * rep movsb
238384 kCycles for 1 * rep movsd
269729 kCycles for 1 * movlps qword ptr [esi+8*ecx]
236182 kCycles for 1 * movaps xmm0, oword ptr [esi]
156614 kCycles for 1 * movdqa + movntdq
156042 kCycles for 1 * movdqu + movntdq
156594 kCycles for 1 * movdqu + movntdq + mfence
236577 kCycles for 1 * rep movsb
236369 kCycles for 1 * rep movsd
270908 kCycles for 1 * movlps qword ptr [esi+8*ecx]
236626 kCycles for 1 * movaps xmm0, oword ptr [esi]
156354 kCycles for 1 * movdqa + movntdq
155998 kCycles for 1 * movdqu + movntdq
156243 kCycles for 1 * movdqu + movntdq + mfence
235567 kCycles for 1 * rep movsb
236734 kCycles for 1 * rep movsd
276802 kCycles for 1 * movlps qword ptr [esi+8*ecx]
236012 kCycles for 1 * movaps xmm0, oword ptr [esi]
156233 kCycles for 1 * movdqa + movntdq
156445 kCycles for 1 * movdqu + movntdq
156944 kCycles for 1 * movdqu + movntdq + mfence
237039 kCycles for 1 * rep movsb
238233 kCycles for 1 * rep movsd
270702 kCycles for 1 * movlps qword ptr [esi+8*ecx]
236677 kCycles for 1 * movaps xmm0, oword ptr [esi]
155610 kCycles for 1 * movdqa + movntdq
156935 kCycles for 1 * movdqu + movntdq
156013 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
-----------9 of 20 tests valid,
103297 kCycles for 1 * rep movsb
83453 kCycles for 1 * rep movsd
151305 kCycles for 1 * movlps qword ptr [esi+8*ecx]
140797 kCycles for 1 * movaps xmm0, oword ptr [esi]
85881 kCycles for 1 * movdqa + movntdq
87420 kCycles for 1 * movdqu + movntdq
85314 kCycles for 1 * movdqu + movntdq + mfence
82482 kCycles for 1 * rep movsb
81107 kCycles for 1 * rep movsd
148720 kCycles for 1 * movlps qword ptr [esi+8*ecx]
140735 kCycles for 1 * movaps xmm0, oword ptr [esi]
87417 kCycles for 1 * movdqa + movntdq
85541 kCycles for 1 * movdqu + movntdq
86765 kCycles for 1 * movdqu + movntdq + mfence
83181 kCycles for 1 * rep movsb
81348 kCycles for 1 * rep movsd
149105 kCycles for 1 * movlps qword ptr [esi+8*ecx]
141740 kCycles for 1 * movaps xmm0, oword ptr [esi]
85743 kCycles for 1 * movdqa + movntdq
86256 kCycles for 1 * movdqu + movntdq
87608 kCycles for 1 * movdqu + movntdq + mfence
81210 kCycles for 1 * rep movsb
81663 kCycles for 1 * rep movsd
150942 kCycles for 1 * movlps qword ptr [esi+8*ecx]
140339 kCycles for 1 * movaps xmm0, oword ptr [esi]
86239 kCycles for 1 * movdqa + movntdq
87931 kCycles for 1 * movdqu + movntdq
85408 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
Hi Jochen,
What does the number of valid tests mean?
Hi Jochen,
Two systems. Tests valid?
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
+++-+++++++++++5 of 20 tests valid,
352139 kCycles for 1 * rep movsb
237663 kCycles for 1 * rep movsd
287198 kCycles for 1 * movlps qword ptr [esi+8*ecx]
279630 kCycles for 1 * movaps xmm0, oword ptr [esi]
181083 kCycles for 1 * movdqa + movntdq
181125 kCycles for 1 * movdqu + movntdq
180909 kCycles for 1 * movdqu + movntdq + mfence
193980 kCycles for 1 * rep movsb
214331 kCycles for 1 * rep movsd
278425 kCycles for 1 * movlps qword ptr [esi+8*ecx]
274866 kCycles for 1 * movaps xmm0, oword ptr [esi]
179814 kCycles for 1 * movdqa + movntdq
179520 kCycles for 1 * movdqu + movntdq
179469 kCycles for 1 * movdqu + movntdq + mfence
192139 kCycles for 1 * rep movsb
210664 kCycles for 1 * rep movsd
279349 kCycles for 1 * movlps qword ptr [esi+8*ecx]
274878 kCycles for 1 * movaps xmm0, oword ptr [esi]
180115 kCycles for 1 * movdqa + movntdq
179785 kCycles for 1 * movdqu + movntdq
179769 kCycles for 1 * movdqu + movntdq + mfence
191809 kCycles for 1 * rep movsb
216878 kCycles for 1 * rep movsd
277053 kCycles for 1 * movlps qword ptr [esi+8*ecx]
277263 kCycles for 1 * movaps xmm0, oword ptr [esi]
178899 kCycles for 1 * movdqa + movntdq
179083 kCycles for 1 * movdqu + movntdq
178847 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
+++++15 of 20 tests valid,
241261 kCycles for 1 * rep movsb
201928 kCycles for 1 * rep movsd
250473 kCycles for 1 * movlps qword ptr [esi+8*ecx]
245765 kCycles for 1 * movaps xmm0, oword ptr [esi]
182061 kCycles for 1 * movdqa + movntdq
187806 kCycles for 1 * movdqu + movntdq
197382 kCycles for 1 * movdqu + movntdq + mfence
224850 kCycles for 1 * rep movsb
202630 kCycles for 1 * rep movsd
234536 kCycles for 1 * movlps qword ptr [esi+8*ecx]
228211 kCycles for 1 * movaps xmm0, oword ptr [esi]
191152 kCycles for 1 * movdqa + movntdq
188628 kCycles for 1 * movdqu + movntdq
185565 kCycles for 1 * movdqu + movntdq + mfence
206426 kCycles for 1 * rep movsb
206008 kCycles for 1 * rep movsd
233301 kCycles for 1 * movlps qword ptr [esi+8*ecx]
229024 kCycles for 1 * movaps xmm0, oword ptr [esi]
181524 kCycles for 1 * movdqa + movntdq
198103 kCycles for 1 * movdqu + movntdq
177373 kCycles for 1 * movdqu + movntdq + mfence
199886 kCycles for 1 * rep movsb
200050 kCycles for 1 * rep movsd
233793 kCycles for 1 * movlps qword ptr [esi+8*ecx]
228413 kCycles for 1 * movaps xmm0, oword ptr [esi]
177392 kCycles for 1 * movdqa + movntdq
175842 kCycles for 1 * movdqu + movntdq
175220 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
--- ok ---
Regards,
Steve N.
Quote from: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
++++---++-+--+++-++1 of 20 tests valid,
111187 kCycles for 1 * rep movsb
86590 kCycles for 1 * rep movsd
131239 kCycles for 1 * movlps qword ptr [esi+8*ecx]
119079 kCycles for 1 * movaps xmm0, oword ptr [esi]
99788 kCycles for 1 * movdqa + movntdq
89096 kCycles for 1 * movdqu + movntdq
92749 kCycles for 1 * movdqu + movntdq + mfence
99740 kCycles for 1 * rep movsb
92438 kCycles for 1 * rep movsd
119977 kCycles for 1 * movlps qword ptr [esi+8*ecx]
111659 kCycles for 1 * movaps xmm0, oword ptr [esi]
76366 kCycles for 1 * movdqa + movntdq
79162 kCycles for 1 * movdqu + movntdq
77279 kCycles for 1 * movdqu + movntdq + mfence
89597 kCycles for 1 * rep movsb
85665 kCycles for 1 * rep movsd
125051 kCycles for 1 * movlps qword ptr [esi+8*ecx]
111892 kCycles for 1 * movaps xmm0, oword ptr [esi]
76149 kCycles for 1 * movdqa + movntdq
76483 kCycles for 1 * movdqu + movntdq
76167 kCycles for 1 * movdqu + movntdq + mfence
86964 kCycles for 1 * rep movsb
85324 kCycles for 1 * rep movsd
121596 kCycles for 1 * movlps qword ptr [esi+8*ecx]
111769 kCycles for 1 * movaps xmm0, oword ptr [esi]
76968 kCycles for 1 * movdqa + movntdq
75970 kCycles for 1 * movdqu + movntdq
75677 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
--- ok ---
Quote from: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:
Quote from: mineiro on December 14, 2021, 01:24:01 AM
I do some tests here, when I align data to 64, rep movs(bd) execution decreased by a 50% 40% ratio. Others functions results remains unchanged.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
++-++--++++9 of 20 tests valid,
242232 kCycles for 1 * rep movsb
215571 kCycles for 1 * rep movsd
227473 kCycles for 1 * movlps qword ptr [esi+8*ecx]
230151 kCycles for 1 * movaps xmm0, oword ptr [esi]
146063 kCycles for 1 * movdqa + movntdq
146242 kCycles for 1 * movdqu + movntdq
146090 kCycles for 1 * movdqu + movntdq + mfence
208745 kCycles for 1 * rep movsb
214643 kCycles for 1 * rep movsd
227536 kCycles for 1 * movlps qword ptr [esi+8*ecx]
230180 kCycles for 1 * movaps xmm0, oword ptr [esi]
146452 kCycles for 1 * movdqa + movntdq
146127 kCycles for 1 * movdqu + movntdq
146534 kCycles for 1 * movdqu + movntdq + mfence
208675 kCycles for 1 * rep movsb
213978 kCycles for 1 * rep movsd
227433 kCycles for 1 * movlps qword ptr [esi+8*ecx]
230045 kCycles for 1 * movaps xmm0, oword ptr [esi]
146253 kCycles for 1 * movdqa + movntdq
146041 kCycles for 1 * movdqu + movntdq
146361 kCycles for 1 * movdqu + movntdq + mfence
208670 kCycles for 1 * rep movsb
216371 kCycles for 1 * rep movsd
227677 kCycles for 1 * movlps qword ptr [esi+8*ecx]
229852 kCycles for 1 * movaps xmm0, oword ptr [esi]
146070 kCycles for 1 * movdqa + movntdq
146349 kCycles for 1 * movdqu + movntdq
146292 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
-
Thanks to everybody :thup:
I see a pretty consistent pattern: rep movs* faster than movlps and movaps, but movntdq faster than rep movs*. Makes sense :thumbsup:
Hi,
I tried it on two older machines. One locked up ( kinda / sorta ),
one ran okay.
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
++++++++++++++6 of 20 tests valid,
916706 kCycles for 1 * rep movsb
822971 kCycles for 1 * rep movsd
1043451 kCycles for 1 * movlps qword ptr [esi+8*ecx]
904357 kCycles for 1 * movaps xmm0, oword ptr [esi]
708260 kCycles for 1 * movdqa + movntdq
709339 kCycles for 1 * movdqu + movntdq
683374 kCycles for 1 * movdqu + movntdq + mfence
825066 kCycles for 1 * rep movsb
834842 kCycles for 1 * rep movsd
1048805 kCycles for 1 * movlps qword ptr [esi+8*ecx]
904001 kCycles for 1 * movaps xmm0, oword ptr [esi]
677020 kCycles for 1 * movdqa + movntdq
684107 kCycles for 1 * movdqu + movntdq
683326 kCycles for 1 * movdqu + movntdq + mfence
820533 kCycles for 1 * rep movsb
820807 kCycles for 1 * rep movsd
1033703 kCycles for 1 * movlps qword ptr [esi+8*ecx]
899210 kCycles for 1 * movaps xmm0, oword ptr [esi]
677950 kCycles for 1 * movdqa + movntdq
685388 kCycles for 1 * movdqu + movntdq
682889 kCycles for 1 * movdqu + movntdq + mfence
819905 kCycles for 1 * rep movsb
820346 kCycles for 1 * rep movsd
1033769 kCycles for 1 * movlps qword ptr [esi+8*ecx]
899579 kCycles for 1 * movaps xmm0, oword ptr [esi]
676660 kCycles for 1 * movdqa + movntdq
685048 kCycles for 1 * movdqu + movntdq
682723 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
--- ok ---
Steve N.
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4) AMD Ryzen 9 5950X 16-Core Processor (SSE4) Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4) Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4) Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
kCycles for 1 * rep movsb 261718 103297 352139 241261 315842 242232 111187 916706
kCycles for 1 * rep movsd 238384 83453 237663 201928 241232 215571 86590 822971
kCycles for 1 * movlps qword ptr [esi+8*ecx 269729 151305 287198 250473 304695 227473 131239 1043451
kCycles for 1 * movaps xmm0, oword ptr [esi 236182 140797 279630 245765 302014 230151 119079 904357
kCycles for 1 * movdqa + movntdq 156614 85881 181083 182061 208018 146063 99788 708260
kCycles for 1 * movdqu + movntdq 156042 87420 181125 187806 207876 146242 89096 709339
kCycles for 1 * movdqu + movntdq + mfence 156594 85314 180909 197382 207752 146090 92749 683374
kCycles for 1 * rep movsb 236577 82482 193980 224850 249181 208745 99740 825066
kCycles for 1 * rep movsd 236369 81107 214331 202630 239809 214643 92438 834842
kCycles for 1 * movlps qword ptr [esi+8*ecx 270908 148720 278425 234536 304868 227536 119977 1048805
kCycles for 1 * movaps xmm0, oword ptr [esi 236626 140735 274866 228211 301253 230180 111659 904001
kCycles for 1 * movdqa + movntdq 156354 87417 179814 191152 207931 146452 76366 677020
kCycles for 1 * movdqu + movntdq 155998 85541 179520 188628 208272 146127 79162 684107
kCycles for 1 * movdqu + movntdq + mfence 156243 86765 179469 185565 207503 146534 77279 683326
kCycles for 1 * rep movsb 235567 83181 192139 206426 249727 208675 89597 820533
kCycles for 1 * rep movsd 236734 81348 210664 206008 241799 213978 85665 820807
kCycles for 1 * movlps qword ptr [esi+8*ecx 276802 149105 279349 233301 303516 227433 125051 1033703
kCycles for 1 * movaps xmm0, oword ptr [esi 236012 141740 274878 229024 301728 230045 111892 899210
kCycles for 1 * movdqa + movntdq 156233 85743 180115 181524 207608 146253 76149 677950
kCycles for 1 * movdqu + movntdq 156445 86256 179785 198103 208094 146041 76483 685388
kCycles for 1 * movdqu + movntdq + mfence 156944 87608 179769 177373 208854 146361 76167 682889
kCycles for 1 * rep movsb 237039 81210 191809 199886 248574 208670 86964 819905
kCycles for 1 * rep movsd 238233 81663 216878 200050 240836 216371 85324 820346
kCycles for 1 * movlps qword ptr [esi+8*ecx 270702 150942 277053 233793 304675 227677 121596 1033769
kCycles for 1 * movaps xmm0, oword ptr [esi 236677 140339 277263 228413 301674 229852 111769 899579
kCycles for 1 * movdqa + movntdq 155610 86239 178899 177392 208379 146070 76968 676660
kCycles for 1 * movdqu + movntdq 156935 87931 179083 175842 207882 146349 75970 685048
kCycles for 1 * movdqu + movntdq + mfence 156013 85408 178847 175220 207742 146292 75677 682723
Wow, you put a lot of work into this one, Timo :thumbsup:
Do you have it as a tab-delimited or csv file?
Thanks, Timo, here are the averages - extremely consistent:
All 9 CPUs
276 rep movsb
264 rep movsd
330 movlps qword ptr [esi+8*ecx]
304 movaps xmm0, oword ptr [esi]
216 movdqa + movntdq
217 movdqu + movntdq
216 movdqu + movntdq + mfence
AMD only
165 rep movsb
160 rep movsd
211 movlps qword ptr [esi+8*ecx]
189 movaps xmm0, oword ptr [esi]
121 movdqa + movntdq
122 movdqu + movntdq
121 movdqu + movntdq + mfence
Hi,
Using Timo's data I picked out the first data value in each run
(they looked about 10% large) and normalized the averages
for each processor..
MASM Forum example AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
kCycles for 1 * rep movsb 1.51 1.00 1.07 1.15 1.20 1.43 1.15 1.20
kCycles for 1 * rep movsd 1.52 1.00 1.22 1.11 1.16 1.47 1.09 1.21
kCycles for 1 * movlps qword ptr [esi+8*ecx] 1.74 1.83 1.56 1.30 1.46 1.56 1.55 1.52
kCycles for 1 * movaps xmm0, oword ptr [esi] 1.51 1.72 1.54 1.27 1.45 1.57 1.42 1.32
kCycles for 1 * movdqa + movntdq 1.00 1.05 1.00 1.00 1.00 1.00 1.03 1.00
kCycles for 1 * movdqu + movntdq 1.00 1.06 1.00 1.02 1.00 1.00 1.00 1.01
kCycles for 1 * movdqu + movntdq + mfence 1.00 1.05 1.00 1.00 1.00 1.00 1.00 1.00
Cheers,
Steve N.
whats invalid mean?
Jochen if its cachesize dependent,wouldnt it be good to use some API that shows cachesize?
celeron smaller cache,Xeon bigger cache and newer generation cpu with many cores bigger cache
reminds me of that movie where you are called "invalid" if you are only natural born,compared to the genetically improved
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
+++++++-+11 of 20 tests valid,
212547 kCycles for 1 * rep movsb
179930 kCycles for 1 * rep movsd
220533 kCycles for 1 * movlps qword ptr [esi+8*ecx]
202840 kCycles for 1 * movaps xmm0, oword ptr [esi]
191639 kCycles for 1 * movdqa + movntdq
184883 kCycles for 1 * movdqu + movntdq
170484 kCycles for 1 * movdqu + movntdq + mfence
178210 kCycles for 1 * rep movsb
178902 kCycles for 1 * rep movsd
218612 kCycles for 1 * movlps qword ptr [esi+8*ecx]
202293 kCycles for 1 * movaps xmm0, oword ptr [esi]
152501 kCycles for 1 * movdqa + movntdq
151305 kCycles for 1 * movdqu + movntdq
151683 kCycles for 1 * movdqu + movntdq + mfence
178193 kCycles for 1 * rep movsb
178480 kCycles for 1 * rep movsd
219341 kCycles for 1 * movlps qword ptr [esi+8*ecx]
207351 kCycles for 1 * movaps xmm0, oword ptr [esi]
158671 kCycles for 1 * movdqa + movntdq
151322 kCycles for 1 * movdqu + movntdq
152135 kCycles for 1 * movdqu + movntdq + mfence
178037 kCycles for 1 * rep movsb
178500 kCycles for 1 * rep movsd
218100 kCycles for 1 * movlps qword ptr [esi+8*ecx]
202977 kCycles for 1 * movaps xmm0, oword ptr [esi]
152367 kCycles for 1 * movdqa + movntdq
151737 kCycles for 1 * movdqu + movntdq
151928 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
-
Thanks Steve N. for tiled title idea :thumbsup:
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
kCycles for 1 * rep movsd 238384 83453 237663 201928 241232 179930 215571 86590
kCycles for 1 * movlps qword ptr [esi+8*ecx] 269729 151305 287198 250473 304695 220533 227473 131239
kCycles for 1 * movaps xmm0, oword ptr [esi] 236182 140797 279630 245765 302014 202840 230151 119079
kCycles for 1 * movdqa + movntdq 156614 85881 181083 182061 208018 191639 146063 99788
kCycles for 1 * movdqu + movntdq 156042 87420 181125 187806 207876 184883 146242 89096
kCycles for 1 * movdqu + movntdq + mfence 156594 85314 180909 197382 207752 170484 146090 92749
Excel .prn file looks better
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
kCycles for 1 * rep movsd 83453 2,86 1,00 2,85 2,42 2,89 2,16 2,58 1,04
kCycles for 1 * movlps qword ptr 131239 2,06 1,15 2,19 1,91 2,32 1,68 1,73 1,00
kCycles for 1 * movaps xmm0, owo 119079 1,98 1,18 2,35 2,06 2,54 1,70 1,93 1,00
kCycles for 1 * movdqa + movntdq 85881 1,82 1,00 2,11 2,12 2,42 2,23 1,70 1,16
kCycles for 1 * movdqu + movntdq 87420 1,78 1,00 2,07 2,15 2,38 2,11 1,67 1,02
kCycles for 1 * movdqu + movntdq 85314 1,84 1,00 2,12 2,31 2,44 2,00 1,71 1,09
Sub ImportValues()
Row = 1 ' check title
Col = 2 ' first value
sTab = Chr(9)
With ActiveSheet
While .Cells(Row, Col) <> Empty
'Debug.Print .Cells(Row, Col)
Row = Row + 1
Col = Col + 1
Wend
.Cells(Row, Col).EntireRow.Insert
End With
With Application.FileDialog(msoFileDialogFilePicker)
.AllowMultiSelect = False
.Filters.Add "Text Files", "*.txt", 1
.Show
sFileName = .SelectedItems.Item(1)
End With
nFileNro = FreeFile
nRow = 0 ' start
Open sFileName For Input As #nFileNro
Line Input #nFileNro, sTextRow
ActiveSheet.Cells(Row, Col) = sTextRow ' title
Do While Not EOF(nFileNro)
Line Input #nFileNro, sTextRow
'Debug.Print sTextRow
Row = Row + 1 ' table
nRow = nRow + 1 ' file
If nRow >= 3 Then
Pos = InStr(1, sTextRow, Chr(9)) ' is tab in line
If Pos = 0 Then Pos = 8 ' no tab
If Left(sTextRow, 1) <> "" Then
If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
'Debug.Print nClk
If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
End If
End If
If nRow = 37 Then Exit Do
Loop
Close #nFileNro
End Sub
EDIT: RepMovsd05GB_Results_1.csv
:biggrin:
Hi Timo,
I cheated, I downloaded Libre Office, set the defaults for Excel and Word and BINGO, I can now open you "csv" files. :thumbsup:
Quote from: daydreamer on December 15, 2021, 05:23:23 AM
Jochen if its cachesize dependent,wouldnt it be good to use some API that shows cachesize?
Good idea :thumbsup: Which one? Do you have some code?
If we just collect L3 info
How to Check Processor Cache Memory in Windows 10 (https://www.techbout.com/check-processor-cache-memory-windows-10-48655/)
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
4 64 3 4 3 3 15 24
Something I have just tested is that if you use movdqu for both read and write to make the copy algo fully unaligned, it is no faster than rep movsb.
Quote from: hutch-- on December 15, 2021, 10:48:36 PM
Something I have just tested is that if you use movdqu for both read and write to make the copy algo fully unaligned, it is no faster than rep movsb.
movdqu was an attempt to speed up unaligned moves, but afaik they merged it with movups; and the latter is no longer slower than movaps. Three equivalent instructions.
Quote from: hutch-- on December 15, 2021, 04:16:16 PM
:biggrin:
Hi Timo,
I cheated, I downloaded Libre Office, set the defaults for Excel and Word and BINGO, I can now open you "csv" files. :thumbsup:
A fix for Libre Office vba
Option VBASupport 1
Sub ImportValues()
Row = 1 ' check title
Col = 2 ' first value
sTab = Chr(9)
With ActiveSheet
While .Cells(Row, Col).Text <> ""
'Debug.Print .Cells(Row, Col)
Col = Col + 1
Wend
End With
With Application.FileDialog(msoFileDialogFilePicker)
'.AllowMultiSelect = False
'.Filters.Add "Text Files", "*.txt", 1
.Show
sFileName = .SelectedItems.Item(1)
End With
nFileNro = FreeFile
Row = 0 ' start
Open sFileName For Input As #nFileNro
Do While Not EOF(nFileNro)
Line Input #nFileNro, sTextRow
'Debug.Print sTextRow
If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow ' title
Row = Row + 1
If Row >= 3 Then
Pos = InStr(1, sTextRow, Chr(9)) ' is tab in line
If Pos = 0 Then Pos = 8 ' no tab
If Left(sTextRow, 1) <> "" Then
If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
'Debug.Print nClk
If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
End If
End If
If Row = 37 Then Exit Do
Loop
Close #nFileNro
End Sub
Question, what format users want for tables?
Quote from: TimoVJL on December 15, 2021, 08:32:35 PM
If we just collect L3 info
include \masm32\MasmBasic\MasmBasic.inc
CACHE_DESCRIPTOR STRUCT
Level BYTE ?
Associativity BYTE ?
LineSize WORD ?
_Size DWORD ?
_Type dd ? ; PROCESSOR_CACHE_TYPE enum
CACHE_DESCRIPTOR ENDS
SYSTEM_LOGICAL_PROCESSOR_INFORMATION STRUCT
NodeNumber DWORD ?
Cache CACHE_DESCRIPTOR <>
Reserved ULONGLONG 2 dup(?)
SYSTEM_LOGICAL_PROCESSOR_INFORMATION ENDS
Init
Print cfm$("\n\nWmic:\n"), Launch$("wmic cpu get L2CacheSize, L3CacheSize") ; the simple solution
Dll "Kernel32"
Declare void GetLogicalProcessorInformation, 2
Let edi=New$(400)
ClearLastError
push 400
GetLogicalProcessorInformation(edi, esp)
pop edx
pinfo equ [edi.SYSTEM_LOGICAL_PROCESSOR_INFORMATION]
deb 1, "GetLogicalProcessorInformation output:", eax, pinfo.NodeNumber, b:pinfo.Cache.Level, b:pinfo.Cache.LineSize, b:pinfo.Cache._Size, b:pinfo.Cache._Type, $Err$()
EndOfCode
Output:
Wmic:
L2CacheSize L3CacheSize
256 3072
GetLogicalProcessorInformation output:
eax 1
pinfo.NodeNumber 3
b:pinfo.Cache.Level 00000000
b:pinfo.Cache.LineSize 0000000000000000
b:pinfo.Cache._Size 00000000000000000000000000000001
b:pinfo.Cache._Type 00000000000000000000000000000000
$Err$() Operazione completata.__
WMIC works, but GetLogicalProcessorInformation returns rubbish :sad:
Quote from: jj2007 on December 16, 2021, 01:17:53 AM
WMIC works, but GetLogicalProcessorInformation returns rubbish :sad:
---------------------------
GetLogicalProcessorInformation output:
---------------------------
eax 0
pinfo.NodeNumber 0
b:pinfo.Cache.Level 00000000
b:pinfo.Cache.LineSize 0000000000000000
b:pinfo.Cache._Size 00000000000000000000000000000000
b:pinfo.Cache._Type 00000000000000000000000000000000
$Err$() The data area passed to a system call is too small.__
---------------------------
OK Cancel
---------------------------
I've seen this error, but 400 bytes works on Win7-64. Try version 2, with a 1000-byte buffer
Quote from: jj2007 on December 16, 2021, 03:57:49 AM
I've seen this error, but 400 bytes works on Win7-64. Try version 2, with a 1000-byte buffer
The program is closed after the start, the message remains the same with the debugger.
It works on my new machine, Win10:
Wmic:
L2CacheSize L3CacheSize
1024 4096
GetLogicalProcessorInformation output:
eax 1
pinfo.NodeNumber 3
b:pinfo.Cache.Level 00000000
b:pinfo.Cache.LineSize 0000000000000000
b:pinfo.Cache._Size 00000000000000000000000000000001
b:pinfo.Cache._Type 00000000000000000000000000000000
$Err$() Operazione completata.__
Hi JJ!
Can be a problem with wow64?
What happen with program in 64 bits?
LATER:
Apparently structure is different:typedef struct _SYSTEM_LOGICAL_PROCESSOR_INFORMATION {
ULONG_PTR ProcessorMask;
LOGICAL_PROCESSOR_RELATIONSHIP Relationship;
union {
struct {
BYTE Flags;
} ProcessorCore;
struct {
DWORD NodeNumber;
} NumaNode;
CACHE_DESCRIPTOR Cache;
ULONGLONG Reserved[2];
} DUMMYUNIONNAME;
} SYSTEM_LOGICAL_PROCESSOR_INFORMATION, *PSYSTEM_LOGICAL_PROCESSOR_INFORMATION;
Quote from: HSE on December 16, 2021, 11:26:56 AM
Can be a problem with wow64?
It does not crash on my Win7-64 and Win10 machines.
QuoteApparently structure is different:
I'm afraid my C skills are not sufficient to translate it correctly to MASM syntax :sad:
Quote from: jj2007 on December 16, 2021, 12:40:02 PM
I'm afraid my C skills are not sufficient to translate it correctly to MASM syntax :sad:
:biggrin: :biggrin: I hope somebody can.
Modified code from here:
https://docs.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL3CacheSize: 3145728
GetLogicalProcessorInformation results:
Number of NUMA nodes: 1
Number of physical processor packages: 1
Number of processor cores: 2
Number of logical processors: 4
Number of processor L1/L2/L3 caches: 4/2/1
EDIT:
32-bit
SYSTEM_LOGICAL_PROCESSOR_INFORMATION 24 18h bytes
Relationship +4h 4h
Cache.Level +8h 1h
Cache.Size +Ch 4h
64-bit
SYSTEM_LOGICAL_PROCESSOR_INFORMATION 32 20h bytes
Relationship +8h 4h
Cache.Level +10h 1h
Cache.Size +14h 4h
Thanks Timo :thumbsup:
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL3CacheSize: 6291456
GetLogicalProcessorInformation results:
Number of NUMA nodes: 1
Number of physical processor packages: 1
Number of processor cores: 4
Number of logical processors: 8
Number of processor L1/L2/L3 caches: 8/4/1
Biterider structure translation is: SYSTEM_LOGICAL_PROCESSOR_INFORMATION struct
ProcessorMask ULONG_PTR ?
Relationship LOGICAL_PROCESSOR_RELATIONSHIP ?
union
struct ProcessorCore
Flags BYTE ?
ends
struct NumaNode
NodeNumber DWORD ?
ends
Cache CACHE_DESCRIPTOR <>
Reserved ULONGLONG 2 dup (?)
ends
SYSTEM_LOGICAL_PROCESSOR_INFORMATION ends
@Greenhorn, please give AMD Ryzen 7 3700X results.
These are interesting:
AMD Ryzen™ 5 5600G L3 16 MB
AMD Ryzen™ 7 5700G L3 16 MB
AMD Ryzen 7 3700X
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
GetLogicalProcessorInformation results:
Number of NUMA nodes: 1
Number of physical processor packages: 1
Number of processor cores: 8
Number of logical processors: 16
Number of processor L1/L2/L3 caches: 32/16/16
I was after this test
http://masm32.com/board/index.php?topic=9691.msg106349#msg106349
EDIT: AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
AMD Ryzen 7 3700X 8-Core Processor (SSE4)
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
kCycles for 1 * rep movsd 83453 2,86 2,32 1,00 2,85 2,42 2,89 2,16 2,58 1,04
kCycles for 1 * movlps qword ptr [esi+8*e 131239 2,06 1,31 1,15 2,19 1,91 2,32 1,68 1,73 1,00
kCycles for 1 * movaps xmm0, oword ptr [e 119079 1,98 1,40 1,18 2,35 2,06 2,54 1,70 1,93 1,00
kCycles for 1 * movdqa + movntdq 85881 1,82 1,17 1,00 2,11 2,12 2,42 2,23 1,70 1,16
kCycles for 1 * movdqu + movntdq 87420 1,78 1,16 1,00 2,07 2,15 2,38 2,11 1,67 1,02
kCycles for 1 * movdqu + movntdq + mfence 85314 1,84 1,18 1,00 2,12 2,31 2,44 2,00 1,71 1,09
Ah, OK ...
AMD Ryzen 7 3700X 8-Core Processor (SSE4)
++++++++-++9 of 20 tests valid,
252562 kCycles for 1 * rep movsb
193705 kCycles for 1 * rep movsd
172087 kCycles for 1 * movlps qword ptr [esi+8*ecx]
167271 kCycles for 1 * movaps xmm0, oword ptr [esi]
100892 kCycles for 1 * movdqa + movntdq
101051 kCycles for 1 * movdqu + movntdq
100887 kCycles for 1 * movdqu + movntdq + mfence
191904 kCycles for 1 * rep movsb
192633 kCycles for 1 * rep movsd
171663 kCycles for 1 * movlps qword ptr [esi+8*ecx]
163020 kCycles for 1 * movaps xmm0, oword ptr [esi]
100933 kCycles for 1 * movdqa + movntdq
100811 kCycles for 1 * movdqu + movntdq
101287 kCycles for 1 * movdqu + movntdq + mfence
193081 kCycles for 1 * rep movsb
192589 kCycles for 1 * rep movsd
171437 kCycles for 1 * movlps qword ptr [esi+8*ecx]
163055 kCycles for 1 * movaps xmm0, oword ptr [esi]
100982 kCycles for 1 * movdqa + movntdq
100927 kCycles for 1 * movdqu + movntdq
101013 kCycles for 1 * movdqu + movntdq + mfence
191832 kCycles for 1 * rep movsb
192769 kCycles for 1 * rep movsd
171349 kCycles for 1 * movlps qword ptr [esi+8*ecx]
163022 kCycles for 1 * movaps xmm0, oword ptr [esi]
100896 kCycles for 1 * movdqa + movntdq
100825 kCycles for 1 * movdqu + movntdq
101005 kCycles for 1 * movdqu + movntdq + mfence
21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence
--- ok ---