The MASM Forum

General => The Laboratory => Topic started by: hutch-- on December 06, 2021, 08:34:55 PM

Title: Unaligned memory copy test piece.
Post by: hutch-- on December 06, 2021, 08:34:55 PM
I have a task where the memory copy cannot be controlled to SSE alignment.

The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.

    movdqu xmm0, [rcx+r10]
    movntdq [rdx+r10], xmm0

I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.

I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.
Title: Re: Unaligned memory copy test piece.
Post by: mineiro on December 06, 2021, 10:43:44 PM
This is the result in my machine:
I don't have all that include files and tools, if you can release only executable file of rep movsb I can run it here.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3338 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: HSE on December 06, 2021, 11:02:03 PM
i3-10100 not so fast :biggrin:

xmmcopyu:
--------------------------------
50 gig copy in 7531 milliseconds
--------------------------------

ByteCopy:
--------------------------------
50 gig copy in 10563 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 06, 2021, 11:51:12 PM
Thanks guys, all of these results are very useful to me.
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 07, 2021, 02:36:50 AM
I added the rep movsd version as a zip file.

--------------------------------
50 gig copy in 6625 milliseconds    rep movsb
--------------------------------
--------------------------------
50 gig copy in 4578 milliseconds    movdqu xmm0, [rcx+r10] : movntdq [rdx+r10], xmm0
--------------------------------

What I am chasing is the ratio difference as the SSE version will be used to copy memory that has originated from a MMF written to by a 32 bit app.
Title: Re: Unaligned memory copy test piece.
Post by: avcaballero on December 07, 2021, 03:11:45 AM
umcmovsb:
--------------------------------
50 gig copy in 6015 milliseconds
--------------------------------
Press any key to continue...


umc:
--------------------------------
50 gig copy in 4719 milliseconds
--------------------------------
Press any key to continue...
Title: Re: Unaligned memory copy test piece.
Post by: HSE on December 07, 2021, 03:16:47 AM
umcmovsb:
--------------------------------
50 gig copy in 9547 milliseconds
--------------------------------
Press any key to continue...
Title: Re: Unaligned memory copy test piece.
Post by: Greenhorn on December 07, 2021, 03:26:43 AM
AMD Ryzen 3700X

umcmovsb:
--------------------------------
50 gig copy in 5522 milliseconds
--------------------------------

umc:
--------------------------------
50 gig copy in 2902 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: Siekmanski on December 07, 2021, 03:32:51 AM
AMD Ryzen 9 5950X 16-Core Processor

umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------

umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: Greenhorn on December 07, 2021, 03:50:06 AM
Quote from: Siekmanski on December 07, 2021, 03:32:51 AM
AMD Ryzen 9 5950X 16-Core Processor

umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------

umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------

Well, the result for movsb is surprising.  :thumbsup:
Title: Re: Unaligned memory copy test piece.
Post by: mineiro on December 07, 2021, 03:53:27 AM
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3384 milliseconds
--------------------------------
wine umcmovsb.exe
--------------------------------
50 gig copy in 3450 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 07, 2021, 03:58:20 AM
--------------------------------
50 gig copy in 9064 milliseconds
--------------------------------

--------------------------------
50 gig copy in 10343 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 07, 2021, 09:41:29 AM
Thanks all, it seems that in every case over a wide range of different hardware that the SSE2 version is faster in every instance and that is useful for the task I have in mind.  :biggrin:
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 07, 2021, 11:09:02 AM
Are you sure it's unaligned? My debugger says halloc() delivers a 16-byte aligned buffer. I also wonder whether lodsd would be faster than lodsb
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 07, 2021, 11:27:06 AM
What I have to do is load data from a 32 bit app via a memory mapped file into a 64 bit app which uses HeapAlloc() to store the data. The input source from 32 bit can be rough byte aligned string data or anything else that will fit into the memory mapped file size. If it was going to be fully controlled alignment at both ends, I would use the faster aligned SSE2 instructions.

"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry and I have not seen examples of "rep lodsb" being faster so I have used "rep movsb" as a reference to compare the SSE2 version and across multiple CPUs that the folks here have tested on, the SSE2 version is always faster.

I already have prototypes of the task up and running using a 1gb memory mapped file as the data transfer window and the idea is to be able to work with a 32 bit app that can store multiple 1gb blocks in the 64 bit "container" and work on any of them 1 at a time. I had used "rep movsb" for the unaligned data transfer and it worked OK but as you start using larger blocks of memory, speed starts to matter.
Title: Re: Unaligned memory copy test piece.
Post by: daydreamer on December 07, 2021, 06:05:48 PM
Would be interesting test the smaller SSE movups /movaps,because if same speed as SSE2 moves can fit more  instructions in cache
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 07, 2021, 06:26:01 PM
Quote from: hutch-- on December 06, 2021, 08:34:55 PM
I have a task where the memory copy cannot be controlled to SSE alignment.

The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.

    movdqu xmm0, [rcx+r10]
    movntdq [rdx+r10], xmm0

I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.

I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.

Hi Hutch,

i7-11800h
--------------------------------
50 gig copy in 3500 milliseconds
--------------------------------
Press any key to continue...

rep movsd
--------------------------------
50 gig copy in 3750 milliseconds
--------------------------------
Press any key to continue...

Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 07, 2021, 08:22:56 PM
Quote from: hutch-- on December 07, 2021, 11:27:06 AM"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry

Not on my machine, at least with 32-bit code...

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

64674   cycles for 100 * rep movsb
64365   cycles for 100 * rep movsd
206043  cycles for 100 * movlps qword ptr [esi+8*ecx]
122243  cycles for 100 * movaps xmm0, oword ptr [esi]
195049  cycles for 100 * movntdq xmm0, oword ptr [esi]

65058   cycles for 100 * rep movsb
63966   cycles for 100 * rep movsd
206036  cycles for 100 * movlps qword ptr [esi+8*ecx]
122348  cycles for 100 * movaps xmm0, oword ptr [esi]
193151  cycles for 100 * movntdq xmm0, oword ptr [esi]

65376   cycles for 100 * rep movsb
64353   cycles for 100 * rep movsd
206087  cycles for 100 * movlps qword ptr [esi+8*ecx]
122278  cycles for 100 * movaps xmm0, oword ptr [esi]
193349  cycles for 100 * movntdq xmm0, oword ptr [esi]

65125   cycles for 100 * rep movsb
63895   cycles for 100 * rep movsd
205977  cycles for 100 * movlps qword ptr [esi+8*ecx]
121872  cycles for 100 * movaps xmm0, oword ptr [esi]
193156  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 07, 2021, 08:40:50 PM
I get much the same on this old Haswell. I have usually used a combination of rep movsd and used rep movsb but the single use of rep movsb is close enough to as fast as the rep movsd version and does not suffer from the switch from DWORD to BYTE.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

64649   cycles for 100 * rep movsb
64474   cycles for 100 * rep movsd
158890  cycles for 100 * movlps qword ptr [esi+8*ecx]
80167   cycles for 100 * movaps xmm0, oword ptr [esi]

63923   cycles for 100 * rep movsb
65043   cycles for 100 * rep movsd
158972  cycles for 100 * movlps qword ptr [esi+8*ecx]
82787   cycles for 100 * movaps xmm0, oword ptr [esi]

65293   cycles for 100 * rep movsb
66105   cycles for 100 * rep movsd
158830  cycles for 100 * movlps qword ptr [esi+8*ecx]
81359   cycles for 100 * movaps xmm0, oword ptr [esi]

64850   cycles for 100 * rep movsb
66085   cycles for 100 * rep movsd
159823  cycles for 100 * movlps qword ptr [esi+8*ecx]
81326   cycles for 100 * movaps xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
28      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]


--- ok ---
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 07, 2021, 08:42:21 PM
Thanks LiaoMi, interesting result in that they are much closer than earlier hardware. Looks like a nice fast box.
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 07, 2021, 09:05:54 PM
New machine. I added movntdq

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

57654   cycles for 100 * rep movsb
65892   cycles for 100 * rep movsd
112954  cycles for 100 * movlps qword ptr [esi+8*ecx]
58152   cycles for 100 * movaps xmm0, oword ptr [esi]
129800  cycles for 100 * movntdq xmm0, oword ptr [esi]

59723   cycles for 100 * rep movsb
59356   cycles for 100 * rep movsd
113875  cycles for 100 * movlps qword ptr [esi+8*ecx]
57518   cycles for 100 * movaps xmm0, oword ptr [esi]
130509  cycles for 100 * movntdq xmm0, oword ptr [esi]

59061   cycles for 100 * rep movsb
63768   cycles for 100 * rep movsd
112908  cycles for 100 * movlps qword ptr [esi+8*ecx]
57839   cycles for 100 * movaps xmm0, oword ptr [esi]
132310  cycles for 100 * movntdq xmm0, oword ptr [esi]

59031   cycles for 100 * rep movsb
58619   cycles for 100 * rep movsd
129052  cycles for 100 * movlps qword ptr [esi+8*ecx]
57675   cycles for 100 * movaps xmm0, oword ptr [esi]
131438  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 07, 2021, 09:17:56 PM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

62494   cycles for 100 * rep movsb
63721   cycles for 100 * rep movsd
120658  cycles for 100 * movlps qword ptr [esi+8*ecx]
58538   cycles for 100 * movaps xmm0, oword ptr [esi]
129730  cycles for 100 * movntdq xmm0, oword ptr [esi]

63098   cycles for 100 * rep movsb
63117   cycles for 100 * rep movsd
119950  cycles for 100 * movlps qword ptr [esi+8*ecx]
58756   cycles for 100 * movaps xmm0, oword ptr [esi]
129155  cycles for 100 * movntdq xmm0, oword ptr [esi]

62565   cycles for 100 * rep movsb
62759   cycles for 100 * rep movsd
119914  cycles for 100 * movlps qword ptr [esi+8*ecx]
57471   cycles for 100 * movaps xmm0, oword ptr [esi]
126619  cycles for 100 * movntdq xmm0, oword ptr [esi]

63080   cycles for 100 * rep movsb
62948   cycles for 100 * rep movsd
119847  cycles for 100 * movlps qword ptr [esi+8*ecx]
57513   cycles for 100 * movaps xmm0, oword ptr [esi]
123895  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 07, 2021, 09:30:53 PM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
umc.exe
--------------------------------
50 gig copy in 4453 milliseconds
--------------------------------
umcmovsb.exe
--------------------------------
50 gig copy in 6469 milliseconds
--------------------------------
Title: Re: Unaligned memory copy test piece.
Post by: daydreamer on December 09, 2021, 04:37:29 AM
--------------------------------
50 gig copy in 7547 milliseconds
--------------------------------
Press any key to continue...

I have 20GB memory installed,turbo 3.1ghz
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

31040   cycles for 100 * rep movsb
31143   cycles for 100 * rep movsd
117139  cycles for 100 * movlps qword ptr [esi+8*ecx]
72688   cycles for 100 * movaps xmm0, oword ptr [esi]
109706  cycles for 100 * movntdq xmm0, oword ptr [esi]

31381   cycles for 100 * rep movsb
31663   cycles for 100 * rep movsd
116001  cycles for 100 * movlps qword ptr [esi+8*ecx]
71727   cycles for 100 * movaps xmm0, oword ptr [esi]
110933  cycles for 100 * movntdq xmm0, oword ptr [esi]

31644   cycles for 100 * rep movsb
37560   cycles for 100 * rep movsd
114454  cycles for 100 * movlps qword ptr [esi+8*ecx]
72541   cycles for 100 * movaps xmm0, oword ptr [esi]
124899  cycles for 100 * movntdq xmm0, oword ptr [esi]

31097   cycles for 100 * rep movsb
31010   cycles for 100 * rep movsd
115056  cycles for 100 * movlps qword ptr [esi+8*ecx]
72463   cycles for 100 * movaps xmm0, oword ptr [esi]
109951  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]


-
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 09, 2021, 05:10:05 AM
Quote from: jj2007 on December 07, 2021, 09:05:54 PM
New machine. I added movntdq

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

13916   cycles for 100 * rep movsb
16152   cycles for 100 * rep movsd
106432  cycles for 100 * movlps qword ptr [esi+8*ecx]
42050   cycles for 100 * movaps xmm0, oword ptr [esi]
59500   cycles for 100 * movntdq xmm0, oword ptr [esi]

16298   cycles for 100 * rep movsb
15607   cycles for 100 * rep movsd
109919  cycles for 100 * movlps qword ptr [esi+8*ecx]
41897   cycles for 100 * movaps xmm0, oword ptr [esi]
58949   cycles for 100 * movntdq xmm0, oword ptr [esi]

15691   cycles for 100 * rep movsb
16515   cycles for 100 * rep movsd
108793  cycles for 100 * movlps qword ptr [esi+8*ecx]
41640   cycles for 100 * movaps xmm0, oword ptr [esi]
101036  cycles for 100 * movntdq xmm0, oword ptr [esi]

17390   cycles for 100 * rep movsb
16209   cycles for 100 * rep movsd
117106  cycles for 100 * movlps qword ptr [esi+8*ecx]
42124   cycles for 100 * movaps xmm0, oword ptr [esi]
60058   cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]


--- ok ---
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 09, 2021, 06:53:58 AM
Quote from: LiaoMi on December 09, 2021, 05:10:05 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

13916   cycles for 100 * rep movsb
...
59500   cycles for 100 * movntdq xmm0, oword ptr [esi]

That looks odd, and I wondered whether my counts were correct. But I can't find an error... how can movntdq be so slow?
Title: Re: Unaligned memory copy test piece.
Post by: FORTRANS on December 10, 2021, 01:19:23 AM
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

35139 cycles for 100 * rep movsb
36189 cycles for 100 * rep movsd
161839 cycles for 100 * movlps qword ptr [esi+8*ecx]
82736 cycles for 100 * movaps xmm0, oword ptr [esi]
173215 cycles for 100 * movntdq xmm0, oword ptr [esi]

35248 cycles for 100 * rep movsb
36580 cycles for 100 * rep movsd
160325 cycles for 100 * movlps qword ptr [esi+8*ecx]
82958 cycles for 100 * movaps xmm0, oword ptr [esi]
174700 cycles for 100 * movntdq xmm0, oword ptr [esi]

35392 cycles for 100 * rep movsb
36231 cycles for 100 * rep movsd
160691 cycles for 100 * movlps qword ptr [esi+8*ecx]
83033 cycles for 100 * movaps xmm0, oword ptr [esi]
174148 cycles for 100 * movntdq xmm0, oword ptr [esi]

35310 cycles for 100 * rep movsb
36172 cycles for 100 * rep movsd
162454 cycles for 100 * movlps qword ptr [esi+8*ecx]
83124 cycles for 100 * movaps xmm0, oword ptr [esi]
173325 cycles for 100 * movntdq xmm0, oword ptr [esi]

19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]


--- ok ---


Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

22808 cycles for 100 * rep movsb
22940 cycles for 100 * rep movsd
82232 cycles for 100 * movlps qword ptr [esi+8*ecx]
55785 cycles for 100 * movaps xmm0, oword ptr [esi]
148033 cycles for 100 * movntdq xmm0, oword ptr [esi]

22471 cycles for 100 * rep movsb
22846 cycles for 100 * rep movsd
82406 cycles for 100 * movlps qword ptr [esi+8*ecx]
57255 cycles for 100 * movaps xmm0, oword ptr [esi]
151683 cycles for 100 * movntdq xmm0, oword ptr [esi]

22507 cycles for 100 * rep movsb
23157 cycles for 100 * rep movsd
82990 cycles for 100 * movlps qword ptr [esi+8*ecx]
55098 cycles for 100 * movaps xmm0, oword ptr [esi]
144060 cycles for 100 * movntdq xmm0, oword ptr [esi]

22462 cycles for 100 * rep movsb
22567 cycles for 100 * rep movsd
82398 cycles for 100 * movlps qword ptr [esi+8*ecx]
54862 cycles for 100 * movaps xmm0, oword ptr [esi]
142853 cycles for 100 * movntdq xmm0, oword ptr [esi]

19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]


--- ok ---
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 10, 2021, 05:10:32 AM
AMD Athlon Gold 3150U with Radeon Graphics      (SSE4) Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4) Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

cycles for 100 * rep movsb                    57654   35139   22808   31040   13916
cycles for 100 * rep movsd                    65892   36189   22940   31143   16152
cycles for 100 * movlps qword ptr [esi+8*ec  112954  161839   82232  117139  106432
cycles for 100 * movaps xmm0, oword ptr [es   58152   82736   55785   72688   42050
cycles for 100 * movntdq xmm0, oword ptr [e  129800  173215  148033  109706   59500

cycles for 100 * rep movsb                    59723   35248   22471   31381   16298
cycles for 100 * rep movsd                    59356   36580   22846   31663   15607
cycles for 100 * movlps qword ptr [esi+8*ec  113875  160325   82406  116001  109919
cycles for 100 * movaps xmm0, oword ptr [es   57518   82958   57255   71727   41897
cycles for 100 * movntdq xmm0, oword ptr [e  130509  174700  151683  110933   58949

cycles for 100 * rep movsb                    59061   35392   22507   31644   15691
cycles for 100 * rep movsd                    63768   36231   23157   37560   16515
cycles for 100 * movlps qword ptr [esi+8*ec  112908  160691   82990  114454  108793
cycles for 100 * movaps xmm0, oword ptr [es   57839   83033   55098   72541   41640
cycles for 100 * movntdq xmm0, oword ptr [e  132310  174148  144060  124899  101036

cycles for 100 * rep movsb                    59031   35310   22462   31097   17390
cycles for 100 * rep movsd                    58619   36172   22567   31010   16209
cycles for 100 * movlps qword ptr [esi+8*ec  129052  162454   82398  115056  117106
cycles for 100 * movaps xmm0, oword ptr [es   57675   83124   54862   72463   42124
cycles for 100 * movntdq xmm0, oword ptr [e  131438  173325  142853  109951   60058

bytes for rep movsb                              19      19      19      19      19
bytes for rep movsd                              19      19      19      19      19
bytes for movlps qword ptr [esi+8*ecx]           29      29      29      29      29
bytes for movaps xmm0, oword ptr [esi]           34      34      34      34      34
bytes for movntdq xmm0, oword ptr [esi]          35      35      35      35      35
Sub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    sTab = Chr(9)
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 10, 2021, 09:12:15 PM
Quote from: jj2007 on December 09, 2021, 06:53:58 AM
Quote from: LiaoMi on December 09, 2021, 05:10:05 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

13916   cycles for 100 * rep movsb
...
59500   cycles for 100 * movntdq xmm0, oword ptr [esi]

That looks odd, and I wondered whether my counts were correct. But I can't find an error... how can movntdq be so slow?

Hi jj2007,

I would also like to know what is the reason for such slowdowns  :undecided:

Random slow downs with AVX2 code - https://community.intel.com/t5/Intel-ISA-Extensions/Random-slow-downs-with-AVX2-code/m-p/1084764
Depending on CPU, 10x is approximately the difference between L1 cache and L3 cache latency. Is your thread pinned?
Jim Dempsey

How L1 and L2 CPU Caches Work, and Why They're an Essential Part of Modern Chips - https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

Reducing Memory Access Times with Caches - https://developers.redhat.com/blog/2016/03/01/reducing-memory-access-times-with-caches#

What is a "cache-friendly" code? - https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code

CS 201 Writing Cache-Friendly Code - Portland State University - http://web.cecs.pdx.edu/~jrb/cs201/lectures/cache.friendly.code.pdf

Very slow performance of VMOVNTDQ instruction - https://community.intel.com/t5/Intel-ISA-Extensions/Very-slow-performance-of-VMOVNTDQ-instruction/td-p/941697
Thanks for the link. Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.

AVX transition penalties and OS support - https://community.intel.com/t5/Intel-ISA-Extensions/AVX-transition-penalties-and-OS-support/m-p/931977
Intel Avoiding AVX-SSE Transition Penalties - https://web.archive.org/web/20160409073240/software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

What Every Programmer Should Know About Memory - https://www.akkadia.org/drepper/cpumemory.pdf
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 10, 2021, 09:49:28 PM
Few dozens of cycles... fascinating :rolleyes:
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 10, 2021, 10:33:46 PM
Intel Manual

66 0F E7 /r
MOVNTDQ m128, xmm1
A V/V SSE2 Move packed integer values in xmm1 to m128 using nontemporal hint.


Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 10, 2021, 10:57:43 PM
Quote from: hutch-- on December 10, 2021, 10:33:46 PM
Intel Manual

66 0F E7 /r
MOVNTDQ m128, xmm1
A V/V SSE2 Move packed integer values in xmm1 to m128 using nontemporal hint.


Hi Hutch,

exactly :thup:  :thup:  :thup:  :thup: Thanks!

What is the meaning of "non temporal" memory accesses in x86 - https://stackoverflow.com/questions/37070/what-is-the-meaning-of-non-temporal-memory-accesses-in-x86
When are x86 LFENCE, SFENCE and MFENCE instructions required? - https://stackoverflow.com/questions/27595595/when-are-x86-lfence-sfence-and-mfence-instructions-required

The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.

Notes on "non-temporal" (aka "streaming") stores - https://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
Optimizing Cache Usage With Nontemporal Accesses - https://vgatherps.github.io/2018-09-02-nontemporal/
void force_nt_store(cache_line *a) {
    __m128i zeros = {0, 0}; // chosen to use zeroing idiom;

    __asm volatile("movntdq %0, (%1)\n\t"
#if BYTES > 16
                   "movntdq %0, 16(%1)\n\t"
#endif
#if BYTES > 32
                   "movntdq %0, 32(%1)\n\t"
#endif
#if BYTES > 48
                   "movntdq %0, 48(%1)"
#endif
                   :
                   : "x" (zeros), "r" (&a->vec_val)
                   : "memory");
}

uint64_t run_timer_loop(void) {

    mfence();
    uint64_t start = rdtscp();

    for (int i = 0; i < 32; i++) {
        force_nt_store(&large_buffer[i]);
    }

    mfence();

    uint64_t end = rdtscp();
}


nontemporal_stores
https://github.com/vgatherps/nontemporal_stores/blob/master/basic_write_allocate/test.c
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 10, 2021, 11:14:41 PM
movntdq + mfence - Example
https://www.felixcloutier.com/x86/mfence
https://www.felixcloutier.com/x86/lfence
https://www.felixcloutier.com/x86/sfence

    .686
    .model  flat,C
    .xmm
    .code

;------------------------------------------------------------------------------
;  VOID *
;  InternalMemCopyMem (
;    IN VOID   *Destination,
;    IN VOID   *Source,
;    IN UINTN  Count
;    );
;------------------------------------------------------------------------------
InternalMemCopyMem  PROC    USES    esi edi
    mov     esi, [esp + 16]             ; esi <- Source
    mov     edi, [esp + 12]             ; edi <- Destination
    mov     edx, [esp + 20]             ; edx <- Count
    lea     eax, [esi + edx - 1]        ; eax <- End of Source
    cmp     esi, edi
    jae     @F
    cmp     eax, edi                    ; Overlapped?
    jae     @CopyBackward               ; Copy backward if overlapped
@@:
    xor     ecx, ecx
    sub     ecx, edi
    and     ecx, 15                     ; ecx + edi aligns on 16-byte boundary
    jz      @F
    cmp     ecx, edx
    cmova   ecx, edx
    sub     edx, ecx                    ; edx <- remaining bytes to copy
    rep     movsb
@@:
    mov     ecx, edx
    and     edx, 15
    shr     ecx, 4                      ; ecx <- # of DQwords to copy
    jz      @CopyBytes
    add     esp, -16
    movdqu  [esp], xmm0                 ; save xmm0
@@:
    movdqu  xmm0, [esi]                 ; esi may not be 16-bytes aligned
    movntdq [edi], xmm0                 ; edi should be 16-bytes aligned
    add     esi, 16
    add     edi, 16
    loop    @B
    mfence
    movdqu  xmm0, [esp]                 ; restore xmm0
    add     esp, 16                     ; stack cleanup
    jmp     @CopyBytes
@CopyBackward:
    mov     esi, eax                    ; esi <- Last byte in Source
    lea     edi, [edi + edx - 1]        ; edi <- Last byte in Destination
    std
@CopyBytes:
    mov     ecx, edx
    rep     movsb
    cld
    mov     eax, [esp + 12]             ; eax <- Destination as return value
    ret
InternalMemCopyMem  ENDP

    END


    .686
    .model  flat,C
    .xmm
    .code

;------------------------------------------------------------------------------
;  VOID *
;  EFIAPI
;  InternalMemSetMem (
;    IN VOID   *Buffer,
;    IN UINTN  Count,
;    IN UINT8  Value
;    );
;------------------------------------------------------------------------------
InternalMemSetMem   PROC    USES    edi
    mov     edx, [esp + 12]             ; edx <- Count
    mov     edi, [esp + 8]              ; edi <- Buffer
    mov     al, [esp + 16]              ; al <- Value
    xor     ecx, ecx
    sub     ecx, edi
    and     ecx, 15                     ; ecx + edi aligns on 16-byte boundary
    jz      @F
    cmp     ecx, edx
    cmova   ecx, edx
    sub     edx, ecx
    rep     stosb
@@:
    mov     ecx, edx
    and     edx, 15
    shr     ecx, 4                      ; ecx <- # of DQwords to set
    jz      @SetBytes
    mov     ah, al                      ; ax <- Value | (Value << 8)
    add     esp, -16
    movdqu  [esp], xmm0                 ; save xmm0
    movd    xmm0, eax
    pshuflw xmm0, xmm0, 0               ; xmm0[0..63] <- Value repeats 8 times
    movlhps xmm0, xmm0                  ; xmm0 <- Value repeats 16 times
@@:
    movntdq [edi], xmm0                 ; edi should be 16-byte aligned
    add     edi, 16
    loop    @B
    mfence
    movdqu  xmm0, [esp]                 ; restore xmm0
    add     esp, 16                     ; stack cleanup
@SetBytes:
    mov     ecx, edx
    rep     stosb
    mov     eax, [esp + 8]              ; eax <- Buffer as return value
    ret
InternalMemSetMem   ENDP

    END



    .686
    .model  flat,C
    .xmm
    .code

;------------------------------------------------------------------------------
;  VOID *
;  EFIAPI
;  InternalMemZeroMem (
;    IN VOID   *Buffer,
;    IN UINTN  Count
;    );
;------------------------------------------------------------------------------
InternalMemZeroMem  PROC    USES    edi
    mov     edi, [esp + 8]
    mov     edx, [esp + 12]
    xor     ecx, ecx
    sub     ecx, edi
    xor     eax, eax
    and     ecx, 15
    jz      @F
    cmp     ecx, edx
    cmova   ecx, edx
    sub     edx, ecx
    rep     stosb
@@:
    mov     ecx, edx
    and     edx, 15
    shr     ecx, 4
    jz      @ZeroBytes
    pxor    xmm0, xmm0
@@:
    movntdq [edi], xmm0
    add     edi, 16
    loop    @B
    mfence
@ZeroBytes:
    mov     ecx, edx
    rep     stosb
    mov     eax, [esp + 8]
    ret
InternalMemZeroMem  ENDP

    END
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 11, 2021, 02:09:50 AM
"cpuid" before "rdtsc" - https://newbedev.com/cpuid-before-rdtsc

It's to prevent out-of-order execution. From a link that has now disappeared from the web (but which was fortuitously copied here before it disappeared), this text is from an article entitled "Performance monitoring" by one John Eckerdal:
The Pentium Pro and Pentium II processors support out-of-order execution instructions may be executed in another order as you programmed them. This can be a source of errors if not taken care of.
To prevent this the programmer must serialize the the instruction queue. This can be done by inserting a serializing instruction like CPUID instruction before the RDTSC instruction.

Two reasons:

As paxdiablo says, when the CPU sees a CPUID opcode it makes sure all the previous instructions are executed, then the CPUID taken, before any subsequent instructions execute. Without such an instruction, the CPU execution pipeline may end up executing TSC before the instruction(s) you'd like to time.
A significant proportion of machines fail to synchronise the TSC registers across cores. In you want to read it from a horse's mouth - knock yourself out at http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx. So, when measuring an interval between TSC readings, unless they're taken on the same core you'll have an effectively random but possibly constant (see below) interval introduced - it can easily be several seconds (yes seconds) even soon after bootup. This effectively reflects how long the BIOS was running on a single core before kicking off the others, plus - if you've any nasty power saving options on - increasing drift caused by cores running at different frequencies or shutting down again. So, if you haven't nailed the threads reading TSC registers to the same core then you'll need to build some kind of cross-core delta table and know the core id (which is returned by CPUID) of each TSC sample in order to compensate for this offset. That's another reason you can see CPUID alongside RDTSC, and indeed a reason why with newer RDTSCP many OSes are storing core id numbers into the extra TSC_AUX[31:0] data returned. (Available from Core i7 and Athlon 64 X2, RDTSCP is a much better option in all respects - the OS normally gives you the core id as mentioned, atomic to the TSC read, and prevent instruction reordering).

CPUID is serializing, preventing out-of-order execution of RDTSC.

These days you can safely use LFENCE instead. It's documented as serializing on the instruction stream (but not stores to memory) on Intel CPUs, and now also on AMD after their microcode update for Spectre.

https://hadibrais.wordpress.com/2018/05/14/the-significance-of-the-x86-lfence-instruction/ explains more about LFENCE.

See also https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf for a way to use RDTSCP that keeps CPUID (or LFENCE) out of the timed region:
LFENCE     ; (or CPUID) Don't start the timed region until everything above has executed
RDTSC           ; EDX:EAX = timestamp
mov  ebx, eax   ; low 32 bits of start time

   code under test

RDTSCP     ; built-in one way barrier stops it from running early
LFENCE     ; (or CPUID) still use a barrier after to prevent anything weird
sub  eax, ebx   ; low 32 bits of end-start
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 11, 2021, 02:16:41 AM
I have generally found that the combination of CPUID and RDTSC stabilise timings and improves the accuracy of benchmarking.
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 11, 2021, 05:00:13 AM
@Hutch: I didn't know before why it was so  :thumbsup:

movntps + sfence - Example
xorps           macro   XMMReg1, XMMReg2
                db      0FH, 057H, 0C0H + (XMMReg1 * 8) + XMMReg2
                endm

movntps         macro   GeneralReg, Offset, XMMReg
                db      0FH, 02BH, 040H + (XmmReg * 8) + GeneralReg, Offset
                endm

sfence          macro
                db      0FH, 0AEH, 0F8H
                endm

movaps_load     macro   XMMReg, GeneralReg
                db      0FH, 028H, (XMMReg * 8) + 4, (4 * 8) + GeneralReg
                endm

movaps_store    macro   GeneralReg, XMMReg
                db      0FH, 029H, (XMMReg * 8) + 4, (4 * 8) + GeneralReg
                endm

;
; Register Definitions (for instruction macros).
;

rEAX            equ     0
rECX            equ     1
rEDX            equ     2
rEBX            equ     3
rESP            equ     4
rEBP            equ     5
rESI            equ     6
rEDI            equ     7


Test Proc

        sti                                     ; reenable context switching
        movaps_store rESP, 0                    ; save xmm0
        mov ecx, Dest
        call XMMZeroPage                        ; zero MEM
        movaps_load  0, rESP                    ; restore xmm

Test ENDP


XMMZeroPage Proc

        xorps   0, 0                            ; zero xmm0 (128 bits)
        mov     eax, SIZE                       ; Number of Iterations

inner:

        movntps rECX, 0,  0                     ; store bytes  0 - 15
        movntps rECX, 16, 0                     ;             16 - 31
        movntps rECX, 32, 0                     ;             32 - 47
        movntps rECX, 48, 0                     ;             48 - 63

        add     ecx, 64                         ; increment base
        dec     eax                             ; decrement loop count
        jnz     short inner

        ; Force all stores to complete before any other
        ; stores from this processor.

        sfence

ifndef SFENCE_IS_NOT_BUSTED

        ; the next uncached write to this processor's apic
        ; may fail unless the store pipes have drained.  sfence by
        ; itself is not enough.   Force drainage now by doing an
        ; interlocked exchange.

        xchg    [esp-4], eax

endif

        ret

XMMZeroPage ENDP


Intel memory ordering, fence instructions, and atomic operations - https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
MFENCE and LFENCE micro-architectural implementation (Patent) - https://patents.google.com/patent/US6678810B1/en or https://patentimages.storage.googleapis.com/d4/fd/41/fd35729a18a3cd/US6678810.pdf
MFENCE and LFENCE micro-architectural implementation method and system - https://patents.google.com/patent/US6651151B2/en or https://patentimages.storage.googleapis.com/fe/41/a3/ddea1fb5732c17/US6651151.pdf

Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?

x86 fence instructions can be briefly described as follows:
MFENCE prevents any later loads or stores from becoming globally observable before any earlier loads or stores. It drains the store buffer before later loads1 can execute.
LFENCE blocks instruction dispatch (Intel's terminology) until all earlier instructions retire. This is currently implemented by draining the ROB (ReOrder Buffer) before later instructions can issue into the back-end.
SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not necessarily force the store buffer to be drained before it retires from the ROB, so putting LFENCE after it doesn't add up to MFENCE.
A "serializing instruction" like CPUID (and IRET, etc) drains everything (ROB, store buffer) before later instructions can issue into the back-end. MFENCE + LFENCE would also do that, but true serializing instructions might also have other effects, I don't know.

Memory Reordering Caught in the Act - https://preshing.com/20120515/memory-reordering-caught-in-the-act/
Does the Intel Memory Model make SFENCE and LFENCE redundant? - https://stackoverflow.com/questions/32705169/does-the-intel-memory-model-make-sfence-and-lfence-redundant/32705560#32705560
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 11, 2021, 09:03:40 AM
If I am timing something really critical, I use the API SleepEx() to pause the thread for about 100 ms to try and get the start of a time slice.
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 11, 2021, 09:27:13 AM
Quote from: hutch-- on December 11, 2021, 02:16:41 AM
I have generally found that the combination of CPUID and RDTSC stabilise timings and improves the accuracy of benchmarking.

\Masm32\macros\timers.asm
        xor   eax, eax        ;; Use same CPUID input value for each call
        cpuid                 ;; Flush pipe & wait for pending ops to finish
        rdtsc                 ;; Read Time Stamp Counter


Michael Webster :thumbsup:
Title: Re: Unaligned memory copy test piece.
Post by: nidud on December 11, 2021, 09:47:45 AM
deleted
Title: Re: Unaligned memory copy test piece.
Post by: daydreamer on December 12, 2021, 05:50:22 AM
Quote from: hutch-- on December 10, 2021, 10:33:46 PM
Intel Manual

66 0F E7 /r
MOVNTDQ m128, xmm1
A V/V SSE2 Move packed integer values in xmm1 to m128 using nontemporal hint.

is old advice to use for VRAM ->pciexpress to gpu is faster,but I forget to time different alternatives it when ddraw blend lots of circles,but using movaps looked very fast
note the 66 prefix,most packed SSE2 is one byte bigger than SSE versions so wonder how many more instructions in 64byte cacheline can be fit with SSE versions instead?
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 12, 2021, 02:05:43 PM
magnus,

I think you have missed something here, the mnemonic "movntdq" is designed to be used in conjunction with an instruction like either "movdqa" or "movdqu" where they are used to load memory through the cache and "movntdq" is used to write back to memory bypassing the cache. The reduction in cache pollution generally yields an improvement in performance.
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 12, 2021, 08:45:19 PM
Quote from: hutch-- on December 12, 2021, 02:05:43 PM"movntdq" is designed to be used in conjunction with an instruction like either "movdqa" or "movdqu"

I had movaps before, but even with movdqa or movdqu it won't become any faster. Mysterious :rolleyes:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

63929   cycles for 100 * rep movsb
95770   cycles for 100 * rep movsd
209628  cycles for 100 * movlps qword ptr [esi+8*ecx]
120512  cycles for 100 * movaps xmm0, oword ptr [esi]
176887  cycles for 100 * movdqa + movntdq
175955  cycles for 100 * movdqu + movntdq

65768   cycles for 100 * rep movsb
64697   cycles for 100 * rep movsd
206155  cycles for 100 * movlps qword ptr [esi+8*ecx]
122034  cycles for 100 * movaps xmm0, oword ptr [esi]
174827  cycles for 100 * movdqa + movntdq
176240  cycles for 100 * movdqu + movntdq

65109   cycles for 100 * rep movsb
64308   cycles for 100 * rep movsd
208594  cycles for 100 * movlps qword ptr [esi+8*ecx]
120838  cycles for 100 * movaps xmm0, oword ptr [esi]
176082  cycles for 100 * movdqa + movntdq
176391  cycles for 100 * movdqu + movntdq

65057   cycles for 100 * rep movsb
64755   cycles for 100 * rep movsd
206689  cycles for 100 * movlps qword ptr [esi+8*ecx]
121700  cycles for 100 * movaps xmm0, oword ptr [esi]
175600  cycles for 100 * movdqa + movntdq
176981  cycles for 100 * movdqu + movntdq

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 12, 2021, 11:31:29 PM
Quote from: jj2007 on December 12, 2021, 08:45:19 PM
Quote from: hutch-- on December 12, 2021, 02:05:43 PM"movntdq" is designed to be used in conjunction with an instruction like either "movdqa" or "movdqu"

I had movaps before, but even with movdqa or movdqu it won't become any faster. Mysterious :rolleyes:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

63929   cycles for 100 * rep movsb
95770   cycles for 100 * rep movsd
209628  cycles for 100 * movlps qword ptr [esi+8*ecx]
120512  cycles for 100 * movaps xmm0, oword ptr [esi]
176887  cycles for 100 * movdqa + movntdq
175955  cycles for 100 * movdqu + movntdq

65768   cycles for 100 * rep movsb
64697   cycles for 100 * rep movsd
206155  cycles for 100 * movlps qword ptr [esi+8*ecx]
122034  cycles for 100 * movaps xmm0, oword ptr [esi]
174827  cycles for 100 * movdqa + movntdq
176240  cycles for 100 * movdqu + movntdq

65109   cycles for 100 * rep movsb
64308   cycles for 100 * rep movsd
208594  cycles for 100 * movlps qword ptr [esi+8*ecx]
120838  cycles for 100 * movaps xmm0, oword ptr [esi]
176082  cycles for 100 * movdqa + movntdq
176391  cycles for 100 * movdqu + movntdq

65057   cycles for 100 * rep movsb
64755   cycles for 100 * rep movsd
206689  cycles for 100 * movlps qword ptr [esi+8*ecx]
121700  cycles for 100 * movaps xmm0, oword ptr [esi]
175600  cycles for 100 * movdqa + movntdq
176981  cycles for 100 * movdqu + movntdq

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq


Hi jj2007,

please add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286

"movntdq + mfence"
@@:
    movdqu  xmm0, [esi]                 ; esi may not be 16-bytes aligned
    movntdq [edi], xmm0                 ; edi should be 16-bytes aligned
    add     esi, 16
    add     edi, 16
    loop    @B
    mfence


11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

17005   cycles for 100 * rep movsb
16922   cycles for 100 * rep movsd
106248  cycles for 100 * movlps qword ptr [esi+8*ecx]
41768   cycles for 100 * movaps xmm0, oword ptr [esi]
56037   cycles for 100 * movdqa + movntdq
55746   cycles for 100 * movdqu + movntdq

16797   cycles for 100 * rep movsb
17090   cycles for 100 * rep movsd
105885  cycles for 100 * movlps qword ptr [esi+8*ecx]
42111   cycles for 100 * movaps xmm0, oword ptr [esi]
56001   cycles for 100 * movdqa + movntdq
56026   cycles for 100 * movdqu + movntdq

17075   cycles for 100 * rep movsb
16702   cycles for 100 * rep movsd
107414  cycles for 100 * movlps qword ptr [esi+8*ecx]
41896   cycles for 100 * movaps xmm0, oword ptr [esi]
56205   cycles for 100 * movdqa + movntdq
56293   cycles for 100 * movdqu + movntdq

16736   cycles for 100 * rep movsb
17064   cycles for 100 * rep movsd
105788  cycles for 100 * movlps qword ptr [esi+8*ecx]
41915   cycles for 100 * movaps xmm0, oword ptr [esi]
56349   cycles for 100 * movdqa + movntdq
56819   cycles for 100 * movdqu + movntdq

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq


--- ok ---
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 12, 2021, 11:42:46 PM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

62547   cycles for 100 * rep movsb
63471   cycles for 100 * rep movsd
119633  cycles for 100 * movlps qword ptr [esi+8*ecx]
60383   cycles for 100 * movaps xmm0, oword ptr [esi]
120757  cycles for 100 * movdqa + movntdq
115172  cycles for 100 * movdqu + movntdq

63334   cycles for 100 * rep movsb
62718   cycles for 100 * rep movsd
118873  cycles for 100 * movlps qword ptr [esi+8*ecx]
60457   cycles for 100 * movaps xmm0, oword ptr [esi]
112820  cycles for 100 * movdqa + movntdq
116539  cycles for 100 * movdqu + movntdq

62664   cycles for 100 * rep movsb
63786   cycles for 100 * rep movsd
119998  cycles for 100 * movlps qword ptr [esi+8*ecx]
57309   cycles for 100 * movaps xmm0, oword ptr [esi]
118881  cycles for 100 * movdqa + movntdq
112190  cycles for 100 * movdqu + movntdq

63090   cycles for 100 * rep movsb
63073   cycles for 100 * rep movsd
118692  cycles for 100 * movlps qword ptr [esi+8*ecx]
59713   cycles for 100 * movaps xmm0, oword ptr [esi]
117861  cycles for 100 * movdqa + movntdq
117263  cycles for 100 * movdqu + movntdq

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 13, 2021, 12:03:37 AM
Quote from: LiaoMi on December 12, 2021, 11:31:29 PMplease add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286

"movntdq + mfence"
@@:
    movdqu  xmm0, [esi]                 ; esi may not be 16-bytes aligned
    movntdq [edi], xmm0                 ; edi should be 16-bytes aligned
    add     esi, 16
    add     edi, 16
    loop    @B
    mfence

I'm not impressed...
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

65904   cycles for 100 * rep movsb
70679   cycles for 100 * rep movsd
207177  cycles for 100 * movlps qword ptr [esi+8*ecx]
121524  cycles for 100 * movaps xmm0, oword ptr [esi]
191206  cycles for 100 * movdqa + movntdq
194912  cycles for 100 * movdqu + movntdq
197640  cycles for 100 * movdqu + movntdq + mfence

66396   cycles for 100 * rep movsb
64295   cycles for 100 * rep movsd
207218  cycles for 100 * movlps qword ptr [esi+8*ecx]
121237  cycles for 100 * movaps xmm0, oword ptr [esi]
192188  cycles for 100 * movdqa + movntdq
193955  cycles for 100 * movdqu + movntdq
195811  cycles for 100 * movdqu + movntdq + mfence

65465   cycles for 100 * rep movsb
ID 10616    1 MB at 14:01:30  12.12.2021 13:25:28 wb    0 = 11553508 /  1 h    0 MB/day  firefox.exe
63888   cycles for 100 * rep movsd
209074  cycles for 100 * movlps qword ptr [esi+8*ecx]
122465  cycles for 100 * movaps xmm0, oword ptr [esi]
190494  cycles for 100 * movdqa + movntdq
192326  cycles for 100 * movdqu + movntdq
198034  cycles for 100 * movdqu + movntdq + mfence

65560   cycles for 100 * rep movsb
65119   cycles for 100 * rep movsd
206794  cycles for 100 * movlps qword ptr [esi+8*ecx]
121545  cycles for 100 * movaps xmm0, oword ptr [esi]
191100  cycles for 100 * movdqa + movntdq
196902  cycles for 100 * movdqu + movntdq
197136  cycles for 100 * movdqu + movntdq + mfence

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq
39      bytes for movdqu + movntdq + mfence
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 13, 2021, 12:19:18 AM
JJ,

Keep in mind that the sample size being looped will effect the timing of most of the combinations. If you run a gigabyte sample, you get rid of those effects. The other factor is different hardware will give different results.
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 13, 2021, 01:40:35 AM
Quote from: jj2007 on December 13, 2021, 12:03:37 AM
Quote from: LiaoMi on December 12, 2021, 11:31:29 PMplease add two more examples from here http://masm32.com/board/index.php?topic=9691.msg106286#msg106286

"movntdq + mfence"
@@:
    movdqu  xmm0, [esi]                 ; esi may not be 16-bytes aligned
    movntdq [edi], xmm0                 ; edi should be 16-bytes aligned
    add     esi, 16
    add     edi, 16
    loop    @B
    mfence

I'm not impressed...

Very curious results  :rolleyes:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

17782   cycles for 100 * rep movsb
17554   cycles for 100 * rep movsd
115012  cycles for 100 * movlps qword ptr [esi+8*ecx]
52006   cycles for 100 * movaps xmm0, oword ptr [esi]
58101   cycles for 100 * movdqa + movntdq
57415   cycles for 100 * movdqu + movntdq
73437   cycles for 100 * movdqu + movntdq + mfence

18073   cycles for 100 * rep movsb
17701   cycles for 100 * rep movsd
110545  cycles for 100 * movlps qword ptr [esi+8*ecx]
42643   cycles for 100 * movaps xmm0, oword ptr [esi]
56827   cycles for 100 * movdqa + movntdq
58362   cycles for 100 * movdqu + movntdq
72001   cycles for 100 * movdqu + movntdq + mfence

19436   cycles for 100 * rep movsb
17883   cycles for 100 * rep movsd
107491  cycles for 100 * movlps qword ptr [esi+8*ecx]
43259   cycles for 100 * movaps xmm0, oword ptr [esi]
56876   cycles for 100 * movdqa + movntdq
57166   cycles for 100 * movdqu + movntdq
74082   cycles for 100 * movdqu + movntdq + mfence

18036   cycles for 100 * rep movsb
18419   cycles for 100 * rep movsd
106922  cycles for 100 * movlps qword ptr [esi+8*ecx]
42377   cycles for 100 * movaps xmm0, oword ptr [esi]
58547   cycles for 100 * movdqa + movntdq
57797   cycles for 100 * movdqu + movntdq
74547   cycles for 100 * movdqu + movntdq + mfence

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
36      bytes for movdqa + movntdq
36      bytes for movdqu + movntdq
39      bytes for movdqu + movntdq + mfence


--- ok ---
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 13, 2021, 01:39:30 PM
Here is a 2 test pieces for 32 bit memory copy, an unaligned SSE2 copy and a normal rep movsb copy. In every instance the SSE2 version is always faster and to the extent that the test pieces don't need to be stabilised or run with a higher priority. It is the same source for both but each has been saved as a separate exe file so that there is no cache interaction, between the two.

SSE2 Copy
--------
843 ms
--------
Press any key to continue ...

ByteCopy
--------
1219 ms
--------
Press any key to continue ...

Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 13, 2021, 07:55:00 PM
Quote from: hutch-- on December 13, 2021, 01:39:30 PM
Here is a 2 test pieces for 32 bit memory copy, an unaligned SSE2 copy and a normal rep movsb copy. In every instance the SSE2 version is always faster and to the extent that the test pieces don't need to be stabilised or run with a higher priority. It is the same source for both but each has been saved as a separate exe file so that there is no cache interaction, between the two.

SSE2 Copy
--------
843 ms
--------
Press any key to continue ...

ByteCopy
--------
1219 ms
--------
Press any key to continue ...


SSE2 Copy
--------
640 ms
--------
Press any key to continue ...

ByteCopy
--------
719 ms
--------
Press any key to continue ...
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 13, 2021, 07:58:53 PM
Similar for me. The interesting bit, though: if you use movups instead of movnt, rep movsb is faster.

My test pieces were made with a shorter copy but more iterations. Which implies that they used the cache, and then, apparently, movnt has no advantage.
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 13, 2021, 09:23:23 PM
I think that is normally the case with SSE mnemonics, they are designed for streaming and apparently have a wind up effect that makes it difficult for short duration loop code. In the past, the advice on SSE code was to more or less forget integer code optimisation and set up the SSE code so it did the work. I have never yet got any gain out of SSE code by unrolling it so it really is a different system built into the CPU.

This much, over time the unaligned mnemonics are a lot closer in speed terms to the fully aligned ones and there is little gain in using the aligned instructions unless your code design requires it.
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 13, 2021, 09:37:08 PM
ByteCopy--------
1266 ms
--------
SSE2--------
875 ms
--------
Title: Re: Unaligned memory copy test piece.
Post by: mineiro on December 14, 2021, 01:24:01 AM
bytecopy.exe
--------
695 ms
--------
sse2copy.exe
--------
680 ms
--------

Quote from: jj2007 on December 13, 2021, 07:58:53 PM
Similar for me. The interesting bit, though: if you use movups instead of movnt, rep movsb is faster.
I do some tests here, when I align data to 64, rep movs(bd) execution decreased by a 50% 40% ratio. Others functions results remains unchanged.

align 16
db 1
align 16
db 1
align 16
db 1
align 16
somestring db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
REPEAT 99
db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero"
ENDM

Title: Re: Unaligned memory copy test piece.
Post by: daydreamer on December 14, 2021, 02:00:05 AM
I am interested in timings used with ddraw or sdl, with someone have pci express to/from vram still has the advantage with movntdqa system ram -> vram, and. The disadvantage read from vram 100* slower?
Can't test with laptop with shared memory
Wonder if dx loadtexturefrommemory api uses movntdqa?
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
++++++++-+++8 of 20 tests valid,
315842  kCycles for 1 * rep movsb
241232  kCycles for 1 * rep movsd
304695  kCycles for 1 * movlps qword ptr [esi+8*ecx]
302014  kCycles for 1 * movaps xmm0, oword ptr [esi]
208018  kCycles for 1 * movdqa + movntdq
207876  kCycles for 1 * movdqu + movntdq
207752  kCycles for 1 * movdqu + movntdq + mfence

249181  kCycles for 1 * rep movsb
239809  kCycles for 1 * rep movsd
304868  kCycles for 1 * movlps qword ptr [esi+8*ecx]
301253  kCycles for 1 * movaps xmm0, oword ptr [esi]
207931  kCycles for 1 * movdqa + movntdq
208272  kCycles for 1 * movdqu + movntdq
207503  kCycles for 1 * movdqu + movntdq + mfence

249727  kCycles for 1 * rep movsb
241799  kCycles for 1 * rep movsd
303516  kCycles for 1 * movlps qword ptr [esi+8*ecx]
301728  kCycles for 1 * movaps xmm0, oword ptr [esi]
207608  kCycles for 1 * movdqa + movntdq
208094  kCycles for 1 * movdqu + movntdq
208854  kCycles for 1 * movdqu + movntdq + mfence

248574  kCycles for 1 * rep movsb
240836  kCycles for 1 * rep movsd
304675  kCycles for 1 * movlps qword ptr [esi+8*ecx]
301674  kCycles for 1 * movaps xmm0, oword ptr [esi]
208379  kCycles for 1 * movdqa + movntdq
207882  kCycles for 1 * movdqu + movntdq
207742  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 14, 2021, 04:13:17 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
+-++++++++++++++++++
261718  kCycles for 1 * rep movsb
238384  kCycles for 1 * rep movsd
269729  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236182  kCycles for 1 * movaps xmm0, oword ptr [esi]
156614  kCycles for 1 * movdqa + movntdq
156042  kCycles for 1 * movdqu + movntdq
156594  kCycles for 1 * movdqu + movntdq + mfence

236577  kCycles for 1 * rep movsb
236369  kCycles for 1 * rep movsd
270908  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236626  kCycles for 1 * movaps xmm0, oword ptr [esi]
156354  kCycles for 1 * movdqa + movntdq
155998  kCycles for 1 * movdqu + movntdq
156243  kCycles for 1 * movdqu + movntdq + mfence

235567  kCycles for 1 * rep movsb
236734  kCycles for 1 * rep movsd
276802  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236012  kCycles for 1 * movaps xmm0, oword ptr [esi]
156233  kCycles for 1 * movdqa + movntdq
156445  kCycles for 1 * movdqu + movntdq
156944  kCycles for 1 * movdqu + movntdq + mfence

237039  kCycles for 1 * rep movsb
238233  kCycles for 1 * rep movsd
270702  kCycles for 1 * movlps qword ptr [esi+8*ecx]
236677  kCycles for 1 * movaps xmm0, oword ptr [esi]
155610  kCycles for 1 * movdqa + movntdq
156935  kCycles for 1 * movdqu + movntdq
156013  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence
Title: Re: Unaligned memory copy test piece.
Post by: Siekmanski on December 14, 2021, 05:56:35 AM
AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
-----------9 of 20 tests valid,
103297  kCycles for 1 * rep movsb
83453   kCycles for 1 * rep movsd
151305  kCycles for 1 * movlps qword ptr [esi+8*ecx]
140797  kCycles for 1 * movaps xmm0, oword ptr [esi]
85881   kCycles for 1 * movdqa + movntdq
87420   kCycles for 1 * movdqu + movntdq
85314   kCycles for 1 * movdqu + movntdq + mfence

82482   kCycles for 1 * rep movsb
81107   kCycles for 1 * rep movsd
148720  kCycles for 1 * movlps qword ptr [esi+8*ecx]
140735  kCycles for 1 * movaps xmm0, oword ptr [esi]
87417   kCycles for 1 * movdqa + movntdq
85541   kCycles for 1 * movdqu + movntdq
86765   kCycles for 1 * movdqu + movntdq + mfence

83181   kCycles for 1 * rep movsb
81348   kCycles for 1 * rep movsd
149105  kCycles for 1 * movlps qword ptr [esi+8*ecx]
141740  kCycles for 1 * movaps xmm0, oword ptr [esi]
85743   kCycles for 1 * movdqa + movntdq
86256   kCycles for 1 * movdqu + movntdq
87608   kCycles for 1 * movdqu + movntdq + mfence

81210   kCycles for 1 * rep movsb
81663   kCycles for 1 * rep movsd
150942  kCycles for 1 * movlps qword ptr [esi+8*ecx]
140339  kCycles for 1 * movaps xmm0, oword ptr [esi]
86239   kCycles for 1 * movdqa + movntdq
87931   kCycles for 1 * movdqu + movntdq
85408   kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


Hi Jochen,
What does the number of valid tests mean?
Title: Re: Unaligned memory copy test piece.
Post by: FORTRANS on December 14, 2021, 06:32:45 AM
Hi Jochen,

   Two systems.  Tests valid?

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
+++-+++++++++++5 of 20 tests valid,
352139  kCycles for 1 * rep movsb
237663  kCycles for 1 * rep movsd
287198  kCycles for 1 * movlps qword ptr [esi+8*ecx]
279630  kCycles for 1 * movaps xmm0, oword ptr [esi]
181083  kCycles for 1 * movdqa + movntdq
181125  kCycles for 1 * movdqu + movntdq
180909  kCycles for 1 * movdqu + movntdq + mfence

193980  kCycles for 1 * rep movsb
214331  kCycles for 1 * rep movsd
278425  kCycles for 1 * movlps qword ptr [esi+8*ecx]
274866  kCycles for 1 * movaps xmm0, oword ptr [esi]
179814  kCycles for 1 * movdqa + movntdq
179520  kCycles for 1 * movdqu + movntdq
179469  kCycles for 1 * movdqu + movntdq + mfence

192139  kCycles for 1 * rep movsb
210664  kCycles for 1 * rep movsd
279349  kCycles for 1 * movlps qword ptr [esi+8*ecx]
274878  kCycles for 1 * movaps xmm0, oword ptr [esi]
180115  kCycles for 1 * movdqa + movntdq
179785  kCycles for 1 * movdqu + movntdq
179769  kCycles for 1 * movdqu + movntdq + mfence

191809  kCycles for 1 * rep movsb
216878  kCycles for 1 * rep movsd
277053  kCycles for 1 * movlps qword ptr [esi+8*ecx]
277263  kCycles for 1 * movaps xmm0, oword ptr [esi]
178899  kCycles for 1 * movdqa + movntdq
179083  kCycles for 1 * movdqu + movntdq
178847  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


--- ok ---


Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
+++++15 of 20 tests valid,
241261  kCycles for 1 * rep movsb
201928  kCycles for 1 * rep movsd
250473  kCycles for 1 * movlps qword ptr [esi+8*ecx]
245765  kCycles for 1 * movaps xmm0, oword ptr [esi]
182061  kCycles for 1 * movdqa + movntdq
187806  kCycles for 1 * movdqu + movntdq
197382  kCycles for 1 * movdqu + movntdq + mfence

224850  kCycles for 1 * rep movsb
202630  kCycles for 1 * rep movsd
234536  kCycles for 1 * movlps qword ptr [esi+8*ecx]
228211  kCycles for 1 * movaps xmm0, oword ptr [esi]
191152  kCycles for 1 * movdqa + movntdq
188628  kCycles for 1 * movdqu + movntdq
185565  kCycles for 1 * movdqu + movntdq + mfence

206426  kCycles for 1 * rep movsb
206008  kCycles for 1 * rep movsd
233301  kCycles for 1 * movlps qword ptr [esi+8*ecx]
229024  kCycles for 1 * movaps xmm0, oword ptr [esi]
181524  kCycles for 1 * movdqa + movntdq
198103  kCycles for 1 * movdqu + movntdq
177373  kCycles for 1 * movdqu + movntdq + mfence

199886  kCycles for 1 * rep movsb
200050  kCycles for 1 * rep movsd
233793  kCycles for 1 * movlps qword ptr [esi+8*ecx]
228413  kCycles for 1 * movaps xmm0, oword ptr [esi]
177392  kCycles for 1 * movdqa + movntdq
175842  kCycles for 1 * movdqu + movntdq
175220  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


--- ok ---


Regards,

Steve N.
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 14, 2021, 08:09:43 AM
Quote from: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
++++---++-+--+++-++1 of 20 tests valid,
111187  kCycles for 1 * rep movsb
86590   kCycles for 1 * rep movsd
131239  kCycles for 1 * movlps qword ptr [esi+8*ecx]
119079  kCycles for 1 * movaps xmm0, oword ptr [esi]
99788   kCycles for 1 * movdqa + movntdq
89096   kCycles for 1 * movdqu + movntdq
92749   kCycles for 1 * movdqu + movntdq + mfence

99740   kCycles for 1 * rep movsb
92438   kCycles for 1 * rep movsd
119977  kCycles for 1 * movlps qword ptr [esi+8*ecx]
111659  kCycles for 1 * movaps xmm0, oword ptr [esi]
76366   kCycles for 1 * movdqa + movntdq
79162   kCycles for 1 * movdqu + movntdq
77279   kCycles for 1 * movdqu + movntdq + mfence

89597   kCycles for 1 * rep movsb
85665   kCycles for 1 * rep movsd
125051  kCycles for 1 * movlps qword ptr [esi+8*ecx]
111892  kCycles for 1 * movaps xmm0, oword ptr [esi]
76149   kCycles for 1 * movdqa + movntdq
76483   kCycles for 1 * movdqu + movntdq
76167   kCycles for 1 * movdqu + movntdq + mfence

86964   kCycles for 1 * rep movsb
85324   kCycles for 1 * rep movsd
121596  kCycles for 1 * movlps qword ptr [esi+8*ecx]
111769  kCycles for 1 * movaps xmm0, oword ptr [esi]
76968   kCycles for 1 * movdqa + movntdq
75970   kCycles for 1 * movdqu + movntdq
75677   kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


--- ok ---


Title: Re: Unaligned memory copy test piece.
Post by: mineiro on December 14, 2021, 08:16:35 AM
Quote from: jj2007 on December 14, 2021, 03:54:57 AM
I changed my testbed to one big allocation (0.5GB), in order to minimise cache use. Now movntdq shines, of course, but rep movs is only about 15% slower:
Quote from: mineiro on December 14, 2021, 01:24:01 AM
I do some tests here, when I align data to 64, rep movs(bd) execution decreased by a 50% 40% ratio. Others functions results remains unchanged.
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 14, 2021, 09:01:47 AM


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
++-++--++++9 of 20 tests valid,
242232  kCycles for 1 * rep movsb
215571  kCycles for 1 * rep movsd
227473  kCycles for 1 * movlps qword ptr [esi+8*ecx]
230151  kCycles for 1 * movaps xmm0, oword ptr [esi]
146063  kCycles for 1 * movdqa + movntdq
146242  kCycles for 1 * movdqu + movntdq
146090  kCycles for 1 * movdqu + movntdq + mfence

208745  kCycles for 1 * rep movsb
214643  kCycles for 1 * rep movsd
227536  kCycles for 1 * movlps qword ptr [esi+8*ecx]
230180  kCycles for 1 * movaps xmm0, oword ptr [esi]
146452  kCycles for 1 * movdqa + movntdq
146127  kCycles for 1 * movdqu + movntdq
146534  kCycles for 1 * movdqu + movntdq + mfence

208675  kCycles for 1 * rep movsb
213978  kCycles for 1 * rep movsd
227433  kCycles for 1 * movlps qword ptr [esi+8*ecx]
230045  kCycles for 1 * movaps xmm0, oword ptr [esi]
146253  kCycles for 1 * movdqa + movntdq
146041  kCycles for 1 * movdqu + movntdq
146361  kCycles for 1 * movdqu + movntdq + mfence

208670  kCycles for 1 * rep movsb
216371  kCycles for 1 * rep movsd
227677  kCycles for 1 * movlps qword ptr [esi+8*ecx]
229852  kCycles for 1 * movaps xmm0, oword ptr [esi]
146070  kCycles for 1 * movdqa + movntdq
146349  kCycles for 1 * movdqu + movntdq
146292  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


-

Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 14, 2021, 09:31:18 AM
Thanks to everybody :thup:

I see a pretty consistent pattern: rep movs* faster than movlps and movaps, but movntdq faster than rep movs*. Makes sense :thumbsup:
Title: Re: Unaligned memory copy test piece.
Post by: FORTRANS on December 14, 2021, 09:44:49 AM
Hi,

   I tried it on two older machines.  One locked up ( kinda / sorta ),
one ran okay.

Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
++++++++++++++6 of 20 tests valid,
916706 kCycles for 1 * rep movsb
822971 kCycles for 1 * rep movsd
1043451 kCycles for 1 * movlps qword ptr [esi+8*ecx]
904357 kCycles for 1 * movaps xmm0, oword ptr [esi]
708260 kCycles for 1 * movdqa + movntdq
709339 kCycles for 1 * movdqu + movntdq
683374 kCycles for 1 * movdqu + movntdq + mfence

825066 kCycles for 1 * rep movsb
834842 kCycles for 1 * rep movsd
1048805 kCycles for 1 * movlps qword ptr [esi+8*ecx]
904001 kCycles for 1 * movaps xmm0, oword ptr [esi]
677020 kCycles for 1 * movdqa + movntdq
684107 kCycles for 1 * movdqu + movntdq
683326 kCycles for 1 * movdqu + movntdq + mfence

820533 kCycles for 1 * rep movsb
820807 kCycles for 1 * rep movsd
1033703 kCycles for 1 * movlps qword ptr [esi+8*ecx]
899210 kCycles for 1 * movaps xmm0, oword ptr [esi]
677950 kCycles for 1 * movdqa + movntdq
685388 kCycles for 1 * movdqu + movntdq
682889 kCycles for 1 * movdqu + movntdq + mfence

819905 kCycles for 1 * rep movsb
820346 kCycles for 1 * rep movsd
1033769 kCycles for 1 * movlps qword ptr [esi+8*ecx]
899579 kCycles for 1 * movaps xmm0, oword ptr [esi]
676660 kCycles for 1 * movdqa + movntdq
685048 kCycles for 1 * movdqu + movntdq
682723 kCycles for 1 * movdqu + movntdq + mfence

21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence


--- ok ---


Steve N.
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 14, 2021, 06:39:24 PM

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4) AMD Ryzen 9 5950X 16-Core Processor             (SSE4) Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4) Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4) Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
kCycles for 1 * rep movsb                    261718  103297  352139  241261  315842  242232  111187  916706
kCycles for 1 * rep movsd                    238384   83453  237663  201928  241232  215571   86590  822971
kCycles for 1 * movlps qword ptr [esi+8*ecx  269729  151305  287198  250473  304695  227473  131239 1043451
kCycles for 1 * movaps xmm0, oword ptr [esi  236182  140797  279630  245765  302014  230151  119079  904357
kCycles for 1 * movdqa + movntdq             156614   85881  181083  182061  208018  146063   99788  708260
kCycles for 1 * movdqu + movntdq             156042   87420  181125  187806  207876  146242   89096  709339
kCycles for 1 * movdqu + movntdq + mfence    156594   85314  180909  197382  207752  146090   92749  683374

kCycles for 1 * rep movsb                    236577   82482  193980  224850  249181  208745   99740  825066
kCycles for 1 * rep movsd                    236369   81107  214331  202630  239809  214643   92438  834842
kCycles for 1 * movlps qword ptr [esi+8*ecx  270908  148720  278425  234536  304868  227536  119977 1048805
kCycles for 1 * movaps xmm0, oword ptr [esi  236626  140735  274866  228211  301253  230180  111659  904001
kCycles for 1 * movdqa + movntdq             156354   87417  179814  191152  207931  146452   76366  677020
kCycles for 1 * movdqu + movntdq             155998   85541  179520  188628  208272  146127   79162  684107
kCycles for 1 * movdqu + movntdq + mfence    156243   86765  179469  185565  207503  146534   77279  683326

kCycles for 1 * rep movsb                    235567   83181  192139  206426  249727  208675   89597  820533
kCycles for 1 * rep movsd                    236734   81348  210664  206008  241799  213978   85665  820807
kCycles for 1 * movlps qword ptr [esi+8*ecx  276802  149105  279349  233301  303516  227433  125051 1033703
kCycles for 1 * movaps xmm0, oword ptr [esi  236012  141740  274878  229024  301728  230045  111892  899210
kCycles for 1 * movdqa + movntdq             156233   85743  180115  181524  207608  146253   76149  677950
kCycles for 1 * movdqu + movntdq             156445   86256  179785  198103  208094  146041   76483  685388
kCycles for 1 * movdqu + movntdq + mfence    156944   87608  179769  177373  208854  146361   76167  682889

kCycles for 1 * rep movsb                    237039   81210  191809  199886  248574  208670   86964  819905
kCycles for 1 * rep movsd                    238233   81663  216878  200050  240836  216371   85324  820346
kCycles for 1 * movlps qword ptr [esi+8*ecx  270702  150942  277053  233793  304675  227677  121596 1033769
kCycles for 1 * movaps xmm0, oword ptr [esi  236677  140339  277263  228413  301674  229852  111769  899579
kCycles for 1 * movdqa + movntdq             155610   86239  178899  177392  208379  146070   76968  676660
kCycles for 1 * movdqu + movntdq             156935   87931  179083  175842  207882  146349   75970  685048
kCycles for 1 * movdqu + movntdq + mfence    156013   85408  178847  175220  207742  146292   75677  682723
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 14, 2021, 07:21:49 PM
Wow, you put a lot of work into this one, Timo :thumbsup:
Do you have it as a tab-delimited or csv file?

Thanks, Timo, here are the averages - extremely consistent:

All 9 CPUs
276 rep movsb
264 rep movsd
330 movlps qword ptr [esi+8*ecx]
304 movaps xmm0, oword ptr [esi]
216 movdqa + movntdq
217 movdqu + movntdq
216 movdqu + movntdq + mfence


AMD only
165 rep movsb
160 rep movsd
211 movlps qword ptr [esi+8*ecx]
189 movaps xmm0, oword ptr [esi]
121 movdqa + movntdq
122 movdqu + movntdq
121 movdqu + movntdq + mfence
Title: Re: Unaligned memory copy test piece.
Post by: FORTRANS on December 15, 2021, 02:58:14 AM
Hi,

   Using Timo's data I picked out the first data value in each run
(they looked about 10% large) and normalized the averages
for each processor..


                      MASM Forum example     AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
                                                   AMD Ryzen 9 5950X 16-Core Processor (SSE4)
                                                         Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                               Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                     Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                           Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
                                                                                       Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)

kCycles for 1 * rep movsb                    1.51  1.00  1.07  1.15  1.20  1.43  1.15  1.20
kCycles for 1 * rep movsd                    1.52  1.00  1.22  1.11  1.16  1.47  1.09  1.21
kCycles for 1 * movlps qword ptr [esi+8*ecx] 1.74  1.83  1.56  1.30  1.46  1.56  1.55  1.52
kCycles for 1 * movaps xmm0, oword ptr [esi] 1.51  1.72  1.54  1.27  1.45  1.57  1.42  1.32
kCycles for 1 * movdqa + movntdq             1.00  1.05  1.00  1.00  1.00  1.00  1.03  1.00
kCycles for 1 * movdqu + movntdq             1.00  1.06  1.00  1.02  1.00  1.00  1.00  1.01
kCycles for 1 * movdqu + movntdq + mfence    1.00  1.05  1.00  1.00  1.00  1.00  1.00  1.00


Cheers,

Steve N.
Title: Re: Unaligned memory copy test piece.
Post by: daydreamer on December 15, 2021, 05:23:23 AM
whats invalid mean?
Jochen if its cachesize dependent,wouldnt it be good to use some API that shows cachesize?
celeron smaller cache,Xeon bigger cache and newer generation cpu with many cores bigger cache
reminds me of that movie where you are called "invalid" if you are only natural born,compared to the genetically improved
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
+++++++-+11 of 20 tests valid,
212547  kCycles for 1 * rep movsb
179930  kCycles for 1 * rep movsd
220533  kCycles for 1 * movlps qword ptr [esi+8*ecx]
202840  kCycles for 1 * movaps xmm0, oword ptr [esi]
191639  kCycles for 1 * movdqa + movntdq
184883  kCycles for 1 * movdqu + movntdq
170484  kCycles for 1 * movdqu + movntdq + mfence

178210  kCycles for 1 * rep movsb
178902  kCycles for 1 * rep movsd
218612  kCycles for 1 * movlps qword ptr [esi+8*ecx]
202293  kCycles for 1 * movaps xmm0, oword ptr [esi]
152501  kCycles for 1 * movdqa + movntdq
151305  kCycles for 1 * movdqu + movntdq
151683  kCycles for 1 * movdqu + movntdq + mfence

178193  kCycles for 1 * rep movsb
178480  kCycles for 1 * rep movsd
219341  kCycles for 1 * movlps qword ptr [esi+8*ecx]
207351  kCycles for 1 * movaps xmm0, oword ptr [esi]
158671  kCycles for 1 * movdqa + movntdq
151322  kCycles for 1 * movdqu + movntdq
152135  kCycles for 1 * movdqu + movntdq + mfence

178037  kCycles for 1 * rep movsb
178500  kCycles for 1 * rep movsd
218100  kCycles for 1 * movlps qword ptr [esi+8*ecx]
202977  kCycles for 1 * movaps xmm0, oword ptr [esi]
152367  kCycles for 1 * movdqa + movntdq
151737  kCycles for 1 * movdqu + movntdq
151928  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


-
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 15, 2021, 07:25:30 AM
Thanks Steve N. for tiled title idea :thumbsup:
                                                      AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
                                                              AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
                                                                      Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                                              Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                                      Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                                              Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
                                                                                                      Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                                              11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

kCycles for 1 * rep movsd                              238384   83453  237663  201928  241232  179930  215571   86590
kCycles for 1 * movlps qword ptr [esi+8*ecx]           269729  151305  287198  250473  304695  220533  227473  131239
kCycles for 1 * movaps xmm0, oword ptr [esi]           236182  140797  279630  245765  302014  202840  230151  119079
kCycles for 1 * movdqa + movntdq                       156614   85881  181083  182061  208018  191639  146063   99788
kCycles for 1 * movdqu + movntdq                       156042   87420  181125  187806  207876  184883  146242   89096
kCycles for 1 * movdqu + movntdq + mfence              156594   85314  180909  197382  207752  170484  146090   92749
Excel .prn file looks better
                                         AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
                                                 AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
                                                         Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                                 Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                         Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                                 Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
                                                                                         Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                                 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

kCycles for 1 * rep movsd          83453    2,86    1,00    2,85    2,42    2,89    2,16    2,58    1,04
kCycles for 1 * movlps qword ptr  131239    2,06    1,15    2,19    1,91    2,32    1,68    1,73    1,00
kCycles for 1 * movaps xmm0, owo  119079    1,98    1,18    2,35    2,06    2,54    1,70    1,93    1,00
kCycles for 1 * movdqa + movntdq   85881    1,82    1,00    2,11    2,12    2,42    2,23    1,70    1,16
kCycles for 1 * movdqu + movntdq   87420    1,78    1,00    2,07    2,15    2,38    2,11    1,67    1,02
kCycles for 1 * movdqu + movntdq   85314    1,84    1,00    2,12    2,31    2,44    2,00    1,71    1,09


Sub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    sTab = Chr(9)
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            'Debug.Print .Cells(Row, Col)
            Row = Row + 1
            Col = Col + 1
        Wend
        .Cells(Row, Col).EntireRow.Insert
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    nRow = 0    ' start
    Open sFileName For Input As #nFileNro
    Line Input #nFileNro, sTextRow
    ActiveSheet.Cells(Row, Col) = sTextRow   ' title
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        Row = Row + 1   ' table
        nRow = nRow + 1 ' file
        If nRow >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If nRow = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub

EDIT: RepMovsd05GB_Results_1.csv
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 15, 2021, 04:16:16 PM
 :biggrin:

Hi Timo,

I cheated, I downloaded Libre Office, set the defaults for Excel and Word and BINGO, I can now open you "csv" files.  :thumbsup:
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 15, 2021, 08:16:18 PM
Quote from: daydreamer on December 15, 2021, 05:23:23 AM
Jochen if its cachesize dependent,wouldnt it be good to use some API that shows cachesize?

Good idea :thumbsup: Which one? Do you have some code?
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 15, 2021, 08:32:35 PM
If we just collect L3 info
How to Check Processor Cache Memory in Windows 10 (https://www.techbout.com/check-processor-cache-memory-windows-10-48655/)
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
4 64 3 4 3 3 15 24
Title: Re: Unaligned memory copy test piece.
Post by: hutch-- on December 15, 2021, 10:48:36 PM
Something I have just tested is that if you use movdqu for both read and write to make the copy algo fully unaligned, it is no faster than rep movsb.
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 16, 2021, 12:03:13 AM
Quote from: hutch-- on December 15, 2021, 10:48:36 PM
Something I have just tested is that if you use movdqu for both read and write to make the copy algo fully unaligned, it is no faster than rep movsb.

movdqu was an attempt to speed up unaligned moves, but afaik they merged it with movups; and the latter is no longer slower than movaps. Three equivalent instructions.
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 16, 2021, 12:16:32 AM
Quote from: hutch-- on December 15, 2021, 04:16:16 PM
:biggrin:

Hi Timo,

I cheated, I downloaded Libre Office, set the defaults for Excel and Word and BINGO, I can now open you "csv" files.  :thumbsup:
A fix for Libre Office vba
Option VBASupport 1
Sub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    sTab = Chr(9)
    With ActiveSheet
        While .Cells(Row, Col).Text <> ""
            'Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        '.AllowMultiSelect = False
        '.Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub

Question, what format users want for tables?
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 16, 2021, 01:17:53 AM
Quote from: TimoVJL on December 15, 2021, 08:32:35 PM
If we just collect L3 info

include \masm32\MasmBasic\MasmBasic.inc
CACHE_DESCRIPTOR STRUCT
Level      BYTE ?
Associativity BYTE ?
LineSize  WORD ?
_Size      DWORD ?
_Type      dd ? ; PROCESSOR_CACHE_TYPE enum
CACHE_DESCRIPTOR ENDS

SYSTEM_LOGICAL_PROCESSOR_INFORMATION STRUCT
NodeNumber DWORD ?
Cache      CACHE_DESCRIPTOR <>
Reserved  ULONGLONG 2 dup(?)
SYSTEM_LOGICAL_PROCESSOR_INFORMATION ENDS

  Init
  Print cfm$("\n\nWmic:\n"), Launch$("wmic cpu get L2CacheSize, L3CacheSize") ; the simple solution
  Dll "Kernel32"
  Declare void GetLogicalProcessorInformation, 2
  Let edi=New$(400)
  ClearLastError
  push 400
  GetLogicalProcessorInformation(edi, esp)
  pop edx
  pinfo equ [edi.SYSTEM_LOGICAL_PROCESSOR_INFORMATION]
  deb 1, "GetLogicalProcessorInformation output:", eax, pinfo.NodeNumber, b:pinfo.Cache.Level, b:pinfo.Cache.LineSize, b:pinfo.Cache._Size, b:pinfo.Cache._Type, $Err$()
EndOfCode


Output:
Wmic:
L2CacheSize  L3CacheSize
256          3072


GetLogicalProcessorInformation output:
eax             1
pinfo.NodeNumber        3
b:pinfo.Cache.Level     00000000
b:pinfo.Cache.LineSize  0000000000000000
b:pinfo.Cache._Size     00000000000000000000000000000001
b:pinfo.Cache._Type     00000000000000000000000000000000
$Err$()         Operazione completata.__


WMIC works, but GetLogicalProcessorInformation returns rubbish :sad:
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 16, 2021, 02:05:04 AM
Quote from: jj2007 on December 16, 2021, 01:17:53 AM
WMIC works, but GetLogicalProcessorInformation returns rubbish :sad:

---------------------------
GetLogicalProcessorInformation output:
---------------------------
eax 0

pinfo.NodeNumber 0

b:pinfo.Cache.Level 00000000

b:pinfo.Cache.LineSize 0000000000000000

b:pinfo.Cache._Size 00000000000000000000000000000000

b:pinfo.Cache._Type 00000000000000000000000000000000

$Err$() The data area passed to a system call is too small.__


---------------------------
OK   Cancel   
---------------------------
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 16, 2021, 03:57:49 AM
I've seen this error, but 400 bytes works on Win7-64. Try version 2, with a 1000-byte buffer
Title: Re: Unaligned memory copy test piece.
Post by: LiaoMi on December 16, 2021, 06:13:06 AM
Quote from: jj2007 on December 16, 2021, 03:57:49 AM
I've seen this error, but 400 bytes works on Win7-64. Try version 2, with a 1000-byte buffer

The program is closed after the start, the message remains the same with the debugger.
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 16, 2021, 11:08:46 AM
It works on my new machine, Win10:

Wmic:
L2CacheSize  L3CacheSize
1024         4096


GetLogicalProcessorInformation output:
eax             1
pinfo.NodeNumber        3
b:pinfo.Cache.Level     00000000
b:pinfo.Cache.LineSize  0000000000000000
b:pinfo.Cache._Size     00000000000000000000000000000001
b:pinfo.Cache._Type     00000000000000000000000000000000
$Err$()         Operazione completata.__
Title: Re: Unaligned memory copy test piece.
Post by: HSE on December 16, 2021, 11:26:56 AM
Hi JJ!

Can be a problem with wow64?

What happen with program in 64 bits?

LATER:
Apparently structure is different:typedef struct _SYSTEM_LOGICAL_PROCESSOR_INFORMATION {
  ULONG_PTR                      ProcessorMask;
  LOGICAL_PROCESSOR_RELATIONSHIP Relationship;
  union {
    struct {
      BYTE Flags;
    } ProcessorCore;
    struct {
      DWORD NodeNumber;
    } NumaNode;
    CACHE_DESCRIPTOR Cache;
    ULONGLONG        Reserved[2];
  } DUMMYUNIONNAME;
} SYSTEM_LOGICAL_PROCESSOR_INFORMATION, *PSYSTEM_LOGICAL_PROCESSOR_INFORMATION;
Title: Re: Unaligned memory copy test piece.
Post by: jj2007 on December 16, 2021, 12:40:02 PM
Quote from: HSE on December 16, 2021, 11:26:56 AM
Can be a problem with wow64?

It does not crash on my Win7-64 and Win10 machines.

QuoteApparently structure is different:

I'm afraid my C skills are not sufficient to translate it correctly to MASM syntax :sad:
Title: Re: Unaligned memory copy test piece.
Post by: HSE on December 16, 2021, 12:42:38 PM
Quote from: jj2007 on December 16, 2021, 12:40:02 PM
I'm afraid my C skills are not sufficient to translate it correctly to MASM syntax :sad:

:biggrin: :biggrin: I hope somebody can.
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 16, 2021, 01:27:53 PM
Modified code from here:
https://docs.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL3CacheSize: 3145728

GetLogicalProcessorInformation results:
Number of NUMA nodes: 1
Number of physical processor packages: 1
Number of processor cores: 2
Number of logical processors: 4
Number of processor L1/L2/L3 caches: 4/2/1

EDIT:

32-bit
SYSTEM_LOGICAL_PROCESSOR_INFORMATION 24 18h bytes
Relationship     +4h 4h
Cache.Level      +8h 1h
Cache.Size       +Ch 4h
64-bit
SYSTEM_LOGICAL_PROCESSOR_INFORMATION 32 20h bytes
Relationship     +8h 4h
Cache.Level      +10h 1h
Cache.Size       +14h 4h
Title: Re: Unaligned memory copy test piece.
Post by: HSE on December 16, 2021, 10:41:15 PM
Thanks Timo  :thumbsup:

processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 262144
processorL3CacheSize: 6291456

GetLogicalProcessorInformation results:
Number of NUMA nodes: 1
Number of physical processor packages: 1
Number of processor cores: 4
Number of logical processors: 8
Number of processor L1/L2/L3 caches: 8/4/1


Biterider structure translation is:  SYSTEM_LOGICAL_PROCESSOR_INFORMATION struct
    ProcessorMask ULONG_PTR ?
    Relationship LOGICAL_PROCESSOR_RELATIONSHIP ?
    union
      struct ProcessorCore
        Flags BYTE ?
      ends
      struct NumaNode
        NodeNumber DWORD ?
      ends
      Cache CACHE_DESCRIPTOR <>
      Reserved ULONGLONG 2 dup (?)
    ends
  SYSTEM_LOGICAL_PROCESSOR_INFORMATION ends
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 21, 2021, 12:53:32 AM
@Greenhorn, please give AMD Ryzen 7 3700X results.

These are interesting:
AMD Ryzen™ 5 5600G L3 16 MB
AMD Ryzen™ 7 5700G L3 16 MB

Title: Re: Unaligned memory copy test piece.
Post by: Greenhorn on December 21, 2021, 10:07:26 AM
AMD Ryzen 7 3700X

processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432
processorL1CacheSize: 32768
processorL1CacheSize: 32768
processorL2CacheSize: 524288
processorL3CacheSize: 33554432

GetLogicalProcessorInformation results:
Number of NUMA nodes: 1
Number of physical processor packages: 1
Number of processor cores: 8
Number of logical processors: 16
Number of processor L1/L2/L3 caches: 32/16/16
Title: Re: Unaligned memory copy test piece.
Post by: TimoVJL on December 21, 2021, 10:12:14 AM
I was after this test
http://masm32.com/board/index.php?topic=9691.msg106349#msg106349

EDIT:                                                  AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
                                                          AMD Ryzen 7 3700X 8-Core Processor              (SSE4)
                                                                  AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
                                                                          Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                                                  Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                                          Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                                                  Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
                                                                                                          Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                                                  11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

kCycles for 1 * rep movsd                   83453    2,86    2,32    1,00    2,85    2,42    2,89    2,16    2,58    1,04
kCycles for 1 * movlps qword ptr [esi+8*e  131239    2,06    1,31    1,15    2,19    1,91    2,32    1,68    1,73    1,00
kCycles for 1 * movaps xmm0, oword ptr [e  119079    1,98    1,40    1,18    2,35    2,06    2,54    1,70    1,93    1,00
kCycles for 1 * movdqa + movntdq            85881    1,82    1,17    1,00    2,11    2,12    2,42    2,23    1,70    1,16
kCycles for 1 * movdqu + movntdq            87420    1,78    1,16    1,00    2,07    2,15    2,38    2,11    1,67    1,02
kCycles for 1 * movdqu + movntdq + mfence   85314    1,84    1,18    1,00    2,12    2,31    2,44    2,00    1,71    1,09
Title: Re: Unaligned memory copy test piece.
Post by: Greenhorn on December 21, 2021, 10:19:31 AM
Ah, OK ...

AMD Ryzen 7 3700X 8-Core Processor              (SSE4)
++++++++-++9 of 20 tests valid,
252562 kCycles for 1 * rep movsb
193705 kCycles for 1 * rep movsd
172087 kCycles for 1 * movlps qword ptr [esi+8*ecx]
167271 kCycles for 1 * movaps xmm0, oword ptr [esi]
100892 kCycles for 1 * movdqa + movntdq
101051 kCycles for 1 * movdqu + movntdq
100887 kCycles for 1 * movdqu + movntdq + mfence

191904 kCycles for 1 * rep movsb
192633 kCycles for 1 * rep movsd
171663 kCycles for 1 * movlps qword ptr [esi+8*ecx]
163020 kCycles for 1 * movaps xmm0, oword ptr [esi]
100933 kCycles for 1 * movdqa + movntdq
100811 kCycles for 1 * movdqu + movntdq
101287 kCycles for 1 * movdqu + movntdq + mfence

193081 kCycles for 1 * rep movsb
192589 kCycles for 1 * rep movsd
171437 kCycles for 1 * movlps qword ptr [esi+8*ecx]
163055 kCycles for 1 * movaps xmm0, oword ptr [esi]
100982 kCycles for 1 * movdqa + movntdq
100927 kCycles for 1 * movdqu + movntdq
101013 kCycles for 1 * movdqu + movntdq + mfence

191832 kCycles for 1 * rep movsb
192769 kCycles for 1 * rep movsd
171349 kCycles for 1 * movlps qword ptr [esi+8*ecx]
163022 kCycles for 1 * movaps xmm0, oword ptr [esi]
100896 kCycles for 1 * movdqa + movntdq
100825 kCycles for 1 * movdqu + movntdq
101005 kCycles for 1 * movdqu + movntdq + mfence

21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence


--- ok ---