News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Unaligned memory copy test piece.

Started by hutch--, December 06, 2021, 08:34:55 PM

Previous topic - Next topic

daydreamer

Would be interesting test the smaller SSE movups /movaps,because if same speed as SSE2 moves can fit more  instructions in cache
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

LiaoMi

Quote from: hutch-- on December 06, 2021, 08:34:55 PM
I have a task where the memory copy cannot be controlled to SSE alignment.

The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.

    movdqu xmm0, [rcx+r10]
    movntdq [rdx+r10], xmm0

I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.

I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.

Hi Hutch,

i7-11800h
--------------------------------
50 gig copy in 3500 milliseconds
--------------------------------
Press any key to continue...

rep movsd
--------------------------------
50 gig copy in 3750 milliseconds
--------------------------------
Press any key to continue...


jj2007

Quote from: hutch-- on December 07, 2021, 11:27:06 AM"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry

Not on my machine, at least with 32-bit code...

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

64674   cycles for 100 * rep movsb
64365   cycles for 100 * rep movsd
206043  cycles for 100 * movlps qword ptr [esi+8*ecx]
122243  cycles for 100 * movaps xmm0, oword ptr [esi]
195049  cycles for 100 * movntdq xmm0, oword ptr [esi]

65058   cycles for 100 * rep movsb
63966   cycles for 100 * rep movsd
206036  cycles for 100 * movlps qword ptr [esi+8*ecx]
122348  cycles for 100 * movaps xmm0, oword ptr [esi]
193151  cycles for 100 * movntdq xmm0, oword ptr [esi]

65376   cycles for 100 * rep movsb
64353   cycles for 100 * rep movsd
206087  cycles for 100 * movlps qword ptr [esi+8*ecx]
122278  cycles for 100 * movaps xmm0, oword ptr [esi]
193349  cycles for 100 * movntdq xmm0, oword ptr [esi]

65125   cycles for 100 * rep movsb
63895   cycles for 100 * rep movsd
205977  cycles for 100 * movlps qword ptr [esi+8*ecx]
121872  cycles for 100 * movaps xmm0, oword ptr [esi]
193156  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]

hutch--

I get much the same on this old Haswell. I have usually used a combination of rep movsd and used rep movsb but the single use of rep movsb is close enough to as fast as the rep movsd version and does not suffer from the switch from DWORD to BYTE.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

64649   cycles for 100 * rep movsb
64474   cycles for 100 * rep movsd
158890  cycles for 100 * movlps qword ptr [esi+8*ecx]
80167   cycles for 100 * movaps xmm0, oword ptr [esi]

63923   cycles for 100 * rep movsb
65043   cycles for 100 * rep movsd
158972  cycles for 100 * movlps qword ptr [esi+8*ecx]
82787   cycles for 100 * movaps xmm0, oword ptr [esi]

65293   cycles for 100 * rep movsb
66105   cycles for 100 * rep movsd
158830  cycles for 100 * movlps qword ptr [esi+8*ecx]
81359   cycles for 100 * movaps xmm0, oword ptr [esi]

64850   cycles for 100 * rep movsb
66085   cycles for 100 * rep movsd
159823  cycles for 100 * movlps qword ptr [esi+8*ecx]
81326   cycles for 100 * movaps xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
28      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]


--- ok ---

hutch--

Thanks LiaoMi, interesting result in that they are much closer than earlier hardware. Looks like a nice fast box.

jj2007

New machine. I added movntdq

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

57654   cycles for 100 * rep movsb
65892   cycles for 100 * rep movsd
112954  cycles for 100 * movlps qword ptr [esi+8*ecx]
58152   cycles for 100 * movaps xmm0, oword ptr [esi]
129800  cycles for 100 * movntdq xmm0, oword ptr [esi]

59723   cycles for 100 * rep movsb
59356   cycles for 100 * rep movsd
113875  cycles for 100 * movlps qword ptr [esi+8*ecx]
57518   cycles for 100 * movaps xmm0, oword ptr [esi]
130509  cycles for 100 * movntdq xmm0, oword ptr [esi]

59061   cycles for 100 * rep movsb
63768   cycles for 100 * rep movsd
112908  cycles for 100 * movlps qword ptr [esi+8*ecx]
57839   cycles for 100 * movaps xmm0, oword ptr [esi]
132310  cycles for 100 * movntdq xmm0, oword ptr [esi]

59031   cycles for 100 * rep movsb
58619   cycles for 100 * rep movsd
129052  cycles for 100 * movlps qword ptr [esi+8*ecx]
57675   cycles for 100 * movaps xmm0, oword ptr [esi]
131438  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

62494   cycles for 100 * rep movsb
63721   cycles for 100 * rep movsd
120658  cycles for 100 * movlps qword ptr [esi+8*ecx]
58538   cycles for 100 * movaps xmm0, oword ptr [esi]
129730  cycles for 100 * movntdq xmm0, oword ptr [esi]

63098   cycles for 100 * rep movsb
63117   cycles for 100 * rep movsd
119950  cycles for 100 * movlps qword ptr [esi+8*ecx]
58756   cycles for 100 * movaps xmm0, oword ptr [esi]
129155  cycles for 100 * movntdq xmm0, oword ptr [esi]

62565   cycles for 100 * rep movsb
62759   cycles for 100 * rep movsd
119914  cycles for 100 * movlps qword ptr [esi+8*ecx]
57471   cycles for 100 * movaps xmm0, oword ptr [esi]
126619  cycles for 100 * movntdq xmm0, oword ptr [esi]

63080   cycles for 100 * rep movsb
62948   cycles for 100 * rep movsd
119847  cycles for 100 * movlps qword ptr [esi+8*ecx]
57513   cycles for 100 * movaps xmm0, oword ptr [esi]
123895  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]
May the source be with you

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
umc.exe
--------------------------------
50 gig copy in 4453 milliseconds
--------------------------------
umcmovsb.exe
--------------------------------
50 gig copy in 6469 milliseconds
--------------------------------
May the source be with you

daydreamer

--------------------------------
50 gig copy in 7547 milliseconds
--------------------------------
Press any key to continue...

I have 20GB memory installed,turbo 3.1ghz
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

31040   cycles for 100 * rep movsb
31143   cycles for 100 * rep movsd
117139  cycles for 100 * movlps qword ptr [esi+8*ecx]
72688   cycles for 100 * movaps xmm0, oword ptr [esi]
109706  cycles for 100 * movntdq xmm0, oword ptr [esi]

31381   cycles for 100 * rep movsb
31663   cycles for 100 * rep movsd
116001  cycles for 100 * movlps qword ptr [esi+8*ecx]
71727   cycles for 100 * movaps xmm0, oword ptr [esi]
110933  cycles for 100 * movntdq xmm0, oword ptr [esi]

31644   cycles for 100 * rep movsb
37560   cycles for 100 * rep movsd
114454  cycles for 100 * movlps qword ptr [esi+8*ecx]
72541   cycles for 100 * movaps xmm0, oword ptr [esi]
124899  cycles for 100 * movntdq xmm0, oword ptr [esi]

31097   cycles for 100 * rep movsb
31010   cycles for 100 * rep movsd
115056  cycles for 100 * movlps qword ptr [esi+8*ecx]
72463   cycles for 100 * movaps xmm0, oword ptr [esi]
109951  cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]


-
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

LiaoMi

Quote from: jj2007 on December 07, 2021, 09:05:54 PM
New machine. I added movntdq

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

13916   cycles for 100 * rep movsb
16152   cycles for 100 * rep movsd
106432  cycles for 100 * movlps qword ptr [esi+8*ecx]
42050   cycles for 100 * movaps xmm0, oword ptr [esi]
59500   cycles for 100 * movntdq xmm0, oword ptr [esi]

16298   cycles for 100 * rep movsb
15607   cycles for 100 * rep movsd
109919  cycles for 100 * movlps qword ptr [esi+8*ecx]
41897   cycles for 100 * movaps xmm0, oword ptr [esi]
58949   cycles for 100 * movntdq xmm0, oword ptr [esi]

15691   cycles for 100 * rep movsb
16515   cycles for 100 * rep movsd
108793  cycles for 100 * movlps qword ptr [esi+8*ecx]
41640   cycles for 100 * movaps xmm0, oword ptr [esi]
101036  cycles for 100 * movntdq xmm0, oword ptr [esi]

17390   cycles for 100 * rep movsb
16209   cycles for 100 * rep movsd
117106  cycles for 100 * movlps qword ptr [esi+8*ecx]
42124   cycles for 100 * movaps xmm0, oword ptr [esi]
60058   cycles for 100 * movntdq xmm0, oword ptr [esi]

19      bytes for rep movsb
19      bytes for rep movsd
29      bytes for movlps qword ptr [esi+8*ecx]
34      bytes for movaps xmm0, oword ptr [esi]
35      bytes for movntdq xmm0, oword ptr [esi]


--- ok ---

jj2007

Quote from: LiaoMi on December 09, 2021, 05:10:05 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

13916   cycles for 100 * rep movsb
...
59500   cycles for 100 * movntdq xmm0, oword ptr [esi]

That looks odd, and I wondered whether my counts were correct. But I can't find an error... how can movntdq be so slow?

FORTRANS

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

35139 cycles for 100 * rep movsb
36189 cycles for 100 * rep movsd
161839 cycles for 100 * movlps qword ptr [esi+8*ecx]
82736 cycles for 100 * movaps xmm0, oword ptr [esi]
173215 cycles for 100 * movntdq xmm0, oword ptr [esi]

35248 cycles for 100 * rep movsb
36580 cycles for 100 * rep movsd
160325 cycles for 100 * movlps qword ptr [esi+8*ecx]
82958 cycles for 100 * movaps xmm0, oword ptr [esi]
174700 cycles for 100 * movntdq xmm0, oword ptr [esi]

35392 cycles for 100 * rep movsb
36231 cycles for 100 * rep movsd
160691 cycles for 100 * movlps qword ptr [esi+8*ecx]
83033 cycles for 100 * movaps xmm0, oword ptr [esi]
174148 cycles for 100 * movntdq xmm0, oword ptr [esi]

35310 cycles for 100 * rep movsb
36172 cycles for 100 * rep movsd
162454 cycles for 100 * movlps qword ptr [esi+8*ecx]
83124 cycles for 100 * movaps xmm0, oword ptr [esi]
173325 cycles for 100 * movntdq xmm0, oword ptr [esi]

19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]


--- ok ---


Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

22808 cycles for 100 * rep movsb
22940 cycles for 100 * rep movsd
82232 cycles for 100 * movlps qword ptr [esi+8*ecx]
55785 cycles for 100 * movaps xmm0, oword ptr [esi]
148033 cycles for 100 * movntdq xmm0, oword ptr [esi]

22471 cycles for 100 * rep movsb
22846 cycles for 100 * rep movsd
82406 cycles for 100 * movlps qword ptr [esi+8*ecx]
57255 cycles for 100 * movaps xmm0, oword ptr [esi]
151683 cycles for 100 * movntdq xmm0, oword ptr [esi]

22507 cycles for 100 * rep movsb
23157 cycles for 100 * rep movsd
82990 cycles for 100 * movlps qword ptr [esi+8*ecx]
55098 cycles for 100 * movaps xmm0, oword ptr [esi]
144060 cycles for 100 * movntdq xmm0, oword ptr [esi]

22462 cycles for 100 * rep movsb
22567 cycles for 100 * rep movsd
82398 cycles for 100 * movlps qword ptr [esi+8*ecx]
54862 cycles for 100 * movaps xmm0, oword ptr [esi]
142853 cycles for 100 * movntdq xmm0, oword ptr [esi]

19 bytes for rep movsb
19 bytes for rep movsd
29 bytes for movlps qword ptr [esi+8*ecx]
34 bytes for movaps xmm0, oword ptr [esi]
35 bytes for movntdq xmm0, oword ptr [esi]


--- ok ---

TimoVJL

#27
AMD Athlon Gold 3150U with Radeon Graphics      (SSE4) Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4) Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

cycles for 100 * rep movsb                    57654   35139   22808   31040   13916
cycles for 100 * rep movsd                    65892   36189   22940   31143   16152
cycles for 100 * movlps qword ptr [esi+8*ec  112954  161839   82232  117139  106432
cycles for 100 * movaps xmm0, oword ptr [es   58152   82736   55785   72688   42050
cycles for 100 * movntdq xmm0, oword ptr [e  129800  173215  148033  109706   59500

cycles for 100 * rep movsb                    59723   35248   22471   31381   16298
cycles for 100 * rep movsd                    59356   36580   22846   31663   15607
cycles for 100 * movlps qword ptr [esi+8*ec  113875  160325   82406  116001  109919
cycles for 100 * movaps xmm0, oword ptr [es   57518   82958   57255   71727   41897
cycles for 100 * movntdq xmm0, oword ptr [e  130509  174700  151683  110933   58949

cycles for 100 * rep movsb                    59061   35392   22507   31644   15691
cycles for 100 * rep movsd                    63768   36231   23157   37560   16515
cycles for 100 * movlps qword ptr [esi+8*ec  112908  160691   82990  114454  108793
cycles for 100 * movaps xmm0, oword ptr [es   57839   83033   55098   72541   41640
cycles for 100 * movntdq xmm0, oword ptr [e  132310  174148  144060  124899  101036

cycles for 100 * rep movsb                    59031   35310   22462   31097   17390
cycles for 100 * rep movsd                    58619   36172   22567   31010   16209
cycles for 100 * movlps qword ptr [esi+8*ec  129052  162454   82398  115056  117106
cycles for 100 * movaps xmm0, oword ptr [es   57675   83124   54862   72463   42124
cycles for 100 * movntdq xmm0, oword ptr [e  131438  173325  142853  109951   60058

bytes for rep movsb                              19      19      19      19      19
bytes for rep movsd                              19      19      19      19      19
bytes for movlps qword ptr [esi+8*ecx]           29      29      29      29      29
bytes for movaps xmm0, oword ptr [esi]           34      34      34      34      34
bytes for movntdq xmm0, oword ptr [esi]          35      35      35      35      35
Sub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    sTab = Chr(9)
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub
May the source be with you

LiaoMi

Quote from: jj2007 on December 09, 2021, 06:53:58 AM
Quote from: LiaoMi on December 09, 2021, 05:10:05 AM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

13916   cycles for 100 * rep movsb
...
59500   cycles for 100 * movntdq xmm0, oword ptr [esi]

That looks odd, and I wondered whether my counts were correct. But I can't find an error... how can movntdq be so slow?

Hi jj2007,

I would also like to know what is the reason for such slowdowns  :undecided:

Random slow downs with AVX2 code - https://community.intel.com/t5/Intel-ISA-Extensions/Random-slow-downs-with-AVX2-code/m-p/1084764
Depending on CPU, 10x is approximately the difference between L1 cache and L3 cache latency. Is your thread pinned?
Jim Dempsey


How L1 and L2 CPU Caches Work, and Why They're an Essential Part of Modern Chips - https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

Reducing Memory Access Times with Caches - https://developers.redhat.com/blog/2016/03/01/reducing-memory-access-times-with-caches#

What is a "cache-friendly" code? - https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code

CS 201 Writing Cache-Friendly Code - Portland State University - http://web.cecs.pdx.edu/~jrb/cs201/lectures/cache.friendly.code.pdf

Very slow performance of VMOVNTDQ instruction - https://community.intel.com/t5/Intel-ISA-Extensions/Very-slow-performance-of-VMOVNTDQ-instruction/td-p/941697
Thanks for the link. Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.

AVX transition penalties and OS support - https://community.intel.com/t5/Intel-ISA-Extensions/AVX-transition-penalties-and-OS-support/m-p/931977
Intel Avoiding AVX-SSE Transition Penalties - https://web.archive.org/web/20160409073240/software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

What Every Programmer Should Know About Memory - https://www.akkadia.org/drepper/cpumemory.pdf

jj2007