News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Unaligned memory copy test piece.

Started by hutch--, December 06, 2021, 08:34:55 PM

Previous topic - Next topic

hutch--



Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
++-++--++++9 of 20 tests valid,
242232  kCycles for 1 * rep movsb
215571  kCycles for 1 * rep movsd
227473  kCycles for 1 * movlps qword ptr [esi+8*ecx]
230151  kCycles for 1 * movaps xmm0, oword ptr [esi]
146063  kCycles for 1 * movdqa + movntdq
146242  kCycles for 1 * movdqu + movntdq
146090  kCycles for 1 * movdqu + movntdq + mfence

208745  kCycles for 1 * rep movsb
214643  kCycles for 1 * rep movsd
227536  kCycles for 1 * movlps qword ptr [esi+8*ecx]
230180  kCycles for 1 * movaps xmm0, oword ptr [esi]
146452  kCycles for 1 * movdqa + movntdq
146127  kCycles for 1 * movdqu + movntdq
146534  kCycles for 1 * movdqu + movntdq + mfence

208675  kCycles for 1 * rep movsb
213978  kCycles for 1 * rep movsd
227433  kCycles for 1 * movlps qword ptr [esi+8*ecx]
230045  kCycles for 1 * movaps xmm0, oword ptr [esi]
146253  kCycles for 1 * movdqa + movntdq
146041  kCycles for 1 * movdqu + movntdq
146361  kCycles for 1 * movdqu + movntdq + mfence

208670  kCycles for 1 * rep movsb
216371  kCycles for 1 * rep movsd
227677  kCycles for 1 * movlps qword ptr [esi+8*ecx]
229852  kCycles for 1 * movaps xmm0, oword ptr [esi]
146070  kCycles for 1 * movdqa + movntdq
146349  kCycles for 1 * movdqu + movntdq
146292  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


-


jj2007

Thanks to everybody :thup:

I see a pretty consistent pattern: rep movs* faster than movlps and movaps, but movntdq faster than rep movs*. Makes sense :thumbsup:

FORTRANS

Hi,

   I tried it on two older machines.  One locked up ( kinda / sorta ),
one ran okay.

Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
++++++++++++++6 of 20 tests valid,
916706 kCycles for 1 * rep movsb
822971 kCycles for 1 * rep movsd
1043451 kCycles for 1 * movlps qword ptr [esi+8*ecx]
904357 kCycles for 1 * movaps xmm0, oword ptr [esi]
708260 kCycles for 1 * movdqa + movntdq
709339 kCycles for 1 * movdqu + movntdq
683374 kCycles for 1 * movdqu + movntdq + mfence

825066 kCycles for 1 * rep movsb
834842 kCycles for 1 * rep movsd
1048805 kCycles for 1 * movlps qword ptr [esi+8*ecx]
904001 kCycles for 1 * movaps xmm0, oword ptr [esi]
677020 kCycles for 1 * movdqa + movntdq
684107 kCycles for 1 * movdqu + movntdq
683326 kCycles for 1 * movdqu + movntdq + mfence

820533 kCycles for 1 * rep movsb
820807 kCycles for 1 * rep movsd
1033703 kCycles for 1 * movlps qword ptr [esi+8*ecx]
899210 kCycles for 1 * movaps xmm0, oword ptr [esi]
677950 kCycles for 1 * movdqa + movntdq
685388 kCycles for 1 * movdqu + movntdq
682889 kCycles for 1 * movdqu + movntdq + mfence

819905 kCycles for 1 * rep movsb
820346 kCycles for 1 * rep movsd
1033769 kCycles for 1 * movlps qword ptr [esi+8*ecx]
899579 kCycles for 1 * movaps xmm0, oword ptr [esi]
676660 kCycles for 1 * movdqa + movntdq
685048 kCycles for 1 * movdqu + movntdq
682723 kCycles for 1 * movdqu + movntdq + mfence

21 bytes for rep movsb
21 bytes for rep movsd
37 bytes for movlps qword ptr [esi+8*ecx]
42 bytes for movaps xmm0, oword ptr [esi]
44 bytes for movdqa + movntdq
44 bytes for movdqu + movntdq
47 bytes for movdqu + movntdq + mfence


--- ok ---


Steve N.

TimoVJL

#63

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4) AMD Ryzen 9 5950X 16-Core Processor             (SSE4) Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4) Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4) Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
kCycles for 1 * rep movsb                    261718  103297  352139  241261  315842  242232  111187  916706
kCycles for 1 * rep movsd                    238384   83453  237663  201928  241232  215571   86590  822971
kCycles for 1 * movlps qword ptr [esi+8*ecx  269729  151305  287198  250473  304695  227473  131239 1043451
kCycles for 1 * movaps xmm0, oword ptr [esi  236182  140797  279630  245765  302014  230151  119079  904357
kCycles for 1 * movdqa + movntdq             156614   85881  181083  182061  208018  146063   99788  708260
kCycles for 1 * movdqu + movntdq             156042   87420  181125  187806  207876  146242   89096  709339
kCycles for 1 * movdqu + movntdq + mfence    156594   85314  180909  197382  207752  146090   92749  683374

kCycles for 1 * rep movsb                    236577   82482  193980  224850  249181  208745   99740  825066
kCycles for 1 * rep movsd                    236369   81107  214331  202630  239809  214643   92438  834842
kCycles for 1 * movlps qword ptr [esi+8*ecx  270908  148720  278425  234536  304868  227536  119977 1048805
kCycles for 1 * movaps xmm0, oword ptr [esi  236626  140735  274866  228211  301253  230180  111659  904001
kCycles for 1 * movdqa + movntdq             156354   87417  179814  191152  207931  146452   76366  677020
kCycles for 1 * movdqu + movntdq             155998   85541  179520  188628  208272  146127   79162  684107
kCycles for 1 * movdqu + movntdq + mfence    156243   86765  179469  185565  207503  146534   77279  683326

kCycles for 1 * rep movsb                    235567   83181  192139  206426  249727  208675   89597  820533
kCycles for 1 * rep movsd                    236734   81348  210664  206008  241799  213978   85665  820807
kCycles for 1 * movlps qword ptr [esi+8*ecx  276802  149105  279349  233301  303516  227433  125051 1033703
kCycles for 1 * movaps xmm0, oword ptr [esi  236012  141740  274878  229024  301728  230045  111892  899210
kCycles for 1 * movdqa + movntdq             156233   85743  180115  181524  207608  146253   76149  677950
kCycles for 1 * movdqu + movntdq             156445   86256  179785  198103  208094  146041   76483  685388
kCycles for 1 * movdqu + movntdq + mfence    156944   87608  179769  177373  208854  146361   76167  682889

kCycles for 1 * rep movsb                    237039   81210  191809  199886  248574  208670   86964  819905
kCycles for 1 * rep movsd                    238233   81663  216878  200050  240836  216371   85324  820346
kCycles for 1 * movlps qword ptr [esi+8*ecx  270702  150942  277053  233793  304675  227677  121596 1033769
kCycles for 1 * movaps xmm0, oword ptr [esi  236677  140339  277263  228413  301674  229852  111769  899579
kCycles for 1 * movdqa + movntdq             155610   86239  178899  177392  208379  146070   76968  676660
kCycles for 1 * movdqu + movntdq             156935   87931  179083  175842  207882  146349   75970  685048
kCycles for 1 * movdqu + movntdq + mfence    156013   85408  178847  175220  207742  146292   75677  682723
May the source be with you

jj2007

#64
Wow, you put a lot of work into this one, Timo :thumbsup:
Do you have it as a tab-delimited or csv file?

Thanks, Timo, here are the averages - extremely consistent:

All 9 CPUs
276 rep movsb
264 rep movsd
330 movlps qword ptr [esi+8*ecx]
304 movaps xmm0, oword ptr [esi]
216 movdqa + movntdq
217 movdqu + movntdq
216 movdqu + movntdq + mfence


AMD only
165 rep movsb
160 rep movsd
211 movlps qword ptr [esi+8*ecx]
189 movaps xmm0, oword ptr [esi]
121 movdqa + movntdq
122 movdqu + movntdq
121 movdqu + movntdq + mfence

FORTRANS

Hi,

   Using Timo's data I picked out the first data value in each run
(they looked about 10% large) and normalized the averages
for each processor..


                      MASM Forum example     AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
                                                   AMD Ryzen 9 5950X 16-Core Processor (SSE4)
                                                         Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                               Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                     Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                           Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
                                                                                       Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)

kCycles for 1 * rep movsb                    1.51  1.00  1.07  1.15  1.20  1.43  1.15  1.20
kCycles for 1 * rep movsd                    1.52  1.00  1.22  1.11  1.16  1.47  1.09  1.21
kCycles for 1 * movlps qword ptr [esi+8*ecx] 1.74  1.83  1.56  1.30  1.46  1.56  1.55  1.52
kCycles for 1 * movaps xmm0, oword ptr [esi] 1.51  1.72  1.54  1.27  1.45  1.57  1.42  1.32
kCycles for 1 * movdqa + movntdq             1.00  1.05  1.00  1.00  1.00  1.00  1.03  1.00
kCycles for 1 * movdqu + movntdq             1.00  1.06  1.00  1.02  1.00  1.00  1.00  1.01
kCycles for 1 * movdqu + movntdq + mfence    1.00  1.05  1.00  1.00  1.00  1.00  1.00  1.00


Cheers,

Steve N.

daydreamer

whats invalid mean?
Jochen if its cachesize dependent,wouldnt it be good to use some API that shows cachesize?
celeron smaller cache,Xeon bigger cache and newer generation cpu with many cores bigger cache
reminds me of that movie where you are called "invalid" if you are only natural born,compared to the genetically improved
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
+++++++-+11 of 20 tests valid,
212547  kCycles for 1 * rep movsb
179930  kCycles for 1 * rep movsd
220533  kCycles for 1 * movlps qword ptr [esi+8*ecx]
202840  kCycles for 1 * movaps xmm0, oword ptr [esi]
191639  kCycles for 1 * movdqa + movntdq
184883  kCycles for 1 * movdqu + movntdq
170484  kCycles for 1 * movdqu + movntdq + mfence

178210  kCycles for 1 * rep movsb
178902  kCycles for 1 * rep movsd
218612  kCycles for 1 * movlps qword ptr [esi+8*ecx]
202293  kCycles for 1 * movaps xmm0, oword ptr [esi]
152501  kCycles for 1 * movdqa + movntdq
151305  kCycles for 1 * movdqu + movntdq
151683  kCycles for 1 * movdqu + movntdq + mfence

178193  kCycles for 1 * rep movsb
178480  kCycles for 1 * rep movsd
219341  kCycles for 1 * movlps qword ptr [esi+8*ecx]
207351  kCycles for 1 * movaps xmm0, oword ptr [esi]
158671  kCycles for 1 * movdqa + movntdq
151322  kCycles for 1 * movdqu + movntdq
152135  kCycles for 1 * movdqu + movntdq + mfence

178037  kCycles for 1 * rep movsb
178500  kCycles for 1 * rep movsd
218100  kCycles for 1 * movlps qword ptr [esi+8*ecx]
202977  kCycles for 1 * movaps xmm0, oword ptr [esi]
152367  kCycles for 1 * movdqa + movntdq
151737  kCycles for 1 * movdqu + movntdq
151928  kCycles for 1 * movdqu + movntdq + mfence

21      bytes for rep movsb
21      bytes for rep movsd
37      bytes for movlps qword ptr [esi+8*ecx]
42      bytes for movaps xmm0, oword ptr [esi]
44      bytes for movdqa + movntdq
44      bytes for movdqu + movntdq
47      bytes for movdqu + movntdq + mfence


-
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

TimoVJL

#67
Thanks Steve N. for tiled title idea :thumbsup:
                                                      AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
                                                              AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
                                                                      Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                                              Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                                      Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                                              Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
                                                                                                      Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                                              11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

kCycles for 1 * rep movsd                              238384   83453  237663  201928  241232  179930  215571   86590
kCycles for 1 * movlps qword ptr [esi+8*ecx]           269729  151305  287198  250473  304695  220533  227473  131239
kCycles for 1 * movaps xmm0, oword ptr [esi]           236182  140797  279630  245765  302014  202840  230151  119079
kCycles for 1 * movdqa + movntdq                       156614   85881  181083  182061  208018  191639  146063   99788
kCycles for 1 * movdqu + movntdq                       156042   87420  181125  187806  207876  184883  146242   89096
kCycles for 1 * movdqu + movntdq + mfence              156594   85314  180909  197382  207752  170484  146090   92749
Excel .prn file looks better
                                         AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
                                                 AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
                                                         Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
                                                                 Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
                                                                         Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
                                                                                 Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
                                                                                         Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
                                                                                                 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

kCycles for 1 * rep movsd          83453    2,86    1,00    2,85    2,42    2,89    2,16    2,58    1,04
kCycles for 1 * movlps qword ptr  131239    2,06    1,15    2,19    1,91    2,32    1,68    1,73    1,00
kCycles for 1 * movaps xmm0, owo  119079    1,98    1,18    2,35    2,06    2,54    1,70    1,93    1,00
kCycles for 1 * movdqa + movntdq   85881    1,82    1,00    2,11    2,12    2,42    2,23    1,70    1,16
kCycles for 1 * movdqu + movntdq   87420    1,78    1,00    2,07    2,15    2,38    2,11    1,67    1,02
kCycles for 1 * movdqu + movntdq   85314    1,84    1,00    2,12    2,31    2,44    2,00    1,71    1,09


Sub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    sTab = Chr(9)
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            'Debug.Print .Cells(Row, Col)
            Row = Row + 1
            Col = Col + 1
        Wend
        .Cells(Row, Col).EntireRow.Insert
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    nRow = 0    ' start
    Open sFileName For Input As #nFileNro
    Line Input #nFileNro, sTextRow
    ActiveSheet.Cells(Row, Col) = sTextRow   ' title
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        Row = Row + 1   ' table
        nRow = nRow + 1 ' file
        If nRow >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If nRow = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub

EDIT: RepMovsd05GB_Results_1.csv
May the source be with you

hutch--

 :biggrin:

Hi Timo,

I cheated, I downloaded Libre Office, set the defaults for Excel and Word and BINGO, I can now open you "csv" files.  :thumbsup:

jj2007

Quote from: daydreamer on December 15, 2021, 05:23:23 AM
Jochen if its cachesize dependent,wouldnt it be good to use some API that shows cachesize?

Good idea :thumbsup: Which one? Do you have some code?

TimoVJL

If we just collect L3 info
How to Check Processor Cache Memory in Windows 10
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
4 64 3 4 3 3 15 24
May the source be with you

hutch--

Something I have just tested is that if you use movdqu for both read and write to make the copy algo fully unaligned, it is no faster than rep movsb.

jj2007

Quote from: hutch-- on December 15, 2021, 10:48:36 PM
Something I have just tested is that if you use movdqu for both read and write to make the copy algo fully unaligned, it is no faster than rep movsb.

movdqu was an attempt to speed up unaligned moves, but afaik they merged it with movups; and the latter is no longer slower than movaps. Three equivalent instructions.

TimoVJL

Quote from: hutch-- on December 15, 2021, 04:16:16 PM
:biggrin:

Hi Timo,

I cheated, I downloaded Libre Office, set the defaults for Excel and Word and BINGO, I can now open you "csv" files.  :thumbsup:
A fix for Libre Office vba
Option VBASupport 1
Sub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    sTab = Chr(9)
    With ActiveSheet
        While .Cells(Row, Col).Text <> ""
            'Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        '.AllowMultiSelect = False
        '.Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(sTextRow)
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub

Question, what format users want for tables?
May the source be with you

jj2007

Quote from: TimoVJL on December 15, 2021, 08:32:35 PM
If we just collect L3 info

include \masm32\MasmBasic\MasmBasic.inc
CACHE_DESCRIPTOR STRUCT
Level      BYTE ?
Associativity BYTE ?
LineSize  WORD ?
_Size      DWORD ?
_Type      dd ? ; PROCESSOR_CACHE_TYPE enum
CACHE_DESCRIPTOR ENDS

SYSTEM_LOGICAL_PROCESSOR_INFORMATION STRUCT
NodeNumber DWORD ?
Cache      CACHE_DESCRIPTOR <>
Reserved  ULONGLONG 2 dup(?)
SYSTEM_LOGICAL_PROCESSOR_INFORMATION ENDS

  Init
  Print cfm$("\n\nWmic:\n"), Launch$("wmic cpu get L2CacheSize, L3CacheSize") ; the simple solution
  Dll "Kernel32"
  Declare void GetLogicalProcessorInformation, 2
  Let edi=New$(400)
  ClearLastError
  push 400
  GetLogicalProcessorInformation(edi, esp)
  pop edx
  pinfo equ [edi.SYSTEM_LOGICAL_PROCESSOR_INFORMATION]
  deb 1, "GetLogicalProcessorInformation output:", eax, pinfo.NodeNumber, b:pinfo.Cache.Level, b:pinfo.Cache.LineSize, b:pinfo.Cache._Size, b:pinfo.Cache._Type, $Err$()
EndOfCode


Output:
Wmic:
L2CacheSize  L3CacheSize
256          3072


GetLogicalProcessorInformation output:
eax             1
pinfo.NodeNumber        3
b:pinfo.Cache.Level     00000000
b:pinfo.Cache.LineSize  0000000000000000
b:pinfo.Cache._Size     00000000000000000000000000000001
b:pinfo.Cache._Type     00000000000000000000000000000000
$Err$()         Operazione completata.__


WMIC works, but GetLogicalProcessorInformation returns rubbish :sad: