Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..

Started by jj2007, November 16, 2021, 08:57:25 AM

Previous topic - Next topic

jj2007

Quote from: nidud on November 17, 2021, 01:00:25 AMYou also need to know a few basic things about programming and batch files, so better just ignore it.

:greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp:

Why don't you just post stuff that works with a standard Masm32 SDK?

HSE

Quote from: jj2007 on November 16, 2021, 11:05:52 PM
Please post the executable, as testit.bat doesn't work, sorry

Work well with last AsmC64  :thumbsup:

total [0 .. 3], 1++
   108562 cycles 3.asm: mov   reg32,reg32
   201478 cycles 2.asm: bswap reg32
   409369 cycles 1.asm: imul  reg32,reg32,imm
   415621 cycles 0.asm: lea   reg64,mem64 * 2
   534630 cycles 4.asm: push  reg64
  1676828 cycles 6.asm: cld
  2252969 cycles 5.asm: std

Equations in Assembly: SmplMath

jj2007

Quote from: LiaoMi on November 17, 2021, 03:57:54 AM
Quote from: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap), here is a little testbed.

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

Thanks. Timings look a bit unstable (that happens often), but yours confirm that bswap is very fast, and that lodsd runs as fast as the 5-byter mov eax, [esi] + add esi, 4 - on Intel CPUs.

hutch--

I confess I have never had any problems with BSWAP over the years from my 486 dx onwards. Its originally an endian swap instruction and at a hardware level it does not have to do much so it has always been fast enough. From memory there is an SSE masked instruction that was reasonably fast that will do the same task but its more work to do somethjing that probably does not matter.

nidud


daydreamer

Quote from: nidud on November 17, 2021, 09:50:05 AM
Think the original point here was, as often is the case with BSWAP, that you didn't need to use it.
It's the same with some SIMD instructions, if you rearrange in .data section between array of structures (AoS) and structures of arrays (SoA)
Between xyzw, xyzw,xyzw,xyzw and xxxx,yyyy,zzzz,wwww

@hutch, you mean SSE shufb, byte Shuffle with right constant, you can perform 4x32 bit bswap
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

TimoVJL

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

3       cycles for 100 * imul 10
95      cycles for 100 * lea: *10
7998    cycles for 100 * lodsd (25 DWORDs)
6440    cycles for 100 * mov eax, [esi] + add esi, 4
91      cycles for 100 * lea10, add eax
91      cycles for 100 * lea10, shl eax, 1
91      cycles for 100 * bswap
??      cycles for 100 * ror 16

5       cycles for 100 * imul 10
95      cycles for 100 * lea: *10
8274    cycles for 100 * lodsd (25 DWORDs)
6525    cycles for 100 * mov eax, [esi] + add esi, 4
93      cycles for 100 * lea10, add eax
95      cycles for 100 * lea10, shl eax, 1
94      cycles for 100 * bswap
??      cycles for 100 * ror 16

2       cycles for 100 * imul 10
92      cycles for 100 * lea: *10
8117    cycles for 100 * lodsd (25 DWORDs)
6529    cycles for 100 * mov eax, [esi] + add esi, 4
92      cycles for 100 * lea10, add eax
90      cycles for 100 * lea10, shl eax, 1
90      cycles for 100 * bswap
0       cycles for 100 * ror 16

2       cycles for 100 * imul 10
92      cycles for 100 * lea: *10
8058    cycles for 100 * lodsd (25 DWORDs)
6591    cycles for 100 * mov eax, [esi] + add esi, 4
93      cycles for 100 * lea10, add eax
93      cycles for 100 * lea10, shl eax, 1
93      cycles for 100 * bswap
0       cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


--- ok ---
May the source be with you

hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

??      cycles for 100 * imul 10
11      cycles for 100 * lea: *10
2315    cycles for 100 * lodsd (25 DWORDs)
2860    cycles for 100 * mov eax, [esi] + add esi, 4
24      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

0       cycles for 100 * imul 10
14      cycles for 100 * lea: *10
2305    cycles for 100 * lodsd (25 DWORDs)
2823    cycles for 100 * mov eax, [esi] + add esi, 4
24      cycles for 100 * lea10, add eax
14      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

??      cycles for 100 * imul 10
14      cycles for 100 * lea: *10
2332    cycles for 100 * lodsd (25 DWORDs)
2857    cycles for 100 * mov eax, [esi] + add esi, 4
25      cycles for 100 * lea10, add eax
13      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

??      cycles for 100 * imul 10
16      cycles for 100 * lea: *10
2302    cycles for 100 * lodsd (25 DWORDs)
2837    cycles for 100 * mov eax, [esi] + add esi, 4
26      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-

hutch--


Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (SSE4)

2       cycles for 100 * imul 10
15      cycles for 100 * lea: *10
2160    cycles for 100 * lodsd (25 DWORDs)
2232    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
10      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

2       cycles for 100 * imul 10
18      cycles for 100 * lea: *10
2117    cycles for 100 * lodsd (25 DWORDs)
2287    cycles for 100 * mov eax, [esi] + add esi, 4
25      cycles for 100 * lea10, add eax
13      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

1       cycles for 100 * imul 10
12      cycles for 100 * lea: *10
2086    cycles for 100 * lodsd (25 DWORDs)
2171    cycles for 100 * mov eax, [esi] + add esi, 4
24      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

0       cycles for 100 * imul 10
21      cycles for 100 * lea: *10
2143    cycles for 100 * lodsd (25 DWORDs)
2127    cycles for 100 * mov eax, [esi] + add esi, 4
25      cycles for 100 * lea10, add eax
28      cycles for 100 * lea10, shl eax, 1
1       cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-

TimoVJL

Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4) Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4) AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

cycles for 100 * imul 10 2 ?? ?? 23 ?? ?? 13 26 ??
cycles for 100 * lea: *10 15 11 21 54 21 11 29 9 72
cycles for 100 * lodsd (25 DWORDs) 2160 2315 2606 4871 2563 2315 2929 1765 6976 5630
cycles for 100 * mov eax, [esi] + add 2232 2860 2507 4863 2653 2860 2834 1689 2599 2036
cycles for 100 * lea10, add eax 28 24 30 58 28 24 34 4 28
cycles for 100 * lea10, shl eax, 1 10 17 21 57 31 17 29 16 38
cycles for 100 * bswap 5 27 102 2 6 38
cycles for 100 * ror 16 ?? 6 62 14 ?? 19

cycles for 100 * imul 10 2 ?? 26 22 36
cycles for 100 * lea: *10 18 14 24 55 121 14 28 8 28
cycles for 100 * lodsd (25 DWORDs) 2117 2305 2651 4881 4502 2305 2808 1566 6910 5610
cycles for 100 * mov eax, [esi] + add 2287 2823 2532 4890 4379 2823 2931 1764 2593 2062
cycles for 100 * lea10, add eax 25 24 20 56 57 24 28 2 28
cycles for 100 * lea10, shl eax, 1 13 14 36 55 64 14 30 8 31
cycles for 100 * bswap 14 1 20 14 13 2 31
cycles for 100 * ror 16 14 2 55 6 14 7 9 2

cycles for 100 * imul 10 1 ?? ?? 24 3 ?? 30 ??
cycles for 100 * lea: *10 12 14 26 53 26 14 27 1 32
cycles for 100 * lodsd (25 DWORDs) 2086 2332 2688 4829 2708 2332 2853 2126 6913 5787
cycles for 100 * mov eax, [esi] + add 2171 2857 2459 4894 3182 2857 2832 1734 2582 2020
cycles for 100 * lea10, add eax 24 25 29 56 27 25 28 15 35
cycles for 100 * lea10, shl eax, 1 16 13 23 54 23 13 30 26 28
cycles for 100 * bswap 13 20 13 3 28
cycles for 100 * ror 16 13 5 55 13 13 7 ?? 6

cycles for 100 * imul 10 ?? 3 30 ?? 53 29 3
cycles for 100 * lea: *10 21 16 30 60 32 16 27 4 46
cycles for 100 * lodsd (25 DWORDs) 2143 2302 2544 4921 2617 2302 2808 1551 6928 5600
cycles for 100 * mov eax, [esi] + add 2127 2837 2604 4810 2599 2837 2846 1667 2594 2051
cycles for 100 * lea10, add eax 25 26 34 55 21 26 31 39 32
cycles for 100 * lea10, shl eax, 1 28 17 27 54 29 17 30 5 34 4
cycles for 100 * bswap 1 17 20 5 17 2 5 34
cycles for 100 * ror 16 17 1 55 2 17 9 5 34

bytes for imul 10 8
bytes for lea: *10 11
bytes for lodsd (25 DWORDs) 14
bytes for mov eax, [esi] + add esi, 4 18
bytes for lea10, add eax 10
bytes for lea10, shl eax, 1 10
bytes for bswap 7
bytes for ror 16 8
May the source be with you

jj2007

Impressive work, Timo :thumbsup:

Sub ImportValues()
    Row = 3
    Col = 2
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            'Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Excel Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    nCurRow = 0
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If nCurRow = 0 Then ActiveSheet.Cells(1, Col) = sTextRow
        nCurRow = nCurRow + 1
        If nCurRow >= 3 Then
            If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
            'Debug.Print nClk
            If nClk > 0 Then ActiveSheet.Cells(Row, Col) = nClk
            Row = Row + 1
        End If
        If nCurRow = 38 Then Exit Do
    Loop
    Close #nFileNro
End Sub

hutch--


jj2007

Quote from: hutch-- on November 24, 2021, 08:33:15 PM
Sad to say I have nothing that will open an Excel file.

This one should work, and it usually adds also an advanced RichEd20 version somewhere in C:\Program Files (x86)\Common Files\microsoft shared\OFFICE*\RICHED20.DLL

https://filehippo.com/download_microsoft-excel-viewer/

No idea how useful that is, but here are the averages from Timo's table:

19 imul 10
28 lea * 10
3454 lodsd (25 DWORDs)
2769 mov eax, [esi] + add
30 lea10, add eax
26 lea10, shl eax, 1
18 bswap
18 ror 16


I always liked imul :cool:

TimoVJL

May the source be with you

hutch--

OK, that worked after I renamed the extension back to XLS.

The real use of the data was the comparison between the old i7 and the Xeon. Both have Haswell cores but there is some interesting variations.