Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..

Started by jj2007, November 16, 2021, 08:57:25 AM

Previous topic - Next topic

TimoVJL

Tabular  data.
May the source be with you

jj2007

First timings with the new machine :cool:

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

0       cycles for 100 * imul 10
30      cycles for 100 * lea: *10
488     cycles for 100 * lodsd (25 DWORDs) - 100*
162     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
29      cycles for 100 * lea10, add eax
38      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

3       cycles for 100 * imul 10
29      cycles for 100 * lea: *10
489     cycles for 100 * lodsd (25 DWORDs) - 100*
162     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
30      cycles for 100 * lea10, add eax
29      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
1       cycles for 100 * xor eax, ecx

4       cycles for 100 * imul 10
28      cycles for 100 * lea: *10
487     cycles for 100 * lodsd (25 DWORDs) - 100*
170     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
30      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx

hutch--

This is on my old i7.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

0       cycles for 100 * imul 10
18      cycles for 100 * lea: *10
213     cycles for 100 * lodsd (25 DWORDs) - 100*
196     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
7       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
19      cycles for 100 * lea: *10
213     cycles for 100 * lodsd (25 DWORDs) - 100*
193     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
6       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
19      cycles for 100 * lea: *10
214     cycles for 100 * lodsd (25 DWORDs) - 100*
193     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
7       cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


-

LiaoMi

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

41      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
120     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
16      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
11      cycles for 100 * xor eax, ecx

45      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
118     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
17      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
12      cycles for 100 * xor eax, ecx

44      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
117     cycles for 100 * lodsd (25 DWORDs) - 100*
123     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
16      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
16      cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


--- ok ---

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

0       cycles for 100 * imul 10
38      cycles for 100 * lea: *10
591     cycles for 100 * lodsd (25 DWORDs) - 100*
208     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
38      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
36      cycles for 100 * lea: *10
591     cycles for 100 * lodsd (25 DWORDs) - 100*
202     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
37      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
37      cycles for 100 * lea: *10
590     cycles for 100 * lodsd (25 DWORDs) - 100*
200     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
37      cycles for 100 * lea10, add eax
40      cycles for 100 * lea10, shl eax, 1
3       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4) AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4) AMD Athlon(tm) II X2 220 Processor (SSE3) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

                                           AMD AthAMD RyzAMD Athl11th GenIntel(R)

cycles for 100 * imul 10                        0      0       3      41       0
cycles for 100 * lea: *10                      30     38      27      15      18
cycles for 100 * lodsd (25 DWORDs) - 100*     488    591     744     116     213
cycles for 100 * mov eax, [esi] + add esi,    162    208     459     120     196
cycles for 100 * lea10, add eax                29     38      16      16      32
cycles for 100 * lea10, shl eax, 1             38     37       3      17      17
cycles for 100 * bswap+nop                      0      0      27      18      16
cycles for 100 * xor eax, ecx                   0      0       5      11       7

cycles for 100 * imul 10                        3      0       3      45       0
cycles for 100 * lea: *10                      29     36      31      15      19
cycles for 100 * lodsd (25 DWORDs) - 100*     489    591     771     116     213
cycles for 100 * mov eax, [esi] + add esi,    162    202     479     118     193
cycles for 100 * lea10, add eax                30     37       4      17      32
cycles for 100 * lea10, shl eax, 1             29     37       4      16      16
cycles for 100 * bswap+nop                      0      0      29      18      16
cycles for 100 * xor eax, ecx                   1      0       0      12       6

cycles for 100 * imul 10                        4      0       2      44       0
cycles for 100 * lea: *10                      28     37      29      15      19
cycles for 100 * lodsd (25 DWORDs) - 100*     487    590     753     117     214
cycles for 100 * mov eax, [esi] + add esi,    170    200     459     123     193
cycles for 100 * lea10, add eax                32     37       6      16      32
cycles for 100 * lea10, shl eax, 1             30     40       1      16      16
cycles for 100 * bswap+nop                      0      3      25      16      16
cycles for 100 * xor eax, ecx                   0      0       3      16       7

@jj2007 ExcelSub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, 8)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub
May the source be with you

jj2007

Quote from: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

41      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
120     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*

The only cpu that performs badly for imul :cool:

hutch--

I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.

LiaoMi

Quote from: jj2007 on December 01, 2021, 09:54:49 PM
Quote from: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

41      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
120     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*

The only cpu that performs badly for imul :cool:

:biggrin: I noticed that a lot of commands are slower than on the oldest processors, some of them are extremely slow :rolleyes: here's the price of progress, most likely the reason is in the implementation of microcode. It's very sad  :undecided:

jj2007

Quote from: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.

Very good point :thumbsup:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

28      cycles for 100 * imul 10
688     cycles for 100 * rep lodsd (25 DWORDs) - 10*
405     cycles for 100 *     lodsd (25 DWORDs) - 10*
317     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
57      cycles for 100 * lea10, add eax
56      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap+nop
19      cycles for 100 * xor eax, ecx

27      cycles for 100 * imul 10
688     cycles for 100 * rep lodsd (25 DWORDs) - 10*
406     cycles for 100 *     lodsd (25 DWORDs) - 10*
316     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
58      cycles for 100 * lea10, add eax
56      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap+nop
16      cycles for 100 * xor eax, ecx

28      cycles for 100 * imul 10
686     cycles for 100 * rep lodsd (25 DWORDs) - 10*
406     cycles for 100 *     lodsd (25 DWORDs) - 10*
316     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
57      cycles for 100 * lea10, add eax
59      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap+nop
16      cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

2       cycles for 100 * imul 10
458     cycles for 100 * rep lodsd (25 DWORDs) - 10*
785     cycles for 100 *     lodsd (25 DWORDs) - 10*
470     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
3       cycles for 100 * lea10, add eax
2       cycles for 100 * lea10, shl eax, 1
29      cycles for 100 * bswap+nop
1       cycles for 100 * xor eax, ecx

1       cycles for 100 * imul 10
458     cycles for 100 * rep lodsd (25 DWORDs) - 10*
799     cycles for 100 *     lodsd (25 DWORDs) - 10*
469     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
0       cycles for 100 * lea10, add eax
1       cycles for 100 * lea10, shl eax, 1
24      cycles for 100 * bswap+nop
5       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
465     cycles for 100 * rep lodsd (25 DWORDs) - 10*
764     cycles for 100 *     lodsd (25 DWORDs) - 10*
465     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
2       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
26      cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

0       cycles for 100 * imul 10
544     cycles for 100 * rep lodsd (25 DWORDs) - 10*
657     cycles for 100 *     lodsd (25 DWORDs) - 10*
199     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
37      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
546     cycles for 100 * rep lodsd (25 DWORDs) - 10*
657     cycles for 100 *     lodsd (25 DWORDs) - 10*
204     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
38      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
1       cycles for 100 * bswap+nop
4       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
546     cycles for 100 * rep lodsd (25 DWORDs) - 10*
655     cycles for 100 *     lodsd (25 DWORDs) - 10*
201     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
37      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
8       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx
handle tabsSub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub
May the source be with you

FORTRANS

Hi,

   Four systems with timings.

Cheers,

Steve N.

pre-P4 (SSE1)

101 cycles for 100 * imul 10
100 cycles for 100 * lea: *10
10717 cycles for 100 * lodsd (25 DWORDs)
8402 cycles for 100 * mov eax, [esi] + add esi, 4
100 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
111 cycles for 100 * ror 16

101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8391 cycles for 100 * mov eax, [esi] + add esi, 4
99 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
99 cycles for 100 * bswap
102 cycles for 100 * ror 16

101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8390 cycles for 100 * mov eax, [esi] + add esi, 4
99 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
100 cycles for 100 * ror 16

101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8401 cycles for 100 * mov eax, [esi] + add esi, 4
100 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
100 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
7437 cycles for 100 * lodsd (25 DWORDs)
5961 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
168 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
104 cycles for 100 * ror 16

164 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
7382 cycles for 100 * lodsd (25 DWORDs)
6006 cycles for 100 * mov eax, [esi] + add esi, 4
176 cycles for 100 * lea10, add eax
166 cycles for 100 * lea10, shl eax, 1
160 cycles for 100 * bswap
104 cycles for 100 * ror 16

163 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
7426 cycles for 100 * lodsd (25 DWORDs)
5964 cycles for 100 * mov eax, [esi] + add esi, 4
172 cycles for 100 * lea10, add eax
163 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
104 cycles for 100 * ror 16

154 cycles for 100 * imul 10
163 cycles for 100 * lea: *10
7390 cycles for 100 * lodsd (25 DWORDs)
5961 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
164 cycles for 100 * lea10, shl eax, 1
156 cycles for 100 * bswap
100 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2867 cycles for 100 * lodsd (25 DWORDs)
2990 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2872 cycles for 100 * lodsd (25 DWORDs)
2947 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
13 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
13 cycles for 100 * lea: *10
2843 cycles for 100 * lodsd (25 DWORDs)
3185 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
28 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2874 cycles for 100 * lodsd (25 DWORDs)
3088 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
7 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

1 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2322 cycles for 100 * lodsd (25 DWORDs)
2160 cycles for 100 * mov eax, [esi] + add esi, 4
21 cycles for 100 * lea10, add eax
21 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
0 cycles for 100 * ror 16

0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2250 cycles for 100 * lodsd (25 DWORDs)
2167 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
6 cycles for 100 * ror 16

1 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
2452 cycles for 100 * lodsd (25 DWORDs)
2134 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
22 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16

7 cycles for 100 * imul 10
25 cycles for 100 * lea: *10
2408 cycles for 100 * lodsd (25 DWORDs)
2158 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---


pre-P4 (SSE1)

101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
9606 cycles for 100 * rep lodsd (25 DWORDs)
8392 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
9604 cycles for 100 * rep lodsd (25 DWORDs)
8400 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

103 cycles for 100 * imul 10
113 cycles for 100 * lea: *10
9590 cycles for 100 * rep lodsd (25 DWORDs)
8403 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
9594 cycles for 100 * rep lodsd (25 DWORDs)
8391 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

163 cycles for 100 * imul 10
159 cycles for 100 * lea: *10
9053 cycles for 100 * rep lodsd (25 DWORDs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
167 cycles for 100 * lea10, add eax
160 cycles for 100 * lea10, shl eax, 1
159 cycles for 100 * bswap
104 cycles for 100 * ror 16

163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
9013 cycles for 100 * rep lodsd (25 DWORDs)
6009 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
163 cycles for 100 * lea10, shl eax, 1
159 cycles for 100 * bswap
104 cycles for 100 * ror 16

155 cycles for 100 * imul 10
159 cycles for 100 * lea: *10
9051 cycles for 100 * rep lodsd (25 DWORDs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
180 cycles for 100 * lea10, add eax
166 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
105 cycles for 100 * ror 16

164 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
9008 cycles for 100 * rep lodsd (25 DWORDs)
6009 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
165 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
97 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
9492 cycles for 100 * rep lodsd (25 DWORDs)
3181 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
14 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
19 cycles for 100 * lea: *10
10011 cycles for 100 * rep lodsd (25 DWORDs)
3240 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
7 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
9746 cycles for 100 * rep lodsd (25 DWORDs)
3137 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
12 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
11 cycles for 100 * lea: *10
9658 cycles for 100 * rep lodsd (25 DWORDs)
3242 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
9 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

52 cycles for 100 * imul 10
85 cycles for 100 * lea: *10
10853 cycles for 100 * rep lodsd (25 DWORDs)
4703 cycles for 100 * mov eax, [esi] + add esi, 4
71 cycles for 100 * lea10, add eax
67 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
37 cycles for 100 * ror 16

39 cycles for 100 * imul 10
97 cycles for 100 * lea: *10
10803 cycles for 100 * rep lodsd (25 DWORDs)
3588 cycles for 100 * mov eax, [esi] + add esi, 4
18 cycles for 100 * lea10, add eax
2 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

32 cycles for 100 * imul 10
68 cycles for 100 * lea: *10
8343 cycles for 100 * rep lodsd (25 DWORDs)
2578 cycles for 100 * mov eax, [esi] + add esi, 4
8 cycles for 100 * lea10, add eax
3 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
3 cycles for 100 * lea: *10
8712 cycles for 100 * rep lodsd (25 DWORDs)
3064 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
5 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---


pre-P4 (SSE1)

101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34862 cycles for 100 * rep lodsB (100 BYTEs)
8403 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
103 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34869 cycles for 100 * rep lodsB (100 BYTEs)
8400 cycles for 100 * mov eax, [esi] + add esi, 4
102 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
103 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34864 cycles for 100 * rep lodsB (100 BYTEs)
8392 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
101 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
32678 cycles for 100 * rep lodsB (100 BYTEs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
164 cycles for 100 * lea10, shl eax, 1
156 cycles for 100 * bswap
100 cycles for 100 * ror 16

155 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
32653 cycles for 100 * rep lodsB (100 BYTEs)
5966 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
167 cycles for 100 * lea10, shl eax, 1
166 cycles for 100 * bswap
104 cycles for 100 * ror 16

163 cycles for 100 * imul 10
169 cycles for 100 * lea: *10
32663 cycles for 100 * rep lodsB (100 BYTEs)
5979 cycles for 100 * mov eax, [esi] + add esi, 4
176 cycles for 100 * lea10, add eax
168 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
104 cycles for 100 * ror 16

162 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
32673 cycles for 100 * rep lodsB (100 BYTEs)
5981 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
159 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
96 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
25205 cycles for 100 * rep lodsB (100 BYTEs)
3292 cycles for 100 * mov eax, [esi] + add esi, 4
51 cycles for 100 * lea10, add eax
38 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
31 cycles for 100 * lea: *10
25332 cycles for 100 * rep lodsB (100 BYTEs)
3306 cycles for 100 * mov eax, [esi] + add esi, 4
30 cycles for 100 * lea10, add eax
70 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
12 cycles for 100 * lea: *10
25632 cycles for 100 * rep lodsB (100 BYTEs)
3298 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
10 cycles for 100 * lea: *10
25643 cycles for 100 * rep lodsB (100 BYTEs)
3205 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

0 cycles for 100 * imul 10
26 cycles for 100 * lea: *10
16730 cycles for 100 * rep lodsB (100 BYTEs)
2255 cycles for 100 * mov eax, [esi] + add esi, 4
91 cycles for 100 * lea10, add eax
71 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
16499 cycles for 100 * rep lodsB (100 BYTEs)
2099 cycles for 100 * mov eax, [esi] + add esi, 4
23 cycles for 100 * lea10, add eax
18 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 cycles for 100 * imul 10
20 cycles for 100 * lea: *10
16543 cycles for 100 * rep lodsB (100 BYTEs)
2269 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
8 cycles for 100 * bswap
0 cycles for 100 * ror 16

1 cycles for 100 * imul 10
38 cycles for 100 * lea: *10
19057 cycles for 100 * rep lodsB (100 BYTEs)
2247 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
3 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

LiaoMi

Quote from: jj2007 on December 02, 2021, 01:49:15 AM
Quote from: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.

Very good point :thumbsup:

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

43      cycles for 100 * imul 10
511     cycles for 100 * rep lodsd (25 DWORDs) - 10*
123     cycles for 100 *     lodsd (25 DWORDs) - 10*
122     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
17      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
13      cycles for 100 * xor eax, ecx

43      cycles for 100 * imul 10
508     cycles for 100 * rep lodsd (25 DWORDs) - 10*
122     cycles for 100 *     lodsd (25 DWORDs) - 10*
121     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
18      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
14      cycles for 100 * xor eax, ecx

44      cycles for 100 * imul 10
506     cycles for 100 * rep lodsd (25 DWORDs) - 10*
121     cycles for 100 *     lodsd (25 DWORDs) - 10*
122     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
18      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
19      cycles for 100 * bswap+nop
13      cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


--- ok ---

hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

0       cycles for 100 * imul 10
701     cycles for 100 * rep lodsd (25 DWORDs) - 10*
238     cycles for 100 *     lodsd (25 DWORDs) - 10*
199     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
32      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
14      cycles for 100 * bswap+nop
7       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
701     cycles for 100 * rep lodsd (25 DWORDs) - 10*
238     cycles for 100 *     lodsd (25 DWORDs) - 10*
196     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
31      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
3       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
702     cycles for 100 * rep lodsd (25 DWORDs) - 10*
238     cycles for 100 *     lodsd (25 DWORDs) - 10*
194     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
32      cycles for 100 * lea10, add eax
18      cycles for 100 * lea10, shl eax, 1
15      cycles for 100 * bswap+nop
2       cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


-


jj2007

Quote from: FORTRANS on December 02, 2021, 03:02:47 AM
   Four systems with timings.

34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4


Second row should be mov al, [esi] + inc esi, otherwise it's an unfair comparison

FORTRANS

Hi Jochen,

Quote from: jj2007 on December 02, 2021, 09:41:23 AM
Quote from: FORTRANS on December 02, 2021, 03:02:47 AM
   Four systems with timings.

34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4


Second row should be mov al, [esi] + inc esi, otherwise it's an unfair comparison

   Well, the idea was that Intel had improved LODSB in some of their CPUs.
Or maybe not.  Anyway.

pre-P4 (SSE1)

101 cycles for 100 * imul 10
34863 cycles for 100 * rep lodsB (100 BYTEs)
28500 cycles for 100 * mov AL, [esi] + INC esi

101 cycles for 100 * imul 10
34882 cycles for 100 * rep lodsB (100 BYTEs)
28482 cycles for 100 * mov AL, [esi] + INC esi

101 cycles for 100 * imul 10
34861 cycles for 100 * rep lodsB (100 BYTEs)
28502 cycles for 100 * mov AL, [esi] + INC esi

101 cycles for 100 * imul 10
34876 cycles for 100 * rep lodsB (100 BYTEs)
28492 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

162 cycles for 100 * imul 10
32603 cycles for 100 * rep lodsB (100 BYTEs)
24696 cycles for 100 * mov AL, [esi] + INC esi

162 cycles for 100 * imul 10
32611 cycles for 100 * rep lodsB (100 BYTEs)
24699 cycles for 100 * mov AL, [esi] + INC esi

156 cycles for 100 * imul 10
32592 cycles for 100 * rep lodsB (100 BYTEs)
24726 cycles for 100 * mov AL, [esi] + INC esi

162 cycles for 100 * imul 10
32608 cycles for 100 * rep lodsB (100 BYTEs)
24705 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
24755 cycles for 100 * rep lodsB (100 BYTEs)
14615 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
24772 cycles for 100 * rep lodsB (100 BYTEs)
14738 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
24759 cycles for 100 * rep lodsB (100 BYTEs)
14621 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
24760 cycles for 100 * rep lodsB (100 BYTEs)
14622 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

?? cycles for 100 * imul 10
17094 cycles for 100 * rep lodsB (100 BYTEs)
8491 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
16367 cycles for 100 * rep lodsB (100 BYTEs)
8473 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
16269 cycles for 100 * rep lodsB (100 BYTEs)
8559 cycles for 100 * mov AL, [esi] + INC esi

1 cycles for 100 * imul 10
16111 cycles for 100 * rep lodsB (100 BYTEs)
8466 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---


Regards,

Steve