Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..

Started by jj2007, November 16, 2021, 08:57:25 AM

Previous topic - Next topic

jj2007

Following frequent discussions about old and slow instructions (such as bswap), here is a little testbed.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23      cycles for 100 * imul 10
54      cycles for 100 * lea: *10
4871    cycles for 100 * lodsd (25 DWORDs)
4863    cycles for 100 * mov eax, [esi] + add esi, 4
58      cycles for 100 * lea10, add eax
57      cycles for 100 * lea10, shl eax, 1
27      cycles for 100 * bswap
62      cycles for 100 * ror 16

26      cycles for 100 * imul 10
55      cycles for 100 * lea: *10
4881    cycles for 100 * lodsd (25 DWORDs)
4890    cycles for 100 * mov eax, [esi] + add esi, 4
56      cycles for 100 * lea10, add eax
55      cycles for 100 * lea10, shl eax, 1
20      cycles for 100 * bswap
55      cycles for 100 * ror 16

24      cycles for 100 * imul 10
53      cycles for 100 * lea: *10
4829    cycles for 100 * lodsd (25 DWORDs)
4894    cycles for 100 * mov eax, [esi] + add esi, 4
56      cycles for 100 * lea10, add eax
54      cycles for 100 * lea10, shl eax, 1
20      cycles for 100 * bswap
55      cycles for 100 * ror 16

30      cycles for 100 * imul 10
60      cycles for 100 * lea: *10
4921    cycles for 100 * lodsd (25 DWORDs)
4810    cycles for 100 * mov eax, [esi] + add esi, 4
55      cycles for 100 * lea10, add eax
54      cycles for 100 * lea10, shl eax, 1
20      cycles for 100 * bswap
55      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


In a similar test, I get 0 cycles for bswap and very few cycles for xor eax, ecx:
26      cycles for 100 * imul 10
56      cycles for 100 * lea: *10
0       cycles for 100 * bswap
14      cycles for 100 * xor eax, ecx


Zero cycles probably means that a bswap gets executed in parallel, at no cost.

Siekmanski

AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5630    cycles for 100 * lodsd (25 DWORDs)
2036    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5610    cycles for 100 * lodsd (25 DWORDs)
2062    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5787    cycles for 100 * lodsd (25 DWORDs)
2020    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5600    cycles for 100 * lodsd (25 DWORDs)
2051    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
4       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


--- ok ---
Creative coders use backward thinking techniques as a strategy.

HSE

Nice to test new machine  :thumbsup:Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

?? cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2606 cycles for 100 * lodsd (25 DWORDs)
2507 cycles for 100 * mov eax, [esi] + add esi, 4
30 cycles for 100 * lea10, add eax
21 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
6 cycles for 100 * ror 16

?? cycles for 100 * imul 10
24 cycles for 100 * lea: *10
2651 cycles for 100 * lodsd (25 DWORDs)
2532 cycles for 100 * mov eax, [esi] + add esi, 4
20 cycles for 100 * lea10, add eax
36 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
2 cycles for 100 * ror 16

?? cycles for 100 * imul 10
26 cycles for 100 * lea: *10
2688 cycles for 100 * lodsd (25 DWORDs)
2459 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
23 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
5 cycles for 100 * ror 16

3 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
2544 cycles for 100 * lodsd (25 DWORDs)
2604 cycles for 100 * mov eax, [esi] + add esi, 4
34 cycles for 100 * lea10, add eax
27 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
1 cycles for 100 * ror 16



Well, new box is not sow slow :biggrin:
Equations in Assembly: SmplMath

jj2007

Very interesting. How to interpret the 0 cycles? Parallel executing, or an error in the overhead calculation?

So lodsd shines on Intel but sucks on AMD... I've just ordered a new notebook with an AMD Athlon Gold 3150U :rolleyes:

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

??      cycles for 100 * imul 10
72      cycles for 100 * lea: *10
6976    cycles for 100 * lodsd (25 DWORDs)
2599    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
38      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
28      cycles for 100 * lea: *10
6910    cycles for 100 * lodsd (25 DWORDs)
2593    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
31      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
2       cycles for 100 * ror 16

??      cycles for 100 * imul 10
32      cycles for 100 * lea: *10
6913    cycles for 100 * lodsd (25 DWORDs)
2582    cycles for 100 * mov eax, [esi] + add esi, 4
35      cycles for 100 * lea10, add eax
28      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
6       cycles for 100 * ror 16

3       cycles for 100 * imul 10
46      cycles for 100 * lea: *10
6928    cycles for 100 * lodsd (25 DWORDs)
2594    cycles for 100 * mov eax, [esi] + add esi, 4
32      cycles for 100 * lea10, add eax
34      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-
May the source be with you

hutch--

JJ,

I had a look at the specs for that AMD CPU, you will really appreciate the difference in performance.  :thumbsup:

mineiro

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

13 cycles for 100 * imul 10
29 cycles for 100 * lea: *10
2929 cycles for 100 * lodsd (25 DWORDs)
2834 cycles for 100 * mov eax, [esi] + add esi, 4
34 cycles for 100 * lea10, add eax
29 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
19 cycles for 100 * ror 16

0 cycles for 100 * imul 10
28 cycles for 100 * lea: *10
2808 cycles for 100 * lodsd (25 DWORDs)
2931 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
13 cycles for 100 * bswap
7 cycles for 100 * ror 16

0 cycles for 100 * imul 10
27 cycles for 100 * lea: *10
2853 cycles for 100 * lodsd (25 DWORDs)
2832 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
3 cycles for 100 * bswap
7 cycles for 100 * ror 16

53 cycles for 100 * imul 10
27 cycles for 100 * lea: *10
2808 cycles for 100 * lodsd (25 DWORDs)
2846 cycles for 100 * mov eax, [esi] + add esi, 4
31 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
9 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: hutch-- on November 16, 2021, 07:22:36 PM
I had a look at the specs for that AMD CPU, you will really appreciate the difference in performance.  :thumbsup:

I hope so! Single thread performance is only 50% better, which is not so much. But I'll have a 256GB SSD instead of a ten year old traditional disk, and 12GB of RAM instead of 6GB.

Cpu benchmark/single thread performance
2046/1273  Intel Core i5-2450M
4112/1815  AMD Athlon Gold 3150U (2 cores)

nidud


jj2007


daydreamer

cool results Marinus,seem ryzen comes with the AD&D timestop spell :thumbsup: :greenclp:

why does it only says SSE4 caps?insufficient testing vs avx and avx2 caps?

JJ dont forget better/bigger cache,faster RAMspeed as well on your new computer faster in memory intensive tasks
mine is both great for fast 3d rendering and have fun learn avx2 and avx

shouldnt it be spelled testIT instead? :bgrin:
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

??      cycles for 100 * imul 10
21      cycles for 100 * lea: *10
2563    cycles for 100 * lodsd (25 DWORDs)
2653    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
31      cycles for 100 * lea10, shl eax, 1
102     cycles for 100 * bswap
14      cycles for 100 * ror 16

22      cycles for 100 * imul 10
121     cycles for 100 * lea: *10
4502    cycles for 100 * lodsd (25 DWORDs)
4379    cycles for 100 * mov eax, [esi] + add esi, 4
57      cycles for 100 * lea10, add eax
64      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
6       cycles for 100 * ror 16

3       cycles for 100 * imul 10
26      cycles for 100 * lea: *10
2708    cycles for 100 * lodsd (25 DWORDs)
3182    cycles for 100 * mov eax, [esi] + add esi, 4
27      cycles for 100 * lea10, add eax
23      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
13      cycles for 100 * ror 16

0       cycles for 100 * imul 10
32      cycles for 100 * lea: *10
2617    cycles for 100 * lodsd (25 DWORDs)
2599    cycles for 100 * mov eax, [esi] + add esi, 4
21      cycles for 100 * lea10, add eax
29      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap
2       cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

TimoVJL

Interesting numbersIntel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
2606 cycles for 100 * lodsd (25 DWORDs)

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
2929 cycles for 100 * lodsd (25 DWORDs)

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
2563    cycles for 100 * lodsd (25 DWORDs)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
4871    cycles for 100 * lodsd (25 DWORDs)

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
1765    cycles for 100 * lodsd (25 DWORDs)

AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
5630    cycles for 100 * lodsd (25 DWORDs)

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
6976    cycles for 100 * lodsd (25 DWORDs)

AMD Athlon(tm) II X2 220 Processor (SSE3)
9202    cycles for 100 * lodsd (25 DWORDs)
May the source be with you

jj2007

Quote from: TimoVJL on November 17, 2021, 12:19:16 AM
Interesting numbers

Yes. It seems that Intel has invested in lodsd, while AMD hasn't. I use lodsd whenever speed is not an issue, because it's a one-byter, while mov eax, [esi] + add esi, 4 occupies 5 bytes in the instruction cache.

nidud


LiaoMi

Quote from: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap), here is a little testbed.

Hi jj2007,

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

26      cycles for 100 * imul 10
9       cycles for 100 * lea: *10
1765    cycles for 100 * lodsd (25 DWORDs)
1689    cycles for 100 * mov eax, [esi] + add esi, 4
4       cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
6       cycles for 100 * bswap
0       cycles for 100 * ror 16

36      cycles for 100 * imul 10
8       cycles for 100 * lea: *10
1566    cycles for 100 * lodsd (25 DWORDs)
1764    cycles for 100 * mov eax, [esi] + add esi, 4
2       cycles for 100 * lea10, add eax
8       cycles for 100 * lea10, shl eax, 1
2       cycles for 100 * bswap
9       cycles for 100 * ror 16

30      cycles for 100 * imul 10
1       cycles for 100 * lea: *10
2126    cycles for 100 * lodsd (25 DWORDs)
1734    cycles for 100 * mov eax, [esi] + add esi, 4
15      cycles for 100 * lea10, add eax
26      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

29      cycles for 100 * imul 10
4       cycles for 100 * lea: *10
1551    cycles for 100 * lodsd (25 DWORDs)
1667    cycles for 100 * mov eax, [esi] + add esi, 4
39      cycles for 100 * lea10, add eax
5       cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


--- ok ---