The MASM Forum

General => The Laboratory => Topic started by: jj2007 on November 16, 2021, 08:57:25 AM

Title: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap (http://masm32.com/board/index.php?topic=9624.msg105534#msg105534)), here is a little testbed.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23      cycles for 100 * imul 10
54      cycles for 100 * lea: *10
4871    cycles for 100 * lodsd (25 DWORDs)
4863    cycles for 100 * mov eax, [esi] + add esi, 4
58      cycles for 100 * lea10, add eax
57      cycles for 100 * lea10, shl eax, 1
27      cycles for 100 * bswap
62      cycles for 100 * ror 16

26      cycles for 100 * imul 10
55      cycles for 100 * lea: *10
4881    cycles for 100 * lodsd (25 DWORDs)
4890    cycles for 100 * mov eax, [esi] + add esi, 4
56      cycles for 100 * lea10, add eax
55      cycles for 100 * lea10, shl eax, 1
20      cycles for 100 * bswap
55      cycles for 100 * ror 16

24      cycles for 100 * imul 10
53      cycles for 100 * lea: *10
4829    cycles for 100 * lodsd (25 DWORDs)
4894    cycles for 100 * mov eax, [esi] + add esi, 4
56      cycles for 100 * lea10, add eax
54      cycles for 100 * lea10, shl eax, 1
20      cycles for 100 * bswap
55      cycles for 100 * ror 16

30      cycles for 100 * imul 10
60      cycles for 100 * lea: *10
4921    cycles for 100 * lodsd (25 DWORDs)
4810    cycles for 100 * mov eax, [esi] + add esi, 4
55      cycles for 100 * lea10, add eax
54      cycles for 100 * lea10, shl eax, 1
20      cycles for 100 * bswap
55      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


In a similar test, I get 0 cycles for bswap and very few cycles for xor eax, ecx:
26      cycles for 100 * imul 10
56      cycles for 100 * lea: *10
0       cycles for 100 * bswap
14      cycles for 100 * xor eax, ecx


Zero cycles probably means that a bswap gets executed in parallel, at no cost.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: Siekmanski on November 16, 2021, 09:04:16 AM
AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5630    cycles for 100 * lodsd (25 DWORDs)
2036    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5610    cycles for 100 * lodsd (25 DWORDs)
2062    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5787    cycles for 100 * lodsd (25 DWORDs)
2020    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
0       cycles for 100 * lea: *10
5600    cycles for 100 * lodsd (25 DWORDs)
2051    cycles for 100 * mov eax, [esi] + add esi, 4
0       cycles for 100 * lea10, add eax
4       cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
0       cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


--- ok ---
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: HSE on November 16, 2021, 09:06:23 AM
Nice to test new machine  :thumbsup:Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)

?? cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2606 cycles for 100 * lodsd (25 DWORDs)
2507 cycles for 100 * mov eax, [esi] + add esi, 4
30 cycles for 100 * lea10, add eax
21 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
6 cycles for 100 * ror 16

?? cycles for 100 * imul 10
24 cycles for 100 * lea: *10
2651 cycles for 100 * lodsd (25 DWORDs)
2532 cycles for 100 * mov eax, [esi] + add esi, 4
20 cycles for 100 * lea10, add eax
36 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
2 cycles for 100 * ror 16

?? cycles for 100 * imul 10
26 cycles for 100 * lea: *10
2688 cycles for 100 * lodsd (25 DWORDs)
2459 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
23 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
5 cycles for 100 * ror 16

3 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
2544 cycles for 100 * lodsd (25 DWORDs)
2604 cycles for 100 * mov eax, [esi] + add esi, 4
34 cycles for 100 * lea10, add eax
27 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
1 cycles for 100 * ror 16



Well, new box is not sow slow :biggrin:
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 16, 2021, 09:53:45 AM
Very interesting. How to interpret the 0 cycles? Parallel executing, or an error in the overhead calculation?

So lodsd shines on Intel but sucks on AMD... I've just ordered a new notebook with an AMD Athlon Gold 3150U (https://www.cpubenchmark.net/cpu.php?cpu=AMD+Athlon+Gold+3150U&id=3777) :rolleyes:
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on November 16, 2021, 10:36:30 AM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

??      cycles for 100 * imul 10
72      cycles for 100 * lea: *10
6976    cycles for 100 * lodsd (25 DWORDs)
2599    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
38      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
0       cycles for 100 * ror 16

0       cycles for 100 * imul 10
28      cycles for 100 * lea: *10
6910    cycles for 100 * lodsd (25 DWORDs)
2593    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
31      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
2       cycles for 100 * ror 16

??      cycles for 100 * imul 10
32      cycles for 100 * lea: *10
6913    cycles for 100 * lodsd (25 DWORDs)
2582    cycles for 100 * mov eax, [esi] + add esi, 4
35      cycles for 100 * lea10, add eax
28      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
6       cycles for 100 * ror 16

3       cycles for 100 * imul 10
46      cycles for 100 * lea: *10
6928    cycles for 100 * lodsd (25 DWORDs)
2594    cycles for 100 * mov eax, [esi] + add esi, 4
32      cycles for 100 * lea10, add eax
34      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on November 16, 2021, 07:22:36 PM
JJ,

I had a look at the specs for that AMD CPU, you will really appreciate the difference in performance.  :thumbsup:
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: mineiro on November 16, 2021, 10:01:53 PM
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

13 cycles for 100 * imul 10
29 cycles for 100 * lea: *10
2929 cycles for 100 * lodsd (25 DWORDs)
2834 cycles for 100 * mov eax, [esi] + add esi, 4
34 cycles for 100 * lea10, add eax
29 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
19 cycles for 100 * ror 16

0 cycles for 100 * imul 10
28 cycles for 100 * lea: *10
2808 cycles for 100 * lodsd (25 DWORDs)
2931 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
13 cycles for 100 * bswap
7 cycles for 100 * ror 16

0 cycles for 100 * imul 10
27 cycles for 100 * lea: *10
2853 cycles for 100 * lodsd (25 DWORDs)
2832 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
3 cycles for 100 * bswap
7 cycles for 100 * ror 16

53 cycles for 100 * imul 10
27 cycles for 100 * lea: *10
2808 cycles for 100 * lodsd (25 DWORDs)
2846 cycles for 100 * mov eax, [esi] + add esi, 4
31 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
9 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 16, 2021, 10:49:24 PM
Quote from: hutch-- on November 16, 2021, 07:22:36 PM
I had a look at the specs for that AMD CPU, you will really appreciate the difference in performance.  :thumbsup:

I hope so! Single thread performance is only 50% better, which is not so much. But I'll have a 256GB SSD instead of a ten year old traditional disk, and 12GB of RAM instead of 6GB.

Cpu benchmark/single thread performance
2046/1273  Intel Core i5-2450M
4112/1815  AMD Athlon Gold 3150U (2 cores)
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: nidud on November 16, 2021, 10:53:28 PM
deleted
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 16, 2021, 11:05:52 PM
Please post the executable, as testit.bat doesn't work, sorry
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: daydreamer on November 16, 2021, 11:30:53 PM
cool results Marinus,seem ryzen comes with the AD&D timestop spell :thumbsup: :greenclp:

why does it only says SSE4 caps?insufficient testing vs avx and avx2 caps?

JJ dont forget better/bigger cache,faster RAMspeed as well on your new computer faster in memory intensive tasks
mine is both great for fast 3d rendering and have fun learn avx2 and avx

shouldnt it be spelled testIT instead? :bgrin:
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

??      cycles for 100 * imul 10
21      cycles for 100 * lea: *10
2563    cycles for 100 * lodsd (25 DWORDs)
2653    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
31      cycles for 100 * lea10, shl eax, 1
102     cycles for 100 * bswap
14      cycles for 100 * ror 16

22      cycles for 100 * imul 10
121     cycles for 100 * lea: *10
4502    cycles for 100 * lodsd (25 DWORDs)
4379    cycles for 100 * mov eax, [esi] + add esi, 4
57      cycles for 100 * lea10, add eax
64      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
6       cycles for 100 * ror 16

3       cycles for 100 * imul 10
26      cycles for 100 * lea: *10
2708    cycles for 100 * lodsd (25 DWORDs)
3182    cycles for 100 * mov eax, [esi] + add esi, 4
27      cycles for 100 * lea10, add eax
23      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
13      cycles for 100 * ror 16

0       cycles for 100 * imul 10
32      cycles for 100 * lea: *10
2617    cycles for 100 * lodsd (25 DWORDs)
2599    cycles for 100 * mov eax, [esi] + add esi, 4
21      cycles for 100 * lea10, add eax
29      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap
2       cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on November 17, 2021, 12:19:16 AM
Interesting numbersIntel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
2606 cycles for 100 * lodsd (25 DWORDs)

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
2929 cycles for 100 * lodsd (25 DWORDs)

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
2563    cycles for 100 * lodsd (25 DWORDs)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
4871    cycles for 100 * lodsd (25 DWORDs)

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
1765    cycles for 100 * lodsd (25 DWORDs)

AMD Ryzen 9 5950X 16-Core Processor             (SSE4)
5630    cycles for 100 * lodsd (25 DWORDs)

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
6976    cycles for 100 * lodsd (25 DWORDs)

AMD Athlon(tm) II X2 220 Processor (SSE3)
9202    cycles for 100 * lodsd (25 DWORDs)
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 17, 2021, 12:29:03 AM
Quote from: TimoVJL on November 17, 2021, 12:19:16 AM
Interesting numbers

Yes. It seems that Intel has invested in lodsd, while AMD hasn't. I use lodsd whenever speed is not an issue, because it's a one-byter, while mov eax, [esi] + add esi, 4 occupies 5 bytes in the instruction cache.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: nidud on November 17, 2021, 01:00:25 AM
deleted
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: LiaoMi on November 17, 2021, 03:57:54 AM
Quote from: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap (http://masm32.com/board/index.php?topic=9624.msg105534#msg105534)), here is a little testbed.

Hi jj2007,

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

26      cycles for 100 * imul 10
9       cycles for 100 * lea: *10
1765    cycles for 100 * lodsd (25 DWORDs)
1689    cycles for 100 * mov eax, [esi] + add esi, 4
4       cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
6       cycles for 100 * bswap
0       cycles for 100 * ror 16

36      cycles for 100 * imul 10
8       cycles for 100 * lea: *10
1566    cycles for 100 * lodsd (25 DWORDs)
1764    cycles for 100 * mov eax, [esi] + add esi, 4
2       cycles for 100 * lea10, add eax
8       cycles for 100 * lea10, shl eax, 1
2       cycles for 100 * bswap
9       cycles for 100 * ror 16

30      cycles for 100 * imul 10
1       cycles for 100 * lea: *10
2126    cycles for 100 * lodsd (25 DWORDs)
1734    cycles for 100 * mov eax, [esi] + add esi, 4
15      cycles for 100 * lea10, add eax
26      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

29      cycles for 100 * imul 10
4       cycles for 100 * lea: *10
1551    cycles for 100 * lodsd (25 DWORDs)
1667    cycles for 100 * mov eax, [esi] + add esi, 4
39      cycles for 100 * lea10, add eax
5       cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


--- ok ---
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 17, 2021, 04:10:12 AM
Quote from: nidud on November 17, 2021, 01:00:25 AMYou also need to know a few basic things about programming and batch files, so better just ignore it.

:greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp:

Why don't you just post stuff that works with a standard Masm32 SDK?
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: HSE on November 17, 2021, 04:10:48 AM
Quote from: jj2007 on November 16, 2021, 11:05:52 PM
Please post the executable, as testit.bat doesn't work, sorry

Work well with last AsmC64  :thumbsup:

total [0 .. 3], 1++
   108562 cycles 3.asm: mov   reg32,reg32
   201478 cycles 2.asm: bswap reg32
   409369 cycles 1.asm: imul  reg32,reg32,imm
   415621 cycles 0.asm: lea   reg64,mem64 * 2
   534630 cycles 4.asm: push  reg64
  1676828 cycles 6.asm: cld
  2252969 cycles 5.asm: std

Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 17, 2021, 04:13:31 AM
Quote from: LiaoMi on November 17, 2021, 03:57:54 AM
Quote from: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap (http://masm32.com/board/index.php?topic=9624.msg105534#msg105534)), here is a little testbed.

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

Thanks. Timings look a bit unstable (that happens often), but yours confirm that bswap is very fast, and that lodsd runs as fast as the 5-byter mov eax, [esi] + add esi, 4 - on Intel CPUs.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on November 17, 2021, 09:18:04 AM
I confess I have never had any problems with BSWAP over the years from my 486 dx onwards. Its originally an endian swap instruction and at a hardware level it does not have to do much so it has always been fast enough. From memory there is an SSE masked instruction that was reasonably fast that will do the same task but its more work to do somethjing that probably does not matter.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: nidud on November 17, 2021, 09:50:05 AM
deleted
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: daydreamer on November 17, 2021, 10:27:28 AM
Quote from: nidud on November 17, 2021, 09:50:05 AM
Think the original point here was, as often is the case with BSWAP, that you didn't need to use it.
It's the same with some SIMD instructions, if you rearrange in .data section between array of structures (AoS) and structures of arrays (SoA)
Between xyzw, xyzw,xyzw,xyzw and xxxx,yyyy,zzzz,wwww

@hutch, you mean SSE shufb, byte Shuffle with right constant, you can perform 4x32 bit bswap
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on November 20, 2021, 11:37:49 PM
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

3       cycles for 100 * imul 10
95      cycles for 100 * lea: *10
7998    cycles for 100 * lodsd (25 DWORDs)
6440    cycles for 100 * mov eax, [esi] + add esi, 4
91      cycles for 100 * lea10, add eax
91      cycles for 100 * lea10, shl eax, 1
91      cycles for 100 * bswap
??      cycles for 100 * ror 16

5       cycles for 100 * imul 10
95      cycles for 100 * lea: *10
8274    cycles for 100 * lodsd (25 DWORDs)
6525    cycles for 100 * mov eax, [esi] + add esi, 4
93      cycles for 100 * lea10, add eax
95      cycles for 100 * lea10, shl eax, 1
94      cycles for 100 * bswap
??      cycles for 100 * ror 16

2       cycles for 100 * imul 10
92      cycles for 100 * lea: *10
8117    cycles for 100 * lodsd (25 DWORDs)
6529    cycles for 100 * mov eax, [esi] + add esi, 4
92      cycles for 100 * lea10, add eax
90      cycles for 100 * lea10, shl eax, 1
90      cycles for 100 * bswap
0       cycles for 100 * ror 16

2       cycles for 100 * imul 10
92      cycles for 100 * lea: *10
8058    cycles for 100 * lodsd (25 DWORDs)
6591    cycles for 100 * mov eax, [esi] + add esi, 4
93      cycles for 100 * lea10, add eax
93      cycles for 100 * lea10, shl eax, 1
93      cycles for 100 * bswap
0       cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


--- ok ---
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on November 24, 2021, 05:56:59 PM

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

??      cycles for 100 * imul 10
11      cycles for 100 * lea: *10
2315    cycles for 100 * lodsd (25 DWORDs)
2860    cycles for 100 * mov eax, [esi] + add esi, 4
24      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

0       cycles for 100 * imul 10
14      cycles for 100 * lea: *10
2305    cycles for 100 * lodsd (25 DWORDs)
2823    cycles for 100 * mov eax, [esi] + add esi, 4
24      cycles for 100 * lea10, add eax
14      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

??      cycles for 100 * imul 10
14      cycles for 100 * lea: *10
2332    cycles for 100 * lodsd (25 DWORDs)
2857    cycles for 100 * mov eax, [esi] + add esi, 4
25      cycles for 100 * lea10, add eax
13      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

??      cycles for 100 * imul 10
16      cycles for 100 * lea: *10
2302    cycles for 100 * lodsd (25 DWORDs)
2837    cycles for 100 * mov eax, [esi] + add esi, 4
26      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
??      cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on November 24, 2021, 06:00:57 PM

Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (SSE4)

2       cycles for 100 * imul 10
15      cycles for 100 * lea: *10
2160    cycles for 100 * lodsd (25 DWORDs)
2232    cycles for 100 * mov eax, [esi] + add esi, 4
28      cycles for 100 * lea10, add eax
10      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

2       cycles for 100 * imul 10
18      cycles for 100 * lea: *10
2117    cycles for 100 * lodsd (25 DWORDs)
2287    cycles for 100 * mov eax, [esi] + add esi, 4
25      cycles for 100 * lea10, add eax
13      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

1       cycles for 100 * imul 10
12      cycles for 100 * lea: *10
2086    cycles for 100 * lodsd (25 DWORDs)
2171    cycles for 100 * mov eax, [esi] + add esi, 4
24      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap
??      cycles for 100 * ror 16

0       cycles for 100 * imul 10
21      cycles for 100 * lea: *10
2143    cycles for 100 * lodsd (25 DWORDs)
2127    cycles for 100 * mov eax, [esi] + add esi, 4
25      cycles for 100 * lea10, add eax
28      cycles for 100 * lea10, shl eax, 1
1       cycles for 100 * bswap
??      cycles for 100 * ror 16

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs)
18      bytes for mov eax, [esi] + add esi, 4
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
7       bytes for bswap
8       bytes for ror 16


-
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on November 24, 2021, 07:49:13 PM
Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4) Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4) AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

cycles for 100 * imul 10 2 ?? ?? 23 ?? ?? 13 26 ??
cycles for 100 * lea: *10 15 11 21 54 21 11 29 9 72
cycles for 100 * lodsd (25 DWORDs) 2160 2315 2606 4871 2563 2315 2929 1765 6976 5630
cycles for 100 * mov eax, [esi] + add 2232 2860 2507 4863 2653 2860 2834 1689 2599 2036
cycles for 100 * lea10, add eax 28 24 30 58 28 24 34 4 28
cycles for 100 * lea10, shl eax, 1 10 17 21 57 31 17 29 16 38
cycles for 100 * bswap 5 27 102 2 6 38
cycles for 100 * ror 16 ?? 6 62 14 ?? 19

cycles for 100 * imul 10 2 ?? 26 22 36
cycles for 100 * lea: *10 18 14 24 55 121 14 28 8 28
cycles for 100 * lodsd (25 DWORDs) 2117 2305 2651 4881 4502 2305 2808 1566 6910 5610
cycles for 100 * mov eax, [esi] + add 2287 2823 2532 4890 4379 2823 2931 1764 2593 2062
cycles for 100 * lea10, add eax 25 24 20 56 57 24 28 2 28
cycles for 100 * lea10, shl eax, 1 13 14 36 55 64 14 30 8 31
cycles for 100 * bswap 14 1 20 14 13 2 31
cycles for 100 * ror 16 14 2 55 6 14 7 9 2

cycles for 100 * imul 10 1 ?? ?? 24 3 ?? 30 ??
cycles for 100 * lea: *10 12 14 26 53 26 14 27 1 32
cycles for 100 * lodsd (25 DWORDs) 2086 2332 2688 4829 2708 2332 2853 2126 6913 5787
cycles for 100 * mov eax, [esi] + add 2171 2857 2459 4894 3182 2857 2832 1734 2582 2020
cycles for 100 * lea10, add eax 24 25 29 56 27 25 28 15 35
cycles for 100 * lea10, shl eax, 1 16 13 23 54 23 13 30 26 28
cycles for 100 * bswap 13 20 13 3 28
cycles for 100 * ror 16 13 5 55 13 13 7 ?? 6

cycles for 100 * imul 10 ?? 3 30 ?? 53 29 3
cycles for 100 * lea: *10 21 16 30 60 32 16 27 4 46
cycles for 100 * lodsd (25 DWORDs) 2143 2302 2544 4921 2617 2302 2808 1551 6928 5600
cycles for 100 * mov eax, [esi] + add 2127 2837 2604 4810 2599 2837 2846 1667 2594 2051
cycles for 100 * lea10, add eax 25 26 34 55 21 26 31 39 32
cycles for 100 * lea10, shl eax, 1 28 17 27 54 29 17 30 5 34 4
cycles for 100 * bswap 1 17 20 5 17 2 5 34
cycles for 100 * ror 16 17 1 55 2 17 9 5 34

bytes for imul 10 8
bytes for lea: *10 11
bytes for lodsd (25 DWORDs) 14
bytes for mov eax, [esi] + add esi, 4 18
bytes for lea10, add eax 10
bytes for lea10, shl eax, 1 10
bytes for bswap 7
bytes for ror 16 8
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 24, 2021, 08:11:12 PM
Impressive work, Timo :thumbsup:

Sub ImportValues()
    Row = 3
    Col = 2
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            'Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Excel Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    nCurRow = 0
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If nCurRow = 0 Then ActiveSheet.Cells(1, Col) = sTextRow
        nCurRow = nCurRow + 1
        If nCurRow >= 3 Then
            If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
            'Debug.Print nClk
            If nClk > 0 Then ActiveSheet.Cells(Row, Col) = nClk
            Row = Row + 1
        End If
        If nCurRow = 38 Then Exit Do
    Loop
    Close #nFileNro
End Sub
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on November 24, 2021, 08:33:15 PM
Sad to say I have nothing that will open an Excel file.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on November 24, 2021, 09:20:18 PM
Quote from: hutch-- on November 24, 2021, 08:33:15 PM
Sad to say I have nothing that will open an Excel file.

This one should work, and it usually adds also an advanced RichEd20 version somewhere in C:\Program Files (x86)\Common Files\microsoft shared\OFFICE*\RICHED20.DLL

https://filehippo.com/download_microsoft-excel-viewer/

No idea how useful that is, but here are the averages from Timo's table:

19 imul 10
28 lea * 10
3454 lodsd (25 DWORDs)
2769 mov eax, [esi] + add
30 lea10, add eax
26 lea10, shl eax, 1
18 bswap
18 ror 16


I always liked imul :cool:
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on November 24, 2021, 10:23:40 PM
Free Excel Viewer 2.2 (https://www.majorgeeks.com/files/details/free_excel_viewer.html)

Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on November 25, 2021, 04:04:03 AM
OK, that worked after I renamed the extension back to XLS.

The real use of the data was the comparison between the old i7 and the Xeon. Both have Haswell cores but there is some interesting variations.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on November 25, 2021, 06:53:31 PM
Tabular  data.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on December 01, 2021, 11:14:37 AM
First timings with the new machine :cool:

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

0       cycles for 100 * imul 10
30      cycles for 100 * lea: *10
488     cycles for 100 * lodsd (25 DWORDs) - 100*
162     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
29      cycles for 100 * lea10, add eax
38      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

3       cycles for 100 * imul 10
29      cycles for 100 * lea: *10
489     cycles for 100 * lodsd (25 DWORDs) - 100*
162     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
30      cycles for 100 * lea10, add eax
29      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
1       cycles for 100 * xor eax, ecx

4       cycles for 100 * imul 10
28      cycles for 100 * lea: *10
487     cycles for 100 * lodsd (25 DWORDs) - 100*
170     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
30      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on December 01, 2021, 12:24:45 PM
This is on my old i7.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

0       cycles for 100 * imul 10
18      cycles for 100 * lea: *10
213     cycles for 100 * lodsd (25 DWORDs) - 100*
196     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
7       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
19      cycles for 100 * lea: *10
213     cycles for 100 * lodsd (25 DWORDs) - 100*
193     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
6       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
19      cycles for 100 * lea: *10
214     cycles for 100 * lodsd (25 DWORDs) - 100*
193     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
7       cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


-
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

41      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
120     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
16      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
11      cycles for 100 * xor eax, ecx

45      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
118     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
17      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
12      cycles for 100 * xor eax, ecx

44      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
117     cycles for 100 * lodsd (25 DWORDs) - 100*
123     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
16      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
16      cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


--- ok ---
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on December 01, 2021, 09:13:20 PM
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

0       cycles for 100 * imul 10
38      cycles for 100 * lea: *10
591     cycles for 100 * lodsd (25 DWORDs) - 100*
208     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
38      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
36      cycles for 100 * lea: *10
591     cycles for 100 * lodsd (25 DWORDs) - 100*
202     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
37      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
37      cycles for 100 * lea: *10
590     cycles for 100 * lodsd (25 DWORDs) - 100*
200     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
37      cycles for 100 * lea10, add eax
40      cycles for 100 * lea10, shl eax, 1
3       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
11      bytes for lea: *10
14      bytes for lodsd (25 DWORDs) - 100*
18      bytes for mov eax, [esi] + add esi, 4 - 100*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4) AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4) AMD Athlon(tm) II X2 220 Processor (SSE3) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

                                           AMD AthAMD RyzAMD Athl11th GenIntel(R)

cycles for 100 * imul 10                        0      0       3      41       0
cycles for 100 * lea: *10                      30     38      27      15      18
cycles for 100 * lodsd (25 DWORDs) - 100*     488    591     744     116     213
cycles for 100 * mov eax, [esi] + add esi,    162    208     459     120     196
cycles for 100 * lea10, add eax                29     38      16      16      32
cycles for 100 * lea10, shl eax, 1             38     37       3      17      17
cycles for 100 * bswap+nop                      0      0      27      18      16
cycles for 100 * xor eax, ecx                   0      0       5      11       7

cycles for 100 * imul 10                        3      0       3      45       0
cycles for 100 * lea: *10                      29     36      31      15      19
cycles for 100 * lodsd (25 DWORDs) - 100*     489    591     771     116     213
cycles for 100 * mov eax, [esi] + add esi,    162    202     479     118     193
cycles for 100 * lea10, add eax                30     37       4      17      32
cycles for 100 * lea10, shl eax, 1             29     37       4      16      16
cycles for 100 * bswap+nop                      0      0      29      18      16
cycles for 100 * xor eax, ecx                   1      0       0      12       6

cycles for 100 * imul 10                        4      0       2      44       0
cycles for 100 * lea: *10                      28     37      29      15      19
cycles for 100 * lodsd (25 DWORDs) - 100*     487    590     753     117     214
cycles for 100 * mov eax, [esi] + add esi,    170    200     459     123     193
cycles for 100 * lea10, add eax                32     37       6      16      32
cycles for 100 * lea10, shl eax, 1             30     40       1      16      16
cycles for 100 * bswap+nop                      0      3      25      16      16
cycles for 100 * xor eax, ecx                   0      0       3      16       7

@jj2007 ExcelSub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, 8)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on December 01, 2021, 09:54:49 PM
Quote from: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

41      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
120     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*

The only cpu that performs badly for imul :cool:
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: LiaoMi on December 02, 2021, 12:05:28 AM
Quote from: jj2007 on December 01, 2021, 09:54:49 PM
Quote from: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

41      cycles for 100 * imul 10
15      cycles for 100 * lea: *10
116     cycles for 100 * lodsd (25 DWORDs) - 100*
120     cycles for 100 * mov eax, [esi] + add esi, 4 - 100*

The only cpu that performs badly for imul :cool:

:biggrin: I noticed that a lot of commands are slower than on the oldest processors, some of them are extremely slow :rolleyes: here's the price of progress, most likely the reason is in the implementation of microcode. It's very sad  :undecided:
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on December 02, 2021, 01:49:15 AM
Quote from: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.

Very good point :thumbsup:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

28      cycles for 100 * imul 10
688     cycles for 100 * rep lodsd (25 DWORDs) - 10*
405     cycles for 100 *     lodsd (25 DWORDs) - 10*
317     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
57      cycles for 100 * lea10, add eax
56      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap+nop
19      cycles for 100 * xor eax, ecx

27      cycles for 100 * imul 10
688     cycles for 100 * rep lodsd (25 DWORDs) - 10*
406     cycles for 100 *     lodsd (25 DWORDs) - 10*
316     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
58      cycles for 100 * lea10, add eax
56      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap+nop
16      cycles for 100 * xor eax, ecx

28      cycles for 100 * imul 10
686     cycles for 100 * rep lodsd (25 DWORDs) - 10*
406     cycles for 100 *     lodsd (25 DWORDs) - 10*
316     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
57      cycles for 100 * lea10, add eax
59      cycles for 100 * lea10, shl eax, 1
5       cycles for 100 * bswap+nop
16      cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: TimoVJL on December 02, 2021, 02:14:47 AM
AMD Athlon(tm) II X2 220 Processor (SSE3)

2       cycles for 100 * imul 10
458     cycles for 100 * rep lodsd (25 DWORDs) - 10*
785     cycles for 100 *     lodsd (25 DWORDs) - 10*
470     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
3       cycles for 100 * lea10, add eax
2       cycles for 100 * lea10, shl eax, 1
29      cycles for 100 * bswap+nop
1       cycles for 100 * xor eax, ecx

1       cycles for 100 * imul 10
458     cycles for 100 * rep lodsd (25 DWORDs) - 10*
799     cycles for 100 *     lodsd (25 DWORDs) - 10*
469     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
0       cycles for 100 * lea10, add eax
1       cycles for 100 * lea10, shl eax, 1
24      cycles for 100 * bswap+nop
5       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
465     cycles for 100 * rep lodsd (25 DWORDs) - 10*
764     cycles for 100 *     lodsd (25 DWORDs) - 10*
465     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
2       cycles for 100 * lea10, add eax
0       cycles for 100 * lea10, shl eax, 1
26      cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

0       cycles for 100 * imul 10
544     cycles for 100 * rep lodsd (25 DWORDs) - 10*
657     cycles for 100 *     lodsd (25 DWORDs) - 10*
199     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
37      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
0       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
546     cycles for 100 * rep lodsd (25 DWORDs) - 10*
657     cycles for 100 *     lodsd (25 DWORDs) - 10*
204     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
38      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
1       cycles for 100 * bswap+nop
4       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
546     cycles for 100 * rep lodsd (25 DWORDs) - 10*
655     cycles for 100 *     lodsd (25 DWORDs) - 10*
201     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
37      cycles for 100 * lea10, add eax
37      cycles for 100 * lea10, shl eax, 1
8       cycles for 100 * bswap+nop
0       cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx
handle tabsSub ImportValues()
    Row = 1 ' check title
    Col = 2 ' first value
    With ActiveSheet
        While .Cells(Row, Col) <> Empty
            Debug.Print .Cells(Row, Col)
            Col = Col + 1
        Wend
    End With
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Filters.Add "Text Files", "*.txt", 1
        .Show
        sFileName = .SelectedItems.Item(1)
    End With
    nFileNro = FreeFile
    Row = 0    ' start
    Open sFileName For Input As #nFileNro
    Do While Not EOF(nFileNro)
        Line Input #nFileNro, sTextRow
        'Debug.Print sTextRow
        If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow   ' title
        Row = Row + 1
        If Row >= 3 Then
            Pos = InStr(1, sTextRow, Chr(9))    ' is tab in line
            If Pos = 0 Then Pos = 8 ' no tab
            If Left(sTextRow, 1) <> "" Then
                If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
                'Debug.Print nClk
                If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
                ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
            End If
        End If
        If Row = 37 Then Exit Do
    Loop
    Close #nFileNro
End Sub
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: FORTRANS on December 02, 2021, 03:02:47 AM
Hi,

   Four systems with timings.

Cheers,

Steve N.

pre-P4 (SSE1)

101 cycles for 100 * imul 10
100 cycles for 100 * lea: *10
10717 cycles for 100 * lodsd (25 DWORDs)
8402 cycles for 100 * mov eax, [esi] + add esi, 4
100 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
111 cycles for 100 * ror 16

101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8391 cycles for 100 * mov eax, [esi] + add esi, 4
99 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
99 cycles for 100 * bswap
102 cycles for 100 * ror 16

101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8390 cycles for 100 * mov eax, [esi] + add esi, 4
99 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
100 cycles for 100 * ror 16

101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8401 cycles for 100 * mov eax, [esi] + add esi, 4
100 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
100 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
7437 cycles for 100 * lodsd (25 DWORDs)
5961 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
168 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
104 cycles for 100 * ror 16

164 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
7382 cycles for 100 * lodsd (25 DWORDs)
6006 cycles for 100 * mov eax, [esi] + add esi, 4
176 cycles for 100 * lea10, add eax
166 cycles for 100 * lea10, shl eax, 1
160 cycles for 100 * bswap
104 cycles for 100 * ror 16

163 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
7426 cycles for 100 * lodsd (25 DWORDs)
5964 cycles for 100 * mov eax, [esi] + add esi, 4
172 cycles for 100 * lea10, add eax
163 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
104 cycles for 100 * ror 16

154 cycles for 100 * imul 10
163 cycles for 100 * lea: *10
7390 cycles for 100 * lodsd (25 DWORDs)
5961 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
164 cycles for 100 * lea10, shl eax, 1
156 cycles for 100 * bswap
100 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2867 cycles for 100 * lodsd (25 DWORDs)
2990 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2872 cycles for 100 * lodsd (25 DWORDs)
2947 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
13 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
13 cycles for 100 * lea: *10
2843 cycles for 100 * lodsd (25 DWORDs)
3185 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
28 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2874 cycles for 100 * lodsd (25 DWORDs)
3088 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
7 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

1 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2322 cycles for 100 * lodsd (25 DWORDs)
2160 cycles for 100 * mov eax, [esi] + add esi, 4
21 cycles for 100 * lea10, add eax
21 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
0 cycles for 100 * ror 16

0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2250 cycles for 100 * lodsd (25 DWORDs)
2167 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
6 cycles for 100 * ror 16

1 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
2452 cycles for 100 * lodsd (25 DWORDs)
2134 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
22 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16

7 cycles for 100 * imul 10
25 cycles for 100 * lea: *10
2408 cycles for 100 * lodsd (25 DWORDs)
2158 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---


pre-P4 (SSE1)

101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
9606 cycles for 100 * rep lodsd (25 DWORDs)
8392 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
9604 cycles for 100 * rep lodsd (25 DWORDs)
8400 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

103 cycles for 100 * imul 10
113 cycles for 100 * lea: *10
9590 cycles for 100 * rep lodsd (25 DWORDs)
8403 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
9594 cycles for 100 * rep lodsd (25 DWORDs)
8391 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

163 cycles for 100 * imul 10
159 cycles for 100 * lea: *10
9053 cycles for 100 * rep lodsd (25 DWORDs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
167 cycles for 100 * lea10, add eax
160 cycles for 100 * lea10, shl eax, 1
159 cycles for 100 * bswap
104 cycles for 100 * ror 16

163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
9013 cycles for 100 * rep lodsd (25 DWORDs)
6009 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
163 cycles for 100 * lea10, shl eax, 1
159 cycles for 100 * bswap
104 cycles for 100 * ror 16

155 cycles for 100 * imul 10
159 cycles for 100 * lea: *10
9051 cycles for 100 * rep lodsd (25 DWORDs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
180 cycles for 100 * lea10, add eax
166 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
105 cycles for 100 * ror 16

164 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
9008 cycles for 100 * rep lodsd (25 DWORDs)
6009 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
165 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
97 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
9492 cycles for 100 * rep lodsd (25 DWORDs)
3181 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
14 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
19 cycles for 100 * lea: *10
10011 cycles for 100 * rep lodsd (25 DWORDs)
3240 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
7 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
9746 cycles for 100 * rep lodsd (25 DWORDs)
3137 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
12 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
11 cycles for 100 * lea: *10
9658 cycles for 100 * rep lodsd (25 DWORDs)
3242 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
9 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

52 cycles for 100 * imul 10
85 cycles for 100 * lea: *10
10853 cycles for 100 * rep lodsd (25 DWORDs)
4703 cycles for 100 * mov eax, [esi] + add esi, 4
71 cycles for 100 * lea10, add eax
67 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
37 cycles for 100 * ror 16

39 cycles for 100 * imul 10
97 cycles for 100 * lea: *10
10803 cycles for 100 * rep lodsd (25 DWORDs)
3588 cycles for 100 * mov eax, [esi] + add esi, 4
18 cycles for 100 * lea10, add eax
2 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

32 cycles for 100 * imul 10
68 cycles for 100 * lea: *10
8343 cycles for 100 * rep lodsd (25 DWORDs)
2578 cycles for 100 * mov eax, [esi] + add esi, 4
8 cycles for 100 * lea10, add eax
3 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
3 cycles for 100 * lea: *10
8712 cycles for 100 * rep lodsd (25 DWORDs)
3064 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
5 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---


pre-P4 (SSE1)

101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34862 cycles for 100 * rep lodsB (100 BYTEs)
8403 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
103 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34869 cycles for 100 * rep lodsB (100 BYTEs)
8400 cycles for 100 * mov eax, [esi] + add esi, 4
102 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
103 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16

102 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34864 cycles for 100 * rep lodsB (100 BYTEs)
8392 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
101 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
32678 cycles for 100 * rep lodsB (100 BYTEs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
164 cycles for 100 * lea10, shl eax, 1
156 cycles for 100 * bswap
100 cycles for 100 * ror 16

155 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
32653 cycles for 100 * rep lodsB (100 BYTEs)
5966 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
167 cycles for 100 * lea10, shl eax, 1
166 cycles for 100 * bswap
104 cycles for 100 * ror 16

163 cycles for 100 * imul 10
169 cycles for 100 * lea: *10
32663 cycles for 100 * rep lodsB (100 BYTEs)
5979 cycles for 100 * mov eax, [esi] + add esi, 4
176 cycles for 100 * lea10, add eax
168 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
104 cycles for 100 * ror 16

162 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
32673 cycles for 100 * rep lodsB (100 BYTEs)
5981 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
159 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
96 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
25205 cycles for 100 * rep lodsB (100 BYTEs)
3292 cycles for 100 * mov eax, [esi] + add esi, 4
51 cycles for 100 * lea10, add eax
38 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
31 cycles for 100 * lea: *10
25332 cycles for 100 * rep lodsB (100 BYTEs)
3306 cycles for 100 * mov eax, [esi] + add esi, 4
30 cycles for 100 * lea10, add eax
70 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
12 cycles for 100 * lea: *10
25632 cycles for 100 * rep lodsB (100 BYTEs)
3298 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

?? cycles for 100 * imul 10
10 cycles for 100 * lea: *10
25643 cycles for 100 * rep lodsB (100 BYTEs)
3205 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

0 cycles for 100 * imul 10
26 cycles for 100 * lea: *10
16730 cycles for 100 * rep lodsB (100 BYTEs)
2255 cycles for 100 * mov eax, [esi] + add esi, 4
91 cycles for 100 * lea10, add eax
71 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
16499 cycles for 100 * rep lodsB (100 BYTEs)
2099 cycles for 100 * mov eax, [esi] + add esi, 4
23 cycles for 100 * lea10, add eax
18 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16

8 cycles for 100 * imul 10
20 cycles for 100 * lea: *10
16543 cycles for 100 * rep lodsB (100 BYTEs)
2269 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
8 cycles for 100 * bswap
0 cycles for 100 * ror 16

1 cycles for 100 * imul 10
38 cycles for 100 * lea: *10
19057 cycles for 100 * rep lodsB (100 BYTEs)
2247 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
3 cycles for 100 * ror 16

8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16


--- ok ---
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: LiaoMi on December 02, 2021, 05:17:55 AM
Quote from: jj2007 on December 02, 2021, 01:49:15 AM
Quote from: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.

Very good point :thumbsup:

11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)

43      cycles for 100 * imul 10
511     cycles for 100 * rep lodsd (25 DWORDs) - 10*
123     cycles for 100 *     lodsd (25 DWORDs) - 10*
122     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
17      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
13      cycles for 100 * xor eax, ecx

43      cycles for 100 * imul 10
508     cycles for 100 * rep lodsd (25 DWORDs) - 10*
122     cycles for 100 *     lodsd (25 DWORDs) - 10*
121     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
18      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
18      cycles for 100 * bswap+nop
14      cycles for 100 * xor eax, ecx

44      cycles for 100 * imul 10
506     cycles for 100 * rep lodsd (25 DWORDs) - 10*
121     cycles for 100 *     lodsd (25 DWORDs) - 10*
122     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
18      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
19      cycles for 100 * bswap+nop
13      cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


--- ok ---
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: hutch-- on December 02, 2021, 07:16:44 AM

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

0       cycles for 100 * imul 10
701     cycles for 100 * rep lodsd (25 DWORDs) - 10*
238     cycles for 100 *     lodsd (25 DWORDs) - 10*
199     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
32      cycles for 100 * lea10, add eax
16      cycles for 100 * lea10, shl eax, 1
14      cycles for 100 * bswap+nop
7       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
701     cycles for 100 * rep lodsd (25 DWORDs) - 10*
238     cycles for 100 *     lodsd (25 DWORDs) - 10*
196     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
31      cycles for 100 * lea10, add eax
17      cycles for 100 * lea10, shl eax, 1
16      cycles for 100 * bswap+nop
3       cycles for 100 * xor eax, ecx

0       cycles for 100 * imul 10
702     cycles for 100 * rep lodsd (25 DWORDs) - 10*
238     cycles for 100 *     lodsd (25 DWORDs) - 10*
194     cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
32      cycles for 100 * lea10, add eax
18      cycles for 100 * lea10, shl eax, 1
15      cycles for 100 * bswap+nop
2       cycles for 100 * xor eax, ecx

8       bytes for imul 10
12      bytes for rep lodsd (25 DWORDs) - 10*
14      bytes for     lodsd (25 DWORDs) - 10*
18      bytes for mov eax, [esi] + add esi, 4 - 10*
10      bytes for lea10, add eax
10      bytes for lea10, shl eax, 1
3       bytes for bswap+nop
2       bytes for xor eax, ecx


-

Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on December 02, 2021, 09:41:23 AM
Quote from: FORTRANS on December 02, 2021, 03:02:47 AM
   Four systems with timings.

34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4


Second row should be mov al, [esi] + inc esi, otherwise it's an unfair comparison
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: FORTRANS on December 03, 2021, 12:51:20 AM
Hi Jochen,

Quote from: jj2007 on December 02, 2021, 09:41:23 AM
Quote from: FORTRANS on December 02, 2021, 03:02:47 AM
   Four systems with timings.

34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4


Second row should be mov al, [esi] + inc esi, otherwise it's an unfair comparison

   Well, the idea was that Intel had improved LODSB in some of their CPUs.
Or maybe not.  Anyway.

pre-P4 (SSE1)

101 cycles for 100 * imul 10
34863 cycles for 100 * rep lodsB (100 BYTEs)
28500 cycles for 100 * mov AL, [esi] + INC esi

101 cycles for 100 * imul 10
34882 cycles for 100 * rep lodsB (100 BYTEs)
28482 cycles for 100 * mov AL, [esi] + INC esi

101 cycles for 100 * imul 10
34861 cycles for 100 * rep lodsB (100 BYTEs)
28502 cycles for 100 * mov AL, [esi] + INC esi

101 cycles for 100 * imul 10
34876 cycles for 100 * rep lodsB (100 BYTEs)
28492 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

162 cycles for 100 * imul 10
32603 cycles for 100 * rep lodsB (100 BYTEs)
24696 cycles for 100 * mov AL, [esi] + INC esi

162 cycles for 100 * imul 10
32611 cycles for 100 * rep lodsB (100 BYTEs)
24699 cycles for 100 * mov AL, [esi] + INC esi

156 cycles for 100 * imul 10
32592 cycles for 100 * rep lodsB (100 BYTEs)
24726 cycles for 100 * mov AL, [esi] + INC esi

162 cycles for 100 * imul 10
32608 cycles for 100 * rep lodsB (100 BYTEs)
24705 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

?? cycles for 100 * imul 10
24755 cycles for 100 * rep lodsB (100 BYTEs)
14615 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
24772 cycles for 100 * rep lodsB (100 BYTEs)
14738 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
24759 cycles for 100 * rep lodsB (100 BYTEs)
14621 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
24760 cycles for 100 * rep lodsB (100 BYTEs)
14622 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---

Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)

?? cycles for 100 * imul 10
17094 cycles for 100 * rep lodsB (100 BYTEs)
8491 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
16367 cycles for 100 * rep lodsB (100 BYTEs)
8473 cycles for 100 * mov AL, [esi] + INC esi

?? cycles for 100 * imul 10
16269 cycles for 100 * rep lodsB (100 BYTEs)
8559 cycles for 100 * mov AL, [esi] + INC esi

1 cycles for 100 * imul 10
16111 cycles for 100 * rep lodsB (100 BYTEs)
8466 cycles for 100 * mov AL, [esi] + INC esi

8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi


--- ok ---


Regards,

Steve
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on December 03, 2021, 02:37:47 AM
Quote from: FORTRANS on December 03, 2021, 12:51:20 AM
  Well, the idea was that Intel had improved LODSB in some of their CPUs.

Hi Steve,

I haven't tested lodsb, but lodsd is clearly a fast instruction on recent Intel CPUs, in comparison with the expanded mov eax, [esi] plus add esi, 4. What is striking, though, is that rep lodsd is often slower than the equivalent loop.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: FORTRANS on December 03, 2021, 08:22:44 AM
Hi,

   Memory error.  Faster REP MOVS/STOS, not LODS can
be identified with CPUID.  REP LODS is a bit useless anyway,
overwriting previously loaded values.  Oh well, maybe next
time.

Regards,

Steve N.
Title: Re: Timings for bswap, ror, imul, push+pop vs mov [esp+x], nnn, lodsd vs mov eax,..
Post by: jj2007 on December 03, 2021, 09:33:31 AM
Quote from: FORTRANS on December 03, 2021, 08:22:44 AMREP LODS is a bit useless

Good point :biggrin: