Following frequent discussions about old and slow instructions (such as bswap (http://masm32.com/board/index.php?topic=9624.msg105534#msg105534)), here is a little testbed.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
23 cycles for 100 * imul 10
54 cycles for 100 * lea: *10
4871 cycles for 100 * lodsd (25 DWORDs)
4863 cycles for 100 * mov eax, [esi] + add esi, 4
58 cycles for 100 * lea10, add eax
57 cycles for 100 * lea10, shl eax, 1
27 cycles for 100 * bswap
62 cycles for 100 * ror 16
26 cycles for 100 * imul 10
55 cycles for 100 * lea: *10
4881 cycles for 100 * lodsd (25 DWORDs)
4890 cycles for 100 * mov eax, [esi] + add esi, 4
56 cycles for 100 * lea10, add eax
55 cycles for 100 * lea10, shl eax, 1
20 cycles for 100 * bswap
55 cycles for 100 * ror 16
24 cycles for 100 * imul 10
53 cycles for 100 * lea: *10
4829 cycles for 100 * lodsd (25 DWORDs)
4894 cycles for 100 * mov eax, [esi] + add esi, 4
56 cycles for 100 * lea10, add eax
54 cycles for 100 * lea10, shl eax, 1
20 cycles for 100 * bswap
55 cycles for 100 * ror 16
30 cycles for 100 * imul 10
60 cycles for 100 * lea: *10
4921 cycles for 100 * lodsd (25 DWORDs)
4810 cycles for 100 * mov eax, [esi] + add esi, 4
55 cycles for 100 * lea10, add eax
54 cycles for 100 * lea10, shl eax, 1
20 cycles for 100 * bswap
55 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
In a similar test, I get 0 cycles for bswap and very few cycles for xor eax, ecx:
26 cycles for 100 * imul 10
56 cycles for 100 * lea: *10
0 cycles for 100 * bswap
14 cycles for 100 * xor eax, ecx
Zero cycles probably means that a bswap gets executed in parallel, at no cost.
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
0 cycles for 100 * imul 10
0 cycles for 100 * lea: *10
5630 cycles for 100 * lodsd (25 DWORDs)
2036 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
0 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16
0 cycles for 100 * imul 10
0 cycles for 100 * lea: *10
5610 cycles for 100 * lodsd (25 DWORDs)
2062 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
0 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16
0 cycles for 100 * imul 10
0 cycles for 100 * lea: *10
5787 cycles for 100 * lodsd (25 DWORDs)
2020 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
0 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16
0 cycles for 100 * imul 10
0 cycles for 100 * lea: *10
5600 cycles for 100 * lodsd (25 DWORDs)
2051 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
4 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Nice to test new machine :thumbsup:Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
?? cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2606 cycles for 100 * lodsd (25 DWORDs)
2507 cycles for 100 * mov eax, [esi] + add esi, 4
30 cycles for 100 * lea10, add eax
21 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
6 cycles for 100 * ror 16
?? cycles for 100 * imul 10
24 cycles for 100 * lea: *10
2651 cycles for 100 * lodsd (25 DWORDs)
2532 cycles for 100 * mov eax, [esi] + add esi, 4
20 cycles for 100 * lea10, add eax
36 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
2 cycles for 100 * ror 16
?? cycles for 100 * imul 10
26 cycles for 100 * lea: *10
2688 cycles for 100 * lodsd (25 DWORDs)
2459 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
23 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
5 cycles for 100 * ror 16
3 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
2544 cycles for 100 * lodsd (25 DWORDs)
2604 cycles for 100 * mov eax, [esi] + add esi, 4
34 cycles for 100 * lea10, add eax
27 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
1 cycles for 100 * ror 16
Well, new box is not sow slow :biggrin:
Very interesting. How to interpret the 0 cycles? Parallel executing, or an error in the overhead calculation?
So lodsd shines on Intel but sucks on AMD... I've just ordered a new notebook with an AMD Athlon Gold 3150U (https://www.cpubenchmark.net/cpu.php?cpu=AMD+Athlon+Gold+3150U&id=3777) :rolleyes:
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
?? cycles for 100 * imul 10
72 cycles for 100 * lea: *10
6976 cycles for 100 * lodsd (25 DWORDs)
2599 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
38 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
0 cycles for 100 * ror 16
0 cycles for 100 * imul 10
28 cycles for 100 * lea: *10
6910 cycles for 100 * lodsd (25 DWORDs)
2593 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
31 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
2 cycles for 100 * ror 16
?? cycles for 100 * imul 10
32 cycles for 100 * lea: *10
6913 cycles for 100 * lodsd (25 DWORDs)
2582 cycles for 100 * mov eax, [esi] + add esi, 4
35 cycles for 100 * lea10, add eax
28 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
6 cycles for 100 * ror 16
3 cycles for 100 * imul 10
46 cycles for 100 * lea: *10
6928 cycles for 100 * lodsd (25 DWORDs)
2594 cycles for 100 * mov eax, [esi] + add esi, 4
32 cycles for 100 * lea10, add eax
34 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
-
JJ,
I had a look at the specs for that AMD CPU, you will really appreciate the difference in performance. :thumbsup:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
13 cycles for 100 * imul 10
29 cycles for 100 * lea: *10
2929 cycles for 100 * lodsd (25 DWORDs)
2834 cycles for 100 * mov eax, [esi] + add esi, 4
34 cycles for 100 * lea10, add eax
29 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
19 cycles for 100 * ror 16
0 cycles for 100 * imul 10
28 cycles for 100 * lea: *10
2808 cycles for 100 * lodsd (25 DWORDs)
2931 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
13 cycles for 100 * bswap
7 cycles for 100 * ror 16
0 cycles for 100 * imul 10
27 cycles for 100 * lea: *10
2853 cycles for 100 * lodsd (25 DWORDs)
2832 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
3 cycles for 100 * bswap
7 cycles for 100 * ror 16
53 cycles for 100 * imul 10
27 cycles for 100 * lea: *10
2808 cycles for 100 * lodsd (25 DWORDs)
2846 cycles for 100 * mov eax, [esi] + add esi, 4
31 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
9 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
Quote from: hutch-- on November 16, 2021, 07:22:36 PM
I had a look at the specs for that AMD CPU, you will really appreciate the difference in performance. :thumbsup:
I hope so! Single thread performance is only 50% better, which is not so much. But I'll have a 256GB SSD instead of a ten year old traditional disk, and 12GB of RAM instead of 6GB.
Cpu benchmark/single thread performance
2046/1273 Intel Core i5-2450M
4112/1815 AMD Athlon Gold 3150U (2 cores)
deleted
Please post the executable, as testit.bat doesn't work, sorry
cool results Marinus,seem ryzen comes with the AD&D timestop spell :thumbsup: :greenclp:
why does it only says SSE4 caps?insufficient testing vs avx and avx2 caps?
JJ dont forget better/bigger cache,faster RAMspeed as well on your new computer faster in memory intensive tasks
mine is both great for fast 3d rendering and have fun learn avx2 and avx
shouldnt it be spelled testIT instead? :bgrin:
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
?? cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2563 cycles for 100 * lodsd (25 DWORDs)
2653 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
31 cycles for 100 * lea10, shl eax, 1
102 cycles for 100 * bswap
14 cycles for 100 * ror 16
22 cycles for 100 * imul 10
121 cycles for 100 * lea: *10
4502 cycles for 100 * lodsd (25 DWORDs)
4379 cycles for 100 * mov eax, [esi] + add esi, 4
57 cycles for 100 * lea10, add eax
64 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
6 cycles for 100 * ror 16
3 cycles for 100 * imul 10
26 cycles for 100 * lea: *10
2708 cycles for 100 * lodsd (25 DWORDs)
3182 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
23 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
13 cycles for 100 * ror 16
0 cycles for 100 * imul 10
32 cycles for 100 * lea: *10
2617 cycles for 100 * lodsd (25 DWORDs)
2599 cycles for 100 * mov eax, [esi] + add esi, 4
21 cycles for 100 * lea10, add eax
29 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
2 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
-
Interesting numbersIntel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4)
2606 cycles for 100 * lodsd (25 DWORDs)
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
2929 cycles for 100 * lodsd (25 DWORDs)
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
2563 cycles for 100 * lodsd (25 DWORDs)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
4871 cycles for 100 * lodsd (25 DWORDs)
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
1765 cycles for 100 * lodsd (25 DWORDs)
AMD Ryzen 9 5950X 16-Core Processor (SSE4)
5630 cycles for 100 * lodsd (25 DWORDs)
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
6976 cycles for 100 * lodsd (25 DWORDs)
AMD Athlon(tm) II X2 220 Processor (SSE3)
9202 cycles for 100 * lodsd (25 DWORDs)
Quote from: TimoVJL on November 17, 2021, 12:19:16 AM
Interesting numbers
Yes. It seems that Intel has invested in lodsd, while AMD hasn't. I use
lodsd whenever speed is not an issue, because it's a one-byter, while
mov eax, [esi] + add esi, 4 occupies 5 bytes in the instruction cache.
deleted
Quote from: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap (http://masm32.com/board/index.php?topic=9624.msg105534#msg105534)), here is a little testbed.
Hi jj2007,
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
26 cycles for 100 * imul 10
9 cycles for 100 * lea: *10
1765 cycles for 100 * lodsd (25 DWORDs)
1689 cycles for 100 * mov eax, [esi] + add esi, 4
4 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
6 cycles for 100 * bswap
0 cycles for 100 * ror 16
36 cycles for 100 * imul 10
8 cycles for 100 * lea: *10
1566 cycles for 100 * lodsd (25 DWORDs)
1764 cycles for 100 * mov eax, [esi] + add esi, 4
2 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
2 cycles for 100 * bswap
9 cycles for 100 * ror 16
30 cycles for 100 * imul 10
1 cycles for 100 * lea: *10
2126 cycles for 100 * lodsd (25 DWORDs)
1734 cycles for 100 * mov eax, [esi] + add esi, 4
15 cycles for 100 * lea10, add eax
26 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
?? cycles for 100 * ror 16
29 cycles for 100 * imul 10
4 cycles for 100 * lea: *10
1551 cycles for 100 * lodsd (25 DWORDs)
1667 cycles for 100 * mov eax, [esi] + add esi, 4
39 cycles for 100 * lea10, add eax
5 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Quote from: nidud on November 17, 2021, 01:00:25 AMYou also need to know a few basic things about programming and batch files, so better just ignore it.
:greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp: :greenclp:
Why don't you just post stuff that works with a standard Masm32 SDK?
Quote from: jj2007 on November 16, 2021, 11:05:52 PM
Please post the executable, as testit.bat doesn't work, sorry
Work well with last AsmC64 :thumbsup:
total [0 .. 3], 1++
108562 cycles 3.asm: mov reg32,reg32
201478 cycles 2.asm: bswap reg32
409369 cycles 1.asm: imul reg32,reg32,imm
415621 cycles 0.asm: lea reg64,mem64 * 2
534630 cycles 4.asm: push reg64
1676828 cycles 6.asm: cld
2252969 cycles 5.asm: std
Quote from: LiaoMi on November 17, 2021, 03:57:54 AM
Quote from: jj2007 on November 16, 2021, 08:57:25 AM
Following frequent discussions about old and slow instructions (such as bswap (http://masm32.com/board/index.php?topic=9624.msg105534#msg105534)), here is a little testbed.
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
Thanks. Timings look a bit unstable (that happens often), but yours confirm that bswap is very fast, and that
lodsd runs as fast as the 5-byter mov eax, [esi] + add esi, 4 - on Intel CPUs.
I confess I have never had any problems with BSWAP over the years from my 486 dx onwards. Its originally an endian swap instruction and at a hardware level it does not have to do much so it has always been fast enough. From memory there is an SSE masked instruction that was reasonably fast that will do the same task but its more work to do somethjing that probably does not matter.
deleted
Quote from: nidud on November 17, 2021, 09:50:05 AM
Think the original point here was, as often is the case with BSWAP, that you didn't need to use it.
It's the same with some SIMD instructions, if you rearrange in .data section between array of structures (AoS) and structures of arrays (SoA)
Between xyzw, xyzw,xyzw,xyzw and xxxx,yyyy,zzzz,wwww
@hutch, you mean SSE shufb, byte Shuffle with right constant, you can perform 4x32 bit bswap
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)
3 cycles for 100 * imul 10
95 cycles for 100 * lea: *10
7998 cycles for 100 * lodsd (25 DWORDs)
6440 cycles for 100 * mov eax, [esi] + add esi, 4
91 cycles for 100 * lea10, add eax
91 cycles for 100 * lea10, shl eax, 1
91 cycles for 100 * bswap
?? cycles for 100 * ror 16
5 cycles for 100 * imul 10
95 cycles for 100 * lea: *10
8274 cycles for 100 * lodsd (25 DWORDs)
6525 cycles for 100 * mov eax, [esi] + add esi, 4
93 cycles for 100 * lea10, add eax
95 cycles for 100 * lea10, shl eax, 1
94 cycles for 100 * bswap
?? cycles for 100 * ror 16
2 cycles for 100 * imul 10
92 cycles for 100 * lea: *10
8117 cycles for 100 * lodsd (25 DWORDs)
6529 cycles for 100 * mov eax, [esi] + add esi, 4
92 cycles for 100 * lea10, add eax
90 cycles for 100 * lea10, shl eax, 1
90 cycles for 100 * bswap
0 cycles for 100 * ror 16
2 cycles for 100 * imul 10
92 cycles for 100 * lea: *10
8058 cycles for 100 * lodsd (25 DWORDs)
6591 cycles for 100 * mov eax, [esi] + add esi, 4
93 cycles for 100 * lea10, add eax
93 cycles for 100 * lea10, shl eax, 1
93 cycles for 100 * bswap
0 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
?? cycles for 100 * imul 10
11 cycles for 100 * lea: *10
2315 cycles for 100 * lodsd (25 DWORDs)
2860 cycles for 100 * mov eax, [esi] + add esi, 4
24 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
?? cycles for 100 * ror 16
0 cycles for 100 * imul 10
14 cycles for 100 * lea: *10
2305 cycles for 100 * lodsd (25 DWORDs)
2823 cycles for 100 * mov eax, [esi] + add esi, 4
24 cycles for 100 * lea10, add eax
14 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
14 cycles for 100 * lea: *10
2332 cycles for 100 * lodsd (25 DWORDs)
2857 cycles for 100 * mov eax, [esi] + add esi, 4
25 cycles for 100 * lea10, add eax
13 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
16 cycles for 100 * lea: *10
2302 cycles for 100 * lodsd (25 DWORDs)
2837 cycles for 100 * mov eax, [esi] + add esi, 4
26 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
-
Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (SSE4)
2 cycles for 100 * imul 10
15 cycles for 100 * lea: *10
2160 cycles for 100 * lodsd (25 DWORDs)
2232 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
10 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
?? cycles for 100 * ror 16
2 cycles for 100 * imul 10
18 cycles for 100 * lea: *10
2117 cycles for 100 * lodsd (25 DWORDs)
2287 cycles for 100 * mov eax, [esi] + add esi, 4
25 cycles for 100 * lea10, add eax
13 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
?? cycles for 100 * ror 16
1 cycles for 100 * imul 10
12 cycles for 100 * lea: *10
2086 cycles for 100 * lodsd (25 DWORDs)
2171 cycles for 100 * mov eax, [esi] + add esi, 4
24 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
?? cycles for 100 * ror 16
0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2143 cycles for 100 * lodsd (25 DWORDs)
2127 cycles for 100 * mov eax, [esi] + add esi, 4
25 cycles for 100 * lea10, add eax
28 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
-
Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz (SSE4) Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4) AMD Ryzen 9 5950X 16-Core Processor (SSE4)
cycles for 100 * imul 10 2 ?? ?? 23 ?? ?? 13 26 ??
cycles for 100 * lea: *10 15 11 21 54 21 11 29 9 72
cycles for 100 * lodsd (25 DWORDs) 2160 2315 2606 4871 2563 2315 2929 1765 6976 5630
cycles for 100 * mov eax, [esi] + add 2232 2860 2507 4863 2653 2860 2834 1689 2599 2036
cycles for 100 * lea10, add eax 28 24 30 58 28 24 34 4 28
cycles for 100 * lea10, shl eax, 1 10 17 21 57 31 17 29 16 38
cycles for 100 * bswap 5 27 102 2 6 38
cycles for 100 * ror 16 ?? 6 62 14 ?? 19
cycles for 100 * imul 10 2 ?? 26 22 36
cycles for 100 * lea: *10 18 14 24 55 121 14 28 8 28
cycles for 100 * lodsd (25 DWORDs) 2117 2305 2651 4881 4502 2305 2808 1566 6910 5610
cycles for 100 * mov eax, [esi] + add 2287 2823 2532 4890 4379 2823 2931 1764 2593 2062
cycles for 100 * lea10, add eax 25 24 20 56 57 24 28 2 28
cycles for 100 * lea10, shl eax, 1 13 14 36 55 64 14 30 8 31
cycles for 100 * bswap 14 1 20 14 13 2 31
cycles for 100 * ror 16 14 2 55 6 14 7 9 2
cycles for 100 * imul 10 1 ?? ?? 24 3 ?? 30 ??
cycles for 100 * lea: *10 12 14 26 53 26 14 27 1 32
cycles for 100 * lodsd (25 DWORDs) 2086 2332 2688 4829 2708 2332 2853 2126 6913 5787
cycles for 100 * mov eax, [esi] + add 2171 2857 2459 4894 3182 2857 2832 1734 2582 2020
cycles for 100 * lea10, add eax 24 25 29 56 27 25 28 15 35
cycles for 100 * lea10, shl eax, 1 16 13 23 54 23 13 30 26 28
cycles for 100 * bswap 13 20 13 3 28
cycles for 100 * ror 16 13 5 55 13 13 7 ?? 6
cycles for 100 * imul 10 ?? 3 30 ?? 53 29 3
cycles for 100 * lea: *10 21 16 30 60 32 16 27 4 46
cycles for 100 * lodsd (25 DWORDs) 2143 2302 2544 4921 2617 2302 2808 1551 6928 5600
cycles for 100 * mov eax, [esi] + add 2127 2837 2604 4810 2599 2837 2846 1667 2594 2051
cycles for 100 * lea10, add eax 25 26 34 55 21 26 31 39 32
cycles for 100 * lea10, shl eax, 1 28 17 27 54 29 17 30 5 34 4
cycles for 100 * bswap 1 17 20 5 17 2 5 34
cycles for 100 * ror 16 17 1 55 2 17 9 5 34
bytes for imul 10 8
bytes for lea: *10 11
bytes for lodsd (25 DWORDs) 14
bytes for mov eax, [esi] + add esi, 4 18
bytes for lea10, add eax 10
bytes for lea10, shl eax, 1 10
bytes for bswap 7
bytes for ror 16 8
Impressive work, Timo :thumbsup:
Sub ImportValues()
Row = 3
Col = 2
With ActiveSheet
While .Cells(Row, Col) <> Empty
'Debug.Print .Cells(Row, Col)
Col = Col + 1
Wend
End With
With Application.FileDialog(msoFileDialogFilePicker)
.AllowMultiSelect = False
.Filters.Add "Excel Files", "*.txt", 1
.Show
sFileName = .SelectedItems.Item(1)
End With
nFileNro = FreeFile
nCurRow = 0
Open sFileName For Input As #nFileNro
Do While Not EOF(nFileNro)
Line Input #nFileNro, sTextRow
'Debug.Print sTextRow
If nCurRow = 0 Then ActiveSheet.Cells(1, Col) = sTextRow
nCurRow = nCurRow + 1
If nCurRow >= 3 Then
If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
'Debug.Print nClk
If nClk > 0 Then ActiveSheet.Cells(Row, Col) = nClk
Row = Row + 1
End If
If nCurRow = 38 Then Exit Do
Loop
Close #nFileNro
End Sub
Sad to say I have nothing that will open an Excel file.
Quote from: hutch-- on November 24, 2021, 08:33:15 PM
Sad to say I have nothing that will open an Excel file.
This one should work, and it usually adds also an advanced RichEd20 version somewhere in C:\Program Files (x86)\Common Files\microsoft shared\OFFICE*\RICHED20.DLL
https://filehippo.com/download_microsoft-excel-viewer/
No idea how useful that is, but here are the averages from Timo's table:
19 imul 10
28 lea * 10
3454 lodsd (25 DWORDs)
2769 mov eax, [esi] + add
30 lea10, add eax
26 lea10, shl eax, 1
18 bswap
18 ror 16
I always liked imul :cool:
Free Excel Viewer 2.2 (https://www.majorgeeks.com/files/details/free_excel_viewer.html)
OK, that worked after I renamed the extension back to XLS.
The real use of the data was the comparison between the old i7 and the Xeon. Both have Haswell cores but there is some interesting variations.
Tabular data.
First timings with the new machine :cool:
AMD Athlon Gold 3150U with Radeon Graphics (SSE4)
0 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
488 cycles for 100 * lodsd (25 DWORDs) - 100*
162 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
29 cycles for 100 * lea10, add eax
38 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
3 cycles for 100 * imul 10
29 cycles for 100 * lea: *10
489 cycles for 100 * lodsd (25 DWORDs) - 100*
162 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
30 cycles for 100 * lea10, add eax
29 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap+nop
1 cycles for 100 * xor eax, ecx
4 cycles for 100 * imul 10
28 cycles for 100 * lea: *10
487 cycles for 100 * lodsd (25 DWORDs) - 100*
170 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32 cycles for 100 * lea10, add eax
30 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs) - 100*
18 bytes for mov eax, [esi] + add esi, 4 - 100*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
This is on my old i7.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
0 cycles for 100 * imul 10
18 cycles for 100 * lea: *10
213 cycles for 100 * lodsd (25 DWORDs) - 100*
196 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
16 cycles for 100 * bswap+nop
7 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
19 cycles for 100 * lea: *10
213 cycles for 100 * lodsd (25 DWORDs) - 100*
193 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
16 cycles for 100 * bswap+nop
6 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
19 cycles for 100 * lea: *10
214 cycles for 100 * lodsd (25 DWORDs) - 100*
193 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
32 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
16 cycles for 100 * bswap+nop
7 cycles for 100 * xor eax, ecx
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs) - 100*
18 bytes for mov eax, [esi] + add esi, 4 - 100*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
-
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
41 cycles for 100 * imul 10
15 cycles for 100 * lea: *10
116 cycles for 100 * lodsd (25 DWORDs) - 100*
120 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
16 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
18 cycles for 100 * bswap+nop
11 cycles for 100 * xor eax, ecx
45 cycles for 100 * imul 10
15 cycles for 100 * lea: *10
116 cycles for 100 * lodsd (25 DWORDs) - 100*
118 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
17 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
18 cycles for 100 * bswap+nop
12 cycles for 100 * xor eax, ecx
44 cycles for 100 * imul 10
15 cycles for 100 * lea: *10
117 cycles for 100 * lodsd (25 DWORDs) - 100*
123 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
16 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
16 cycles for 100 * bswap+nop
16 cycles for 100 * xor eax, ecx
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs) - 100*
18 bytes for mov eax, [esi] + add esi, 4 - 100*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
--- ok ---
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
0 cycles for 100 * imul 10
38 cycles for 100 * lea: *10
591 cycles for 100 * lodsd (25 DWORDs) - 100*
208 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
38 cycles for 100 * lea10, add eax
37 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
36 cycles for 100 * lea: *10
591 cycles for 100 * lodsd (25 DWORDs) - 100*
202 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
37 cycles for 100 * lea10, add eax
37 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
37 cycles for 100 * lea: *10
590 cycles for 100 * lodsd (25 DWORDs) - 100*
200 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
37 cycles for 100 * lea10, add eax
40 cycles for 100 * lea10, shl eax, 1
3 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs) - 100*
18 bytes for mov eax, [esi] + add esi, 4 - 100*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
AMD Athlon Gold 3150U with Radeon Graphics (SSE4) AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4) AMD Athlon(tm) II X2 220 Processor (SSE3) 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4) Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
AMD AthAMD RyzAMD Athl11th GenIntel(R)
cycles for 100 * imul 10 0 0 3 41 0
cycles for 100 * lea: *10 30 38 27 15 18
cycles for 100 * lodsd (25 DWORDs) - 100* 488 591 744 116 213
cycles for 100 * mov eax, [esi] + add esi, 162 208 459 120 196
cycles for 100 * lea10, add eax 29 38 16 16 32
cycles for 100 * lea10, shl eax, 1 38 37 3 17 17
cycles for 100 * bswap+nop 0 0 27 18 16
cycles for 100 * xor eax, ecx 0 0 5 11 7
cycles for 100 * imul 10 3 0 3 45 0
cycles for 100 * lea: *10 29 36 31 15 19
cycles for 100 * lodsd (25 DWORDs) - 100* 489 591 771 116 213
cycles for 100 * mov eax, [esi] + add esi, 162 202 479 118 193
cycles for 100 * lea10, add eax 30 37 4 17 32
cycles for 100 * lea10, shl eax, 1 29 37 4 16 16
cycles for 100 * bswap+nop 0 0 29 18 16
cycles for 100 * xor eax, ecx 1 0 0 12 6
cycles for 100 * imul 10 4 0 2 44 0
cycles for 100 * lea: *10 28 37 29 15 19
cycles for 100 * lodsd (25 DWORDs) - 100* 487 590 753 117 214
cycles for 100 * mov eax, [esi] + add esi, 170 200 459 123 193
cycles for 100 * lea10, add eax 32 37 6 16 32
cycles for 100 * lea10, shl eax, 1 30 40 1 16 16
cycles for 100 * bswap+nop 0 3 25 16 16
cycles for 100 * xor eax, ecx 0 0 3 16 7
@jj2007 ExcelSub ImportValues()
Row = 1 ' check title
Col = 2 ' first value
With ActiveSheet
While .Cells(Row, Col) <> Empty
Debug.Print .Cells(Row, Col)
Col = Col + 1
Wend
End With
With Application.FileDialog(msoFileDialogFilePicker)
.AllowMultiSelect = False
.Filters.Add "Text Files", "*.txt", 1
.Show
sFileName = .SelectedItems.Item(1)
End With
nFileNro = FreeFile
Row = 0 ' start
Open sFileName For Input As #nFileNro
Do While Not EOF(nFileNro)
Line Input #nFileNro, sTextRow
'Debug.Print sTextRow
If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow ' title
Row = Row + 1
If Row >= 3 Then
If Left(sTextRow, 1) <> "" Then
If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
'Debug.Print nClk
If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
ActiveSheet.Cells(Row, 1) = Mid(sTextRow, 8)
End If
End If
If Row = 37 Then Exit Do
Loop
Close #nFileNro
End Sub
Quote from: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
41 cycles for 100 * imul 10
15 cycles for 100 * lea: *10
116 cycles for 100 * lodsd (25 DWORDs) - 100*
120 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
The only cpu that performs badly for
imul :cool:
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.
Quote from: jj2007 on December 01, 2021, 09:54:49 PM
Quote from: LiaoMi on December 01, 2021, 06:01:02 PM
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
41 cycles for 100 * imul 10
15 cycles for 100 * lea: *10
116 cycles for 100 * lodsd (25 DWORDs) - 100*
120 cycles for 100 * mov eax, [esi] + add esi, 4 - 100*
The only cpu that performs badly for imul :cool:
:biggrin: I noticed that a lot of commands are slower than on the oldest processors, some of them are extremely slow :rolleyes: here's the price of progress, most likely the reason is in the implementation of microcode. It's very sad :undecided:
Quote from: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.
Very good point :thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
28 cycles for 100 * imul 10
688 cycles for 100 * rep lodsd (25 DWORDs) - 10*
405 cycles for 100 * lodsd (25 DWORDs) - 10*
317 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
57 cycles for 100 * lea10, add eax
56 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap+nop
19 cycles for 100 * xor eax, ecx
27 cycles for 100 * imul 10
688 cycles for 100 * rep lodsd (25 DWORDs) - 10*
406 cycles for 100 * lodsd (25 DWORDs) - 10*
316 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
58 cycles for 100 * lea10, add eax
56 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap+nop
16 cycles for 100 * xor eax, ecx
28 cycles for 100 * imul 10
686 cycles for 100 * rep lodsd (25 DWORDs) - 10*
406 cycles for 100 * lodsd (25 DWORDs) - 10*
316 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
57 cycles for 100 * lea10, add eax
59 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap+nop
16 cycles for 100 * xor eax, ecx
8 bytes for imul 10
12 bytes for rep lodsd (25 DWORDs) - 10*
14 bytes for lodsd (25 DWORDs) - 10*
18 bytes for mov eax, [esi] + add esi, 4 - 10*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
AMD Athlon(tm) II X2 220 Processor (SSE3)
2 cycles for 100 * imul 10
458 cycles for 100 * rep lodsd (25 DWORDs) - 10*
785 cycles for 100 * lodsd (25 DWORDs) - 10*
470 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
3 cycles for 100 * lea10, add eax
2 cycles for 100 * lea10, shl eax, 1
29 cycles for 100 * bswap+nop
1 cycles for 100 * xor eax, ecx
1 cycles for 100 * imul 10
458 cycles for 100 * rep lodsd (25 DWORDs) - 10*
799 cycles for 100 * lodsd (25 DWORDs) - 10*
469 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
0 cycles for 100 * lea10, add eax
1 cycles for 100 * lea10, shl eax, 1
24 cycles for 100 * bswap+nop
5 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
465 cycles for 100 * rep lodsd (25 DWORDs) - 10*
764 cycles for 100 * lodsd (25 DWORDs) - 10*
465 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
2 cycles for 100 * lea10, add eax
0 cycles for 100 * lea10, shl eax, 1
26 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
8 bytes for imul 10
12 bytes for rep lodsd (25 DWORDs) - 10*
14 bytes for lodsd (25 DWORDs) - 10*
18 bytes for mov eax, [esi] + add esi, 4 - 10*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
0 cycles for 100 * imul 10
544 cycles for 100 * rep lodsd (25 DWORDs) - 10*
657 cycles for 100 * lodsd (25 DWORDs) - 10*
199 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
37 cycles for 100 * lea10, add eax
37 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
546 cycles for 100 * rep lodsd (25 DWORDs) - 10*
657 cycles for 100 * lodsd (25 DWORDs) - 10*
204 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
38 cycles for 100 * lea10, add eax
37 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap+nop
4 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
546 cycles for 100 * rep lodsd (25 DWORDs) - 10*
655 cycles for 100 * lodsd (25 DWORDs) - 10*
201 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
37 cycles for 100 * lea10, add eax
37 cycles for 100 * lea10, shl eax, 1
8 cycles for 100 * bswap+nop
0 cycles for 100 * xor eax, ecx
8 bytes for imul 10
12 bytes for rep lodsd (25 DWORDs) - 10*
14 bytes for lodsd (25 DWORDs) - 10*
18 bytes for mov eax, [esi] + add esi, 4 - 10*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
handle tabsSub ImportValues()
Row = 1 ' check title
Col = 2 ' first value
With ActiveSheet
While .Cells(Row, Col) <> Empty
Debug.Print .Cells(Row, Col)
Col = Col + 1
Wend
End With
With Application.FileDialog(msoFileDialogFilePicker)
.AllowMultiSelect = False
.Filters.Add "Text Files", "*.txt", 1
.Show
sFileName = .SelectedItems.Item(1)
End With
nFileNro = FreeFile
Row = 0 ' start
Open sFileName For Input As #nFileNro
Do While Not EOF(nFileNro)
Line Input #nFileNro, sTextRow
'Debug.Print sTextRow
If Row = 0 Then ActiveSheet.Cells(1, Col) = sTextRow ' title
Row = Row + 1
If Row >= 3 Then
Pos = InStr(1, sTextRow, Chr(9)) ' is tab in line
If Pos = 0 Then Pos = 8 ' no tab
If Left(sTextRow, 1) <> "" Then
If Left(sTextRow, 1) = "?" Then ActiveSheet.Cells(Row, Col) = "??" Else nClk = Val(Left(sTextRow, 5))
'Debug.Print nClk
If nClk >= 0 Then ActiveSheet.Cells(Row, Col) = nClk
ActiveSheet.Cells(Row, 1) = Mid(sTextRow, Pos)
End If
End If
If Row = 37 Then Exit Do
Loop
Close #nFileNro
End Sub
Hi,
Four systems with timings.
Cheers,
Steve N.
pre-P4 (SSE1)
101 cycles for 100 * imul 10
100 cycles for 100 * lea: *10
10717 cycles for 100 * lodsd (25 DWORDs)
8402 cycles for 100 * mov eax, [esi] + add esi, 4
100 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
111 cycles for 100 * ror 16
101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8391 cycles for 100 * mov eax, [esi] + add esi, 4
99 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
99 cycles for 100 * bswap
102 cycles for 100 * ror 16
101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8390 cycles for 100 * mov eax, [esi] + add esi, 4
99 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
100 cycles for 100 * ror 16
101 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
10729 cycles for 100 * lodsd (25 DWORDs)
8401 cycles for 100 * mov eax, [esi] + add esi, 4
100 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
100 cycles for 100 * bswap
100 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
7437 cycles for 100 * lodsd (25 DWORDs)
5961 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
168 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
104 cycles for 100 * ror 16
164 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
7382 cycles for 100 * lodsd (25 DWORDs)
6006 cycles for 100 * mov eax, [esi] + add esi, 4
176 cycles for 100 * lea10, add eax
166 cycles for 100 * lea10, shl eax, 1
160 cycles for 100 * bswap
104 cycles for 100 * ror 16
163 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
7426 cycles for 100 * lodsd (25 DWORDs)
5964 cycles for 100 * mov eax, [esi] + add esi, 4
172 cycles for 100 * lea10, add eax
163 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
104 cycles for 100 * ror 16
154 cycles for 100 * imul 10
163 cycles for 100 * lea: *10
7390 cycles for 100 * lodsd (25 DWORDs)
5961 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
164 cycles for 100 * lea10, shl eax, 1
156 cycles for 100 * bswap
100 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2867 cycles for 100 * lodsd (25 DWORDs)
2990 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2872 cycles for 100 * lodsd (25 DWORDs)
2947 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
13 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
13 cycles for 100 * lea: *10
2843 cycles for 100 * lodsd (25 DWORDs)
3185 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
28 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
2874 cycles for 100 * lodsd (25 DWORDs)
3088 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
7 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
1 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2322 cycles for 100 * lodsd (25 DWORDs)
2160 cycles for 100 * mov eax, [esi] + add esi, 4
21 cycles for 100 * lea10, add eax
21 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
0 cycles for 100 * ror 16
0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
2250 cycles for 100 * lodsd (25 DWORDs)
2167 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
6 cycles for 100 * ror 16
1 cycles for 100 * imul 10
30 cycles for 100 * lea: *10
2452 cycles for 100 * lodsd (25 DWORDs)
2134 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
22 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16
7 cycles for 100 * imul 10
25 cycles for 100 * lea: *10
2408 cycles for 100 * lodsd (25 DWORDs)
2158 cycles for 100 * mov eax, [esi] + add esi, 4
27 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
0 cycles for 100 * bswap
0 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
14 bytes for lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
pre-P4 (SSE1)
101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
9606 cycles for 100 * rep lodsd (25 DWORDs)
8392 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16
102 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
9604 cycles for 100 * rep lodsd (25 DWORDs)
8400 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16
103 cycles for 100 * imul 10
113 cycles for 100 * lea: *10
9590 cycles for 100 * rep lodsd (25 DWORDs)
8403 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16
102 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
9594 cycles for 100 * rep lodsd (25 DWORDs)
8391 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
163 cycles for 100 * imul 10
159 cycles for 100 * lea: *10
9053 cycles for 100 * rep lodsd (25 DWORDs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
167 cycles for 100 * lea10, add eax
160 cycles for 100 * lea10, shl eax, 1
159 cycles for 100 * bswap
104 cycles for 100 * ror 16
163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
9013 cycles for 100 * rep lodsd (25 DWORDs)
6009 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
163 cycles for 100 * lea10, shl eax, 1
159 cycles for 100 * bswap
104 cycles for 100 * ror 16
155 cycles for 100 * imul 10
159 cycles for 100 * lea: *10
9051 cycles for 100 * rep lodsd (25 DWORDs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
180 cycles for 100 * lea10, add eax
166 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
105 cycles for 100 * ror 16
164 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
9008 cycles for 100 * rep lodsd (25 DWORDs)
6009 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
165 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
97 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
9492 cycles for 100 * rep lodsd (25 DWORDs)
3181 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
14 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
19 cycles for 100 * lea: *10
10011 cycles for 100 * rep lodsd (25 DWORDs)
3240 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
7 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
9746 cycles for 100 * rep lodsd (25 DWORDs)
3137 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
12 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
11 cycles for 100 * lea: *10
9658 cycles for 100 * rep lodsd (25 DWORDs)
3242 cycles for 100 * mov eax, [esi] + add esi, 4
28 cycles for 100 * lea10, add eax
9 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
52 cycles for 100 * imul 10
85 cycles for 100 * lea: *10
10853 cycles for 100 * rep lodsd (25 DWORDs)
4703 cycles for 100 * mov eax, [esi] + add esi, 4
71 cycles for 100 * lea10, add eax
67 cycles for 100 * lea10, shl eax, 1
5 cycles for 100 * bswap
37 cycles for 100 * ror 16
39 cycles for 100 * imul 10
97 cycles for 100 * lea: *10
10803 cycles for 100 * rep lodsd (25 DWORDs)
3588 cycles for 100 * mov eax, [esi] + add esi, 4
18 cycles for 100 * lea10, add eax
2 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
32 cycles for 100 * imul 10
68 cycles for 100 * lea: *10
8343 cycles for 100 * rep lodsd (25 DWORDs)
2578 cycles for 100 * mov eax, [esi] + add esi, 4
8 cycles for 100 * lea10, add eax
3 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
3 cycles for 100 * lea: *10
8712 cycles for 100 * rep lodsd (25 DWORDs)
3064 cycles for 100 * mov eax, [esi] + add esi, 4
0 cycles for 100 * lea10, add eax
5 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsd (25 DWORDs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
pre-P4 (SSE1)
101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34862 cycles for 100 * rep lodsB (100 BYTEs)
8403 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
103 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16
101 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34869 cycles for 100 * rep lodsB (100 BYTEs)
8400 cycles for 100 * mov eax, [esi] + add esi, 4
102 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
103 cycles for 100 * bswap
102 cycles for 100 * ror 16
102 cycles for 100 * imul 10
101 cycles for 100 * lea: *10
34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
101 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
102 cycles for 100 * ror 16
102 cycles for 100 * imul 10
102 cycles for 100 * lea: *10
34864 cycles for 100 * rep lodsB (100 BYTEs)
8392 cycles for 100 * mov eax, [esi] + add esi, 4
101 cycles for 100 * lea10, add eax
102 cycles for 100 * lea10, shl eax, 1
101 cycles for 100 * bswap
101 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
163 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
32678 cycles for 100 * rep lodsB (100 BYTEs)
5963 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
164 cycles for 100 * lea10, shl eax, 1
156 cycles for 100 * bswap
100 cycles for 100 * ror 16
155 cycles for 100 * imul 10
167 cycles for 100 * lea: *10
32653 cycles for 100 * rep lodsB (100 BYTEs)
5966 cycles for 100 * mov eax, [esi] + add esi, 4
175 cycles for 100 * lea10, add eax
167 cycles for 100 * lea10, shl eax, 1
166 cycles for 100 * bswap
104 cycles for 100 * ror 16
163 cycles for 100 * imul 10
169 cycles for 100 * lea: *10
32663 cycles for 100 * rep lodsB (100 BYTEs)
5979 cycles for 100 * mov eax, [esi] + add esi, 4
176 cycles for 100 * lea10, add eax
168 cycles for 100 * lea10, shl eax, 1
163 cycles for 100 * bswap
104 cycles for 100 * ror 16
162 cycles for 100 * imul 10
166 cycles for 100 * lea: *10
32673 cycles for 100 * rep lodsB (100 BYTEs)
5981 cycles for 100 * mov eax, [esi] + add esi, 4
168 cycles for 100 * lea10, add eax
159 cycles for 100 * lea10, shl eax, 1
155 cycles for 100 * bswap
96 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
?? cycles for 100 * imul 10
9 cycles for 100 * lea: *10
25205 cycles for 100 * rep lodsB (100 BYTEs)
3292 cycles for 100 * mov eax, [esi] + add esi, 4
51 cycles for 100 * lea10, add eax
38 cycles for 100 * lea10, shl eax, 1
1 cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
31 cycles for 100 * lea: *10
25332 cycles for 100 * rep lodsB (100 BYTEs)
3306 cycles for 100 * mov eax, [esi] + add esi, 4
30 cycles for 100 * lea10, add eax
70 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
12 cycles for 100 * lea: *10
25632 cycles for 100 * rep lodsB (100 BYTEs)
3298 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
?? cycles for 100 * imul 10
10 cycles for 100 * lea: *10
25643 cycles for 100 * rep lodsB (100 BYTEs)
3205 cycles for 100 * mov eax, [esi] + add esi, 4
29 cycles for 100 * lea10, add eax
8 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
0 cycles for 100 * imul 10
26 cycles for 100 * lea: *10
16730 cycles for 100 * rep lodsB (100 BYTEs)
2255 cycles for 100 * mov eax, [esi] + add esi, 4
91 cycles for 100 * lea10, add eax
71 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
0 cycles for 100 * imul 10
21 cycles for 100 * lea: *10
16499 cycles for 100 * rep lodsB (100 BYTEs)
2099 cycles for 100 * mov eax, [esi] + add esi, 4
23 cycles for 100 * lea10, add eax
18 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
?? cycles for 100 * ror 16
8 cycles for 100 * imul 10
20 cycles for 100 * lea: *10
16543 cycles for 100 * rep lodsB (100 BYTEs)
2269 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
8 cycles for 100 * bswap
0 cycles for 100 * ror 16
1 cycles for 100 * imul 10
38 cycles for 100 * lea: *10
19057 cycles for 100 * rep lodsB (100 BYTEs)
2247 cycles for 100 * mov eax, [esi] + add esi, 4
22 cycles for 100 * lea10, add eax
20 cycles for 100 * lea10, shl eax, 1
?? cycles for 100 * bswap
3 cycles for 100 * ror 16
8 bytes for imul 10
11 bytes for lea: *10
12 bytes for rep lodsB (100 BYTEs)
18 bytes for mov eax, [esi] + add esi, 4
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
7 bytes for bswap
8 bytes for ror 16
--- ok ---
Quote from: jj2007 on December 02, 2021, 01:49:15 AM
Quote from: hutch-- on December 01, 2021, 11:47:29 PM
I do wonder at the virtue of testing for "lodsb" without the "rep" prefix. It is generally seen that its a very slow mnemonic used by itself.
Very good point :thumbsup:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz (SSE4)
43 cycles for 100 * imul 10
511 cycles for 100 * rep lodsd (25 DWORDs) - 10*
123 cycles for 100 * lodsd (25 DWORDs) - 10*
122 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
17 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
18 cycles for 100 * bswap+nop
13 cycles for 100 * xor eax, ecx
43 cycles for 100 * imul 10
508 cycles for 100 * rep lodsd (25 DWORDs) - 10*
122 cycles for 100 * lodsd (25 DWORDs) - 10*
121 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
18 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
18 cycles for 100 * bswap+nop
14 cycles for 100 * xor eax, ecx
44 cycles for 100 * imul 10
506 cycles for 100 * rep lodsd (25 DWORDs) - 10*
121 cycles for 100 * lodsd (25 DWORDs) - 10*
122 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
18 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
19 cycles for 100 * bswap+nop
13 cycles for 100 * xor eax, ecx
8 bytes for imul 10
12 bytes for rep lodsd (25 DWORDs) - 10*
14 bytes for lodsd (25 DWORDs) - 10*
18 bytes for mov eax, [esi] + add esi, 4 - 10*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
--- ok ---
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
0 cycles for 100 * imul 10
701 cycles for 100 * rep lodsd (25 DWORDs) - 10*
238 cycles for 100 * lodsd (25 DWORDs) - 10*
199 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
32 cycles for 100 * lea10, add eax
16 cycles for 100 * lea10, shl eax, 1
14 cycles for 100 * bswap+nop
7 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
701 cycles for 100 * rep lodsd (25 DWORDs) - 10*
238 cycles for 100 * lodsd (25 DWORDs) - 10*
196 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
31 cycles for 100 * lea10, add eax
17 cycles for 100 * lea10, shl eax, 1
16 cycles for 100 * bswap+nop
3 cycles for 100 * xor eax, ecx
0 cycles for 100 * imul 10
702 cycles for 100 * rep lodsd (25 DWORDs) - 10*
238 cycles for 100 * lodsd (25 DWORDs) - 10*
194 cycles for 100 * mov eax, [esi] + add esi, 4 - 10*
32 cycles for 100 * lea10, add eax
18 cycles for 100 * lea10, shl eax, 1
15 cycles for 100 * bswap+nop
2 cycles for 100 * xor eax, ecx
8 bytes for imul 10
12 bytes for rep lodsd (25 DWORDs) - 10*
14 bytes for lodsd (25 DWORDs) - 10*
18 bytes for mov eax, [esi] + add esi, 4 - 10*
10 bytes for lea10, add eax
10 bytes for lea10, shl eax, 1
3 bytes for bswap+nop
2 bytes for xor eax, ecx
-
Quote from: FORTRANS on December 02, 2021, 03:02:47 AM
Four systems with timings.
34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4
Second row should be
mov al, [esi] + inc esi, otherwise it's an unfair comparison
Hi Jochen,
Quote from: jj2007 on December 02, 2021, 09:41:23 AM
Quote from: FORTRANS on December 02, 2021, 03:02:47 AM
Four systems with timings.
34858 cycles for 100 * rep lodsB (100 BYTEs)
8394 cycles for 100 * mov eax, [esi] + add esi, 4
Second row should be mov al, [esi] + inc esi, otherwise it's an unfair comparison
Well, the idea was that Intel had improved LODSB in some of their CPUs.
Or maybe not. Anyway.
pre-P4 (SSE1)
101 cycles for 100 * imul 10
34863 cycles for 100 * rep lodsB (100 BYTEs)
28500 cycles for 100 * mov AL, [esi] + INC esi
101 cycles for 100 * imul 10
34882 cycles for 100 * rep lodsB (100 BYTEs)
28482 cycles for 100 * mov AL, [esi] + INC esi
101 cycles for 100 * imul 10
34861 cycles for 100 * rep lodsB (100 BYTEs)
28502 cycles for 100 * mov AL, [esi] + INC esi
101 cycles for 100 * imul 10
34876 cycles for 100 * rep lodsB (100 BYTEs)
28492 cycles for 100 * mov AL, [esi] + INC esi
8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
162 cycles for 100 * imul 10
32603 cycles for 100 * rep lodsB (100 BYTEs)
24696 cycles for 100 * mov AL, [esi] + INC esi
162 cycles for 100 * imul 10
32611 cycles for 100 * rep lodsB (100 BYTEs)
24699 cycles for 100 * mov AL, [esi] + INC esi
156 cycles for 100 * imul 10
32592 cycles for 100 * rep lodsB (100 BYTEs)
24726 cycles for 100 * mov AL, [esi] + INC esi
162 cycles for 100 * imul 10
32608 cycles for 100 * rep lodsB (100 BYTEs)
24705 cycles for 100 * mov AL, [esi] + INC esi
8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi
--- ok ---
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
?? cycles for 100 * imul 10
24755 cycles for 100 * rep lodsB (100 BYTEs)
14615 cycles for 100 * mov AL, [esi] + INC esi
?? cycles for 100 * imul 10
24772 cycles for 100 * rep lodsB (100 BYTEs)
14738 cycles for 100 * mov AL, [esi] + INC esi
?? cycles for 100 * imul 10
24759 cycles for 100 * rep lodsB (100 BYTEs)
14621 cycles for 100 * mov AL, [esi] + INC esi
?? cycles for 100 * imul 10
24760 cycles for 100 * rep lodsB (100 BYTEs)
14622 cycles for 100 * mov AL, [esi] + INC esi
8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi
--- ok ---
Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz (SSE4)
?? cycles for 100 * imul 10
17094 cycles for 100 * rep lodsB (100 BYTEs)
8491 cycles for 100 * mov AL, [esi] + INC esi
?? cycles for 100 * imul 10
16367 cycles for 100 * rep lodsB (100 BYTEs)
8473 cycles for 100 * mov AL, [esi] + INC esi
?? cycles for 100 * imul 10
16269 cycles for 100 * rep lodsB (100 BYTEs)
8559 cycles for 100 * mov AL, [esi] + INC esi
1 cycles for 100 * imul 10
16111 cycles for 100 * rep lodsB (100 BYTEs)
8466 cycles for 100 * mov AL, [esi] + INC esi
8 bytes for imul 10
12 bytes for rep lodsB (100 BYTEs)
16 bytes for mov AL, [esi] + INC esi
--- ok ---
Regards,
Steve
Quote from: FORTRANS on December 03, 2021, 12:51:20 AM
Well, the idea was that Intel had improved LODSB in some of their CPUs.
Hi Steve,
I haven't tested lods
b, but lodsd is clearly a fast instruction on recent Intel CPUs, in comparison with the expanded mov eax, [esi] plus add esi, 4. What is striking, though, is that
rep lodsd is often slower than the equivalent loop.
Hi,
Memory error. Faster REP MOVS/STOS, not LODS can
be identified with CPUID. REP LODS is a bit useless anyway,
overwriting previously loaded values. Oh well, maybe next
time.
Regards,
Steve N.