Just for fun :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
14426 kCycles for 10 * CRT _swab
12648 kCycles for 10 * MbSwap16a
13391 kCycles for 10 * MbSwap16b
8941 kCycles for 10 * MbSwap16c
13733 kCycles for 10 * CRT _swab
13109 kCycles for 10 * MbSwap16a
12523 kCycles for 10 * MbSwap16b
8302 kCycles for 10 * MbSwap16c
14025 kCycles for 10 * CRT _swab
12489 kCycles for 10 * MbSwap16a
24203 kCycles for 10 * MbSwap16b
8027 kCycles for 10 * MbSwap16c
13954 kCycles for 10 * CRT _swab
12714 kCycles for 10 * MbSwap16a
12000 kCycles for 10 * MbSwap16b
9119 kCycles for 10 * MbSwap16c
I guess pshufb is a tick faster, in case someone is eager to humiliate the crt even more ;-)
pshufb is indeed a little bit faster:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
13124 kCycles for 10 * CRT _swab
7806 kCycles for 10 * MbSwap16c
1757 kCycles for 10 * pshufb
13576 kCycles for 10 * CRT _swab
7778 kCycles for 10 * MbSwap16c
1828 kCycles for 10 * pshufb
13093 kCycles for 10 * CRT _swab
7828 kCycles for 10 * MbSwap16c
1793 kCycles for 10 * pshufb
13183 kCycles for 10 * CRT _swab
7919 kCycles for 10 * MbSwap16c
1759 kCycles for 10 * pshufb
13132 kCycles for 10 * CRT _swab
7789 kCycles for 10 * MbSwap16c
1783 kCycles for 10 * pshufb
13153 kCycles for 10 * CRT _swab
7919 kCycles for 10 * MbSwap16c
1782 kCycles for 10 * pshufb
One more, on average a factor 7.64 faster than the CRT:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
13198 kCycles for 10 * CRT _swab
1743 kCycles for 10 * pshufb
1747 kCycles for 10 * pshufb frameless
13104 kCycles for 10 * CRT _swab
1779 kCycles for 10 * pshufb
1681 kCycles for 10 * pshufb frameless
13142 kCycles for 10 * CRT _swab
1765 kCycles for 10 * pshufb
1749 kCycles for 10 * pshufb frameless
13105 kCycles for 10 * CRT _swab
1761 kCycles for 10 * pshufb
1683 kCycles for 10 * pshufb frameless
13157 kCycles for 10 * CRT _swab
1739 kCycles for 10 * pshufb
1775 kCycles for 10 * pshufb frameless
13104 kCycles for 10 * CRT _swab
1740 kCycles for 10 * pshufb
1680 kCycles for 10 * pshufb frameless
in the order presented above...
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
15717 kCycles for 10 * CRT _swab
20633 kCycles for 10 * MbSwap16a
68825 kCycles for 10 * MbSwap16b
8013 kCycles for 10 * MbSwap16c
14801 kCycles for 10 * CRT _swab
20048 kCycles for 10 * MbSwap16a
68100 kCycles for 10 * MbSwap16b
8328 kCycles for 10 * MbSwap16c
14827 kCycles for 10 * CRT _swab
24427 kCycles for 10 * MbSwap16a
68029 kCycles for 10 * MbSwap16b
8155 kCycles for 10 * MbSwap16c
14904 kCycles for 10 * CRT _swab
19972 kCycles for 10 * MbSwap16a
68120 kCycles for 10 * MbSwap16b
9155 kCycles for 10 * MbSwap16c
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
15418 kCycles for 10 * CRT _swab
8617 kCycles for 10 * MbSwap16c
7965 kCycles for 10 * pshufb
15489 kCycles for 10 * CRT _swab
8722 kCycles for 10 * MbSwap16c
6458 kCycles for 10 * pshufb
15746 kCycles for 10 * CRT _swab
8742 kCycles for 10 * MbSwap16c
6756 kCycles for 10 * pshufb
15215 kCycles for 10 * CRT _swab
8656 kCycles for 10 * MbSwap16c
7140 kCycles for 10 * pshufb
15222 kCycles for 10 * CRT _swab
8685 kCycles for 10 * MbSwap16c
7293 kCycles for 10 * pshufb
15247 kCycles for 10 * CRT _swab
8679 kCycles for 10 * MbSwap16c
7319 kCycles for 10 * pshufb
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
16496 kCycles for 10 * CRT _swab
8640 kCycles for 10 * pshufb
6507 kCycles for 10 * pshufb frameless
15281 kCycles for 10 * CRT _swab
7073 kCycles for 10 * pshufb
6630 kCycles for 10 * pshufb frameless
15528 kCycles for 10 * CRT _swab
7046 kCycles for 10 * pshufb
6448 kCycles for 10 * pshufb frameless
15344 kCycles for 10 * CRT _swab
6905 kCycles for 10 * pshufb
6497 kCycles for 10 * pshufb frameless
15418 kCycles for 10 * CRT _swab
6829 kCycles for 10 * pshufb
6523 kCycles for 10 * pshufb frameless
15362 kCycles for 10 * CRT _swab
6880 kCycles for 10 * pshufb
7595 kCycles for 10 * pshufb frameless
all win7 pro 32 bit
Thanks :thumbsup:
Not a big improvement on the AMD :sad:
Quote from: jj2007 on February 12, 2021, 07:53:12 AM
Just for fun :biggrin:
I guess pshufb is a tick faster, in case someone is eager to humiliate the crt even more ;-)
how about ucrt too, as it give even more fun sometimes.
_byteswap_ushort
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
++18 of 20 tests valid,
10071 kCycles for 10 * CRT _swab
12143 kCycles for 10 * MbSwap16a
10233 kCycles for 10 * MbSwap16b
6200 kCycles for 10 * MbSwap16c
9861 kCycles for 10 * CRT _swab
12155 kCycles for 10 * MbSwap16a
10794 kCycles for 10 * MbSwap16b
5850 kCycles for 10 * MbSwap16c
9817 kCycles for 10 * CRT _swab
11763 kCycles for 10 * MbSwap16a
10209 kCycles for 10 * MbSwap16b
5942 kCycles for 10 * MbSwap16c
9412 kCycles for 10 * CRT _swab
11800 kCycles for 10 * MbSwap16a
10208 kCycles for 10 * MbSwap16b
5784 kCycles for 10 * MbSwap16c
27 bytes for CRT _swab
55 bytes for MbSwap16a
59 bytes for MbSwap16b
59 bytes for MbSwap16c
65 023EF020 3B 3B 3B 3B 68 20 61 65 024DDA14 2D 2D 2D 2D 2D 2D 0D 2D
65 023EF020 3B 3B 3B 3B 68 20 61 65 024DDA14 2D 2D 2D 2D 2D 2D 0D 2D
65 023EF020 3B 3B 3B 3B 68 20 61 65 024DDA14 2D 2D 2D 2D 2D 2D 0D 2D
65 023EF020 3B 3B 3B 3B 68 20 61 65 024DDA14 2D 2D 2D 2D 2D 2D 0D 2D
--- ok ---
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
+-18 of 20 tests valid,
9796 kCycles for 10 * CRT _swab
6027 kCycles for 10 * MbSwap16c
1684 kCycles for 10 * pshufb
9607 kCycles for 10 * CRT _swab
6072 kCycles for 10 * MbSwap16c
1649 kCycles for 10 * pshufb
10079 kCycles for 10 * CRT _swab
5981 kCycles for 10 * MbSwap16c
1660 kCycles for 10 * pshufb
9609 kCycles for 10 * CRT _swab
6105 kCycles for 10 * MbSwap16c
1661 kCycles for 10 * pshufb
9833 kCycles for 10 * CRT _swab
6386 kCycles for 10 * MbSwap16c
1695 kCycles for 10 * pshufb
9971 kCycles for 10 * CRT _swab
6457 kCycles for 10 * MbSwap16c
1725 kCycles for 10 * pshufb
27 bytes for CRT _swab
59 bytes for MbSwap16c
119 bytes for pshufb
37 0249F020 32 31 34 33 36 35 38 37 0258DA14 2D 2D 2D 2D 2D 2D 0D 2D
37 0249F020 32 31 34 33 36 35 38 37 0258DA14 2D 2D 2D 2D 2D 2D 0D 2D
37 0249F020 32 31 34 33 36 35 38 37 0258DA14 2D 2D 2D 2D 2D 2D 0D 2D
37 0249F020 32 31 34 33 36 35 38 37 0258DA14 2D 2D 2D 2D 2D 2D 0D 2D
--- ok ---
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
9667 kCycles for 10 * CRT _swab
1730 kCycles for 10 * pshufb
1657 kCycles for 10 * pshufb frameless
9896 kCycles for 10 * CRT _swab
1671 kCycles for 10 * pshufb
1621 kCycles for 10 * pshufb frameless
9588 kCycles for 10 * CRT _swab
1668 kCycles for 10 * pshufb
1781 kCycles for 10 * pshufb frameless
9611 kCycles for 10 * CRT _swab
1655 kCycles for 10 * pshufb
1779 kCycles for 10 * pshufb frameless
9669 kCycles for 10 * CRT _swab
1663 kCycles for 10 * pshufb
1611 kCycles for 10 * pshufb frameless
9559 kCycles for 10 * CRT _swab
1639 kCycles for 10 * pshufb
1631 kCycles for 10 * pshufb frameless
27 bytes for CRT _swab
119 bytes for pshufb
99 bytes for pshufb frameless
37 02417020 32 31 34 33 36 35 38 37 02505A14 2D 2D 2D 2D 2D 2D 0D 2D
37 02417020 32 31 34 33 36 35 38 37 02505A14 2D 2D 2D 2D 2D 2D 0D 2D
37 02417020 32 31 34 33 36 35 38 37 02505A14 2D 2D 2D 2D 2D 2D 0D 2D
37 02417020 32 31 34 33 36 35 38 37 02505A14 2D 2D 2D 2D 2D 2D 0D 2D
--- ok ---
@Liaomi: Thanks for testing :thup:
Quote from: TimoVJL on February 12, 2021, 08:27:50 PMhow about ucrt too, as it give even more fun sometimes.
_byteswap_ushort
Looks like an overkill, it's a factor 31 slower...:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
688 cycles for 100 * UCRT _byteswap_ulong
22 cycles for 100 * bswap eax
688 cycles for 100 * UCRT _byteswap_ulong
22 cycles for 100 * bswap eax
686 cycles for 100 * UCRT _byteswap_ulong
22 cycles for 100 * bswap eax
include \masm32\MasmBasic\MasmBasic.inc
Init
Dll "C:\Windows\System32\ucrtbase"
Declare void _byteswap_ulong, C:1 ; C: means use C calling convention
mov ebx, 12345678h
PrintLine "12345678"
_byteswap_ulong(ebx)
PrintLine Hex$(eax)
mov eax, ebx
bswap eax
PrintLine Hex$(eax)
EndOfCode
Output:
12345678
78563412
78563412
Disassembly:
_byteswap_ulong
mov edi, edi
push ebp
mov ebp, esp
mov edx, [ebp+8]
mov eax, edx
push esi
mov esi, 0FF00
mov ecx, edx
and eax, esi
shl ecx, 10
add eax, ecx
mov ecx, edx
shr ecx, 8
and ecx, esi
shl eax, 8
add eax, ecx
shr edx, 18
add eax, edx
pop esi
pop ebp
retn
testing a new box over here
Quote
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
14543 kCycles for 10 * CRT _swab
3682 kCycles for 10 * pshufb
3675 kCycles for 10 * pshufb frameless
14521 kCycles for 10 * CRT _swab
3679 kCycles for 10 * pshufb
3676 kCycles for 10 * pshufb frameless
14543 kCycles for 10 * CRT _swab
3681 kCycles for 10 * pshufb
3676 kCycles for 10 * pshufb frameless
14529 kCycles for 10 * CRT _swab
3679 kCycles for 10 * pshufb
3679 kCycles for 10 * pshufb frameless
14554 kCycles for 10 * CRT _swab
3706 kCycles for 10 * pshufb
3674 kCycles for 10 * pshufb frameless
14513 kCycles for 10 * CRT _swab
3679 kCycles for 10 * pshufb
3678 kCycles for 10 * pshufb frameless
27 bytes for CRT _swab
119 bytes for pshufb
99 bytes for pshufb frameless
37 01320020 32 31 34 33 36 35 38 37 0140EA14 2D 2D 2D 2D 2D 2D 0D 2D
37 01320020 32 31 34 33 36 35 38 37 0140EA14 2D 2D 2D 2D 2D 2D 0D 2D
37 01320020 32 31 34 33 36 35 38 37 0140EA14 2D 2D 2D 2D 2D 2D 0D 2D
37 01320020 32 31 34 33 36 35 38 37 0140EA14 2D 2D 2D 2D 2D 2D 0D 2D
--- ok ---
On my old Haswell.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
-19 of 20 tests valid,
9665 kCycles for 10 * CRT _swab
1954 kCycles for 10 * pshufb
1953 kCycles for 10 * pshufb frameless
9661 kCycles for 10 * CRT _swab
1950 kCycles for 10 * pshufb
1940 kCycles for 10 * pshufb frameless
9679 kCycles for 10 * CRT _swab
1947 kCycles for 10 * pshufb
1942 kCycles for 10 * pshufb frameless
9669 kCycles for 10 * CRT _swab
1945 kCycles for 10 * pshufb
2283 kCycles for 10 * pshufb frameless
9661 kCycles for 10 * CRT _swab
1945 kCycles for 10 * pshufb
1946 kCycles for 10 * pshufb frameless
9809 kCycles for 10 * CRT _swab
1949 kCycles for 10 * pshufb
1941 kCycles for 10 * pshufb frameless
27 bytes for CRT _swab
119 bytes for pshufb
99 bytes for pshufb frameless
37 023A9020 32 31 34 33 36 35 38 37 02497A14 2D 2D 2D 2D 2D 2D 0D 2D
37 023A9020 32 31 34 33 36 35 38 37 02497A14 2D 2D 2D 2D 2D 2D 0D 2D
37 023A9020 32 31 34 33 36 35 38 37 02497A14 2D 2D 2D 2D 2D 2D 0D 2D
37 023A9020 32 31 34 33 36 35 38 37 02497A14 2D 2D 2D 2D 2D 2D 0D 2D
--- ok ---