News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Little to Big Endian: Assembly beats CRT _swab hands down

Started by jj2007, February 12, 2021, 07:53:12 AM

Previous topic - Next topic

jj2007

Just for fun :biggrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

14426   kCycles for 10 * CRT _swab
12648   kCycles for 10 * MbSwap16a
13391   kCycles for 10 * MbSwap16b
8941    kCycles for 10 * MbSwap16c

13733   kCycles for 10 * CRT _swab
13109   kCycles for 10 * MbSwap16a
12523   kCycles for 10 * MbSwap16b
8302    kCycles for 10 * MbSwap16c

14025   kCycles for 10 * CRT _swab
12489   kCycles for 10 * MbSwap16a
24203   kCycles for 10 * MbSwap16b
8027    kCycles for 10 * MbSwap16c

13954   kCycles for 10 * CRT _swab
12714   kCycles for 10 * MbSwap16a
12000   kCycles for 10 * MbSwap16b
9119    kCycles for 10 * MbSwap16c


I guess pshufb is a tick faster, in case someone is eager to humiliate the crt even more ;-)

jj2007

pshufb is indeed a little bit faster:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

13124   kCycles for 10 * CRT _swab
7806    kCycles for 10 * MbSwap16c
1757    kCycles for 10 * pshufb

13576   kCycles for 10 * CRT _swab
7778    kCycles for 10 * MbSwap16c
1828    kCycles for 10 * pshufb

13093   kCycles for 10 * CRT _swab
7828    kCycles for 10 * MbSwap16c
1793    kCycles for 10 * pshufb

13183   kCycles for 10 * CRT _swab
7919    kCycles for 10 * MbSwap16c
1759    kCycles for 10 * pshufb

13132   kCycles for 10 * CRT _swab
7789    kCycles for 10 * MbSwap16c
1783    kCycles for 10 * pshufb

13153   kCycles for 10 * CRT _swab
7919    kCycles for 10 * MbSwap16c
1782    kCycles for 10 * pshufb

jj2007

One more, on average a factor 7.64 faster than the CRT:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

13198   kCycles for 10 * CRT _swab
1743    kCycles for 10 * pshufb
1747    kCycles for 10 * pshufb frameless

13104   kCycles for 10 * CRT _swab
1779    kCycles for 10 * pshufb
1681    kCycles for 10 * pshufb frameless

13142   kCycles for 10 * CRT _swab
1765    kCycles for 10 * pshufb
1749    kCycles for 10 * pshufb frameless

13105   kCycles for 10 * CRT _swab
1761    kCycles for 10 * pshufb
1683    kCycles for 10 * pshufb frameless

13157   kCycles for 10 * CRT _swab
1739    kCycles for 10 * pshufb
1775    kCycles for 10 * pshufb frameless

13104   kCycles for 10 * CRT _swab
1740    kCycles for 10 * pshufb
1680    kCycles for 10 * pshufb frameless

quarantined

in the order presented above...

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

15717   kCycles for 10 * CRT _swab
20633   kCycles for 10 * MbSwap16a
68825   kCycles for 10 * MbSwap16b
8013    kCycles for 10 * MbSwap16c

14801   kCycles for 10 * CRT _swab
20048   kCycles for 10 * MbSwap16a
68100   kCycles for 10 * MbSwap16b
8328    kCycles for 10 * MbSwap16c

14827   kCycles for 10 * CRT _swab
24427   kCycles for 10 * MbSwap16a
68029   kCycles for 10 * MbSwap16b
8155    kCycles for 10 * MbSwap16c

14904   kCycles for 10 * CRT _swab
19972   kCycles for 10 * MbSwap16a
68120   kCycles for 10 * MbSwap16b
9155    kCycles for 10 * MbSwap16c


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

15418   kCycles for 10 * CRT _swab
8617    kCycles for 10 * MbSwap16c
7965    kCycles for 10 * pshufb

15489   kCycles for 10 * CRT _swab
8722    kCycles for 10 * MbSwap16c
6458    kCycles for 10 * pshufb

15746   kCycles for 10 * CRT _swab
8742    kCycles for 10 * MbSwap16c
6756    kCycles for 10 * pshufb

15215   kCycles for 10 * CRT _swab
8656    kCycles for 10 * MbSwap16c
7140    kCycles for 10 * pshufb

15222   kCycles for 10 * CRT _swab
8685    kCycles for 10 * MbSwap16c
7293    kCycles for 10 * pshufb

15247   kCycles for 10 * CRT _swab
8679    kCycles for 10 * MbSwap16c
7319    kCycles for 10 * pshufb


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

16496   kCycles for 10 * CRT _swab
8640    kCycles for 10 * pshufb
6507    kCycles for 10 * pshufb frameless

15281   kCycles for 10 * CRT _swab
7073    kCycles for 10 * pshufb
6630    kCycles for 10 * pshufb frameless

15528   kCycles for 10 * CRT _swab
7046    kCycles for 10 * pshufb
6448    kCycles for 10 * pshufb frameless

15344   kCycles for 10 * CRT _swab
6905    kCycles for 10 * pshufb
6497    kCycles for 10 * pshufb frameless

15418   kCycles for 10 * CRT _swab
6829    kCycles for 10 * pshufb
6523    kCycles for 10 * pshufb frameless

15362   kCycles for 10 * CRT _swab
6880    kCycles for 10 * pshufb
7595    kCycles for 10 * pshufb frameless


all win7 pro 32 bit

jj2007


TimoVJL

Quote from: jj2007 on February 12, 2021, 07:53:12 AM
Just for fun :biggrin:
I guess pshufb is a tick faster, in case someone is eager to humiliate the crt even more ;-)
how about ucrt too, as it give even more fun sometimes.
_byteswap_ushort
May the source be with you

LiaoMi

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
++18 of 20 tests valid,
10071   kCycles for 10 * CRT _swab
12143   kCycles for 10 * MbSwap16a
10233   kCycles for 10 * MbSwap16b
6200    kCycles for 10 * MbSwap16c

9861    kCycles for 10 * CRT _swab
12155   kCycles for 10 * MbSwap16a
10794   kCycles for 10 * MbSwap16b
5850    kCycles for 10 * MbSwap16c

9817    kCycles for 10 * CRT _swab
11763   kCycles for 10 * MbSwap16a
10209   kCycles for 10 * MbSwap16b
5942    kCycles for 10 * MbSwap16c

9412    kCycles for 10 * CRT _swab
11800   kCycles for 10 * MbSwap16a
10208   kCycles for 10 * MbSwap16b
5784    kCycles for 10 * MbSwap16c

27      bytes for CRT _swab
55      bytes for MbSwap16a
59      bytes for MbSwap16b
59      bytes for MbSwap16c

65      023EF020  3B 3B 3B 3B 68 20 61 65   024DDA14  2D 2D 2D 2D 2D 2D 0D 2D
65      023EF020  3B 3B 3B 3B 68 20 61 65   024DDA14  2D 2D 2D 2D 2D 2D 0D 2D
65      023EF020  3B 3B 3B 3B 68 20 61 65   024DDA14  2D 2D 2D 2D 2D 2D 0D 2D
65      023EF020  3B 3B 3B 3B 68 20 61 65   024DDA14  2D 2D 2D 2D 2D 2D 0D 2D

--- ok ---


Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
+-18 of 20 tests valid,
9796    kCycles for 10 * CRT _swab
6027    kCycles for 10 * MbSwap16c
1684    kCycles for 10 * pshufb

9607    kCycles for 10 * CRT _swab
6072    kCycles for 10 * MbSwap16c
1649    kCycles for 10 * pshufb

10079   kCycles for 10 * CRT _swab
5981    kCycles for 10 * MbSwap16c
1660    kCycles for 10 * pshufb

9609    kCycles for 10 * CRT _swab
6105    kCycles for 10 * MbSwap16c
1661    kCycles for 10 * pshufb

9833    kCycles for 10 * CRT _swab
6386    kCycles for 10 * MbSwap16c
1695    kCycles for 10 * pshufb

9971    kCycles for 10 * CRT _swab
6457    kCycles for 10 * MbSwap16c
1725    kCycles for 10 * pshufb

27      bytes for CRT _swab
59      bytes for MbSwap16c
119     bytes for pshufb

37      0249F020  32 31 34 33 36 35 38 37   0258DA14  2D 2D 2D 2D 2D 2D 0D 2D
37      0249F020  32 31 34 33 36 35 38 37   0258DA14  2D 2D 2D 2D 2D 2D 0D 2D
37      0249F020  32 31 34 33 36 35 38 37   0258DA14  2D 2D 2D 2D 2D 2D 0D 2D
37      0249F020  32 31 34 33 36 35 38 37   0258DA14  2D 2D 2D 2D 2D 2D 0D 2D

--- ok ---


Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

9667    kCycles for 10 * CRT _swab
1730    kCycles for 10 * pshufb
1657    kCycles for 10 * pshufb frameless

9896    kCycles for 10 * CRT _swab
1671    kCycles for 10 * pshufb
1621    kCycles for 10 * pshufb frameless

9588    kCycles for 10 * CRT _swab
1668    kCycles for 10 * pshufb
1781    kCycles for 10 * pshufb frameless

9611    kCycles for 10 * CRT _swab
1655    kCycles for 10 * pshufb
1779    kCycles for 10 * pshufb frameless

9669    kCycles for 10 * CRT _swab
1663    kCycles for 10 * pshufb
1611    kCycles for 10 * pshufb frameless

9559    kCycles for 10 * CRT _swab
1639    kCycles for 10 * pshufb
1631    kCycles for 10 * pshufb frameless

27      bytes for CRT _swab
119     bytes for pshufb
99      bytes for pshufb frameless

37      02417020  32 31 34 33 36 35 38 37   02505A14  2D 2D 2D 2D 2D 2D 0D 2D
37      02417020  32 31 34 33 36 35 38 37   02505A14  2D 2D 2D 2D 2D 2D 0D 2D
37      02417020  32 31 34 33 36 35 38 37   02505A14  2D 2D 2D 2D 2D 2D 0D 2D
37      02417020  32 31 34 33 36 35 38 37   02505A14  2D 2D 2D 2D 2D 2D 0D 2D

--- ok ---

jj2007

@Liaomi: Thanks for testing :thup:

Quote from: TimoVJL on February 12, 2021, 08:27:50 PMhow about ucrt too, as it give even more fun sometimes.
_byteswap_ushort

Looks like an overkill, it's a factor 31 slower...:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

688     cycles for 100 * UCRT _byteswap_ulong
22      cycles for 100 * bswap eax

688     cycles for 100 * UCRT _byteswap_ulong
22      cycles for 100 * bswap eax

686     cycles for 100 * UCRT _byteswap_ulong
22      cycles for 100 * bswap eax


include \masm32\MasmBasic\MasmBasic.inc
  Init
  Dll "C:\Windows\System32\ucrtbase"
  Declare void _byteswap_ulong, C:1 ; C: means use C calling convention
  mov ebx, 12345678h
  PrintLine "12345678"
  _byteswap_ulong(ebx)
  PrintLine Hex$(eax)
  mov eax, ebx
  bswap eax
  PrintLine Hex$(eax)
EndOfCode


Output:
12345678
78563412
78563412


Disassembly:
_byteswap_ulong
mov edi, edi
push ebp
mov ebp, esp
mov edx, [ebp+8]
mov eax, edx
push esi
mov esi, 0FF00
mov ecx, edx
and eax, esi
shl ecx, 10
add eax, ecx
mov ecx, edx
shr ecx, 8
and ecx, esi
shl eax, 8
add eax, ecx
shr edx, 18
add eax, edx
pop esi
pop ebp
retn

quarantined

testing a new box over here

Quote
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

14543   kCycles for 10 * CRT _swab
3682    kCycles for 10 * pshufb
3675    kCycles for 10 * pshufb frameless

14521   kCycles for 10 * CRT _swab
3679    kCycles for 10 * pshufb
3676    kCycles for 10 * pshufb frameless

14543   kCycles for 10 * CRT _swab
3681    kCycles for 10 * pshufb
3676    kCycles for 10 * pshufb frameless

14529   kCycles for 10 * CRT _swab
3679    kCycles for 10 * pshufb
3679    kCycles for 10 * pshufb frameless

14554   kCycles for 10 * CRT _swab
3706    kCycles for 10 * pshufb
3674    kCycles for 10 * pshufb frameless

14513   kCycles for 10 * CRT _swab
3679    kCycles for 10 * pshufb
3678    kCycles for 10 * pshufb frameless

27      bytes for CRT _swab
119     bytes for pshufb
99      bytes for pshufb frameless

37      01320020  32 31 34 33 36 35 38 37   0140EA14  2D 2D 2D 2D 2D 2D 0D 2D
37      01320020  32 31 34 33 36 35 38 37   0140EA14  2D 2D 2D 2D 2D 2D 0D 2D
37      01320020  32 31 34 33 36 35 38 37   0140EA14  2D 2D 2D 2D 2D 2D 0D 2D
37      01320020  32 31 34 33 36 35 38 37   0140EA14  2D 2D 2D 2D 2D 2D 0D 2D

--- ok ---

hutch--

On my old Haswell.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
-19 of 20 tests valid,
9665    kCycles for 10 * CRT _swab
1954    kCycles for 10 * pshufb
1953    kCycles for 10 * pshufb frameless

9661    kCycles for 10 * CRT _swab
1950    kCycles for 10 * pshufb
1940    kCycles for 10 * pshufb frameless

9679    kCycles for 10 * CRT _swab
1947    kCycles for 10 * pshufb
1942    kCycles for 10 * pshufb frameless

9669    kCycles for 10 * CRT _swab
1945    kCycles for 10 * pshufb
2283    kCycles for 10 * pshufb frameless

9661    kCycles for 10 * CRT _swab
1945    kCycles for 10 * pshufb
1946    kCycles for 10 * pshufb frameless

9809    kCycles for 10 * CRT _swab
1949    kCycles for 10 * pshufb
1941    kCycles for 10 * pshufb frameless

27      bytes for CRT _swab
119     bytes for pshufb
99      bytes for pshufb frameless

37      023A9020  32 31 34 33 36 35 38 37   02497A14  2D 2D 2D 2D 2D 2D 0D 2D
37      023A9020  32 31 34 33 36 35 38 37   02497A14  2D 2D 2D 2D 2D 2D 0D 2D
37      023A9020  32 31 34 33 36 35 38 37   02497A14  2D 2D 2D 2D 2D 2D 0D 2D
37      023A9020  32 31 34 33 36 35 38 37   02497A14  2D 2D 2D 2D 2D 2D 0D 2D

--- ok ---