News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Intel SSE and ARM NEON hexadecimal string

Started by ttribelli, November 17, 2023, 03:24:42 AM

Previous topic - Next topic

ttribelli

This repository is mainly for readers of Randy Hyde's various Art of Assembly books, but I thought I'd share it here just in case someone else might find it interesting. The code is a follow-up to a book example. Creating a hexadecimal string just begs for SIMD implementations. Here are Intel SSE and ARM NEON implementations, SSE and NEON C intrinsic implementations, x86-64 and AArch64 implementations, and an ordinary C implementation for comparison. Windows, macOS, and Linux.

The repository is a testbed, it's not trying for the fastest code possible. The "audience" is students, so I went for readability at times. Different algorithms may be faster depending on the underlying hardware. Table lookup or computed hex digits. Copying individual digits to the output buffer or collecting digits in a register and only copying to the output buffer when the register is full. Also, assembly vs intrinsics. If a specific compiler does not optimize intrinsics then even my not-fully-optimized assembly can beat intrinsics.

Forgive me for a repository that has both masm and gas code, but I am not a complete heathen and configured gas for Intel syntax so that Mac programmers may see the light. :-)

https://github.com/atribelli/hexstr

jj2007

Thanks. I see two asm files in hexstr-main.zip, hexstr-sse and hexstr-sse, what's their respective role?

ttribelli

There are source and make files for both Windows masm and nmake, and macOS/Linux gas and make. The gas .s files are using Intel syntax mode. Note that the code is slightly different between Windows and macOS/Linux. Different registers are used for passing parameters from C to assembly.

Files

makefile - macOS and Linux based builds.
hexstr.mak - Windows based builds.
main.cpp - Timing code.
hexstr.h - Prototypes for hex string conversion functions.
hexstr.c - C and SSE and NEON intrinsic implementations.
hexstr-x64.s - x86-64 assembly implementation (gas).
hexstr-sse.s - SSE implementation (gas).
hexstr-a64.s - AArch64 assembly implementation.
hexstr-neon.s - NEON implementation.
hexstr-x64.asm - x86-64 assembly implementation (masm).
hexstr-sse.asm - SSE implementation (masm).

Building

make - Creates C and intrinsics based code, hexstr-c hexstr-intrin.
make intel - Creates assembly and SSE code, hexstr-x64 hexstr-sse.
make arm - Creates assembly and NEON code, hexstr-a64 hexstr-neon.
make clean - Removes executable and build files.
nmake /f hexstr.mak - Create all executables for Windows.
nmake /f hexstr.mak clean - Removes executable and build files under Windows.

Vortex

Hi ttribelli,

Thanks for the code. I could not manage to assemble this module :

D:\FreeBASIC\bin\win64>as.exe --version
GNU assembler (Binutils for MinGW-W64 x86_64, built by Brecht Sanders) 2.34
Copyright (C) 2020 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-w64-mingw32'.

D:\FreeBASIC\bin\win64>as -o hexstr-x64.o hexstr-x64.s
Conditional Assembly: Lookup digits
Conditional Assembly: Output bytes
hexstr-x64.s: Assembler messages:
hexstr-x64.s:118: Error: bad expression
hexstr-x64.s:118: Error: invalid use of register
hexstr-x64.s:121: Error: bad expression
hexstr-x64.s:121: Error: invalid use of register
hexstr-x64.s:123: Error: bad expression
hexstr-x64.s:123: Error: invalid use of register
hexstr-x64.s:123: Error: bad expression
hexstr-x64.s:123: Error: invalid use of register
hexstr-x64.s:128: Error: bad expression
hexstr-x64.s:128: Error: invalid use of register
hexstr-x64.s:128: Error: bad expression
hexstr-x64.s:128: Error: invalid use of register
hexstr-x64.s:133: Error: bad expression
hexstr-x64.s:133: Error: invalid use of register
hexstr-x64.s:133: Error: bad expression
hexstr-x64.s:133: Error: invalid use of register
hexstr-x64.s:138: Error: bad expression
hexstr-x64.s:138: Error: invalid use of register
hexstr-x64.s:138: Error: bad expression
hexstr-x64.s:138: Error: invalid use of register
hexstr-x64.s:143: Error: bad expression
hexstr-x64.s:143: Error: invalid use of register

Trying to assemble in the Msys2 environment :

# as -o hexstr-x64.o hexstr-x64.s
Conditional Assembly: Lookup digits
Conditional Assembly: Output bytes
hexstr-x64.s: Assembler messages:
hexstr-x64.s:118: Error: bad expression
hexstr-x64.s:118: Error: invalid use of register
hexstr-x64.s:121: Error: bad expression
hexstr-x64.s:121: Error: invalid use of register
hexstr-x64.s:58: Error: bad expression
hexstr-x64.s:91:  Info: macro invoked from here
hexstr-x64.s:123:   Info: macro invoked from here
hexstr-x64.s:58: Error: invalid use of register
hexstr-x64.s:91:  Info: macro invoked from here
hexstr-x64.s:123:   Info: macro invoked from here
hexstr-x64.s:59: Error: bad expression
hexstr-x64.s:91:  Info: macro invoked from here
hexstr-x64.s:123:   Info: macro invoked from here

ttribelli

QuoteThanks for the code. I could not manage to assemble this module :
D:\FreeBASIC\bin\win64>as.exe --version
GNU assembler (Binutils for MinGW-W64 x86_64, built by Brecht Sanders) 2.34
This assembler was configured for a target of `x86_64-w64-mingw32'.

I replicated the problem on Debian. I screwed up and only tested the Intel .s under macOS and not Linux. All my Linux testing was ARM.

While waiting for a repository update, if you have Windows, try nmake and hexstr.mak.

Thanks for your patience.

jj2007

Quote from: ttribelli on November 17, 2023, 06:34:34 AMhexstr-sse.asm - SSE implementation (masm)

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

417     cycles for 100 * Hex$ (MasmBasic)
550     cycles for 100 * ToHexStr
5907    cycles for 100 * hex$ (Masm32 SDK)
46498   cycles for 100 * crt Printf()

374     cycles for 100 * Hex$ (MasmBasic)
527     cycles for 100 * ToHexStr
5820    cycles for 100 * hex$ (Masm32 SDK)
46328   cycles for 100 * crt Printf()

361     cycles for 100 * Hex$ (MasmBasic)
524     cycles for 100 * ToHexStr
5934    cycles for 100 * hex$ (Masm32 SDK)
46904   cycles for 100 * crt Printf()

371     cycles for 100 * Hex$ (MasmBasic)
558     cycles for 100 * ToHexStr
5673    cycles for 100 * hex$ (Masm32 SDK)
46998   cycles for 100 * crt Printf()

Hex$ (MasmBasic)                        12345600
ToHexStr                                12345600
hex$ (Masm32 SDK)                       12345600
crt Printf()                            12345600

Congrats, it's really fast :thumbsup:

ttribelli

Quote from: jj2007 on November 17, 2023, 11:58:57 AM
Quote from: ttribelli on November 17, 2023, 06:34:34 AMhexstr-sse.asm - SSE implementation (masm)

Congrats, it's really fast :thumbsup:

Thank you. It could get faster. Removing the macro and inlining the code would provide more interleaving opportunities.

NoCforMe

Quote from: jj2007 on November 17, 2023, 11:58:57 AMCongrats, it's really fast :thumbsup:

Wow. (Almost) two orders of magnitude compared to the CRT routine. (Maybe that shouldn't surprise us ...)
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on November 17, 2023, 07:20:02 PMWow. (Almost) two orders of magnitude compared to the CRT routine. (Maybe that shouldn't surprise us ...)

Remember that nowadays C/C++ compilers are faster than hand-made Assembly (according to C/C++ programmers) :cool:

Vortex

Not easy to compete with the C\C++ compilers especially if the case is large projects.

jj2007

I wonder why they are not using their fast compilers for the common good :cool:
361     cycles for 100 * Hex$ (MasmBasic)
46904   cycles for 100 * crt Printf()

TimoVJL

printf and sprintf are standard versatile C runtime functions, not specific for purbose, like yours.
May the source be with you

ttribelli

Quote from: Vortex on November 18, 2023, 12:58:32 AMNot easy to compete with the C\C++ compilers especially if the case is large projects.

That is an unknown until you have profiled the code and discovered any hot spots. Its not about size, its about the percentage of time in hot spots.

Vortex

Hi ttribelli,

This has nothing to do with unknown cases and hot spots. Maybe, I should be more clear : imagine that I am tasked to write a very very big application in assembly. Trying to optimize the whole application would be a very tiring task too. On the other side,you could write the same application with an optimizing C\C++ compiler. The optimizing engine does not get tired like me or another human. From a practical point of view, the C\C++ compiler can be the winner.

ttribelli

Quote from: Vortex on November 18, 2023, 06:25:09 AMimagine that I am tasked to write a very very big application in assembly.

In that case advice HR that your manager may need medical intervention. :-)