Hi,
https://www.asmcommunity.net/forums/topic/?id=30185Ok, I've started working on my own version of h2incn - a program that will parse C header files and convert them into Nasm compatible .inc files. After creating the initial program outline I am now ready to begin adding in the preprocessing support.
One of the things that I need is a fast way of performing lookups of defines and typedefs. I've settled on using hash maps and the hash function algorithm FNV-1 which, according to the authors, provides for a nice dispersion across a hash map. Thus I begin by defining a hash map to contain the key/value pairs and develop the hashing function.
I want the hashing function and the hash map to be capable of holding arbitrary types and data sizes. Thus the routines being developed do not depend on simple character strings (although you can indeed supply them to the functions). That way, the routines may be used to hold objects of all types for any purpose ( token processing, game world objects, in-memory databases, caching, etc. ).
The following is my Nasm modified version of the FNV-1 hash algorithm for use in both 32-bit and 64-bit systems. Note that the 32-bit version uses the standard C calling convention while the 64-bit version uses either Windows or Linux fastcall calling conventions. It may be optimized further (an exersize left for the reader) but I'm quite happy with it. I would love to know if others find use for it and what their collision findings are...
nasm - FNV1HASH.ASM;
; FNV1HASH.ASM
;
; Copyright (C)2010 Rob Neff
; Source code is licensed under the new/simplified 2-clause BSD OSI license.
;
; This function implements the FNV-1 hash algorithm.
; This source file is formatted for Nasm compatibility although it
; is small enough to be easily converted into another assembler format.
;
; Example C/C++ call:
;
; #ifdef __cplusplus
; extern "C" {
; #endif
;
; unsigned int FNV1Hash(char *buffer, unsigned int len, unsigned int offset_basis);
;
; #ifdef __cplusplus
; }
; #endif
;
; int hash;
;
; /* obtain 32-bit FNV1 hash */
; hash = FNV1Hash(buffer, len, 2166136261);
;
; /* if desired - convert from a 32-bit to 16-bit hash */
; hash = ((hash >> 16) ^ (hash & 0xFFFF));
;
%ifidni __BITS__,32
;
; 32-bit C calling convention
;
%define buffer
%define len
%define offset_basis
global _FNV1Hash
_FNV1Hash:
push ebp ; set up stack frame
mov ebp, esp
push esi ; save registers used
push edi
push ebx
push ecx
push edx
mov esi, buffer ;esi = ptr to buffer
mov ecx, len ;ecx = length of buffer (counter)
mov eax, offset_basis ;set to 2166136261 for FNV-1
mov edi, 1000193h ;FNV_32_PRIME = 16777619
xor ebx, ebx ;ebx = 0
nextbyte:
mul edi ;eax = eax * FNV_32_PRIME
mov bl, ;bl = byte from esi
xor eax, ebx ;al = al xor bl
inc esi ;esi = esi + 1 (buffer pos)
dec ecx ;ecx = ecx - 1 (counter)
jnz nextbyte ;if ecx != 0, jmp to NextByte
pop edx ; restore registers
pop ecx
pop ebx
pop edi
pop esi
mov esp, ebp ; restore stack frame
pop ebp
ret ; eax = fnv1 hash
%elifidni __BITS__,64
;
; 64-bit function
;
%ifidni __OUTPUT_FORMAT__,win64
;
; 64-bit Windows fastcall convention:
; ints/longs/ptrs: RCX, RDX, R8, R9
; floats/doubles: XMM0 to XMM3
;
global FNV1Hash
FNV1Hash:
xchg rcx, rdx ;rcx = length of buffer
xchg r8, rdx ;r8 = ptr to buffer
%elifidni __OUTPUT_FORMAT__,elf64
;
; 64-bit Linux fastcall convention
; ints/longs/ptrs: RDI, RSI, RDX, RCX, R8, R9
; floats/doubles: XMM0 to XMM7
;
global _FNV1Hash
_FNV1Hash:
mov rcx, rsi
mov r8, rdi
%endif
mov rax, rdx ;rax = offset_basis - set to 14695981039346656037 for FNV-1
mov r9, 100000001B3h ;r9 = FNV_64_PRIME = 1099511628211
mov r10, rbx ;r10 = saved copy of rbx
xor rbx, rbx ;rbx = 0
nextbyte:
mul r9 ;rax = rax * FNV_64_PRIME
mov bl, ;bl = byte from r8
xor rax, rbx ;al = al xor bl
inc r8 ;inc buffer pos
dec rcx ;rcx = rcx - 1 (counter)
jnz nextbyte ;if rcx != 0, jmp to nextbyte
mov rbx, r10 ;restore rbx
ret ;rax = fnv1 hash
%endif
Robin Hood hashingRobin hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest probe sequence length (PSL)—from its "home location" i.e. the bucket to which the item was hashed into. Although robin hood hashing does not change the theoretical search cost, it significantly affects the variance of the distribution of the items on the buckets, i.e. dealing with cluster formation in the hash table.
Leveraging SIMD parallelism for accelerating network -
https://conferences.sigcomm.org/events/apnet2020/material/apnet20-final20.pdfxxHash - Extremely fast non-cryptographic hash algorithm
BenchmarksThe reference system uses an Intel i7-9700K cpu, and runs Ubuntu x64 20.04. The open source benchmark program is compiled with clang v10.0 using -O3 flag.
Hash Name Width Bandwidth (GB/s) Small Data Velocity Quality Comment
XXH3 (SSE2) 64 31.5 GB/s 133.1 10
XXH128 (SSE2) 128 29.6 GB/s 118.1 10
RAM sequential read N/A 28.0 GB/s N/A N/A for reference
City64 64 22.0 GB/s 76.6 10
T1ha2 64 22.0 GB/s 99.0 9 Slightly worse collisions
City128 128 21.7 GB/s 57.7 10
XXH64 64 19.4 GB/s 71.0 10
SpookyHash 64 19.3 GB/s 53.2 10
Mum 64 18.0 GB/s 67.0 9 Slightly worse collisions
XXH32 32 9.7 GB/s 71.9 10
City32 32 9.1 GB/s 66.0 10
Murmur3 32 3.9 GB/s 56.1 10
SipHash 64 3.0 GB/s 43.2 10
FNV64 64 1.2 GB/s 62.7 5 Poor avalanche properties
Blake2 256 1.1 GB/s 5.1 10 Cryptographic
SHA1 160 0.8 GB/s 5.6 10 Cryptographic but broken
MD5 128 0.6 GB/s 7.8 10 Cryptographic but broken
* | Hash Name | ISA ext | Width | Large Data Speed | Small Data Velocity |
* | -------------------- | ------- | ----: | ---------------: | ------------------: |
* | XXH3_64bits() | @b AVX2 | 64 | 59.4 GB/s | 133.1 |
* | MeowHash | AES-NI | 128 | 58.2 GB/s | 52.5 |
* | XXH3_128bits() | @b AVX2 | 128 | 57.9 GB/s | 118.1 |
* | CLHash | PCLMUL | 64 | 37.1 GB/s | 58.1 |
* | XXH3_64bits() | @b SSE2 | 64 | 31.5 GB/s | 133.1 |
* | XXH3_128bits() | @b SSE2 | 128 | 29.6 GB/s | 118.1 |
* | RAM sequential read | | N/A | 28.0 GB/s | N/A |
* | ahash | AES-NI | 64 | 22.5 GB/s | 107.2 |
* | City64 | | 64 | 22.0 GB/s | 76.6 |
* | T1ha2 | | 64 | 22.0 GB/s | 99.0 |
* | City128 | | 128 | 21.7 GB/s | 57.7 |
* | FarmHash | AES-NI | 64 | 21.3 GB/s | 71.9 |
* | XXH64() | | 64 | 19.4 GB/s | 71.0 |
* | SpookyHash | | 64 | 19.3 GB/s | 53.2 |
* | Mum | | 64 | 18.0 GB/s | 67.0 |
* | CRC32C | SSE4.2 | 32 | 13.0 GB/s | 57.9 |
* | XXH32() | | 32 | 9.7 GB/s | 71.9 |
* | City32 | | 32 | 9.1 GB/s | 66.0 |
* | Blake3* | @b AVX2 | 256 | 4.4 GB/s | 8.1 |
* | Murmur3 | | 32 | 3.9 GB/s | 56.1 |
* | SipHash* | | 64 | 3.0 GB/s | 43.2 |
* | Blake3* | @b SSE2 | 256 | 2.4 GB/s | 8.1 |
* | HighwayHash | | 64 | 1.4 GB/s | 6.0 |
* | FNV64 | | 64 | 1.2 GB/s | 62.7 |
* | Blake2* | | 256 | 1.1 GB/s | 5.1 |
* | SHA1* | | 160 | 0.8 GB/s | 5.6 |
* | MD5* | | 128 | 0.6 GB/s | 7.8 |
note 1: Small data velocity is a rough evaluation of algorithm's efficiency on small data. For more detailed analysis, please refer to next paragraph.
note 2: some algorithms feature faster than RAM speed. In which case, they can only reach their full speed when input data is already in CPU cache (L3 or better). Otherwise, they max out on RAM speed limit.
Small dataPerformance on large data is only one part of the picture. Hashing is also very useful in constructions like hash tables and bloom filters. In these use cases, it's frequent to hash a lot of small data (starting at a few bytes). Algorithm's performance can be very different for such scenarios, since parts of the algorithm, such as initialization or finalization, become fixed cost. The impact of branch mis-prediction also becomes much more present.
XXH3 has been designed for excellent performance on both long and small inputs, which can be observed in the
following graph: -
https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.pngVectorizing xxHash for Fun and Profit -
https://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/