Hash tables for ultra fast dictionaries

Biterider · January 10, 2022, 01:09:58 AM

Hi
For the past 2 years I have been involved in an industrial-scale Python project.
One of the things I've learned is how to use Python dictionaries really efficiently. The backend is a combination of an index table and a hash table.
While there are many implementations in other languages that use many different hash algorithms and implementations, looking for a hash table implementation in masm did not produce a useful result.

Now it is time again to get my hands dirty and write some code that is flexible to use.
As a proof of concept, I started using an OOP approach, but it can easily be turned into a procedural approach.
I will be working on it in the next few weeks and I will be happy when someone can join me.

A good description I used as starting point can be found here https://www.data-structures-in-practice.com/hash-tables/

Biterider

LiaoMi · January 10, 2022, 09:06:21 PM

Hi Biterider,

I was interested in this topic earlier, I was looking for a method in which I could use - Bidirectional-Associative-memory - with high search speed on large volumes up to 200 gigabytes. For other purposes, you can look at a combination of AVL trees and hashing. In any case, a lot depends on the initial requirements for data structures and speed.

A c++ hash map/table which utilizes simd (specifically Intel x86 SSE/AVX) - https://github.com/NathanSWard/simd_hash_map

six_L · January 11, 2022, 06:47:10 AM

Hi，Biterider
interesting...
(2^N-1)&X = Mod(X/2^N)
;-----------------------------
what's your next plan? translate Python's code into Asm

Biterider · January 11, 2022, 07:04:58 AM

Hi LiaoMi
Of course, collision handling is a big deal when it comes to hash tables.
As you know, one way is to "openly address" or append some sort of binary tree (RBT, AVL etc) linked list etc to store items with the same hash.
Thanks for the link, I'll check it

Biterider

Biterider · January 11, 2022, 07:40:02 AM

Hi six_L
From what I've seen using Python (I'm not a Python fanboy!), it has extremely useful constructs. One of them is the dictionaries discussed, but there are others too.
My intention is to bring the good things I've seen into asm using an asm implementation.

In fact, Python's core is written in C, which in many ways comes close to asm. I'm not talking here about Python's VM, which is a complete different thing!
You can check out the latest implementation details on Github's CPython.

Biterider

mineiro · January 11, 2022, 08:28:14 AM

Quote from: six_L on January 11, 2022, 06:47:10 AM
interesting...
(2^N-1)&X = Mod(X/2^N)

I used this approach here, sounds ok without optimized code, an element frequency count of 2 millions random numbers executed in 0.1439990000 sec. To reduce collisions more memory is need.
https://masm32.com/board/index.php?topic=8811.msg96213#msg96213

I think the first question is how data looks like, it's fixed size or dynamic size?

Glib hash to strings use:
"This function implements the widely used "djb" hash apparently posted by Daniel Bernstein to comp.lang.c some time ago. The 32 bit unsigned hash value starts at 5381 and for each byte 'c' in the string, is updated: hash = hash * 33 + c. This function uses the signed value of each byte."

A good reference book about hash is Algorithms by Robert Sedgewick.
You can see a lot of hash code if you search for dictionary approach (lempel ziv) to data compression. LZ,LZW,LZO,LZP, LZARI, ... .

mineiro · January 11, 2022, 08:29:41 AM

Quote from: LiaoMi on January 10, 2022, 09:06:21 PM
I was looking for a method in which I could use - Bidirectional-Associative-memory -

A.I. with Hopfield? To restore damaged images?

Biterider · January 11, 2022, 05:31:59 PM

Hi mineiro
If you look at the different implementations and strategies you will find that they all look very similar. They differ mainly in the hash algorithm and in the collision handler.
IMHO it is possible to build a common skeleton and change the hash algo or the collision handler according to individual needs.

There are of course details like resizing (e.g. only powers of 2) or deleting (Tombstones, shifting, etc.), but that can certainly be worked out.

Biterider

mikeburr · January 12, 2022, 11:34:51 AM

i think you really need to understand your data and hence have a specific application for the use of hash tables .. trees wdnt be almost ubiquitous otherwise ??? .. one thought though is .....
It is often thought that in modular arithmetic [which includes x_bit computers ] that you need to divide to get a modular solution..... but you can multiply by the inverse to get the equivalent .. what is not quite so easy is calculating the inverses which wd seem at first sight to be trivial
regards mike b

Biterider · January 17, 2022, 05:21:00 AM

Hi
It took some effort but finally I got it working.

The HashTable is defined such that the hashing and collision handler are implemented in a derived object. This has the advantage that different classes can be created using different strategies.
As I mentioned before, I took some inspiration from Python, but at the end, I came with a slightly different approach, which maybe is not better, but is good enough as a starting point.

In this case I created a Dictionary from the HashTable using open addressing. It handles string keys of any length and a payload of any kind. I've chosen a DWORD that I've initialized with an incrementing index for demonstration purposes.
I intentionally left aside all the necessary memory allocations to just focus on inserting, searching and deleting the dictionary entries.
My first attempt was to use the well-known FNV-1a hashing algorithm and a naive deletion strategy using tombstones, which are widely used in the literature. This combination results in 126 collisions for 512 items.

As a performance reference, I used a binary tree in form of a highly optimized object called SortedStrCollection(W).
In theory, a dictionary should perform with O(1) while the SortedStrCollection using the binary search should perform with O(n). The reality comes close to the theory. The reasons are:

Hash collisions degrade the performance of the HashTable.
The overhead of the hash calculation is significant when there are only a few elements in the containers.

Still, looking at the results, searching and deleting outperforms the binary search, even for very small element counts.

A major disadvantage comes into play when a HashTable needs to be resized, which requires all elements to be reinserted by hashing them again. The SortedCollection does not suffer from this effect, since all item pointers remain in place when the internal table is resized.

There are many possible improvements like testing other hashing algorithms, in-place allocation, close addressing, etc. which I'll try over the next few days.
Attached are the sources and a log-log chart comparing performance between the dictionary and the SortedStrCollection.

Biterider

LiaoMi · January 17, 2022, 05:43:03 AM

Quote from: mineiro on January 11, 2022, 08:29:41 AM
Quote from: LiaoMi on January 10, 2022, 09:06:21 PM
I was looking for a method in which I could use - Bidirectional-Associative-memory -
A.I. with Hopfield? To restore damaged images?

Hi mineiro,

it was a database search, here the data is located randomly and cannot be sorted by structural conditions. In both directions, the search must have the same speed.

Biterider · January 18, 2022, 09:08:30 PM

Hi
While searching for hashing algorithms, I discovered that both Uasm and AsmC use the well-known FNV-1 hashing routine for their symbol tables.
A much more complicated hash routine is the MurmurHash3, which should be fine for general hash-based searching.

There are a few other interesting algorithms to test.

Biterider

six_L · January 19, 2022, 07:21:45 AM

Hi，Biterider
here is a hash which happens Collision hardly.

Biterider · January 19, 2022, 08:40:47 AM

Hi six_L
Looks promising. I'll check it.

Regards,
Biterider

LiaoMi · January 20, 2022, 10:02:22 AM

Hi,

https://www.asmcommunity.net/forums/topic/?id=30185

QuoteOk, I've started working on my own version of h2incn - a program that will parse C header files and convert them into Nasm compatible .inc files. After creating the initial program outline I am now ready to begin adding in the preprocessing support.

One of the things that I need is a fast way of performing lookups of defines and typedefs. I've settled on using hash maps and the hash function algorithm FNV-1 which, according to the authors, provides for a nice dispersion across a hash map. Thus I begin by defining a hash map to contain the key/value pairs and develop the hashing function.

I want the hashing function and the hash map to be capable of holding arbitrary types and data sizes. Thus the routines being developed do not depend on simple character strings (although you can indeed supply them to the functions). That way, the routines may be used to hold objects of all types for any purpose ( token processing, game world objects, in-memory databases, caching, etc. ).

The following is my Nasm modified version of the FNV-1 hash algorithm for use in both 32-bit and 64-bit systems. Note that the 32-bit version uses the standard C calling convention while the 64-bit version uses either Windows or Linux fastcall calling conventions. It may be optimized further (an exersize left for the reader) but I'm quite happy with it. I would love to know if others find use for it and what their collision findings are...

nasm - FNV1HASH.ASM

Code Select

;
; FNV1HASH.ASM
;
; Copyright (C)2010 Rob Neff
; Source code is licensed under the new/simplified 2-clause BSD OSI license.
;
; This function implements the FNV-1 hash algorithm.
; This source file is formatted for Nasm compatibility although it
; is small enough to be easily converted into another assembler format.
;
; Example C/C++ call:
;
; #ifdef __cplusplus
; extern "C" {
; #endif
;
; unsigned int FNV1Hash(char *buffer, unsigned int len, unsigned int offset_basis);
;
; #ifdef __cplusplus
; }
; #endif
;
; int hash;
;
; /* obtain 32-bit FNV1 hash */
; hash = FNV1Hash(buffer, len, 2166136261);
;
; /* if desired - convert from a 32-bit to 16-bit hash */
; hash = ((hash >> 16) ^ (hash & 0xFFFF));
;



%ifidni __BITS__,32
;
; 32-bit C calling convention
;

%define buffer
%define len
%define offset_basis

global _FNV1Hash

_FNV1Hash:
  push ebp                    ; set up stack frame
  mov  ebp, esp
  push esi                    ; save registers used
  push edi
  push ebx
  push ecx
  push edx

  mov  esi, buffer            ;esi = ptr to buffer
  mov  ecx, len              ;ecx = length of buffer (counter)
  mov  eax, offset_basis      ;set to 2166136261 for FNV-1
  mov  edi, 1000193h          ;FNV_32_PRIME = 16777619
  xor  ebx, ebx              ;ebx = 0
nextbyte:
  mul  edi                    ;eax = eax * FNV_32_PRIME
  mov  bl,               ;bl = byte from esi
  xor  eax, ebx              ;al = al xor bl
  inc  esi                    ;esi = esi + 1 (buffer pos)
  dec  ecx                    ;ecx = ecx - 1 (counter)
  jnz  nextbyte              ;if ecx != 0, jmp to NextByte

  pop  edx                    ; restore registers
  pop  ecx
  pop  ebx
  pop  edi
  pop  esi
  mov  esp, ebp              ; restore stack frame
  pop  ebp
  ret                        ; eax = fnv1 hash

%elifidni __BITS__,64
;
; 64-bit function
;

%ifidni __OUTPUT_FORMAT__,win64
;
; 64-bit Windows fastcall convention:
;    ints/longs/ptrs: RCX, RDX, R8, R9
;    floats/doubles: XMM0 to XMM3
;
global FNV1Hash

FNV1Hash:
  xchg rcx, rdx              ;rcx = length of buffer
  xchg r8, rdx                ;r8 = ptr to buffer

%elifidni __OUTPUT_FORMAT__,elf64
;
; 64-bit Linux fastcall convention
;    ints/longs/ptrs: RDI, RSI, RDX, RCX, R8, R9
;    floats/doubles: XMM0 to XMM7
;
global _FNV1Hash

_FNV1Hash:
  mov  rcx, rsi
  mov  r8, rdi

%endif

  mov  rax, rdx              ;rax = offset_basis - set to 14695981039346656037 for FNV-1
  mov  r9, 100000001B3h      ;r9 = FNV_64_PRIME = 1099511628211
  mov  r10, rbx              ;r10 = saved copy of rbx
  xor  rbx, rbx              ;rbx = 0
nextbyte:
  mul  r9                    ;rax = rax * FNV_64_PRIME
  mov  bl,               ;bl = byte from r8
  xor  rax, rbx              ;al = al xor bl
  inc  r8                    ;inc buffer pos
  dec  rcx                    ;rcx = rcx - 1 (counter)
  jnz  nextbyte              ;if rcx != 0, jmp to nextbyte
  mov  rbx, r10              ;restore rbx
  ret                        ;rax = fnv1 hash

%endif

Robin Hood hashing
Robin hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest probe sequence length (PSL)—from its "home location" i.e. the bucket to which the item was hashed into. Although robin hood hashing does not change the theoretical search cost, it significantly affects the variance of the distribution of the items on the buckets, i.e. dealing with cluster formation in the hash table.

Leveraging SIMD parallelism for accelerating network - https://conferences.sigcomm.org/events/apnet2020/material/apnet20-final20.pdf

xxHash - Extremely fast non-cryptographic hash algorithm
Benchmarks
The reference system uses an Intel i7-9700K cpu, and runs Ubuntu x64 20.04. The open source benchmark program is compiled with clang v10.0 using -O3 flag.

Hash Name   Width   Bandwidth (GB/s)   Small Data Velocity   Quality   Comment
XXH3 (SSE2)   64   31.5 GB/s   133.1   10
XXH128 (SSE2)   128   29.6 GB/s   118.1   10
RAM sequential read   N/A   28.0 GB/s   N/A   N/A   for reference
City64   64   22.0 GB/s   76.6   10
T1ha2   64   22.0 GB/s   99.0   9   Slightly worse collisions
City128   128   21.7 GB/s   57.7   10
XXH64   64   19.4 GB/s   71.0   10
SpookyHash   64   19.3 GB/s   53.2   10
Mum   64   18.0 GB/s   67.0   9   Slightly worse collisions
XXH32   32   9.7 GB/s   71.9   10
City32   32   9.1 GB/s   66.0   10
Murmur3   32   3.9 GB/s   56.1   10
SipHash   64   3.0 GB/s   43.2   10
FNV64   64   1.2 GB/s   62.7   5   Poor avalanche properties
Blake2   256   1.1 GB/s   5.1   10   Cryptographic
SHA1   160   0.8 GB/s   5.6   10   Cryptographic but broken
MD5   128   0.6 GB/s   7.8   10   Cryptographic but broken

* | Hash Name | ISA ext | Width | Large Data Speed | Small Data Velocity |
* | -------------------- | ------- | ----: | ---------------: | ------------------: |
* | XXH3_64bits() | @b AVX2 | 64 | 59.4 GB/s | 133.1 |
* | MeowHash | AES-NI | 128 | 58.2 GB/s | 52.5 |
* | XXH3_128bits() | @b AVX2 | 128 | 57.9 GB/s | 118.1 |
* | CLHash | PCLMUL | 64 | 37.1 GB/s | 58.1 |
* | XXH3_64bits() | @b SSE2 | 64 | 31.5 GB/s | 133.1 |
* | XXH3_128bits() | @b SSE2 | 128 | 29.6 GB/s | 118.1 |
* | RAM sequential read | | N/A | 28.0 GB/s | N/A |
* | ahash | AES-NI | 64 | 22.5 GB/s | 107.2 |
* | City64 | | 64 | 22.0 GB/s | 76.6 |
* | T1ha2 | | 64 | 22.0 GB/s | 99.0 |
* | City128 | | 128 | 21.7 GB/s | 57.7 |
* | FarmHash | AES-NI | 64 | 21.3 GB/s | 71.9 |
* | XXH64() | | 64 | 19.4 GB/s | 71.0 |
* | SpookyHash | | 64 | 19.3 GB/s | 53.2 |
* | Mum | | 64 | 18.0 GB/s | 67.0 |
* | CRC32C | SSE4.2 | 32 | 13.0 GB/s | 57.9 |
* | XXH32() | | 32 | 9.7 GB/s | 71.9 |
* | City32 | | 32 | 9.1 GB/s | 66.0 |
* | Blake3* | @b AVX2 | 256 | 4.4 GB/s | 8.1 |
* | Murmur3 | | 32 | 3.9 GB/s | 56.1 |
* | SipHash* | | 64 | 3.0 GB/s | 43.2 |
* | Blake3* | @b SSE2 | 256 | 2.4 GB/s | 8.1 |
* | HighwayHash | | 64 | 1.4 GB/s | 6.0 |
* | FNV64 | | 64 | 1.2 GB/s | 62.7 |
* | Blake2* | | 256 | 1.1 GB/s | 5.1 |
* | SHA1* | | 160 | 0.8 GB/s | 5.6 |
* | MD5* | | 128 | 0.6 GB/s | 7.8 |

note 1: Small data velocity is a rough evaluation of algorithm's efficiency on small data. For more detailed analysis, please refer to next paragraph.
note 2: some algorithms feature faster than RAM speed. In which case, they can only reach their full speed when input data is already in CPU cache (L3 or better). Otherwise, they max out on RAM speed limit.

Small data
Performance on large data is only one part of the picture. Hashing is also very useful in constructions like hash tables and bloom filters. In these use cases, it's frequent to hash a lot of small data (starting at a few bytes). Algorithm's performance can be very different for such scenarios, since parts of the algorithm, such as initialization or finalization, become fixed cost. The impact of branch mis-prediction also becomes much more present.

XXH3 has been designed for excellent performance on both long and small inputs, which can be observed in the following graph: - https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.png

Vectorizing xxHash for Fun and Profit - https://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/

The MASM Forum

News: