News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Hash tables for ultra fast dictionaries

Started by Biterider, January 10, 2022, 01:09:58 AM

Previous topic - Next topic

six_L

Hi,
i tested the FNV1 hash algorithm. it's faster and happenning collision hardly.
DictHashValue_FNV1 proc uses rbx pData:qword,dqSize:qword ;hashValue
; retValue=rax

_FNV1Hash:
mov rax,14695981039346656037;rax = offset_basis - set to 14695981039346656037 for FNV-1
mov rcx, dqSize
mov r8, pData

mov r9, 0100000001B3h     ;r9 = FNV_64_PRIME = 1099511628211
xor rbx, rbx              ;rbx = 0
nextbyte:
mul r9                    ;rax = rax * FNV_64_PRIME
mov bl, [r8]   ;bl = byte from r8
xor rax, rbx              ;al = al xor bl
inc r8                    ;inc buffer pos
dec rcx                   ;rcx = rcx - 1 (counter)
jnz nextbyte              ;if rcx != 0, jmp to nextbyte
ret                       ;rax = fnv1 hash

DictHashValue_FNV1 endp
Say you, Say me, Say the codes together for ever.

Biterider

Hi six_L
That is cool  :thumbsup:
Since that you have the test setup ready, can you check and compare with the FNV-1a algorithm?
it is only necessary to change the order of operations. This algo is recommended in the literature over the FNV-1 because it is slightly better and requires the same computing time.

https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function

Thanks, Biterider

six_L

Hi,Biterider
i can't access the [https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function], could you convert it into PDF and post here?
i'll know what's the hash algorithm which has been recommended by literature.

regard.
Say you, Say me, Say the codes together for ever.

jj2007


six_L

Say you, Say me, Say the codes together for ever.

six_L

Hi,Biterider
we'll use some names as key, so FNV-1 and FNV-1a has a same effects on avalanche or random distribution of hash values. about speed, FNV-1 is as fast as FNV-1a on calculating hash values.
Quotecan you check and compare with the FNV-1a algorithm?
the attachmnet is the effects of FNV-1a on random distribution of hash values.
Say you, Say me, Say the codes together for ever.

Biterider

Hi six_L
Thanks for testing  :cool:
Here are some interesting comparisons between hash algos. In particular, FNV-1 vs. FNV-1a.
If you look around, CRC32 is another interesting candidate.

https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Biterider

six_L

#22
Hi,Biterider
QuoteDo collisions actually happen?
Yes. I started writing my test program to see if hash collisions actually happen - and are not just a theoretical construct. They do indeed happen:

FNV-1 collisions
    creamwove collides with quists

FNV-1a collisions
    costarring collides with liquid
    declinate collides with macallums
    altarage collides with zinke
    altarages collides with zinkes

Murmur2 collisions
i doubt the results of his test. Maybe he tested on a 32bit system.
Say you, Say me, Say the codes together for ever.

LiaoMi

#23
Hi,

I have compiled a library for this type of hashes - xxHashLib_x64  :rolleyes:

It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).
Here's results of various hash functions, hashing data of different lengths, with performance in MB/s: https://aras-p.info/img/blog/2016-08/hash-perf.png (https://aras-p.info/blog/2016/08/02/Hash-Functions-all-the-way-down/)

Where can I find xxhash64 and md5 collision probability statistics? - https://stackoverflow.com/questions/45788721/where-can-i-find-xxhash64-and-md5-collision-probability-statistics

If you use xxhash64,
Assuming that xxhash64 produce a 64-bit hash. You will get this graph:

LiaoMi

Hi  :biggrin:,

ISCSI CRC 32 Implementation with crc32 and pclmulqdq Instructions.

"Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors"


Biterider

Hi LiaoMi
Thanks for all the usefull information.  :thumbsup:
I took a look into smhasher at https://github.com/aappleby/smhasher.
It has a lot of usefull things, including src of the hash algos!

Biterider


LiaoMi

Quote from: Biterider on January 21, 2022, 10:28:41 PM
Hi LiaoMi
Thanks for all the usefull information.  :thumbsup:
I took a look into smhasher at https://github.com/aappleby/smhasher.
It has a lot of usefull things, including src of the hash algos!

Biterider

Hi Biterider,

I would like to know exactly the spread of execution speed, in the material above there are many graphs with different comparisons. In general terms, I derived the following rules, speed depends:

1. on the type of processor
2. on the data itself
3. on the amount of data
3. on key length (length affects collision balance)
4. on hardware acceleration (crc32 and pclmulqdq etc)
5. on parallel execution (Can slow down execution in special cases)
6. on seed (?!)

xxHash - Extremely fast hash algorithm - already contains a collision test - https://github.com/Cyan4973/xxHash/blob/dev/tests/collisions/main.c (https://en.wikipedia.org/wiki/Avalanche_effect)

Judging by the graphs, xxHash has very good performance, it has different sub-versions for other tasks  :rolleyes:

Today in the previous post I added the x86 version of xxHash

hutch--

I have never bothered to port my basic hash table to masm, it uses a basic dynamic string array which is allocate as an empty array of placeholders which you then fill with data based on the the hash calculation. I use a reverse table of prime numbers to get the initial array location and the array length must also be a prime number so that it can never be divided by an number larger than 1.

The advantage of the dynamic array is that you can make massive arrays as each placeholder is only 4 bytes. You pay a penalty loading the array full of placeholders but you only notice it on very large arrays. Once loaded with values, the technique is genuinely fast and it does not slow down until well over 80% of the table length.

You should only really saturate the table up to about 50% but having a lot of head room makes it a lot easier to use.

six_L

Hi,LiaoMi
thank you providid a lot of the usefull information and file.
QuotexxHash has very good performance.
i want to test. but I have assembled your xxHashLib_x64,show me "fatal error LNK1104: can't open the file 'LIBCMT.lib'" 
Say you, Say me, Say the codes together for ever.

LiaoMi

Quote from: six_L on January 22, 2022, 07:31:45 PM
Hi,LiaoMi
thank you providid a lot of the usefull information and file.
QuotexxHash has very good performance.
i want to test. but I have assembled your xxHashLib_x64,show me "fatal error LNK1104: can't open the file 'LIBCMT.lib'"

Hi six_L,

the problem is in the functions that are not included in this library  :sad:

I added a part, but again something is missing
includelib xxHashLib_x86.lib
includelib ucrt.lib
includelib libucrt.lib
includelib vcruntime.lib
includelib vc.lib


Here is the build log:

Searching libraries
    Searching xxHashLib_x86.lib:
      Found _XXH32@12
        Referenced in test_bugs.obj
        Loaded xxHashLib_x86.lib(xxhash.obj)
Processed /DEFAULTLIB:LIBCMT
Processed /DEFAULTLIB:OLDNAMES
    Searching ucrt.lib:
      Found _free
        Referenced in xxHashLib_x86.lib(xxhash.obj)
        Loaded ucrt.lib(msvcrt.dll)
      Found _malloc
        Referenced in xxHashLib_x86.lib(xxhash.obj)
        Loaded ucrt.lib(msvcrt.dll)
      Found __IMPORT_DESCRIPTOR_msvcrt
        Referenced in ucrt.lib(msvcrt.dll)
        Referenced in ucrt.lib(msvcrt.dll)
        Loaded ucrt.lib(msvcrt.dll)
      Found __NULL_IMPORT_DESCRIPTOR
        Referenced in ucrt.lib(msvcrt.dll)
        Loaded ucrt.lib(msvcrt.dll)
      Found msvcrt_NULL_THUNK_DATA
        Referenced in ucrt.lib(msvcrt.dll)
        Loaded ucrt.lib(msvcrt.dll)
    Searching libucrt.lib:
    Searching vcruntime.lib:
      Found _memcpy
        Referenced in xxHashLib_x86.lib(xxhash.obj)
        Loaded vcruntime.lib(msvcrt.dll)
      Found _memset
        Referenced in xxHashLib_x86.lib(xxhash.obj)
        Loaded vcruntime.lib(msvcrt.dll)
    Searching vc.lib:
      Found __allmul
        Referenced in xxHashLib_x86.lib(xxhash.obj)
        Loaded vc.lib(llmul.obj)
      Found __aullshr
        Referenced in xxHashLib_x86.lib(xxhash.obj)
        Loaded vc.lib(ullshr.obj)
LINK : fatal error LNK1104: cannot open file 'LIBCMT.lib'


They pull everything else with them, I did not find other dependencies

Severity Code Description Project File Line Suppression State
Error LNK2019 unresolved external symbol _free referenced in function _XXH32_freeState@4 xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxhash.obj 1
Error LNK2019 unresolved external symbol _malloc referenced in function _XXH32_createState@0 xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxhash.obj 1
Error LNK2019 unresolved external symbol __allmul referenced in function _XXH3_128bits@8 xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxhash.obj 1
Error LNK2001 unresolved external symbol __allmul xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxh_x86dispatch.obj 1
Error LNK2001 unresolved external symbol __aullshr xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxhash.obj 1
Error LNK2001 unresolved external symbol __aullshr xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxh_x86dispatch.obj 1
Error LNK2019 unresolved external symbol _memcpy referenced in function _XXH32_update@12 xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxhash.obj 1
Error LNK2001 unresolved external symbol _memcpy xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxh_x86dispatch.obj 1
Error LNK2019 unresolved external symbol _memset referenced in function _XXH64_reset@12 xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\xxhash.obj 1
Error LNK2001 unresolved external symbol __DllMainCRTStartup@12 xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\xxHashLib\LINK 1
Error LNK1120 7 unresolved externals xxHashLib E:\DATA\Hashing Table - Dictionary - Hash Maps - High Dispersion\xxHashLib\xxHashLib\Release\xxHashLib.dll 1