Hi
I recently worked on a HashMap/HashTable and compared some hashing algorithms http://masm32.com/board/index.php?topic=9754.0 (http://masm32.com/board/index.php?topic=9754.0).
One of them is the well-known CRC32 algorithm. I was tempted to use the hardware support, so I dug deeper to understand how it works and how to use it.
After reading some technical information and the Intel optimization https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf (https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf), I started programming and did some timing tests to hash regular strings in a range up to 100 characters (ANSI/Wide).
The result is somewhat surprising. The naive implementation is not so bad.
mov xdx, pKey
mov eax, -1
.while TRUE
movzx ecx, CHR ptr [xdx]
CRC32 eax, $SubReg(ecx, sizeof(CHR))
.break .if ecx == 0
add xdx, sizeof(CHR)
.endw
xor eax, -1
Intel's proposed parallel operation to overcome the 3-cycle latency requires too much setup and extra computation to recombine the 3 streams, which is an overkill for smaller buffer sizes.
For this reason I wrote 2 procedures that I want to share, one for an aligned and one for an unaligned memory buffer.
Both compile and work fine for 32 and 64 bit.
Biterider