Author Topic: CRC32 ala Castagnoli  (Read 527 times)

Biterider

  • Member
  • *****
  • Posts: 1043
  • ObjAsm Developer
    • ObjAsm
CRC32 ala Castagnoli
« on: March 07, 2022, 02:56:03 AM »
Hi
I recently worked on a HashMap/HashTable and compared some hashing algorithms http://masm32.com/board/index.php?topic=9754.0.
One of them is the well-known CRC32 algorithm. I was tempted to use the hardware support, so I dug deeper to understand how it works and how to use it.

After reading some technical information and the Intel optimization https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf, I started programming and did some timing tests to hash regular strings in a range up to 100 characters (ANSI/Wide).

The result is somewhat surprising. The naive implementation is not so bad.
Code: [Select]
      mov xdx, pKey
      mov eax, -1
      .while TRUE
        movzx ecx, CHR ptr [xdx]
        CRC32 eax, $SubReg(ecx, sizeof(CHR))
        .break .if ecx == 0
        add xdx, sizeof(CHR)
      .endw
      xor eax, -1
Intel's proposed parallel operation to overcome the 3-cycle latency requires too much setup and extra computation to recombine the 3 streams, which is an overkill for smaller buffer sizes.

For this reason I wrote 2 procedures that I want to share, one for an aligned and one for an unaligned memory buffer.
Both compile and work fine for 32 and 64 bit.

Biterider
« Last Edit: March 07, 2022, 09:29:31 PM by Biterider »