Running on my POS Windows 8.1-64, 2.16Ghz Celeron laptop and passing the test buffer one byte at a time, the CRC32 instruction did not have any speed advantage over the table version. While I didn't have time to test this, or determine how to encode the instruction, presumably it's possible to pass 4 bytes at a time making the instruction 4 times faster.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
E3069283h
E3069283h
100 cycles, crc_reflected
29 cycles, crc_32
i7-4790
E3069283h
E3069283h
37 cycles, crc_reflected
15 cycles, crc_32
Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
Microsoft Windows 10 Famille Version: 10.0.10240
E3069283h
E3069283h
93 cycles, crc_reflected
11 cycles, crc_32
Press any key to continue ...
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Windows 8.1
E3069283h
E3069283h
44 cycles, crc_reflected
26 cycles, crc_32