btw how old cpu do you really want to test on,I am certain an emulator could emulate 486dx or whatever lowest cpu is that can perform bswap
I was more concerned about popcnt ;-)
any practical use of get parity bit in some PROC?
Probably not - since when do we need a reason for testing algos in the Lab? 
I ran some benchmark xchg vs mov earlier,from masm32 sdk,so if you exchange xchg does it become little faster?
testing algos in Lab,is kinda similar to test alternative algo,that solves problem using different "untouched" mnemonics/opcodes
already tested an alternative using shufb,so maybe benchmark shufb vs bswap?