News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Check the parity of a DWORD

Started by jj2007, February 21, 2021, 12:08:24 PM

Previous topic - Next topic

jj2007

Can I have some timings, please, especially from old CPUs? Thanks :thup:

Note that the popcnt instruction and the PopCount macro perform a bit more: they count the actual number of bits set, and parity gets checked via the and eax, 1 - see below.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

854     cycles for 100 * setnp
767     cycles for 100 * popcnt
1279    cycles for 100 * PopCount

851     cycles for 100 * setnp
766     cycles for 100 * popcnt
1276    cycles for 100 * PopCount

855     cycles for 100 * setnp
766     cycles for 100 * popcnt
1283    cycles for 100 * PopCount

858     cycles for 100 * setnp
768     cycles for 100 * popcnt
1285    cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount


setnp:
void Rand(-1) ; get a random DWORD, 0 ... -1
xor al, ah
mov edx, eax
bswap eax
xor al, ah
xor eax, edx
setnp al


popcnt:
popcnt eax, Rand(-1)
and al, 1


PopCount:
void PopCount(Rand(-1))
and eax, 1

LiaoMi

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

759     cycles for 100 * setnp
758     cycles for 100 * popcnt
1056    cycles for 100 * PopCount

876     cycles for 100 * setnp
915     cycles for 100 * popcnt
1143    cycles for 100 * PopCount

761     cycles for 100 * setnp
712     cycles for 100 * popcnt
1051    cycles for 100 * PopCount

769     cycles for 100 * setnp
713     cycles for 100 * popcnt
1038    cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount

--- ok ---

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

1003    cycles for 100 * setnp
831     cycles for 100 * popcnt
1611    cycles for 100 * PopCount

1003    cycles for 100 * setnp
831     cycles for 100 * popcnt
1611    cycles for 100 * PopCount

1018    cycles for 100 * setnp
831     cycles for 100 * popcnt
1611    cycles for 100 * PopCount

1003    cycles for 100 * setnp
831     cycles for 100 * popcnt
1708    cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount
May the source be with you

jj2007

Thanks, LiaoMi and Timo :thumbsup:

I've cooked up a variant (setnp sar, 5% faster than setnp using bswap), and rearranged the testbed so that the (pretty fast) Rand(-1) does no longer influence the results. Timings:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

739     cycles for 400 * setnp sar
932     cycles for 400 * popcnt
2357    cycles for 400 * PopCount
780     cycles for 400 * setnp bswap

735     cycles for 400 * setnp sar
934     cycles for 400 * popcnt
2364    cycles for 400 * PopCount
781     cycles for 400 * setnp bswap

736     cycles for 400 * setnp sar
931     cycles for 400 * popcnt
2354    cycles for 400 * PopCount
777     cycles for 400 * setnp bswap

736     cycles for 400 * setnp sar
936     cycles for 400 * popcnt
2340    cycles for 400 * PopCount
777     cycles for 400 * setnp bswap

FORTRANS

Pentium III dosen't run MasmBasic, as per usual.

= = =
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

1786 cycles for 100 * setnp

Parity32 has encountered a problem and needs to close.
We are sorry for the inconvience.

{Stuff...}
Exception information...
code: 0xc000001d
address: 0x0...0401fa
...

= = =
   I had to monkey about with Windows Defender to
get the following to run.

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

1074 cycles for 100 * setnp
984 cycles for 100 * popcnt
1501 cycles for 100 * PopCount

949 cycles for 100 * setnp
986 cycles for 100 * popcnt
1430 cycles for 100 * PopCount

1327 cycles for 100 * setnp
937 cycles for 100 * popcnt
1327 cycles for 100 * PopCount

1255 cycles for 100 * setnp
897 cycles for 100 * popcnt
1447 cycles for 100 * PopCount

45 = eax setnp
45 = eax popcnt
45 = eax PopCount

--- ok ---


hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)        ; about 5 years old

780     cycles for 100 * setnp
737     cycles for 100 * popcnt
1064    cycles for 100 * PopCount

774     cycles for 100 * setnp
735     cycles for 100 * popcnt
1065    cycles for 100 * PopCount

776     cycles for 100 * setnp
740     cycles for 100 * popcnt
1066    cycles for 100 * PopCount

776     cycles for 100 * setnp
735     cycles for 100 * popcnt
1062    cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount

--- ok ---

Version 2

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

651     cycles for 400 * setnp sar
961     cycles for 400 * popcnt
2179    cycles for 400 * PopCount
568     cycles for 400 * setnp bswap

651     cycles for 400 * setnp sar
955     cycles for 400 * popcnt
2177    cycles for 400 * PopCount
572     cycles for 400 * setnp bswap

651     cycles for 400 * setnp sar
950     cycles for 400 * popcnt
2182    cycles for 400 * PopCount
572     cycles for 400 * setnp bswap

656     cycles for 400 * setnp sar
951     cycles for 400 * popcnt
2178    cycles for 400 * PopCount
569     cycles for 400 * setnp bswap

205     = eax setnp sar
205     = eax popcnt
205     = eax PopCount
205     = eax setnp bswap

--- ok ---

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

813     cycles for 400 * setnp sar
407     cycles for 400 * popcnt
3213    cycles for 400 * PopCount
810     cycles for 400 * setnp bswap

808     cycles for 400 * setnp sar
405     cycles for 400 * popcnt
3210    cycles for 400 * PopCount
807     cycles for 400 * setnp bswap

810     cycles for 400 * setnp sar
406     cycles for 400 * popcnt
3209    cycles for 400 * PopCount
927     cycles for 400 * setnp bswap

807     cycles for 400 * setnp sar
405     cycles for 400 * popcnt
3329    cycles for 400 * PopCount
810     cycles for 400 * setnp bswap

205     = eax setnp sar
205     = eax popcnt
205     = eax PopCount
205     = eax setnp bswap
May the source be with you

jj2007

Timo's Athlon has an ultra-fast popcnt instruction. The other surprise is Hutch' i7, with a 15% faster "bswap" algo - mine is 5% slower :cool:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

651     cycles for 400 * setnp sar
961     cycles for 400 * popcnt
2179    cycles for 400 * PopCount
568     cycles for 400 * setnp bswap


Thanks to all :thup:

HSE

popcnt is faster in old AMD  (or timing have a problem :biggrin:)

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

786     cycles for 400 * setnp sar
396     cycles for 400 * popcnt
3750    cycles for 400 * PopCount
787     cycles for 400 * setnp bswap

787     cycles for 400 * setnp sar
398     cycles for 400 * popcnt
3749    cycles for 400 * PopCount
784     cycles for 400 * setnp bswap

785     cycles for 400 * setnp sar
395     cycles for 400 * popcnt
3748    cycles for 400 * PopCount
785     cycles for 400 * setnp bswap

786     cycles for 400 * setnp sar
398     cycles for 400 * popcnt
3750    cycles for 400 * PopCount
783     cycles for 400 * setnp bswap
Equations in Assembly: SmplMath

LiaoMi

V.2

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

644     cycles for 400 * setnp sar
937     cycles for 400 * popcnt
2141    cycles for 400 * PopCount
592     cycles for 400 * setnp bswap

645     cycles for 400 * setnp sar
930     cycles for 400 * popcnt
2118    cycles for 400 * PopCount
565     cycles for 400 * setnp bswap

635     cycles for 400 * setnp sar
957     cycles for 400 * popcnt
2208    cycles for 400 * PopCount
871     cycles for 400 * setnp bswap

757     cycles for 400 * setnp sar
1096    cycles for 400 * popcnt
2162    cycles for 400 * PopCount
615     cycles for 400 * setnp bswap

205     = eax setnp sar
205     = eax popcnt
205     = eax PopCount
205     = eax setnp bswap

--- ok ---

jj2007

Quote from: HSE on February 22, 2021, 05:58:09 AM
popcnt is faster in old AMD  (or timing have a problem :biggrin:)

The timings are not ultra-stable but it's pretty clear from your and Timo's CPUs that AMD has a much faster popcnt implementation. The i7, in contrast, perform better on the setnp bswap version.

quarantined

For the first test piece...


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

783     cycles for 100 * setnp
685     cycles for 100 * popcnt
920     cycles for 100 * PopCount

758     cycles for 100 * setnp
683     cycles for 100 * popcnt
918     cycles for 100 * PopCount

770     cycles for 100 * setnp
686     cycles for 100 * popcnt
915     cycles for 100 * PopCount

755     cycles for 100 * setnp
682     cycles for 100 * popcnt
915     cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount

--- ok ---



And the second...


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4)

609     cycles for 400 * setnp sar
317     cycles for 400 * popcnt
2406    cycles for 400 * PopCount
511     cycles for 400 * setnp bswap

610     cycles for 400 * setnp sar
360     cycles for 400 * popcnt
2420    cycles for 400 * PopCount
512     cycles for 400 * setnp bswap

605     cycles for 400 * setnp sar
313     cycles for 400 * popcnt
2403    cycles for 400 * PopCount
511     cycles for 400 * setnp bswap

609     cycles for 400 * setnp sar
310     cycles for 400 * popcnt
2401    cycles for 400 * PopCount
512     cycles for 400 * setnp bswap

205     = eax setnp sar
205     = eax popcnt
205     = eax PopCount
205     = eax setnp bswap

--- ok ---


I'm a little late to the party.  :cool:

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

948     cycles for 100 * setnp
918     cycles for 100 * popcnt
1082    cycles for 100 * PopCount

951     cycles for 100 * setnp
905     cycles for 100 * popcnt
1090    cycles for 100 * PopCount

948     cycles for 100 * setnp
914     cycles for 100 * popcnt
1104    cycles for 100 * PopCount

957     cycles for 100 * setnp
917     cycles for 100 * popcnt
1114    cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

694     cycles for 400 * setnp sar
307     cycles for 400 * popcnt
2853    cycles for 400 * PopCount
587     cycles for 400 * setnp bswap

699     cycles for 400 * setnp sar
272     cycles for 400 * popcnt
2861    cycles for 400 * PopCount
584     cycles for 400 * setnp bswap

697     cycles for 400 * setnp sar
292     cycles for 400 * popcnt
2851    cycles for 400 * PopCount
581     cycles for 400 * setnp bswap

700     cycles for 400 * setnp sar
269     cycles for 400 * popcnt
2857    cycles for 400 * PopCount
585     cycles for 400 * setnp bswap

205     = eax setnp sar
205     = eax popcnt
205     = eax PopCount
205     = eax setnp bswap
May the source be with you

daydreamer

my newest
wonder if bswap compared to shufb has about the same performance?or shufb is capable of perform 4 x dwords inside xmm regs?
I knew old AMD's know for faster fpu code,maybe also includes some other opcodes

this has turbo up to 3.1ghz,I am not sure spinup is enough,because I dont hear fans,which is usually sign of run full speed,edit:oops I had it in special energy saving mode made for steady "fixed" clock freqency when running emulator,to not have turbo go up and down,messing steady frame rate
btw how old cpu do you really want to test on,I am certain an emulator could emulate 486dx or whatever lowest cpu is that can perform bswap  :tongue: :badgrin:

Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

849     cycles for 100 * setnp
647     cycles for 100 * popcnt
1157    cycles for 100 * PopCount

823     cycles for 100 * setnp
662     cycles for 100 * popcnt
1148    cycles for 100 * PopCount

829     cycles for 100 * setnp
656     cycles for 100 * popcnt
1124    cycles for 100 * PopCount

822     cycles for 100 * setnp
652     cycles for 100 * popcnt
1137    cycles for 100 * PopCount

45      = eax setnp
45      = eax popcnt
45      = eax PopCount

-
-

second version
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

698     cycles for 400 * setnp sar
1023    cycles for 400 * popcnt
1720    cycles for 400 * PopCount
600     cycles for 400 * setnp bswap

715     cycles for 400 * setnp sar
1024    cycles for 400 * popcnt
1714    cycles for 400 * PopCount
609     cycles for 400 * setnp bswap

706     cycles for 400 * setnp sar
1018    cycles for 400 * popcnt
1730    cycles for 400 * PopCount
594     cycles for 400 * setnp bswap

709     cycles for 400 * setnp sar
1014    cycles for 400 * popcnt
1762    cycles for 400 * PopCount
611     cycles for 400 * setnp bswap

205     = eax setnp sar
205     = eax popcnt
205     = eax PopCount
205     = eax setnp bswap

-

any practical use of get parity bit in some PROC?modify a SEED depending on parity bit?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on February 23, 2021, 04:56:48 AMbtw how old cpu do you really want to test on,I am certain an emulator could emulate 486dx or whatever lowest cpu is that can perform bswap  :tongue: :badgrin:

I was more concerned about popcnt ;-)

Quoteany practical use of get parity bit in some PROC?

Probably not - since when do we need a reason for testing algos in the Lab? :tongue: