Can I have some timings, please, especially from old CPUs? Thanks :thup:
Note that the popcnt instruction and the PopCount macro perform a bit more: they count the actual number of bits set, and parity gets checked via the and eax, 1 - see below.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
854 cycles for 100 * setnp
767 cycles for 100 * popcnt
1279 cycles for 100 * PopCount
851 cycles for 100 * setnp
766 cycles for 100 * popcnt
1276 cycles for 100 * PopCount
855 cycles for 100 * setnp
766 cycles for 100 * popcnt
1283 cycles for 100 * PopCount
858 cycles for 100 * setnp
768 cycles for 100 * popcnt
1285 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
setnp:
void Rand(-1) ; get a random DWORD, 0 ... -1
xor al, ah
mov edx, eax
bswap eax
xor al, ah
xor eax, edx
setnp al
popcnt:
popcnt eax, Rand(-1)
and al, 1
PopCount (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1029):
void PopCount(Rand(-1))
and eax, 1
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
759 cycles for 100 * setnp
758 cycles for 100 * popcnt
1056 cycles for 100 * PopCount
876 cycles for 100 * setnp
915 cycles for 100 * popcnt
1143 cycles for 100 * PopCount
761 cycles for 100 * setnp
712 cycles for 100 * popcnt
1051 cycles for 100 * PopCount
769 cycles for 100 * setnp
713 cycles for 100 * popcnt
1038 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
--- ok ---
AMD Athlon(tm) II X2 220 Processor (SSE3)
1003 cycles for 100 * setnp
831 cycles for 100 * popcnt
1611 cycles for 100 * PopCount
1003 cycles for 100 * setnp
831 cycles for 100 * popcnt
1611 cycles for 100 * PopCount
1018 cycles for 100 * setnp
831 cycles for 100 * popcnt
1611 cycles for 100 * PopCount
1003 cycles for 100 * setnp
831 cycles for 100 * popcnt
1708 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
Thanks, LiaoMi and Timo :thumbsup:
I've cooked up a variant (setnp sar, 5% faster than setnp using bswap), and rearranged the testbed so that the (pretty fast) Rand(-1) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1030) does no longer influence the results. Timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
739 cycles for 400 * setnp sar
932 cycles for 400 * popcnt
2357 cycles for 400 * PopCount
780 cycles for 400 * setnp bswap
735 cycles for 400 * setnp sar
934 cycles for 400 * popcnt
2364 cycles for 400 * PopCount
781 cycles for 400 * setnp bswap
736 cycles for 400 * setnp sar
931 cycles for 400 * popcnt
2354 cycles for 400 * PopCount
777 cycles for 400 * setnp bswap
736 cycles for 400 * setnp sar
936 cycles for 400 * popcnt
2340 cycles for 400 * PopCount
777 cycles for 400 * setnp bswap
Pentium III dosen't run MasmBasic, as per usual.
= = =
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
1786 cycles for 100 * setnp
Parity32 has encountered a problem and needs to close.
We are sorry for the inconvience.
{Stuff...}
Exception information...
code: 0xc000001d
address: 0x0...0401fa
...
= = =
I had to monkey about with Windows Defender to
get the following to run.
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
1074 cycles for 100 * setnp
984 cycles for 100 * popcnt
1501 cycles for 100 * PopCount
949 cycles for 100 * setnp
986 cycles for 100 * popcnt
1430 cycles for 100 * PopCount
1327 cycles for 100 * setnp
937 cycles for 100 * popcnt
1327 cycles for 100 * PopCount
1255 cycles for 100 * setnp
897 cycles for 100 * popcnt
1447 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
--- ok ---
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4) ; about 5 years old
780 cycles for 100 * setnp
737 cycles for 100 * popcnt
1064 cycles for 100 * PopCount
774 cycles for 100 * setnp
735 cycles for 100 * popcnt
1065 cycles for 100 * PopCount
776 cycles for 100 * setnp
740 cycles for 100 * popcnt
1066 cycles for 100 * PopCount
776 cycles for 100 * setnp
735 cycles for 100 * popcnt
1062 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
--- ok ---
Version 2
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
651 cycles for 400 * setnp sar
961 cycles for 400 * popcnt
2179 cycles for 400 * PopCount
568 cycles for 400 * setnp bswap
651 cycles for 400 * setnp sar
955 cycles for 400 * popcnt
2177 cycles for 400 * PopCount
572 cycles for 400 * setnp bswap
651 cycles for 400 * setnp sar
950 cycles for 400 * popcnt
2182 cycles for 400 * PopCount
572 cycles for 400 * setnp bswap
656 cycles for 400 * setnp sar
951 cycles for 400 * popcnt
2178 cycles for 400 * PopCount
569 cycles for 400 * setnp bswap
205 = eax setnp sar
205 = eax popcnt
205 = eax PopCount
205 = eax setnp bswap
--- ok ---
AMD Athlon(tm) II X2 220 Processor (SSE3)
813 cycles for 400 * setnp sar
407 cycles for 400 * popcnt
3213 cycles for 400 * PopCount
810 cycles for 400 * setnp bswap
808 cycles for 400 * setnp sar
405 cycles for 400 * popcnt
3210 cycles for 400 * PopCount
807 cycles for 400 * setnp bswap
810 cycles for 400 * setnp sar
406 cycles for 400 * popcnt
3209 cycles for 400 * PopCount
927 cycles for 400 * setnp bswap
807 cycles for 400 * setnp sar
405 cycles for 400 * popcnt
3329 cycles for 400 * PopCount
810 cycles for 400 * setnp bswap
205 = eax setnp sar
205 = eax popcnt
205 = eax PopCount
205 = eax setnp bswap
Timo's Athlon has an ultra-fast popcnt instruction. The other surprise is Hutch' i7, with a 15% faster "bswap" algo - mine is 5% slower :cool:
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
651 cycles for 400 * setnp sar
961 cycles for 400 * popcnt
2179 cycles for 400 * PopCount
568 cycles for 400 * setnp bswap
Thanks to all :thup:
popcnt is faster in old AMD (or timing have a problem :biggrin:)
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
786 cycles for 400 * setnp sar
396 cycles for 400 * popcnt
3750 cycles for 400 * PopCount
787 cycles for 400 * setnp bswap
787 cycles for 400 * setnp sar
398 cycles for 400 * popcnt
3749 cycles for 400 * PopCount
784 cycles for 400 * setnp bswap
785 cycles for 400 * setnp sar
395 cycles for 400 * popcnt
3748 cycles for 400 * PopCount
785 cycles for 400 * setnp bswap
786 cycles for 400 * setnp sar
398 cycles for 400 * popcnt
3750 cycles for 400 * PopCount
783 cycles for 400 * setnp bswap
V.2
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
644 cycles for 400 * setnp sar
937 cycles for 400 * popcnt
2141 cycles for 400 * PopCount
592 cycles for 400 * setnp bswap
645 cycles for 400 * setnp sar
930 cycles for 400 * popcnt
2118 cycles for 400 * PopCount
565 cycles for 400 * setnp bswap
635 cycles for 400 * setnp sar
957 cycles for 400 * popcnt
2208 cycles for 400 * PopCount
871 cycles for 400 * setnp bswap
757 cycles for 400 * setnp sar
1096 cycles for 400 * popcnt
2162 cycles for 400 * PopCount
615 cycles for 400 * setnp bswap
205 = eax setnp sar
205 = eax popcnt
205 = eax PopCount
205 = eax setnp bswap
--- ok ---
Quote from: HSE on February 22, 2021, 05:58:09 AM
popcnt is faster in old AMD (or timing have a problem :biggrin:)
The timings are not ultra-stable but it's pretty clear from your and Timo's CPUs that AMD has a much faster popcnt implementation. The i7, in contrast, perform better on the setnp bswap version.
For the first test piece...
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
783 cycles for 100 * setnp
685 cycles for 100 * popcnt
920 cycles for 100 * PopCount
758 cycles for 100 * setnp
683 cycles for 100 * popcnt
918 cycles for 100 * PopCount
770 cycles for 100 * setnp
686 cycles for 100 * popcnt
915 cycles for 100 * PopCount
755 cycles for 100 * setnp
682 cycles for 100 * popcnt
915 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
--- ok ---
And the second...
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G (SSE4)
609 cycles for 400 * setnp sar
317 cycles for 400 * popcnt
2406 cycles for 400 * PopCount
511 cycles for 400 * setnp bswap
610 cycles for 400 * setnp sar
360 cycles for 400 * popcnt
2420 cycles for 400 * PopCount
512 cycles for 400 * setnp bswap
605 cycles for 400 * setnp sar
313 cycles for 400 * popcnt
2403 cycles for 400 * PopCount
511 cycles for 400 * setnp bswap
609 cycles for 400 * setnp sar
310 cycles for 400 * popcnt
2401 cycles for 400 * PopCount
512 cycles for 400 * setnp bswap
205 = eax setnp sar
205 = eax popcnt
205 = eax PopCount
205 = eax setnp bswap
--- ok ---
I'm a little late to the party. :cool:
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
948 cycles for 100 * setnp
918 cycles for 100 * popcnt
1082 cycles for 100 * PopCount
951 cycles for 100 * setnp
905 cycles for 100 * popcnt
1090 cycles for 100 * PopCount
948 cycles for 100 * setnp
914 cycles for 100 * popcnt
1104 cycles for 100 * PopCount
957 cycles for 100 * setnp
917 cycles for 100 * popcnt
1114 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
694 cycles for 400 * setnp sar
307 cycles for 400 * popcnt
2853 cycles for 400 * PopCount
587 cycles for 400 * setnp bswap
699 cycles for 400 * setnp sar
272 cycles for 400 * popcnt
2861 cycles for 400 * PopCount
584 cycles for 400 * setnp bswap
697 cycles for 400 * setnp sar
292 cycles for 400 * popcnt
2851 cycles for 400 * PopCount
581 cycles for 400 * setnp bswap
700 cycles for 400 * setnp sar
269 cycles for 400 * popcnt
2857 cycles for 400 * PopCount
585 cycles for 400 * setnp bswap
205 = eax setnp sar
205 = eax popcnt
205 = eax PopCount
205 = eax setnp bswap
my newest
wonder if bswap compared to shufb has about the same performance?or shufb is capable of perform 4 x dwords inside xmm regs?
I knew old AMD's know for faster fpu code,maybe also includes some other opcodes
this has turbo up to 3.1ghz,I am not sure spinup is enough,because I dont hear fans,which is usually sign of run full speed,edit:oops I had it in special energy saving mode made for steady "fixed" clock freqency when running emulator,to not have turbo go up and down,messing steady frame rate
btw how old cpu do you really want to test on,I am certain an emulator could emulate 486dx or whatever lowest cpu is that can perform bswap :tongue: :badgrin:
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
849 cycles for 100 * setnp
647 cycles for 100 * popcnt
1157 cycles for 100 * PopCount
823 cycles for 100 * setnp
662 cycles for 100 * popcnt
1148 cycles for 100 * PopCount
829 cycles for 100 * setnp
656 cycles for 100 * popcnt
1124 cycles for 100 * PopCount
822 cycles for 100 * setnp
652 cycles for 100 * popcnt
1137 cycles for 100 * PopCount
45 = eax setnp
45 = eax popcnt
45 = eax PopCount
-
-
second version
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
698 cycles for 400 * setnp sar
1023 cycles for 400 * popcnt
1720 cycles for 400 * PopCount
600 cycles for 400 * setnp bswap
715 cycles for 400 * setnp sar
1024 cycles for 400 * popcnt
1714 cycles for 400 * PopCount
609 cycles for 400 * setnp bswap
706 cycles for 400 * setnp sar
1018 cycles for 400 * popcnt
1730 cycles for 400 * PopCount
594 cycles for 400 * setnp bswap
709 cycles for 400 * setnp sar
1014 cycles for 400 * popcnt
1762 cycles for 400 * PopCount
611 cycles for 400 * setnp bswap
205 = eax setnp sar
205 = eax popcnt
205 = eax PopCount
205 = eax setnp bswap
-
any practical use of get parity bit in some PROC?modify a SEED depending on parity bit?
Quote from: daydreamer on February 23, 2021, 04:56:48 AMbtw how old cpu do you really want to test on,I am certain an emulator could emulate 486dx or whatever lowest cpu is that can perform bswap :tongue: :badgrin:
I was more concerned about popcnt ;-)
Quoteany practical use of get parity bit in some PROC?
Probably not - since when do we need a
reason for testing algos in the Lab? :tongue:
Quote from: jj2007 on February 23, 2021, 05:54:09 AM
Quote from: daydreamer on February 23, 2021, 04:56:48 AMbtw how old cpu do you really want to test on,I am certain an emulator could emulate 486dx or whatever lowest cpu is that can perform bswap :tongue: :badgrin:
I was more concerned about popcnt ;-)
Quoteany practical use of get parity bit in some PROC?
Probably not - since when do we need a reason for testing algos in the Lab? :tongue:
I ran some benchmark xchg vs mov earlier,from masm32 sdk,so if you exchange xchg does it become little faster?
testing algos in Lab,is kinda similar to test alternative algo,that solves problem using different "untouched" mnemonics/opcodes
already tested an alternative using shufb,so maybe benchmark shufb vs bswap?
Quote from: daydreamer on February 23, 2021, 06:17:36 AMI ran some benchmark xchg vs mov earlier,from masm32 sdk,so if you exchange xchg does it become little faster?
testing algos in Lab,is kinda similar to test alternative algo,that solves problem using different "untouched" mnemonics/opcodes
already tested an alternative using shufb,so maybe benchmark shufb vs bswap?
xchg and bswap do different things, not useful here.
pshufb might be a tick faster, but I'd like to remain on the safe side: this is a recent instruction not supported by older CPUs.
does BSWAP lock the bus as XCHG does ???
regards mikeb
Quote from: jj2007 on February 23, 2021, 07:49:04 AM
Quote from: daydreamer on February 23, 2021, 06:17:36 AMI ran some benchmark xchg vs mov earlier,from masm32 sdk,so if you exchange xchg does it become little faster?
testing algos in Lab,is kinda similar to test alternative algo,that solves problem using different "untouched" mnemonics/opcodes
already tested an alternative using shufb,so maybe benchmark shufb vs bswap?
xchg and bswap do different things, not useful here.
pshufb might be a tick faster, but I'd like to remain on the safe side: this is a recent instruction not supported by older CPUs.
found the xchg_test.exe timing in masm32 examples,which times xchg vs mov,feel free to ignore this timing: :tongue: :bgrin:
727 cycles, (xchg reg,reg)*100
8148 cycles, (xchg reg,mem)*100
7062 cycles, (xchg mem,reg)*100
430 cycles, (exchange reg,reg)*100 using mov
952 cycles, (exchange reg,mem)*100 using mov
Press any key to exit...
Quote from: HSE on February 22, 2021, 05:58:09 AM
popcnt is faster in old AMD (or timing have a problem :biggrin:)
AMD A6-3500 APU [url=https://www.baloune.com/guide-sante-chiens/] comparateur assurance chat [/url] with Radeon(tm) HD Graphics (SSE3)
786 cycles for 400 * setnp sar
396 cycles for 400 * popcnt
3750 cycles for 400 * PopCount
787 cycles for 400 * setnp bswap
787 cycles for 400 * setnp sar
398 cycles for 400 * popcnt
3749 cycles for 400 * PopCount
784 cycles for 400 * setnp bswap
785 cycles for 400 * setnp sar
395 cycles for 400 * popcnt
3748 cycles for 400 * PopCount
785 cycles for 400 * setnp bswap
786 cycles for 400 * setnp sar
398 cycles for 400 * popcnt
3750 cycles for 400 * PopCount
783 cycles for 400 * setnp bswap
wouldnt it be more standard today with unicode 16 reverse string?