News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

PopCount

Started by MichaelW, August 16, 2014, 02:31:46 AM

Previous topic - Next topic

MichaelW

The attachment contains a test of the POPCNT instruction against the sort of code previously used for the purpose.

The very repeatable results running on my Core-i3 G3220:

9 cycles, popcnt*10
36 cycles, popcnt1*10
11 cycles, popcnt2*10

9 cycles, popcnt*10
36 cycles, popcnt1*10
11 cycles, popcnt2*10

9 cycles, popcnt*10
36 cycles, popcnt1*10
11 cycles, popcnt2*10


For timing single instructions, and for recent processors even short sequences of instructions, the one-cycle resolution of the TSC presents a problem. It's difficult to know what constitutes a representative test. 

BTW, I assembled with JWASM.

Well Microsoft, here's another nice mess you've gotten us into.

jj2007

For comparison, PopCount() on a CPU that doesn't support the native popcnt

Intel(R) Celeron(R) M CPU
76 cycles, popcntMB*10
91 cycles, popcnt1*10
22 cycles, popcnt2*10

Gunther

Hi Michael,

here are your timings:
Quote
8 cycles, popcnt*10
40 cycles, popcnt1*10
13 cycles, popcnt2*10

8 cycles, popcnt*10
40 cycles, popcnt1*10
12 cycles, popcnt2*10

8 cycles, popcnt*10
40 cycles, popcnt1*10
12 cycles, popcnt2*10

Press any key to continue ...

Gunther
You have to know the facts before you can distort them.

Siekmanski

 i7-4930K CPU @ 3.40GHz
win 8.1 x64

Quote

8 cycles, popcnt*10
45 cycles, popcnt1*10
13 cycles, popcnt2*10

9 cycles, popcnt*10
45 cycles, popcnt1*10
13 cycles, popcnt2*10

9 cycles, popcnt*10
45 cycles, popcnt1*10
14 cycles, popcnt2*10

Press any key to continue ...
Creative coders use backward thinking techniques as a strategy.

hutch--

This is the result on my i7 64 bit box.




4 cycles, popcnt*10
58 cycles, popcnt1*10
14 cycles, popcnt2*10

4 cycles, popcnt*10
58 cycles, popcnt1*10
14 cycles, popcnt2*10

4 cycles, popcnt*10
58 cycles, popcnt1*10
14 cycles, popcnt2*10

Press any key to continue ...

jj2007

Not an easy choice:

- "true" popcnt excludes many older CPUs
- popcnt1 is kind of slow
- popcnt2 adds 256k of bloat

Attached the code I used for the test above.

jj2007

There is a 2010 thread here with code that uses this popcnt implementation by Alex/Antariy:
@popcnt:
movzx ecx,al
shr eax,8
imul ecx,08040201h
imul eax,08040201h
shr ecx,3
shr eax,3
and ecx,11111111h
and eax,11111111h
imul ecx,11111111h
imul eax,11111111h
shr ecx,28
shr eax,28
add eax, ecx  ; <<< not in the source
ret


Problem is that a) I am quite sure it ran fine in 2010, and very fast, b) it crashes now.
I have stared at it for quite a while, but the fog doesn't go away. Any ideas?

P.S.: Alex's popcount is 16-bit only, and would need an add eax, ecx to work (which in the source is present but higher up).

Here are some timings (the last one is the native popcnt instruction):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
653 ms, sum alex=799979533
543 ms, sum mb32=799979533
475 ms, sum mb16=799979533
446 ms, sum popc=799979533

631 ms, sum alex=799979533
549 ms, sum mb32=799979533
478 ms, sum mb16=799979533
482 ms, sum popc=799979533

649 ms, sum alex=799979533
560 ms, sum mb32=799979533
472 ms, sum mb16=799979533
449 ms, sum popc=799979533

jj2007

Can I have some timings please?

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Counting 1000* the lines in Windows.inc
naive   MasmB   w/frame lfcnt
1154    75      84 ms   1145 ms
1111    77      84 ms   1143 ms
1122    79      86 ms   1138 ms
1113    78      86 ms   1147 ms
1217    76      84 ms   1142 ms
1245    75      88 ms   1152 ms
1107    75      86 ms   1146 ms
1106    75      84 ms   1145 ms
26902 lines found

dedndave

Intel(R) Pentium(R) 4 CPU 3.00GHz
Counting 1000* the lines in Windows.inc
naive   MasmB   w/frame lfcnt
1515    296     290 ms  1937 ms
1528    292     309 ms  1925 ms
1521    299     291 ms  1934 ms
1518    290     298 ms  1926 ms
1518    296     291 ms  1934 ms
1524    287     295 ms  1934 ms
1605    292     291 ms  1933 ms
1531    286     297 ms  1930 ms
26894 lines found

jj2007

Thanks, Dave, and welcome back :icon14:

Btw your windows.inc seems 8 lines shorter than mine, which is 977412 bytes of 10 Jan 2012.

P.S.: We just elected a new Mayor of Rome:

sinsi

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Counting 1000* the lines in Windows.inc
naive   MasmB   w/frame lfcnt
754     50      54 ms   669 ms
754     50      55 ms   693 ms
772     51      53 ms   675 ms
750     50      54 ms   665 ms
768     50      53 ms   669 ms
776     50      55 ms   667 ms
743     50      53 ms   670 ms
744     50      53 ms   668 ms
26902 lines found

jj2007

i7 likes algos developed on little brother :bgrin:

mabdelouahab

QuoteIntel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Counting 1000* the lines in Windows.inc
naive   MasmB   w/frame lfcnt
1140    96      109 ms  1114 ms
1163    85      88 ms   1053 ms
1209    80      83 ms   1037 ms
1144    77      82 ms   1057 ms
1156    77      82 ms   1059 ms
1161    80      84 ms   1017 ms
1140    78      83 ms   1041 ms
1140    77      83 ms   1015 ms
26902 lines found

LiaoMi

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
384 ms, sum alex=799979533
346 ms, sum mb32=799979533
311 ms, sum mb16=799979533
297 ms, sum popc=799979533

395 ms, sum alex=799979533
347 ms, sum mb32=799979533
302 ms, sum mb16=799979533
345 ms, sum popc=799979533

384 ms, sum alex=799979533
348 ms, sum mb32=799979533
318 ms, sum mb16=799979533
295 ms, sum popc=799979533


Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Counting 1000* the lines in Windows.inc
naive   MasmB   w/frame lfcnt
723     48      52 ms   648 ms
714     50      52 ms   647 ms
743     48      52 ms   648 ms
774     49      52 ms   652 ms
724     48      53 ms   649 ms
734     49      52 ms   640 ms
716     50      53 ms   646 ms
722     49      52 ms   643 ms
22274 lines found



jj2007

Thanks, LiaoMi :icon14:

Your line count is low - is that the latest Masm32 SDK?