News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Prime Numbers, faster than primesieve ...

Started by rrr314159, November 20, 2015, 02:17:50 PM

Previous topic - Next topic

rrr314159

This 64-bit version of the Prime Number Generator (png) is a bit faster than primesieve (current net record holder) under 2 billion primes, multi-threaded. It does 2 billion in about 113 milliseconds vs. primesieve's approx 117. Primesieve (ps) uses ticks to count and gets 110 and 125 about equally, so estimate 117.5. These are the times on my 15 3330, using Chris Wallich's latest console version (Nov 8, 2015).

The margin is very slim, and ps beats png with one thread; my advantage is in the threading algorithm. Much more important, he goes up to a billion billion; so png is just a toy by comparison - call it a Proof-of-Concept. Since I've converted to 64-bits I can go up to the same 2^64 without too much trouble. But png will be very slow up there until 2 more critical algo's are added (for medium and large primes). When all that is done (doubt I'll ever get to it) the 2 billion timing might suffer ... Bottom line, Chris Wallich has nothing to worry about yet!

png build size is about 4% that of ps; SLOC count, about 30% of ps. Assembler takes fewer lines than C++, due to macros; and of course far fewer bytes.

As usual this is all my own work, although primesieve has been invaluable for inspiration. About 75% of my algos are roughly the same - there are only so many ways to skin the cat - but a comparison will clearly show they're independently generated. in fact he has a couple of wrinkles I didn't think of, which I didn't steal; next time I will. OTOH he's missing a couple things also, particularly for threading.

To run it: "png" will do up to 2 billion by default. "png 12 67" does from 12 to 67. "png 12 67 -p" prints the primes, "png -t2" uses only two threads. BTW it takes at least 30 seconds to generate a 1-gig file with 100 million primes (or so), using "png -p > primes.txt".

Here are the results of some test cases:

C:\png1>test

C:\png1>png
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 112.9
from 1 up to 2000000010, 98222287 primes

C:\png1>png -t1
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 416.3
from 0 up to 2000000010, 98222287 primes

C:\png1>png 0 1000000000
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 49.8
from 0 up to 1000000000, 50847534 primes

C:\png1>png 1000000000 2000000000
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 69.07
from 1000000000 up to 2000000000, 47374753 primes

C:\png1>png 777 2000000000
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 112.5
from 777 up to 2000000000, 98222150 primes

C:\png1>png 777777 2000000000
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 112.5
from 777777 up to 2000000000, 98159986 primes

C:\png1>png 1234567 123456789
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 5.379
from 1234567 up to 123456789, 6931900 primes

C:\png1>png 15485163
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 0.0001095
from 15485163 up to 15485163, 0 primes

C:\png1>png 15485863
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 0.1597
from 15485863 up to 15485863, 1 primes

C:\png1>png 0 960960
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 0.1498
from 0 up to 960960, 75681 primes


The first zip includes:

png.exe
png.asm          ; main program, does first segment
threads.asm     ; thread algo
segments.asm  ; segments above the first one
wheel.asm       ; wheel algorithm (2 * 3 * 5 = 30)
utility.asm       ; prime-counting, sys info, command line
png_macros     ; png-specific utility macros
png.inc            ; png-specific, plus all required 64-bit includes
my_macros.asm ; generic macros: print, timing, etc
makeit.bat       ; make file
test.bat           ; generates test results
test_results.txt ; correct test results

The second zip has the two required 64-bit libs, which I got from WinInc, kernel32.lib and msvcrt.lib, because there are incompatible versions around. With these two zips you can compile png, all you need is any JWasm (>= 2.11) or asmc.
I am NaN ;)

jj2007

Now that explains why you have been so silent for a while ... compliments :t

Siekmanski

Cool!  8)

d:\RadASM2212\Masm\Projects\test>png
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 19.42
from 1 up to 2000000010, 98222287 primes

d:\RadASM2212\Masm\Projects\test>png 0 0 -t1
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 90.12
from 0 up to 2000000010, 98222287 primes

d:\RadASM2212\Masm\Projects\test>png 0 1000000000
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 11.27
from 0 up to 1000000000, 50847534 primes

d:\RadASM2212\Masm\Projects\test>png 1000000000 2000000
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 9.538
from 1000000000 up to 2000000000, 47374753 primes

d:\RadASM2212\Masm\Projects\test>png 777 2000000000
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 20.63
from 777 up to 2000000000, 98222150 primes

d:\RadASM2212\Masm\Projects\test>png 777777 2000000000
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 19.54
from 777777 up to 2000000000, 98159986 primes

d:\RadASM2212\Masm\Projects\test>png 1234567 123456789
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 1.082
from 1234567 up to 123456789, 6931900 primes

d:\RadASM2212\Masm\Projects\test>png 15485163 15485163
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 2.43e-005
from 15485163 up to 15485163, 0 primes

d:\RadASM2212\Masm\Projects\test>png 15485863
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 0.03574
from 15485863 up to 1, 1 primes

d:\RadASM2212\Masm\Projects\test>png 0 960960
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 0.032
from 0 up to 960960, 75681 primes
Creative coders use backward thinking techniques as a strategy.

rrr314159

Thanks siekmanski, 19.42 ms is pretty good ... I want an i7-4930K! About 6 times faster than my i5 ... :)

I notice a slight problem. If you want to know if a single number is prime u can use "png 103 103", and it will respond "from 103 up to 103, 1 primes". You can also use "png 103" but then it says "from 103 up to 1, 1 primes". A last-minute cosmetic fix didn't make it into the final post. If anybody cares I'll correct it after I see if any other details surface ...

@jj2007, thanks! ... but you're wrong, this project hasn't consumed so much time, that I can't discuss "10000 refugees" or whatever. It's just that I've said what I have to say, no point in beating it to death. Minds are changed (if at all) by events not words. If anyone wants to know my thoughts read previous posts; all those issues have been covered there.
I am NaN ;)

TWell

>png
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 56.58
from 1 up to 2000000010, 98222287 primes

>png 0 0 -t1
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 94.79
from 0 up to 2000000010, 98222287 primes

>png 0 1000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 22.93
from 0 up to 1000000000, 50847534 primes

>png 1000000000 2000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 28.48
from 1000000000 up to 2000000000, 47374753 primes

>png 777 2000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 51.86
from 777 up to 2000000000, 98222150 primes

>png 777777 2000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 51.43
from 777777 up to 2000000000, 98159986 primes

>png 1234567 123456789
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 3.101
from 1234567 up to 123456789, 6931900 primes

>png 15485163 15485163
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 7.983e-005
from 15485163 up to 15485163, 0 primes

>png 15485863
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 0.07936
from 15485863 up to 1, 1 primes

>png 0 960960
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 0.07161
from 0 up to 960960, 75681 primes

rrr314159

Thanks Tim,

appears to be working well on your AMD except those numbers are too fast. My AMD, later than yours, does well but not that well! They appear to be off by factor of 10. No doubt the problem is "QueryPerformanceFrequency". Someday I should fix this but my main concern is the algorithm, hate to waste time wrestling with Windows. Maybe that's why Chris Wallich uses ticks instead of rdtsc, ticks more compatible on different platforms; but they're not accurate enough for my purposes. Anyway you could tell by running "png -t1". Instead of 94.79 it may be 947.9, almost a second, and you could tell by the hesitation.

BTW if anyone with the code wants to fix the cosmetic problem I mentioned above, (since I don't think I'll be posting another version until / unless I improve something) here's how. On line 283 of png.asm change

indextoprime

to

indextoprime ips.Top_Index
I am NaN ;)

sinsi

A newer AMD with hopefully false timings :lol:

C:\Users\tester\Desktop>png
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1.182e+009
from 1 up to 2000000010, 98222287 primes

C:\Users\tester\Desktop>png 0 0 -t1
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 2.516e+009
from 0 up to 2000000010, 98222287 primes

C:\Users\tester\Desktop>png 0 1000000000
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 6.057e+008
from 0 up to 1000000000, 50847534 primes

C:\Users\tester\Desktop>png 1000000000 2000000000
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 6.095e+008
from 1000000000 up to 2000000000, 47374753 primes

C:\Users\tester\Desktop>png 777 2000000000
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1.1e+009
from 777 up to 2000000000, 98222150 primes

C:\Users\tester\Desktop>png 777777 2000000000
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1.199e+009
from 777777 up to 2000000000, 98159986 primes

C:\Users\tester\Desktop>png 1234567 123456789
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 5.545e+007
from 1234567 up to 123456789, 6931900 primes

C:\Users\tester\Desktop>png 15485163 15485163
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1350
from 15485163 up to 15485163, 0 primes

C:\Users\tester\Desktop>png 15485863
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 7.875e+005
from 15485863 up to 1, 1 primes

C:\Users\tester\Desktop>png 0 960960
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 7.514e+005
from 0 up to 960960, 75681 primes


rrr314159

Thanks Sinsi,

going by the off-by-powers-of-10 theory your 1.182e+009 means 118 ms; which definitely makes sense. And, all the other timings are consistent.

I should take more care with this but my attitude is, I'm not into production code but math algorithms. If there's anything wrong with the prime computations I care a lot, but if Windows timing insists on not-working for various versions, it's not a major concern. After all, the timing does work fine on my AMD A6 Windows 8.1, and all my older Intel computers. It's also good with every modern Intel tested here.

I'm hoping one of you guys - who know all these Windows-version details - will go ahead and fix it. I'm also hoping to win the lottery, without buying a ticket :)

My other plan is to go up to a billion billion. Then the timing for 100 billion (say) will be on the order of a minute, and the user can tell by his watch what factor it's off by. I could let him set that factor, save it in a .ini file.

Anyway, thanks again ...
I am NaN ;)

six_L

QuoteD:\masm32\user\internet1\1\primes\png>png
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 113.7
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t1
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 340.6
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png 0 1000000000
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 52.52
from 0 up to 1000000000, 50847534 primes

D:\masm32\user\internet1\1\primes\png>png 1000000000 2000000000
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 59.09
from 1000000000 up to 2000000000, 47374753 primes

D:\masm32\user\internet1\1\primes\png>png 777 2000000000
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 110.9
from 777 up to 2000000000, 98222150 primes

D:\masm32\user\internet1\1\primes\png>png 777777 2000000000
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 117.3
from 777777 up to 2000000000, 98159986 primes

D:\masm32\user\internet1\1\primes\png>png 1234567 123456789
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 4.636
from 1234567 up to 123456789, 6931900 primes

D:\masm32\user\internet1\1\primes\png>png 15485163 15485163
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 0.000297
from 15485163 up to 15485163, 0 primes

D:\masm32\user\internet1\1\primes\png>png 15485863
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 0.1506
from 15485863 up to 1, 1 primes

D:\masm32\user\internet1\1\primes\png>png 0 960960
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 0.1388
from 0 up to 960960, 75681 primes
Say you, Say me, Say the codes together for ever.

rrr314159

Thanks six_L,

Appears to work correctly, your machine about 20% faster than my i5, except for one thing. Apparently logical processors are no good with this math-intensive code which doesn't wait on memory or I/O very much. Normally two threads sharing one core do well because each is often waiting on I/O so the other can run without loss of performance, but not for prime number generator. So if you run "png -t4" restricting to four (real) cores, it will probably improve that 113.7 ms to about 95 milliseconds.

Same comment probably applies to siekmanski with 12 logical cores; probably get faster numbers restricting to 6 cores. I need to change it so that's done automatically.
I am NaN ;)

Siekmanski

Result with 6 cores,

d:\test>png -t6
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 15.37
from 1 up to 2000000010, 98222287 primes

d:\test>png
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 19.27
from 1 up to 2000000010, 98222287 primes
Creative coders use backward thinking techniques as a strategy.

rrr314159

Thanks siekmanksi,

as expected, hyper-threading not effective with CPU-intensive code. 15.37 is almost 6 times better than your 1-core time 90.12 - in fact threading is costing only 2.3%, which is very good. primesieve loses as much as 20% to multi-threading
I am NaN ;)

six_L

QuoteD:\masm32\user\internet1\1\primes\png>png
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 139.3
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t4
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 162.9
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t6
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 106.2
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t2
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 174.2
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t8
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 114.9
from 1 up to 2000000010, 98222287 primes
Say you, Say me, Say the codes together for ever.

rrr314159

thanks Six_L,

Don't understand why 6 threads is best? Your computer performs as modern i7 should with one thread: more than 20% better than my i5. But the multi-thread performance is not so improved, maybe 10%.

I wonder if this has anything to do with it? ...

Quote from: pcmag.comIt's [the i7 4770] also missing the new Transactional Synchronization Extensions (TSX), which is unfortunate. TSX is a new feature, introduced in other Haswell chips, that offers programmers a more efficient way to manage certain multi-threading performance problems. It's not a feature that we expect to make much difference in the short run, but long term, the capability could be vital to improving multi-core scaling.

Meanwhile I'm working on the full 64-bit version, pushing it up to a billion billion. Hope to get 100 billion down to 8 or better (since primesieve is at 8.86). With these larger numbers it will be possible to get better timing. In the 10-second range it doesn't bounce around like 100 ms timings do; less affected by transitory background processing. So hopefully if you try that when it's ready, should see more predictable behavior.

BTW if anyone cares, and wants any features added to the prime number generator, now's the time to mention them. One new addition is an "API" (you might call it that) allowing an asm programmer to use png to process prime numbers with his own routines; for instance, to generate Goldbach numbers, count k-tuplets, print different formats like 8 across the page, or whatever
I am NaN ;)

six_L

QuoteDon't understand why 6 threads is best?
i quited other applications,the result of tested is as following.
Quote
D:\masm32\user\internet1\1\primes\png>png
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 113.3
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t2
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 196
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t4
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 93.75
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t6
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 118.2
from 1 up to 2000000010, 98222287 primes

D:\masm32\user\internet1\1\primes\png>png -t8
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 113.9
from 1 up to 2000000010, 98222287 primes
Say you, Say me, Say the codes together for ever.