The MASM Forum
64 bit assembler => 64 Bit Assembler => Topic started by: rrr314159 on November 20, 2015, 02:17:50 PM

This 64bit version of the Prime Number Generator (png) is a bit faster than primesieve (current net record holder) under 2 billion primes, multithreaded. It does 2 billion in about 113 milliseconds vs. primesieve's approx 117. Primesieve (ps) uses ticks to count and gets 110 and 125 about equally, so estimate 117.5. These are the times on my 15 3330, using Chris Wallich's latest console version (Nov 8, 2015).
The margin is very slim, and ps beats png with one thread; my advantage is in the threading algorithm. Much more important, he goes up to a billion billion; so png is just a toy by comparison  call it a ProofofConcept. Since I've converted to 64bits I can go up to the same 2^64 without too much trouble. But png will be very slow up there until 2 more critical algo's are added (for medium and large primes). When all that is done (doubt I'll ever get to it) the 2 billion timing might suffer ... Bottom line, Chris Wallich has nothing to worry about yet!
png build size is about 4% that of ps; SLOC count, about 30% of ps. Assembler takes fewer lines than C++, due to macros; and of course far fewer bytes.
As usual this is all my own work, although primesieve has been invaluable for inspiration. About 75% of my algos are roughly the same  there are only so many ways to skin the cat  but a comparison will clearly show they're independently generated. in fact he has a couple of wrinkles I didn't think of, which I didn't steal; next time I will. OTOH he's missing a couple things also, particularly for threading.
To run it: "png" will do up to 2 billion by default. "png 12 67" does from 12 to 67. "png 12 67 p" prints the primes, "png t2" uses only two threads. BTW it takes at least 30 seconds to generate a 1gig file with 100 million primes (or so), using "png p > primes.txt".
Here are the results of some test cases:
C:\png1>test
C:\png1>png
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 112.9
from 1 up to 2000000010, 98222287 primes
C:\png1>png t1
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 416.3
from 0 up to 2000000010, 98222287 primes
C:\png1>png 0 1000000000
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 49.8
from 0 up to 1000000000, 50847534 primes
C:\png1>png 1000000000 2000000000
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 69.07
from 1000000000 up to 2000000000, 47374753 primes
C:\png1>png 777 2000000000
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 112.5
from 777 up to 2000000000, 98222150 primes
C:\png1>png 777777 2000000000
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 112.5
from 777777 up to 2000000000, 98159986 primes
C:\png1>png 1234567 123456789
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 5.379
from 1234567 up to 123456789, 6931900 primes
C:\png1>png 15485163
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 0.0001095
from 15485163 up to 15485163, 0 primes
C:\png1>png 15485863
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 0.1597
from 15485863 up to 15485863, 1 primes
C:\png1>png 0 960960
Intel(R) Core(TM) i53330 CPU @ 3.00GHz; Windows 8; 4 threads
prime time ms 0.1498
from 0 up to 960960, 75681 primes
The first zip includes:
png.exe
png.asm ; main program, does first segment
threads.asm ; thread algo
segments.asm ; segments above the first one
wheel.asm ; wheel algorithm (2 * 3 * 5 = 30)
utility.asm ; primecounting, sys info, command line
png_macros ; pngspecific utility macros
png.inc ; pngspecific, plus all required 64bit includes
my_macros.asm ; generic macros: print, timing, etc
makeit.bat ; make file
test.bat ; generates test results
test_results.txt ; correct test results
The second zip has the two required 64bit libs, which I got from WinInc, kernel32.lib and msvcrt.lib, because there are incompatible versions around. With these two zips you can compile png, all you need is any JWasm (>= 2.11) or asmc.

Now that explains why you have been so silent for a while ... compliments :t

Cool! 8)
d:\RadASM2212\Masm\Projects\test>png
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 19.42
from 1 up to 2000000010, 98222287 primes
d:\RadASM2212\Masm\Projects\test>png 0 0 t1
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 90.12
from 0 up to 2000000010, 98222287 primes
d:\RadASM2212\Masm\Projects\test>png 0 1000000000
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 11.27
from 0 up to 1000000000, 50847534 primes
d:\RadASM2212\Masm\Projects\test>png 1000000000 2000000
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 9.538
from 1000000000 up to 2000000000, 47374753 primes
d:\RadASM2212\Masm\Projects\test>png 777 2000000000
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 20.63
from 777 up to 2000000000, 98222150 primes
d:\RadASM2212\Masm\Projects\test>png 777777 2000000000
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 19.54
from 777777 up to 2000000000, 98159986 primes
d:\RadASM2212\Masm\Projects\test>png 1234567 123456789
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 1.082
from 1234567 up to 123456789, 6931900 primes
d:\RadASM2212\Masm\Projects\test>png 15485163 15485163
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 2.43e005
from 15485163 up to 15485163, 0 primes
d:\RadASM2212\Masm\Projects\test>png 15485863
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 0.03574
from 15485863 up to 1, 1 primes
d:\RadASM2212\Masm\Projects\test>png 0 960960
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 0.032
from 0 up to 960960, 75681 primes

Thanks siekmanski, 19.42 ms is pretty good ... I want an i74930K! About 6 times faster than my i5 ... :)
I notice a slight problem. If you want to know if a single number is prime u can use "png 103 103", and it will respond "from 103 up to 103, 1 primes". You can also use "png 103" but then it says "from 103 up to 1, 1 primes". A lastminute cosmetic fix didn't make it into the final post. If anybody cares I'll correct it after I see if any other details surface ...
@jj2007, thanks! ... but you're wrong, this project hasn't consumed so much time, that I can't discuss "10000 refugees" or whatever. It's just that I've said what I have to say, no point in beating it to death. Minds are changed (if at all) by events not words. If anyone wants to know my thoughts read previous posts; all those issues have been covered there.

>png
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 56.58
from 1 up to 2000000010, 98222287 primes
>png 0 0 t1
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 94.79
from 0 up to 2000000010, 98222287 primes
>png 0 1000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 22.93
from 0 up to 1000000000, 50847534 primes
>png 1000000000 2000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 28.48
from 1000000000 up to 2000000000, 47374753 primes
>png 777 2000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 51.86
from 777 up to 2000000000, 98222150 primes
>png 777777 2000000000
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 51.43
from 777777 up to 2000000000, 98159986 primes
>png 1234567 123456789
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 3.101
from 1234567 up to 123456789, 6931900 primes
>png 15485163 15485163
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 7.983e005
from 15485163 up to 15485163, 0 primes
>png 15485863
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 0.07936
from 15485863 up to 1, 1 primes
>png 0 960960
AMD Athlon(tm) II X2 220 Processor; Windows 8; 2 threads
prime time ms 0.07161
from 0 up to 960960, 75681 primes

Thanks Tim,
appears to be working well on your AMD except those numbers are too fast. My AMD, later than yours, does well but not that well! They appear to be off by factor of 10. No doubt the problem is "QueryPerformanceFrequency". Someday I should fix this but my main concern is the algorithm, hate to waste time wrestling with Windows. Maybe that's why Chris Wallich uses ticks instead of rdtsc, ticks more compatible on different platforms; but they're not accurate enough for my purposes. Anyway you could tell by running "png t1". Instead of 94.79 it may be 947.9, almost a second, and you could tell by the hesitation.
BTW if anyone with the code wants to fix the cosmetic problem I mentioned above, (since I don't think I'll be posting another version until / unless I improve something) here's how. On line 283 of png.asm change
indextoprime
to
indextoprime ips.Top_Index

A newer AMD with hopefully false timings :lol:
C:\Users\tester\Desktop>png
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1.182e+009
from 1 up to 2000000010, 98222287 primes
C:\Users\tester\Desktop>png 0 0 t1
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 2.516e+009
from 0 up to 2000000010, 98222287 primes
C:\Users\tester\Desktop>png 0 1000000000
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 6.057e+008
from 0 up to 1000000000, 50847534 primes
C:\Users\tester\Desktop>png 1000000000 2000000000
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 6.095e+008
from 1000000000 up to 2000000000, 47374753 primes
C:\Users\tester\Desktop>png 777 2000000000
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1.1e+009
from 777 up to 2000000000, 98222150 primes
C:\Users\tester\Desktop>png 777777 2000000000
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1.199e+009
from 777777 up to 2000000000, 98159986 primes
C:\Users\tester\Desktop>png 1234567 123456789
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 5.545e+007
from 1234567 up to 123456789, 6931900 primes
C:\Users\tester\Desktop>png 15485163 15485163
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 1350
from 15485163 up to 15485163, 0 primes
C:\Users\tester\Desktop>png 15485863
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 7.875e+005
from 15485863 up to 1, 1 primes
C:\Users\tester\Desktop>png 0 960960
AMD A107850K Radeon R7, 12 Compute Cores 4C+8G; Windows 7; 4 threads
prime time ms 7.514e+005
from 0 up to 960960, 75681 primes

Thanks Sinsi,
going by the offbypowersof10 theory your 1.182e+009 means 118 ms; which definitely makes sense. And, all the other timings are consistent.
I should take more care with this but my attitude is, I'm not into production code but math algorithms. If there's anything wrong with the prime computations I care a lot, but if Windows timing insists on notworking for various versions, it's not a major concern. After all, the timing does work fine on my AMD A6 Windows 8.1, and all my older Intel computers. It's also good with every modern Intel tested here.
I'm hoping one of you guys  who know all these Windowsversion details  will go ahead and fix it. I'm also hoping to win the lottery, without buying a ticket :)
My other plan is to go up to a billion billion. Then the timing for 100 billion (say) will be on the order of a minute, and the user can tell by his watch what factor it's off by. I could let him set that factor, save it in a .ini file.
Anyway, thanks again ...

D:\masm32\user\internet1\1\primes\png>png
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 113.7
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t1
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 340.6
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png 0 1000000000
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 52.52
from 0 up to 1000000000, 50847534 primes
D:\masm32\user\internet1\1\primes\png>png 1000000000 2000000000
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 59.09
from 1000000000 up to 2000000000, 47374753 primes
D:\masm32\user\internet1\1\primes\png>png 777 2000000000
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 110.9
from 777 up to 2000000000, 98222150 primes
D:\masm32\user\internet1\1\primes\png>png 777777 2000000000
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 117.3
from 777777 up to 2000000000, 98159986 primes
D:\masm32\user\internet1\1\primes\png>png 1234567 123456789
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 4.636
from 1234567 up to 123456789, 6931900 primes
D:\masm32\user\internet1\1\primes\png>png 15485163 15485163
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 0.000297
from 15485163 up to 15485163, 0 primes
D:\masm32\user\internet1\1\primes\png>png 15485863
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 0.1506
from 15485863 up to 1, 1 primes
D:\masm32\user\internet1\1\primes\png>png 0 960960
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 0.1388
from 0 up to 960960, 75681 primes

Thanks six_L,
Appears to work correctly, your machine about 20% faster than my i5, except for one thing. Apparently logical processors are no good with this mathintensive code which doesn't wait on memory or I/O very much. Normally two threads sharing one core do well because each is often waiting on I/O so the other can run without loss of performance, but not for prime number generator. So if you run "png t4" restricting to four (real) cores, it will probably improve that 113.7 ms to about 95 milliseconds.
Same comment probably applies to siekmanski with 12 logical cores; probably get faster numbers restricting to 6 cores. I need to change it so that's done automatically.

Result with 6 cores,
d:\test>png t6
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 15.37
from 1 up to 2000000010, 98222287 primes
d:\test>png
Intel(R) Core(TM) i74930K CPU @ 3.40GHz; Windows 8; 12 threads
prime time ms 19.27
from 1 up to 2000000010, 98222287 primes

Thanks siekmanksi,
as expected, hyperthreading not effective with CPUintensive code. 15.37 is almost 6 times better than your 1core time 90.12  in fact threading is costing only 2.3%, which is very good. primesieve loses as much as 20% to multithreading

D:\masm32\user\internet1\1\primes\png>png
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 139.3
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t4
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 162.9
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t6
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 106.2
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t2
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 174.2
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t8
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 114.9
from 1 up to 2000000010, 98222287 primes

thanks Six_L,
Don't understand why 6 threads is best? Your computer performs as modern i7 should with one thread: more than 20% better than my i5. But the multithread performance is not so improved, maybe 10%.
I wonder if this has anything to do with it? ...
It's [the i7 4770] also missing the new Transactional Synchronization Extensions (TSX), which is unfortunate. TSX is a new feature, introduced in other Haswell chips, that offers programmers a more efficient way to manage certain multithreading performance problems. It's not a feature that we expect to make much difference in the short run, but long term, the capability could be vital to improving multicore scaling.
Meanwhile I'm working on the full 64bit version, pushing it up to a billion billion. Hope to get 100 billion down to 8 or better (since primesieve is at 8.86). With these larger numbers it will be possible to get better timing. In the 10second range it doesn't bounce around like 100 ms timings do; less affected by transitory background processing. So hopefully if you try that when it's ready, should see more predictable behavior.
BTW if anyone cares, and wants any features added to the prime number generator, now's the time to mention them. One new addition is an "API" (you might call it that) allowing an asm programmer to use png to process prime numbers with his own routines; for instance, to generate Goldbach numbers, count ktuplets, print different formats like 8 across the page, or whatever

Don't understand why 6 threads is best?
i quited other applications,the result of tested is as following.
D:\masm32\user\internet1\1\primes\png>png
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 113.3
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t2
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 196
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t4
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 93.75
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t6
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 118.2
from 1 up to 2000000010, 98222287 primes
D:\masm32\user\internet1\1\primes\png>png t8
Intel(R) Core(TM) i74770 CPU @ 3.40GHz; Windows 8; 8 threads
prime time ms 113.9
from 1 up to 2000000010, 98222287 primes

Thanks! ... ok, now the numbers make sense.
As you see 4 threads is best. Compared to 1 thread you lose about 10% to thread management. 2 threads take twice as long as 4, of course; all this behavior is similar to my older i5, but faster. 6 and 8 threads do worse because, as mentioned b4, hyperthreading not good for CPUintensive code.
siekmanksi's computer is faster, but also has better thread efficiency than anyone else. The i74930K has TSX; presumably that's why.

BTW I did make the full 64bit version, using 64bit registers to go up to a few hundred billion, and including the necessary largenumber algos. Best I could do at higher numbers (primes up to 100 billion for instance) was 50% slower than Chris Wallich's. So I gave up on figuring it out myself, and studied his code. He uses various clever tricks to get tiny improvements (5% per trick or so). Every one requires quite a lot of code.
So I asked myself, "Self, do you want to spend months implementing these PITA algo's, just to speed up prime number generation by a few seconds?"  the answer came back, "No".
Anyway if anyone cares I could post this "full" 64bit version. It actually does the 2 billion even faster than previous, less than 110 ms (so much for the theory that 64bit is always slower than 32!). Plus a few other minor improvements.
Actually I hope you don't want it because it's been a long time since I looked at it and it'll be some trouble to remember how it works ... but it would give me a project to ease back into programming (took a few months off to smell the roses), and of course I'm always glad to be of use.