News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Prime numbers

Started by jj2007, August 12, 2015, 07:19:26 AM

Previous topic - Next topic

hutch--

I think I will stick with Monica Bellucci being an Italian national treasure. When I want prime numbers I look them up in a massive table I have.

jj2007

Quote from: rrr314159 on August 12, 2015, 01:22:58 PMMonica Bellucci is somebody!

Quote from: hutch-- on August 12, 2015, 04:15:55 PMI think I will stick with Monica Bellucci being an Italian national treasure.

Nice to see that two philosophers can agree on some body :icon_mrgreen:

Will look into the SIMD stuff when I find time. Maybe.

herge


Hello jj2007:
C:\masm32\bin\PrimesBitField>PrimesBitField.exe
Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz
It took 69.97 seconds to find 98222287 primes among 2000000000 numbers
2       3       5       7       11      13      17      19      23      29
31      37      41      43      47      53      59      61      67      71
73      79      83      89      97      101     103     107     109     113
127     131     137     139     149     151     157     163     167     173
179     181     191     193     197     199     211     223     227     229
233     239     241     251     257     263     269     271     277     281
283     293     307     311     313     317     331     337     347     349
353     359     367     373     379     383     389     397     401     409
419     421     431     433     439     443     449     457     461     463
467     479     487     491     499
...
1999999553      1999999559      1999999571      1999999609      1999999613
1999999621      1999999643      1999999649      1999999657      1999999747
1999999763      1999999777      1999999811      1999999817      1999999829
1999999853      1999999861      1999999871      1999999873      1999999913
1999999927      1999999943      1999999973


It took almost 70 seconds.

Regards Herge
Regards herge
Read "Slow Death by Rubber Duck"
for chemical Laughs.

Zen

I LOVE THREADS LIKE THIS !!!
...As long as we're NOT MAKING ANY SENSE AT ALL,...



...Oh,...and, HUTCH,...good point,...

jj2007

Quote from: Zen on August 13, 2015, 05:30:19 AM
I LOVE THREADS LIKE THIS !!!
...As long as we're NOT MAKING ANY SENSE AT ALL,...0

Shovels, dead bodies...?  :dazzled:

You are hijacking this thread! Until here, we were arguing about prime alive bodies 8)

rrr314159

Quote from: primesieve.orgprimesieve is a free software program and C/C++ library that generates primes using a highly optimized sieve of Eratosthenes implementation. It counts the primes below 10^10 in just 0.57 seconds on an Intel Core i7-4770 CPU (4 x 3.4GHz) from 2013.

- This is about as fast as they get, AFAIK. It's written in C++ and doesn't use SIMD, so must be beatable. compared to jj2007's, it's only about 20 times faster (taking into account multiple threads, etc) - with maybe 100 times as much code.
I am NaN ;)

jj2007

Quote from: rrr314159 on August 13, 2015, 06:30:45 AMcompared to jj2007's, it's only about 20 times faster

You made my day :bgrin:

rrr314159

Quote from: jj2007 on August 13, 2015, 06:55:54 AM
Quote from: rrr314159 on August 13, 2015, 06:30:45 AMcompared to jj2007's, it's only about 20 times faster

You made my day :bgrin:

Well it might be only factor of 10, hard to say. There are a lot of tricks to speed up priming, such as marking small primes all at once using bit-map of length 210 (leaving out details). If it makes your day better, I found my prime routine and timed it. It's about the same as yours except with threads
I am NaN ;)

jj2007

Hey, just kidding. I had also noticed primesieve & company, there seems to be a big community dedicated to chasing speed records, so we are just amateurs... if I had more time, I'd try the SIMD road, though.

rrr314159

#24
Quote from: jjwe are just amateurs

- Every pro started as an amateur .. ! If we had to figure out their tricks a defeatist attitude might be appropriate, but we don't, it's all public knowledge.

- BTW I've returned to my main thing - math programming - for the last couple months. I'm no longer learning assembler as such; it's a tool to do math with. You may have noticed my answers to questions have been vague and often along the lines of "d**ned if I know!". For 6 months I studied every issue that came up - goal was to get up to speed on assembler and I was interested in every detail. Now however I feel I've accomplished the objective (well enough).

- Therefore I may tackle prime numbers. To you it's a side issue (the main issue is assembler); to me, this sort of math is what it's all about.

- BTW I don't at the moment see much use for SIMD. Seems prime num algo does not lend itself to vector programming; unlike FFT's and graphics. Probably that's why primesieve author doesn't use it
I am NaN ;)

rrr314159

#25
Just for fun I made this Eratosthenes prime generator, does 2 billion in 10.5 seconds, on one thread. With 4 threads it gets under four seconds, but not 2.625, since threading is not very efficient. It has a few tricks (out of dozens one could put in):

- map only has odd numbers, similar to sieve of Sundaram
- uses "wheel" to mark 3 and 5
- starts priming at the square, skips even numbered multiples
- couple other even more minor items

But Chris Wallich's is 10 times faster. Apparently the reason is mainly cache management. With regular Eratosthenes (as I'm using) you mark one prime all the way through the map, then go back for the next one, etc. So the entire map must be moved to cache once for each prime (up to the square root); that's a few thousand times. The "Wallich technique" (AFAIK he invented it) takes just one piece of the map at a time, which fits into cache, and marks all the primes. Then it gets the next one: about 30,000 little segments. It's a lot more trouble, but this way the entire map is brought into cache only once. Apparently that speeds it up an order of magnitude! So I'd like to go ahead and try that, if I don't get bored. The other nice thing about his technique, it's more convenient to multi-thread. However, none of these approaches seems to use SIMD effectively (scatter / gather commands would help).

Anyway FWIW here it is. The main routine, primes.asm:

include primes.inc

SIZEOFMAP = 1000000000              ; this is highest index
MAX_NUMBER = SIZEOFMAP*2 + 1        ; highest odd number is 2*index + 1

indextoprime MACRO                  ; uses eax, converts any odd num not just prime
    shl eax, 1
    add eax, 1
ENDM

primetoindex MACRO                  ; reverse computation
    sub eax, 1
    shr eax, 1
ENDM

.data
    primesmap dd 0
    tenm3 real8 0.001
    timerResults real8 0.0
.code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
primes:

    call GetSystemInformation
    InitCycles                      ; start timer

; allocate memory
    invoke GlobalAlloc, 0, SIZEOFMAP + 100h    ; buffer in case SIZEOFMAP not div by 4
    mov edi, eax                    ; edi => the primes map

; do the wheel "mask"
    CALL MasktheMap                 ; fill in "wheel" of 60 (30 times 2 for DWORDs)

    dec edi                         ; below here edi always adds 1, edi + 0 is never used,
                                    ; so just back it up 1 because ecx = 1 is index of 3
                                    ; thus edi + 1 is on a dword boundary
    mov primesmap, edi

; main work here
    call MarkPrimes                 ; sieve of eratosthenes
    call CountThem                  ; return count in ebx

; get time, print results
    movflt timerResults, @ElapsedCycles()
    movflt timerResults, @CyclestoTime(timerResults)
    fld timerResults
    fmul tenm3
    fstp timerResults
    pp "number of primes %u to %u\n", ebx, MAX_NUMBER
    pp "prime time seconds %.4g\n", timerResults


; finally, print out last few primes
    mov ecx, SIZEOFMAP
    mov edx, ecx
    sub edx, 100
    xor ebx, ebx
    printloop:
        cmp BYTE PTR [edi + ecx], 0
        je @F
            mov eax, ecx
            indextoprime
            pp "prime %u\n", eax
            inc ebx
        @@:
        dec ecx
        cmp ecx, edx
        jge printloop
    pp "number of primes %i above %i\n", ebx, edx

ret
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
MasktheMap:

    mov esi, edi
    mov ecx, 0                      ; threes/fives mask (nonzero ultimately means it's prime)
    @@:
        mov DWORD PTR [esi + ecx], 00010000h        ; this knocks out (2 and) 3 and 5, 60 bytes at a time
        mov DWORD PTR [esi + ecx + 4], 01000101h    ; don't forget they're backwards
        mov DWORD PTR [esi + ecx + 8], 00010001h

        mov DWORD PTR [esi + ecx + 12], 00010100h
        mov DWORD PTR [esi + ecx + 16], 01000100h
        mov DWORD PTR [esi + ecx + 20], 01010001h

        mov DWORD PTR [esi + ecx + 24], 00000100h
        mov DWORD PTR [esi + ecx + 28], 00000101h
        mov DWORD PTR [esi + ecx + 32], 01010001h

        mov DWORD PTR [esi + ecx + 36], 00010100h
        mov DWORD PTR [esi + ecx + 40], 01000001h
        mov DWORD PTR [esi + ecx + 44], 01000001h

        mov DWORD PTR [esi + ecx + 48], 00010100h
        mov DWORD PTR [esi + ecx + 52], 01000101h
        mov DWORD PTR [esi + ecx + 56], 01010000h

        add ecx, 60
        cmp ecx, SIZEOFMAP
        jle @B

    mov DWORD PTR [edi], 00010101h  ; the first word set 3, 5, 7 to 1
ret
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
MarkPrimes:

    xor esi, esi
    mov edi, primesmap
    mov ecx, 2                      ; below it incs, so it starts with 7

    beginnextprimeloop:
        inc ecx
        cmp BYTE PTR [edi + ecx], 1
        jne beginnextprimeloop
   
        ; found a prime
        mov eax, ecx                ; ecx is the prime's index
        mul ecx
        add eax, ecx                ; now eax has the index ^2 + index
        shl eax, 1                  ; index of the square = (x^2 + x) * 2
        xchg eax, ecx               ; eax again has prime's index
        cmp ecx, SIZEOFMAP          ; square has reached top ?
        jg donepriming
       
        indextoprime                ; double eax, add 1, to check only odd multiples
       
        @@:
            mov BYTE PTR [edi + ecx], 0
            add ecx, eax
            cmp ecx, SIZEOFMAP
            jle @B
       
        primetoindex                ; eax back to the prime's index (eg, for 11 it's 5)
        mov ecx, eax
    jmp beginnextprimeloop

donepriming:
ret
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
CountThem:

    mov ebx, 1                      ; for "2"
    mov esi, primesmap
    mov ecx, 1                      ; ecx = 1 is index of 3

    @@:
        mov edx, [esi + ecx]
        xor eax, eax
        add al, dl
        add al, dh
        bswap edx
        add al, dl
        add al, dh
        add ebx, eax
        add ecx, 4
        cmp ecx, SIZEOFMAP
        jl @B
ret
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end primes
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««


The zip includes:

primes.exe
primes.asm
primes.inc ; support routines like timer, and includes etc
makeit.bat ; batch make file
primesrun.txt ; runs for this routine and also jj2007's on my machine
I am NaN ;)

rrr314159

It works!

Put in a very rough implementation of cache segmentation; cut the time for 2 billion from 10.5 to 2 seconds (!) on one thread. This is NOT optimized, lots of obvious improvements (other than threads) can be made (for instance I've got variables in memory which could be in registers) so it can be a lot faster. That's within striking distance of Chris Wallich, maybe half the speed or so. Then there are more difficult improvements one can steal from his program (like better wheeling) - so it seems likely he can be beat - which is expected: assembler vs. C++ = no contest. Whether anyone feels like doing all that work is another matter! But I think I'll be within a factor of 2 without much trouble.

After a while I'll clean it up and post it. Just wanted to share the good news: this caching technique does, in fact, work :eusa_dance:
I am NaN ;)

hutch--

On an i7 quad, running it with 8 threads should move it along a fair bit faster.

jj2007

Compliments :t
Have no time right now to participate, but I'm watching with awe ;-)

rrr314159

#29
Here is a pretty simple, pretty fast version of the segmented sieve of Eratosthenes prime number algorithm - only single-thread.

This version does 2 billion in about 1.7 seconds. It's hard to say how it compares to Chris Wallich's "primesieve"; not much slower (ignoring threading). It beats his "Simple (single-core) Segmented sieve of Eratosthenes" by 25%. Of course it's much smaller build, 4K (about 350 lines of asm).

This is very simple 32-bit code, uses SSE, compiles under ml.exe or JWasm.

In primes.asm, you may need to set the SEGMENT_SIZE (on my machine it's 32768) and the MAX_INDEX which is half the top number, in this case, 1 billion. It requires a bit more ram than MAX_INDEX, in this case, a gig.

Here are sample runs:

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads (only 1 used)
prime time seconds 1.715
number of primes 98222287 to 2000000001
1st prime 2
2nd prime 3
3rd prime 5
4th prime 7
5th prime 11
6th prime 13
7th prime 17
8th prime 19
9th prime 23
10th prime 29
11th prime 31
12th prime 37
number of primes 12 to 39
prime 1999999973
prime 1999999943
prime 1999999927
prime 1999999913
prime 1999999873
prime 1999999871
prime 1999999861
prime 1999999853
prime 1999999829
prime 1999999817
prime 1999999811
number of primes 11 above 1999999801

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz; Windows 8; 4 threads (only 1 used)
prime time seconds 0.7966
number of primes 50847534 to 1000000001
1st prime 2
2nd prime 3
3rd prime 5
4th prime 7
5th prime 11
6th prime 13
7th prime 17
8th prime 19
9th prime 23
10th prime 29
11th prime 31
12th prime 37
number of primes 12 to 39
prime 999999937
prime 999999929
prime 999999893
prime 999999883
number of primes 4 above 999999801


The zip includes:

primes.exe
primes.asm      ; main routine
primes.inc      ;  includes and support routines: print, timer, sys-info
makeit.bat      ; "make" file

[edit] made another version about 15% faster, updated the .zip file, apologies to those who already downloaded it
I am NaN ;)