News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Prime Sieve again - stumbling through quite a bit of memory

Started by flaschenpost, August 27, 2013, 03:58:06 AM

Previous topic - Next topic

flaschenpost

Hi,

I have discussed it a bit in the end of another longer thread, but now I tried to give the full sieve.

I am a real, real beginner in asm and probably doing a lot of silly stuff in the attempt of writing out data.

Since I use jwasm with linux, I can't use the jj2007 macros, I'm afraid.

But that is not my main point. My main point is: how could I improve the speed of memory-walkthrough like that on Sieve of Erastothenes.

jj2007 mentioned that shifting is slow, and with a few Tests I could see that it costs some time. So I was lucky to find bts, which can set a bit in memory without the c-stylisch mem[i/64] |= 1UL << i%64.

But still I have to add to my bit-position a number and then look if I got over 64-bit, in order to jump i/64 "lines" forward.

I tried several things, like "decomposing" the stepsize into Lines and Bits, like precalculating the number of steps, thus invoking a division - but that just made things slower.

I do not need that stuff, it is just for fun, so no pressure. Just curious how fast it could get.

SSSE3 I do have, but no further - and AVX does not yet have shift operations, and no "set bit" in the xmm- registers.

So: Have Fun!

Marco

Gunther

You have to know the facts before you can distort them.

flaschenpost

Hi Gunther,

Quote from: Gunther on August 27, 2013, 05:19:00 AM

did you check that link?


thanks for the nice tip!

I haven't seen that link before, but quite a bit of it fits with other ressources and with my own experiments/thoughts. It is cute how this small problem leeds to similar kind of solutions! ;-)

There also is a faster Sieve of Atkins, so really this is just pure fun: get the fastest instructions to "cross out" every p-th Bit in a Gig of RAM, and do it for a lot of increasing primes p.

I also tried to be more "memory-local" and I got rid of those tables mentioned in page 6 by some calculations, but at least my implementations were slower than the plain straight solution.

The first few primes could fill a small number of 64-bit-values (210 in the 2-3-5-7-Case) and I will definitely add that precalculated Pattern in the end.

I do think that storing the property of being prime or not prime with one bit per number is quite efficient, and that other data structures have much more overhead. Normally crossing out a bit in RAM is seen as very, very fast. Up to 2^29 it takes just 20 seconds in Assembler and 30s in optimized GCC.

The usual sieve challenge is getting out of the RAM-size and calculating the property blockwise.

Here I want to search for the fastest combination of instructions to achieve setting every p-th bit for a big number of increasing values of p.

The kernel loop inner_loop in the following code. r9 holds the current prime number itself, r10 the current byte position, rbx the current bit position, rcx the loop counter. rbx may have a 64-bit-"Overflow", I mean it will get greater than 63. This I have to check and then I need to add the value of 8*(rbx/64) to my Byte-Pointer r10.

The nice thing about x86/64-Bit-Assembler is that everything fits into registers, everything but the write-to-RAM itself.


    ; how far should we go?
    mov rcx, endmem
    sub rcx, r10
    jc inner_break
    jz inner_break

    sub rcx,1
    ; run (or walk) through the Megabits!
    inner_loop:
        ; set bit
        bts [r10], bx

        add rbx, r9
        mov rax, rbx
        shr rax, 3
        and rax, -8
        and rbx,63
        add r10, rax
        sub rcx, rax
        jc inner_break

    jmp inner_loop



Good night  -  Marco.

KeepingRealBusy

Marco,

There are several things you can do.

1. Use an odd bit sieve -  bits represent only the odd numbers like 1, 3, 5, 7 ...  This gives you twice the numbers for the same memory block.

2. Set all bits, then reset all multiples of the prime numbers. Since 2 is not present, you save the time of resetting half of all of the bits.

3. Scan for a set bit, double and add 1 for the prime number, Square that prime for the first bit to reset, divide by 2 (shift right 1) to get the reset index, then reset the bit and add the prime number for the next bit to reset.

To calculate the last prime number to test for, take the bit number of the first bit outside of the block of bits you are working with, double +1 and take the square root (this only has to be done once - when the size of the bit block is defined).

When the end of the block is detected while resetting, save the bit index and the current block number you are working on in a table indexed by the bit index of the prime number. Get the next prime and repeat. When the last prime has been processed for this block, allocate another block, set all bits, then re-process with the saved table of bit index and block, possibly extending the saved table values.

I did this in 32 bit mode, with about 3 3/4 GB of memory (14 blocks of 256 MB), after 14 blocks were processed, repeated with 13 new blocks (same blocks, re0initialized) 5 times giving 66 blocks of prime bits.  The prime bit scan did not have to go outside of the first block for the entire process.


The Total Count of the Primes in all of the blocks is 16,028,906,205.
The current time is: 11:23:40.00
The current time is:  9:17:42.04
                     -----------
                     02:05:57.96


Dave.

KeepingRealBusy

Marco,

Let me correct one part of that explanation.

"When the end of the block is detected while resetting, save the bit index and the current block number you are working on in a table indexed by the bit index of the prime number. "

The index for the save table is just an incrementing offset in the table. In 32 bit mode with just 66 blocks to process, I only had to save 1 BYTE and 1 DWORD (the block number, and the bit index), thus 5 BYTES per entry. The next iteration will repeat the same old prime numbers and add some new ones, i.e. 3,5,7,11,13 ...  Overlay the table entries when you get to the end of the block, and keep on trucking through the table, then add new entries at the end of the table, then repeat for the next iteration.

Dave.

Farabi

My, people has switched to 64 bits. I better wait until Im Up with another system. Or stick with my android PCTablets. Still unsure they had a 1 Ghz processors. But on some few tests, yes it did. It was the OOP thingy and bad programming concept that making everything is slow. Or sabotage.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

flaschenpost

Hi Dave,

my Code still uses only odd numbers, it doesn't need to set all bits because "1" means "is composite" and in my code the marking as "is composite" starts with p*p, so if p*p > max I break the outer loop.

My Code goes up to 2^29 in 20 Seconds.

I tried to comment my asm; if something is unclear, please ask.

Quote from: KeepingRealBusy on August 29, 2013, 09:13:47 AM
Marco,

There are several things you can do.

1. Use an odd bit sieve -  bits represent only the odd numbers like 1, 3, 5, 7 ...  This gives you twice the numbers for the same memory block.

2. Set all bits, then reset all multiples of the prime numbers. Since 2 is not present, you save the time of resetting half of all of the bits.

3. Scan for a set bit, double and add 1 for the prime number, Square that prime for the first bit to reset, divide by 2 (shift right 1) to get the reset index, then reset the bit and add the prime number for the next bit to reset.

The Total Count of the Primes in all of the blocks is 16,028,906,205.
The current time is: 11:23:40.00
The current time is:  9:17:42.04
                     -----------
                     02:05:57.96


Dave.

flaschenpost

Quote from: Farabi on August 31, 2013, 12:18:10 PM
My, people has switched to 64 bits.

Well, the extra registers *are* tempting, aren't they? Still you could try beating a greenhorn like me with even better 32-Bit-Assembler, since I didn't even use a single SSE3 instruction. ;-)

If you want I can try to write the algorithm in a clearer way.

Rockoon

Quote from: flaschenpost on August 27, 2013, 03:58:06 AM
jj2007 mentioned that shifting is slow, and with a few Tests I could see that it costs some time. So I was lucky to find bts, which can set a bit in memory without the c-stylisch mem[i/64] |= 1UL << i%64.

But still I have to add to my bit-position a number and then look if I got over 64-bit, in order to jump i/64 "lines" forward.

BTS [base], index

aka

BTS [reg32], reg32

works on a memory base pointer, with up to 32-bits of displacement index in 32-bit mode (so thats 512BM of memory, might treat the index as signed so care must be taken)

In 64-bit mode it could well be able to access all available memory using a single base, but you might want to double-check that.

Its quite a bit slower in latency than the register equivalent BTS reg32, reg32 according to agner fog, but you arent waiting for the result anyways.  basically, I see no reason to be recalculating index/64 .. the base can remain constant for literally billions of delta.

K_F

Do a search for 'Prime Number Circle' (Peter Plitcha)
This method seems to be simpler, and maybe faster.
  8)
'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'

Gunther

Hi K_F,

Quote from: K_F on November 02, 2013, 04:44:54 PM
Do a search for 'Prime Number Circle' (Peter Plitcha)
This method seems to be simpler, and maybe faster.
  8)

his name is Peter Plichta and it seems to me that he's a weirdo.

Gunther
You have to know the facts before you can distort them.

dedndave

Quote from: Gunther on November 02, 2013, 11:57:33 PM
his nama is Peter Plichta and it seems to me that he's a weirdo.

i'm a little surprised to hear you say that, Gunther   :biggrin:

it sounds to me like a case of people not listening to someone who's found something new
not the first time people laughed at the smart guy

as for the prime number thing - not particularly fond of the prime circle
but it's an interesting pattern
what i get from it - possibly a base-6 counting system to do the sieve   :t
all the primes end in 1 or 5 (base-6)
so, once you get started....
add 2, prime test, add 4, prime test, add 2, prime test, add 4.....
conversely, you have eliminated 2/3 of all the integers
so - each bit in the stored data might represent only the base-6 1's and 5's
then - use ling long kai fang to access the info and do the base conversion

dedndave


Gunther

Dave,

Quote from: dedndave on November 03, 2013, 01:06:43 AM
i'm a little surprised to hear you say that, Gunther   :biggrin:

it sounds to me like a case of people not listening to someone who's found something new
not the first time people laughed at the smart guy

it's not surprising. No doubt the man is a good chemist. But I've read his books (only available in German) and I think that the world around us is not so easy build like Dr. Plichta might think. Furthermore, he's vain and taken from him. I've observed that in a panel discussion with him. So, I think there are good reasons for my point of view.

Gunther
You have to know the facts before you can distort them.

dedndave

well - i saw a video on youtube
didn't watch the whole thing because i don't speak German
but - he was talking about building a saucer-shaped spaceship - lol
he may have some unique ideas on fuel
so - let's allow him to build it - and see how it goes   :biggrin: