The MASM Forum
General => The Workshop => Topic started by: zedd151 on July 30, 2018, 05:14:55 PM
-
Invigorated by working with the fibonacci generator, I decided to code a program that generates prime numbers using a 'sieve' method.
This code is not fast, and was not really designed to be, just demonstrating the basic principles.
; --------------------------------------------------------------------------------
; Prime Sieve example. Fill array with '1'. For each multiple of a given
; number, zero it out - to exclude (as not prime as it is divisible by other
; than '1' and itself). We exclude the first number. I.e., for '2' we don't
; zero out the '2', only multiples of it. Then 3 and so on...
; If a number is already zeroed, we know it is not prime, and can remove it
; from further consideration, and increment the counter to the next value.
; We continue all the way to n = array size, for a given sized array.
; This will be dependent on the available memory on any given computer.
; --------------------------------------------------------------------------------
include \masm32\include\masm32rt.inc
.data
pTbl dd 0 ; <-- pointer to byte array
arrsize dd 6000h ; <-- size of byte array - adjust depending on available memory; time is also a consideration.
curnumber dd 0 ; <-- current number for our sieve algo
.code
start:
; --------------------------------------------------------------------------------
invoke GlobalAlloc, GPTR, arrsize ; <-- allocate memory for table
cmp eax, 0 ; <-- if not enough memory for array size, exit
jz noarray
mov pTbl, eax
; --------------------------------------------------------------------------------
mov esi, pTbl ; <-- initialize the table to '1'
xor eax, eax
mov cl, 1
@@:
mov byte ptr [esi+eax], cl
inc eax
cmp eax, arrsize ; <-- we are done when array is filled with '1'
jl @b
; --------------------------------------------------------------------------------
xor eax, eax
mov curnumber, 2 ; <-- exluding '0' and '1', as neither are prime
add eax, curnumber ; <-- exclude the first iteration of sequence
; <-- else we will have no primes listed
@@:
add eax, curnumber ; <-- get multiples for this iteration
cmp eax, arrsize ; <-- check if end of array is reached, exit
jae done1 ; loop if yes.
cmp byte ptr [esi+eax], 0 ; <-- check if array location is already marked
jz @b
mov byte ptr [esi+eax], 0 ; <-- if not marked, we mark it now
jmp @b
; --------------------------------------------------------------------------------
done1:
inc curnumber ; <-- increment current number
mov eax, curnumber ; <-- move it into register
cmp eax, arrsize ; <-- check if at array end
jae done2 ; <-- jump to next process, if array end
xor eax, eax ; else increment to next number
add eax, curnumber
jmp @b ; <-- jump back into the processing loop
; --------------------------------------------------------------------------------
; Here is where we process the data into readable strings, skipping '0' and '1'
; --------------------------------------------------------------------------------
done2:
mov edi, 2 ; <--- skip over '0' and '1'
@@:
cmp byte ptr [esi+edi], 1 ; <--- if byte == zero, not prime, don't print
jnz noprint
mov eax, ustr$(edi) ; <--- convert our current prime to ascii decimal string
pushad ; <--- save all the registers
print eax, 0Dh, 0Ah, 0
popad ; <--- restore the registers
noprint:
inc edi
cmp edi, arrsize
jl @b ; <--- loop back to get more primes to print
invoke GlobalFree, pTbl ; <--- free the memory used
noarray:
invoke ExitProcess, 0
end start
edit = add checking if memory allocated
-
Slight modification, now using 'rep stosd' to initialize array.
; --------------------------------------------------------------------------------
push edi
mov eax, 01010101h
mov ecx, arrsize
shr ecx, 2
mov edi, pTbl ; temporarily put ptr to array in edi, for stosb
rep stosd ; <-- initialize the table to '1'
mov esi, pTbl ; <-- we put ptr to array back into esi :)
pop edi ; <-- restore edi to previous value
; --------------------------------------------------------------------------------
:P
-
If you fill 4tb with primes,do you achieve new world record?
-
If you fill 4tb with primes,do you achieve new world record?
No. 8)
But that surely would take an awful lot of time, using my primitive methods. :P
-
This code is not fast, and was not really designed to be, ...
You may want to have a glance at the following code to possibly find some tricks to improve your algo for speed. A sieve up to 1,000,000 gets completed in less than 2 ms with this algo on my laptop. Any question and/or clarification is welcome.
; Prime sieve
; Each byte with a value of 1 in the final sieve will represent a non-prime.
; Bytes with a value of 0 will indicate a prime.
; #########################################################################
.686 ; minimum processor needed for 32 bit
.model flat, stdcall ; FLAT memory model & STDCALL calling
option casemap :none ; set code to case sensitive
; #########################################################################
include \masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
.data?
hmem dd ?
MAX EQU 1000000 ;at your discretion
.code
start:
invoke GlobalAlloc,0,MAX
mov hmem,eax
mov edi,eax
; Prepare allocated memory
mov eax,101h
stosd ;marks 0 and 1 as non-primes
;2 and 3 become indicated as primes
mov ecx,MAX/4-1 ;remaining number of dwords
mov eax,10001h
rep stosd ;all remaining even numbers in the allocated memory
;get marked as non-primes, while all odd numbers
;are left as 'potential' primes
; Define and mark all other non-primes with a 1
mov edi,hmem
mov ebx,2 ;EBX will keep track of the latest prime
; All multiples of 2 have already been marked during memory initialization
; Find the next prime to be processed
nextprime:
inc ebx
cmp byte ptr[edi+ebx],0 ;is it a prime?
jnz nextprime ;continue looking if not prime
mov eax,ebx
mul eax
cmp eax,MAX
ja done ;if the square of that prime exceeds the upper limit,
;all multiples of that prime would already have been
;marked as non-primes (processing of larger primes
;would become useless)
@@:
mov byte ptr[edi+eax],1 ;mark it as non-prime
lea eax,[eax+ebx*2] ;this will skip all even multiples of
;that prime (which have already been marked
;when initializing the allocated memory)
cmp eax,MAX
jbe @B
jmp nextprime
done:
;At this point, the sieve up to the specified MAX is now completed.
;You can save it as such for future use,
;or extract all primes into another table as binary numbers,
;or convert them to ascii for immediate display or saving in a separate file,
;etc.
invoke GlobalFree,hmem
invoke ExitProcess,0
end start
-
Hi, Raymond! Thanks for the tips, I really should have thought of marking all even numbers as non-prime automatically. ::)
This was a little coding exercise for me, as I have been out of the coding loop (no pun intended 8) ) for some time.
Any tips on Fibanacci?? :biggrin: I reused some of jimg's code; that, for all intents and purposes, works well.
Thanks again, for the algo tips.
Later -->> I just looked at yor code again, rofl. Your table is exactly the inverse of mine.. :redface:
I ran it in ollydbg, and looked at the data, I said to myself, 'there can't be that many prime numbers!! Then I realized the inversion.
I mark primes with '1', non primes with '0'. you say 'po-taay-to', I say 'po-taah-to'. :P
-
That was interesting raymond
I wonder if use a reciprocal table of numbers and mulps thru them to test evenly divided?
-
Many years ago, I wrote a sieve for up to 10^8. I then extracted all the primes into another file. From there, I got a count of primes for whichever limit I chose. Here were my results with 10^x limits. You can observe the relative percentage of integers being prime decreasing regularly with increasing limits.
Limit # primes
10 4
100 25
1000 168
10000 1229
100000 9592
1000000 78498
10000000 664579
100000000 5761455
-
Many years ago, I wrote a sieve for up to 10^8. I then extracted all the primes into another file. From there, I got a count of primes for whichever limit I chose. Here were my results with 10^x limits. You can observe the relative percentage of integers being prime decreasing regularly with increasing limits.
Limit # primes
10 4
100 25
1000 168
10000 1229
100000 9592
1000000 78498
10000000 664579
100000000 5761455
thats interesting, a tv documentary about numbers had different themes every episode,one episode he met the man who achieved discover second biggest prime number in the World,but he left a computer on for 20 years crunching numbers,what about creating a workerthread that works in the background while you use your newer multicore computer?
-
Using Rep Stosd to initialize array.
Modified my prime sieve to use registers for the main loop compares.
Also added error checking, to see if memory is allocated for array,
display error message and exit otherwise.
; zedd151's prime sieve example - 32 bit code
; assemble with console assemble and link.
; batch file included is to create txt file with prime numbers up to desired value.
include \masm32\include\masm32rt.inc
.data
pTbl dd 0
arrsize dd 1000000 ; adjust according to your needs
.code
start:
invoke GlobalAlloc, GPTR, arrsize
cmp eax, 0
jz noarray
mov pTbl, eax
mov edi, pTbl
mov eax, 01000100h ; mark all non even numbers as potential prime
mov ecx, arrsize
shr ecx, 2
rep stosd
mov esi, pTbl
mov byte ptr [esi+2], 1
xor eax, eax
mov ecx, 3
add eax, ecx
mov edx, arrsize
align 4
@@:
add eax, ecx
cmp eax, edx
jae done1
cmp byte ptr [esi+eax], 0
jz @b
mov byte ptr [esi+eax], 0
jmp @b
done1:
inc ecx
mov eax, ecx
cmp eax, edx
jae done2
jmp @b
done2:
mov edi, 2
mov ebx, arrsize
align 4
@@:
cmp byte ptr [esi+edi], 1 ; skip printing non primes :)
jnz noprint
mov eax, ustr$(edi)
pushad
print eax, 0Dh, 0Ah, 0
popad
noprint:
inc edi
cmp edi, ebx
jl @b
; inkey ; uncomment to pause command prompt when finished
invoke GlobalFree, pTbl
jmp outt
noarray:
fn MessageBoxA, 0, "Array size too big for available memory", "ERROR", 0
outt:
invoke ExitProcess, 0
end start
Attached!
-
This prime sieve of mine uses a byte array for marking non primes. Once the array of a given size is created, all the bytes that have NOT been marked (the offset of unmarked bytes) represents valid prime numbers up to the size of the array.
Is it possible (I'm sure it is), to use a bit array instead? Or will the code be too cumbersome? My left field idea is that the bit array would be 1/32 the size of the byte array and can be more easily stored for a large number of primes. As I really have NO clue how to even implement this, just wondering if anyone has some insights on what I propose to (try to) do. If the code would be too cumbersome, or time consuming (at runtime) it would defeat the purpose of trying to use 'smaller data' to achieve the same results. Any thoughts on this proposal?
edited for clarity.
-
advice is post a timings in laboratory and we help you test it on many different cpus
so you get clock cycles count for both solutions
-
so you get clock cycles count for both solutions
If I ever get that far, I shirley would. But it's (bit array) just in my imagination at the moment. With the bit masking and conversation from byte etc., I'm sure it would be a much slower algo than the original.
-
so you get clock cycles count for both solutions
If I ever get that far, I shirley would. But it's (bit array) just in my imagination at the moment. With the bit masking and conversation from byte etc., I'm sure it would be a much slower algo than the original.
Better go 64bit with bigger byte array, you can allocate 64GB instead of 3.5GB with largaware flag set in linker
-
But it's [a] (bit array) just in my imagination at the moment.
I know it's just a fantasy at this point, but I like the idea of using bits instead of byte. So if you have, what? 3 values per #, that means you get 4 times the storage (you need 2 bits/entry). And I'm pretty sure you can get that going pretty fast with some clever coding.
-
Bit manipulation is a pain and its slow. Do yourself a favour and use normal data sizes for any speed related tasks. BYTE if you are space limited, DWORD is memory is not a problem as its faster on native x86 hardware.
-
Bit manipulation is a pain and its slow.
I kind of thought so. So basically not worth the trouble, especially since I wouldnt even know where to begin. :tongue:
... you get 4 times the storage
lol yes. For some odd reason I was thinking 32 x the storage. Would be true if replacing a dword array, but not a byte. :eusa_naughty:
hutch already told me what I had suspected (Thats it's too much trouble and too slow for the small gains).
Anyway, thats what I get for seriously thinking about something while still foggy headed (I just woke up this morning when I started thinking about this) The original works well enough.
-
So Hutch: Whoa, not so fast. Of course you may be right that bit manipulation would be too slow, though you can't argue that it wouldn't reduce storage space. But maybe it could work.
Just speculating here, but how about if you stored your # flags in an array of 2-bit entries (could be in bytes, words, whatever). You could access them thus:
; Masks for 4 #s, for 0, 1 & 2:
$num1MaskZ EQU 00000011B
$num2MaskZ EQU 00001100B
$num3MaskZ EQU 00110000B
$num4MaskZ EQU 11000000B
$num1Mask1 EQU 00000001B
$num2Mask1 EQU 00000100B
$num3Mask1 EQU 00010000B
$num4Mask1 EQU 01000000B
; etc.
MOV ESI, BigAssNumberBuffer
loop1: LODSB ;Get 4 more #s.
MOV AH, AL ;Make 3 copies.
MOV CL, AL
MOV CH, AL
; Mask off each #, jump somewhere if it's zero:
TEST AL, $num1MaskZ
JZ xxx
TEST AH, $num2MaskZ
JZ yyy
TEST CL, $num3maskZ
JZ zzz
TEST CH, $num4maskZ
JZ aaa
; Test for the other values:
TEST AL, $num1Mask1
JZ bbb
TEST AH, $num2Mask1
JZ ccc
TEST CL, $num3mask1
JZ ddd
TEST CH, $num4mask1
JZ eee
; etc.
You'd just have to keep track of which 2 bits you're looking at. (In case anyone misses it, what I'm doing here is packing 4 sets of flags (2 bits) into a byte, and getting 4 at a time and looking at them.)
Oh, and one more thing: don't store any even #s in the array! Total waste of space, since we know they're not prime. Just skip from odd to odd. (You'd have to make a special case for 2, but that's trivial.)
OK, so whaddya think? Somebody shoot this idea down ...
-
Hi David,
While I would not wish to hinder anyone's creativity, its a pragmatic response to most modern hardware. In the very early days, memory was so scarce that you did anything to minimise its usage. On a 32 bit machine, you could usually manage 1 gigabyte of memory and on a 64 bit OS machine, much more than that.
Working in BYTE data is straight forward enough and in 32 bit code, working in DWORD data is easy and generally faster and in most cases, the memory usage is no big deal for almost any ordinary machine so while bit manipulation can be made to work OK, it is accessed via masking and a few direct mnemonics and its slow and complicated to work with.
The current uses are generally legacy code in the OS but the compression folks have a direct use for it so it can be done but you need good reason to bother. My comment is if you can perform the task in BYTE or DWORD, you save yourself a lot of hassle and it is generally faster.
-
So Hutch: Whoa, not so fast. Of course you may be right that bit manipulation would be too slow, though you can't argue that it wouldn't reduce storage space. But maybe it could work.
Just speculating here, but how about if you stored your # flags in an array of 2-bit entries (could be in bytes, words, whatever). You could access them
If your goal is 1k,4k,64k demo,or you enjoy size reduction I have 2bit vector code somewhere in 16bit forum( my previous goal was 1k demo)
,3.5 GB allocated *4 is very big primes table
-
beat this if you can
http://masm32.com/board/index.php?topic=7938.msg103017#msg103017 (http://masm32.com/board/index.php?topic=7938.msg103017#msg103017)
is based on HLL students SIMT task
generate 100M random numbers
test if primes
IF prime add primes together
print sum of primes and time
try to get fastest time
-
beat this if you can
I'll have to pass on that daydreamer. :cool:
-
beat this if you can
test if primes
IF prime add primes together
print sum of primes and time
try to get fastest time
I get around ~94 seconds for 100M.
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <iostream>
extern "C" unsigned long long ASM_Masm_Test32(unsigned int* count, unsigned int num);
int main()
{
unsigned int p_c = 0;
_mm_mfence(); _mm_lfence(); unsigned long long t1 = GetTickCount64(); //do not use rdtsc because c++ re-orders it and AMD doesn't have the crystal frequency numbers.
unsigned long long rand_total = ASM_Masm_Test32(&p_c,1000000);
_mm_lfence(); t1 = GetTickCount64() - t1;
std::cout << "Sum of primes of 100M random 32-bit numbers = " << rand_total << ". Count = " << p_c << ". Time = " << t1 << " ms.\r\n";
Sleep(120000);
return(0);
}
MASM
.data
.code
ASM_Masm_Test32 proc ;RCX = prime count. RDX = rand count.
push rbx
push rdi
push rsi ;get some more registers
push r12
push r13
mov eax,10
disable_page_guard:
sub rsp, 4096
mov qword ptr [rsp], 1 ;Windows "page guard" is asinine
dec eax
jnz disable_page_guard
mov r12d,edx
mov r13, rcx
;----generate 3000 fast% numbers---- (x*m)*d>>64 ==> x*m <= m-1. 8+ 8bytes
mov r8d, 3*12
mov rax,6148914691236517206
mov rcx,2635249153387078803
mov rdx,2049638230412172402
mov dword ptr [rsp],3
mov qword ptr [rsp+4],rax
mov dword ptr [rsp+12],7
mov qword ptr [rsp+16],rcx
mov dword ptr [rsp+24],9
mov qword ptr [rsp+28],rdx
mov ecx,9 ;start at 11. pre-add 3,7,9
mov ebx, 3000-3
magic_make: ;------------
add ecx,2
mov rax, 1844674407370955162 ;%10
mov edx,10
imul rax,rcx
mulx rdx,rdx,rax
cmp edx,9
je pass_10
cmp edx,7
je pass_10
cmp edx,3
je pass_10
cmp edx,1
je pass_10
jmp magic_make
pass_10:
mov rax,4841369599423283200 ;2^52 in double format
vmovq xmm1,rax
vmovd xmm0,ecx
vpor xmm0,xmm0,xmm1
vsubsd xmm0,xmm0,xmm1
vsqrtsd xmm0, xmm0, xmm0
vroundsd xmm0,xmm0,xmm0,11b
vaddpd xmm0,xmm0,xmm1
vpxor xmm0,xmm0,xmm1
vmovd esi,xmm0
mov edi, 1
prime_check: ;------------
add edi,2
xor edx,edx
mov eax,ecx
div edi ;edx:eax => edx rem
test edx,edx
jz magic_make
cmp edi,esi
jle prime_check
is_prime:
xor edx,edx
mov rax, 0ffffffffffffffffH
div rcx ;rdx:rax => rdx rem
inc rax
mov dword ptr [rsp+r8],ecx
mov qword ptr [rsp+r8+4],rax
add r8d,12
dec ebx
jnz magic_make
;----find primes with magic---- 33211^2... 33217
xor r8d,r8d ;sum
xor esi,esi
outer_mod:
rdrand ecx ;not even worth using xorshift
mov rax, 1844674407370955162 ;%10
mov edx,10
imul rax,rcx
mulx rdx,rdx,rax
cmp edx,9
je pass_10_2
cmp edx,7
je pass_10_2
cmp edx,3
je pass_10_2
cmp edx,1
je pass_10_2
jmp next_num
pass_10_2:
mov rax,4841369599423283200 ;2^52 in double format
vmovq xmm1,rax
vmovd xmm0,ecx
vpor xmm0,xmm0,xmm1
vsubsd xmm0,xmm0,xmm1
vsqrtsd xmm0, xmm0, xmm0
vroundsd xmm0,xmm0,xmm0,11b
vaddpd xmm0,xmm0,xmm1
vpxor xmm0,xmm0,xmm1
vmovd r9d,xmm0
xor edi,edi
test_mod:
mov ebx, dword ptr [rsp+rdi]
cmp r9d,ebx
jl done_test
mov rax, qword ptr [rsp+rdi+4]
mov rdx,rcx
imul rdx,rax
inc rdx ;(m*x)%2^64 +1 <= m
cmp rax,rdx ;rdx<=rax
jnc next_num ;jle not working. weird. jge not working either. possible damage from thermalright frame to cpu socket area.
add rdi, 12
cmp edi, 3000*12
jl test_mod
slow_test:
mov r10d, 33215 ;start at +2
div_test:
add r10d,2
cmp r9d,r10d
jl done_test
xor edx,edx
mov eax,ecx
div r10d ;edx:eax => edx rem
test edx,edx
jz next_num
jmp div_test
done_test:
add r8,rcx ;summation
inc esi
next_num:
dec r12d
jnz outer_mod
vzeroupper
mov dword ptr [r13], esi
add rsp, 10*4096
pop r13
pop r12
pop rsi
pop rdi
pop rbx
mov rax,r8
ret
ASM_Masm_Test32 endp
end
-
Hi InfiniteLoop,
The original goal of this thread was to show a simple way of generating a list of prime numbers up to n.
While I appreciate your contribution, it was never my intention to have a faster or even a fast algorithm. Just a simple algorithm using 32 bit code.
Others may find your algorithm useful though. Have you checked that the results are all primes? I cannot currently test your algo as I don't have a 64 bit OS set up on this computer right now.
-
I just clicked continue in debug and copied ecx from the breakpoint into "is prime" google. I kept getting "yes".
32-bit is far faster than 64-bit. It if was 64-bit randoms it would still be running...
-
I just clicked continue in debug and copied ecx from the breakpoint into "is prime" google. I kept getting "yes".
Sweet! Makes me want to install Windows 7 64 to check it out.
-
@InfiniteLoop
Pardon asking this stupid question, but am I right supposing that your asm code needs some very advanced CPU?
I've created a simple VS project and added both modules +fine tuned compiling using ml64 from VS but debug build is getting me this exception: https://prnt.sc/4I9QQ2tRWv1o
Note: I guess the C++ module needs this include:
#include <immintrin.h>
so I'd like to ask you what are the minimal CPU requirements to run your code...
-
InfiniteLoop,
Something you need to learn is the difference between 32 bit STDCALL and 64 bit FASTCALL. Manually pushing and popping registers is extremely unreliable as it risks messing up the stack alignment. 64 bit FASTCALL writes arguments in reverse order to stack addresses and also write the first 4 arguments to shadow space.
-
@InfiniteLoop
Pardon asking this stupid question, but am I right supposing that your asm code needs some very advanced CPU?
I've created a simple VS project and added both modules +fine tuned compiling using ml64 from VS but debug build is getting me this exception: https://prnt.sc/4I9QQ2tRWv1o
Note: I guess the C++ module needs this include:
#include <immintrin.h>
so I'd like to ask you what are the minimal CPU requirements to run your code...
Its just AVX2 or "skylake" code. I need AVX to get the square root. mulx is needed for the fast remainder method I found on Daniel Lemire's blog.
-
Thanks for confirmation
would you recommend some replacement in order to run it on older CPU set? eg just AVX