N-Queen Problem

aw27 · May 11, 2019, 08:15:09 PM

May be someone wants to produce a good ASM routine to outperform the C/C++ compiler on this problem. ::)


#include <stdio.h>
#include <windows.h>

//#define SHOWSOLUTIONS
#define N 15
int numsolutions;

_inline int isplaceok(int* a, int n, int c)
{
	for (int i = 0; i < n; i++)
	{
		if ((a[i] == c) || (a[i] - i == c - n) || (a[i] + i == c + n))
			return 0;
	}
	return 1;
}

_inline void printsolution(int* a)
{
	for (int i = 0; i <N; i++)
	{
		for (int j = 0; j < N; j++)
		{
			printf("%s ", a[i] == j ? "X" : "-");
		}
		printf("\n");
	}
	printf("\n");
}

_inline void addqueen(int* a, int n)
{
	if (n >= N)
#ifdef SHOWSOLUTIONS
		printsolution(a);
#endif
		numsolutions++;
	else
	{
		for (int c = 0; c < N;c++)
		{
			if (isplaceok(a, n, c))
			{
				a[n] = c;
				addqueen(a, n + 1);
			}
		}
	}
}

int main()
{
	int allrows[N] = { 0 };
	LARGE_INTEGER frequency;
	LARGE_INTEGER t1, t2;
	double elapsedTime;

	numsolutions = 0;
	QueryPerformanceFrequency(&frequency);
	QueryPerformanceCounter(&t1);
	addqueen(allrows, 0);
	QueryPerformanceCounter(&t2);
	elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
	printf("%d-Queens Problem - Number of solutions: %d Elapsed Time: %.0f milliseconds", N, numsolutions, elapsedTime);
	getchar();
	return 0;
}

Attached compiled .exe for 15-queens.
My result was:
15-Queens Problem - Number of solutions: 2279184 Elapsed Time: 19764 milliseconds

PS: If you want to test with more than 18 queens, a small change in the code is necessary. Of course, the calculation will take a few hours for 19 queens.

LiaoMi · May 11, 2019, 10:21:47 PM

Hi AW,

large calculations need to be done only on the GPU

Famous AI problem of N number of Queens solved in CUDA with brute-force for N=10 https://github.com/mamenyaka/CUDA-N-Queens
CUDA Toolkit 10.1 Download - https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal

NQueens Problem with CUDA https://www12.cs.fau.de/edu/map/ss13/talks/NQueens_Problem_with_CUDA.pdf

aw27 · May 11, 2019, 11:40:25 PM

Hi LiaoMi,

Let's see:

Code Select



#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <Windows.h>

#define QUEENS 10

__global__

void queen(int *db, const int n)
{
	{
		bool B[QUEENS];
		for (int i = 0; i < QUEENS; i++)
			B[i] = 0;

		B[threadIdx.x] = 1;
		B[threadIdx.y] = 1;
		B[threadIdx.z] = 1;
		B[blockIdx.x / 10] = 1;
		B[blockIdx.x % 10] = 1;
		B[blockIdx.y / 10] = 1;
		B[blockIdx.y % 10] = 1;
		B[blockIdx.z / 10] = 1;
		B[blockIdx.z % 10] = 1;
		B[n] = 1;

		for (int i = 0; i < QUEENS; i++)
			if (B[i] == 0)
				return;
	}
	char A[QUEENS];

	A[0] = threadIdx.x;
	A[1] = threadIdx.y;
	A[2] = threadIdx.z;
	A[3] = blockIdx.x / 10;
	A[4] = blockIdx.x % 10;
	A[5] = blockIdx.y / 10;
	A[6] = blockIdx.y % 10;
	A[7] = blockIdx.z / 10;
	A[8] = blockIdx.z % 10;
	A[9] = n;

	for (int i = 0; i < QUEENS - 1; i++)
		for (int j = i + 1; j < QUEENS; j++)
			if (abs(i - j) == abs(A[i] - A[j]))
				return;

	atomicAdd(db, 1);
}

int main()
{
	LARGE_INTEGER frequency;
	LARGE_INTEGER t1, t2;
	double elapsedTime;
	int h = 0, *d;

	cudaMalloc((void**)&d, sizeof(int));
	cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice);

	dim3 blocksPerGrid(100, 100, 100);
	dim3 threadsPerBlock(10, 10, 10);
	QueryPerformanceFrequency(&frequency);
	QueryPerformanceCounter(&t1);
	for (int i = 0; i < QUEENS; i++)
		queen << <blocksPerGrid, threadsPerBlock >> > (d, i);
	QueryPerformanceCounter(&t2);
	elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
	cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost);
	cudaFree(d);
	cudaDeviceReset();

	cudaError_t error = cudaGetLastError();
	if (error != cudaSuccess)
	{
		fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(error));
		return -1;
	}

	//fprintf(stderr, "Solutions: %d\n", h);
	printf("CUDA %d-Queens Problem - Number of solutions: %d Elapsed Time: %.3f milliseconds", QUEENS, h, elapsedTime);

	fprintf(stderr, "\nDone\n");
	getchar();
	return 0;
}

CUDA 10-Queens Problem - Number of solutions: 724 Elapsed Time: 0.030 milliseconds

Without CUDA:
10-Queens Problem - Number of solutions: 724 Elapsed Time: 3.316 milliseconds

However, I am not including the time CUDA takes to start calculating. If I do it will be a few seconds.

I don't know how to generalize to other number of QUEENS - this only works for 10 QUEENS. :(

aw27 · May 12, 2019, 12:52:09 AM

All right, I found some generic solution here:
https://devtalk.nvidia.com/default/topic/397284/cuda-programming-and-performance/n-queen-solver-for-cuda/2

After massaging it a bit I got this:
CUDA 18-Queens Problem - Number of solutions: 666090624 Elapsed Time: 5052.286 milliseconds

CUDA 19-Queens Problem - Number of solutions: 673090552 Elapsed Time: 43697.321 milliseconds

CUDA 20-Queens Problem - Number of solutions: 374483220 Elapsed Time: 360850.035 milliseconds

It is really impressive. Six minutes for the 20 Queens problem versus 11 days mentioned in the picture above.
I know, the number of solutions is not correct for 19 and 20 queens, its is probably due to an overflow and wraparound somewhere in the code. For 18 queens it is correct and well as for values I tested below 18. FIXED

I may not spend more time with this but will follow if anyone wants to dig deeper. This is the code I downloaded and includes the massaging.

Code Select


// N-queen for CUDA
//
// Copyright(c) 2008 Ping-Che Chen


#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <time.h>
#include <cuda_runtime.h>

#define QUEENS 20
#define THREAD_NUM		96


int bunk = 0;		// this is a dummy variable used for making sure clock() are not optimized out

/*
 * ----------------------------------------------------------------
 * This is a recursive version of n-queen backtracking solver.
 * A non-recursive version is used instead.
 * ----------------------------------------------------------------

long long solve_nqueen_internal(int n, unsigned int mask, unsigned int l_mask, unsigned int r_mask, unsigned int t_mask)
{
	if(mask == t_mask) {
		return 1;
	}

	unsigned int m = (mask | l_mask | r_mask);
	if((m & t_mask) == t_mask) {
		return 0;
	}

	long long total = 0;
	unsigned int index = (m + 1) & ~m;
	while((index & t_mask) != 0) {
		total += solve_nqueen_internal(mask | index, (l_mask | index) << 1, (r_mask | index) >> 1, t_mask);
		m |= index;
		index = (m + 1) & ~m;
	}

	return total;
}


long long solve_nqueen(int n)
{
	return solve_nqueen_internal(0, 0, 0, (1 << n) - 1);
}
*/

/* -------------------------------------------------------------------
 * This is a non-recursive version of n-queen backtracking solver.
 * This provides the basis for the CUDA version.
 * -------------------------------------------------------------------
 */

long long solve_nqueen(int n)
{
	unsigned int mask[32];
	unsigned int l_mask[32];
	unsigned int r_mask[32];
	unsigned int m[32];

	if (n <= 0 || n > 32) {
		return 0;
	}

	const unsigned int t_mask = (1 << n) - 1;
	long long total = 0;
	long long upper_total = 0;
	int i = 0, j;
	unsigned int index;

	mask[0] = 0;
	l_mask[0] = 0;
	r_mask[0] = 0;
	m[0] = 0;

	for (j = 0; j < (n + 1) / 2; j++) {
		index = (1 << j);
		m[0] |= index;

		mask[1] = index;
		l_mask[1] = index << 1;
		r_mask[1] = index >> 1;
		m[1] = (mask[1] | l_mask[1] | r_mask[1]);
		i = 1;

		if (n % 2 == 1 && j == (n + 1) / 2 - 1) {
			upper_total = total;
			total = 0;
		}

		while (i > 0) {
			if ((m[i] & t_mask) == t_mask) {
				i--;
			}
			else {
				index = ((m[i] + 1) ^ m[i]) & ~m[i];
				m[i] |= index;
				if ((index & t_mask) != 0) {
					if (i + 1 == n) {
						total++;
						i--;
					}
					else {
						mask[i + 1] = mask[i] | index;
						l_mask[i + 1] = (l_mask[i] | index) << 1;
						r_mask[i + 1] = (r_mask[i] | index) >> 1;
						m[i + 1] = (mask[i + 1] | l_mask[i + 1] | r_mask[i + 1]);
						i++;
					}
				}
				else {
					i--;
				}
			}
		}
	}

	bunk = 2;

	if (n % 2 == 0) {
		return total * 2;
	}
	else {
		return upper_total * 2 + total;
	}
}


/* -------------------------------------------------------------------
 * This is a non-recursive version of n-queen backtracking solver
 * with multi-thread support.
 * -------------------------------------------------------------------
 */

struct thread_context
{
	HANDLE thread;
	bool stop;

	long long total;
	int n;
	unsigned int mask;
	unsigned int l_mask;
	unsigned int r_mask;
	unsigned int t_mask;

	HANDLE ready;
	HANDLE complete;
};

DWORD WINAPI solve_nqueen_proc(LPVOID param)
{
	thread_context* ctx = (thread_context*)param;

	unsigned int mask[32];
	unsigned int l_mask[32];
	unsigned int r_mask[32];
	unsigned int m[32];
	unsigned int t_mask;
	long long total;
	unsigned int index;
	unsigned int mark;

	for (;;) {
		WaitForSingleObject(ctx->ready, INFINITE);
		if (ctx->stop) {
			break;
		}

		int i = 0;

		mask[0] = ctx->mask;
		l_mask[0] = ctx->l_mask;
		r_mask[0] = ctx->r_mask;
		m[0] = mask[0] | l_mask[0] | r_mask[0];
		total = 0;
		t_mask = ctx->t_mask;
		mark = ctx->n;

		while (i >= 0) {
			if ((m[i] & t_mask) == t_mask) {
				i--;
			}
			else {
				index = (m[i] + 1) & ~m[i];
				m[i] |= index;
				if ((index & t_mask) != 0) {
					if (i + 1 == mark) {
						total++;
						i--;
					}
					else {
						mask[i + 1] = mask[i] | index;
						l_mask[i + 1] = (l_mask[i] | index) << 1;
						r_mask[i + 1] = (r_mask[i] | index) >> 1;
						m[i + 1] = (mask[i + 1] | l_mask[i + 1] | r_mask[i + 1]);
						i++;
					}
				}
				else {
					i--;
				}
			}
		}

		ctx->total = total;

		SetEvent(ctx->complete);
	}

	return 0;
}

long long solve_nqueen_mcpu(int n)
{
	if (n <= 0 || n > 32) {
		return 0;
	}

	SYSTEM_INFO info;
	thread_context* threads;
	int num_threads;

	GetSystemInfo(&info);
	num_threads = info.dwNumberOfProcessors;
	if (num_threads == 1) {
		// only one cpu found, use single thread version
		return solve_nqueen(n);
	}

	threads = new thread_context[num_threads];
	int j;
	for (j = 0; j < num_threads; j++) {
		threads[j].stop = false;
		threads[j].ready = CreateEvent(0, FALSE, FALSE, 0);
		threads[j].complete = CreateEvent(0, FALSE, TRUE, 0);
		threads[j].thread = CreateThread(0, 0, solve_nqueen_proc, threads + j, 0, 0);
		threads[j].total = 0;
	}

	int thread_idx = 0;

	const unsigned int t_mask = (1 << n) - 1;
	long long total = 0;
	unsigned int index;

	unsigned int m_mask = 0;
	if (n % 2 == 1) {
		m_mask = 1 << ((n + 1) / 2 - 1);
	}

	for (j = 0; j < (n + 1) / 2; j++) {
		index = 1 << j;

		WaitForSingleObject(threads[thread_idx].complete, INFINITE);

		if (threads[thread_idx].mask != m_mask) {
			total += threads[thread_idx].total * 2;
		}
		else {
			total += threads[thread_idx].total;
		}

		threads[thread_idx].mask = index;
		threads[thread_idx].l_mask = index << 1;
		threads[thread_idx].r_mask = index >> 1;
		threads[thread_idx].t_mask = t_mask;
		threads[thread_idx].total = 0;
		threads[thread_idx].n = n - 1;

		SetEvent(threads[thread_idx].ready);

		thread_idx = (thread_idx + 1) % num_threads;
	}

	// collect all threads...
	HANDLE* events = new HANDLE[num_threads];
	for (j = 0; j < num_threads; j++) {
		events[j] = threads[j].complete;
	}
	WaitForMultipleObjects(num_threads, events, TRUE, INFINITE);
	for (j = 0; j < num_threads; j++) {
		if (threads[j].mask != m_mask) {
			total += threads[j].total * 2;
		}
		else {
			total += threads[j].total;
		}

		threads[j].stop = true;
		SetEvent(threads[j].ready);

		events[j] = threads[j].thread;
	}

	WaitForMultipleObjects(num_threads, events, TRUE, INFINITE);

	for (j = 0; j < num_threads; j++) {
		CloseHandle(threads[j].thread);
		CloseHandle(threads[j].ready);
		CloseHandle(threads[j].complete);
	}
	delete[] threads;
	delete[] events;

	bunk = 3;

	return total;
}



/* --------------------------------------------------------------------------
 * This is a non-recursive version of n-queen backtracking solver for CUDA.
 * It receives multiple initial conditions from a CPU iterator, and count
 * each conditions.
 * --------------------------------------------------------------------------
 */

__global__ void solve_nqueen_cuda_kernel(int n, int mark, unsigned int* total_masks, unsigned int* total_l_masks, unsigned int* total_r_masks, unsigned int* results, int total_conditions)
{
	const int tid = threadIdx.x;
	const int bid = blockIdx.x;
	const int idx = bid * blockDim.x + tid;

	__shared__ unsigned int mask[THREAD_NUM][10];
	__shared__ unsigned int l_mask[THREAD_NUM][10];
	__shared__ unsigned int r_mask[THREAD_NUM][10];
	__shared__ unsigned int m[THREAD_NUM][10];

	__shared__ unsigned int sum[THREAD_NUM];

	const unsigned int t_mask = (1 << n) - 1;
	int total = 0;
	int i = 0;
	unsigned int index;

	if (idx < total_conditions) {
		mask[tid][i] = total_masks[idx];
		l_mask[tid][i] = total_l_masks[idx];
		r_mask[tid][i] = total_r_masks[idx];
		m[tid][i] = mask[tid][i] | l_mask[tid][i] | r_mask[tid][i];

		while (i >= 0) {
			if ((m[tid][i] & t_mask) == t_mask) {
				i--;
			}
			else {
				index = (m[tid][i] + 1) & ~m[tid][i];
				m[tid][i] |= index;
				if ((index & t_mask) != 0) {
					if (i + 1 == mark) {
						total++;
						i--;
					}
					else {
						mask[tid][i + 1] = mask[tid][i] | index;
						l_mask[tid][i + 1] = (l_mask[tid][i] | index) << 1;
						r_mask[tid][i + 1] = (r_mask[tid][i] | index) >> 1;
						m[tid][i + 1] = (mask[tid][i + 1] | l_mask[tid][i + 1] | r_mask[tid][i + 1]);
						i++;
					}
				}
				else {
					i--;
				}
			}
		}

		sum[tid] = total;
	}
	else {
		sum[tid] = 0;
	}

	__syncthreads();

	// reduction
	if (tid < 64 && tid + 64 < THREAD_NUM) { sum[tid] += sum[tid + 64]; } __syncthreads();
	if (tid < 32) { sum[tid] += sum[tid + 32]; } __syncthreads();
	if (tid < 16) { sum[tid] += sum[tid + 16]; } __syncthreads();
	if (tid < 8) { sum[tid] += sum[tid + 8]; } __syncthreads();
	if (tid < 4) { sum[tid] += sum[tid + 4]; } __syncthreads();
	if (tid < 2) { sum[tid] += sum[tid + 2]; } __syncthreads();
	if (tid < 1) { sum[tid] += sum[tid + 1]; } __syncthreads();

	if (tid == 0) {
		results[bid] = sum[0];
	}
}


long long solve_nqueen_cuda(int n, int steps)
{
	// generating start conditions
	unsigned int mask[32];
	unsigned int l_mask[32];
	unsigned int r_mask[32];
	unsigned int m[32];
	unsigned int index;

	if (n <= 0 || n > 32) {
		return 0;
	}

	unsigned int* total_masks = new unsigned int[steps];
	unsigned int* total_l_masks = new unsigned int[steps];
	unsigned int* total_r_masks = new unsigned int[steps];
	unsigned int* results = new unsigned int[steps];

	unsigned int* masks_cuda;
	unsigned int* l_masks_cuda;
	unsigned int* r_masks_cuda;
	unsigned int* results_cuda;

	cudaMalloc((void**)&masks_cuda, sizeof(size_t) * steps);
	cudaMalloc((void**)&l_masks_cuda, sizeof(size_t) * steps);
	cudaMalloc((void**)&r_masks_cuda, sizeof(size_t) * steps);
	cudaMalloc((void**)&results_cuda, sizeof(size_t) * steps / THREAD_NUM);

	const unsigned int t_mask = (1 << n) - 1;
	const unsigned int mark = n > 11 ? n - 10 : 2;
	long long total = 0;
	int total_conditions = 0;
	int i = 0, j;

	mask[0] = 0;
	l_mask[0] = 0;
	r_mask[0] = 0;
	m[0] = 0;

	bool computed = false;

	for (j = 0; j < n / 2; j++) {
		index = (1 << j);
		m[0] |= index;

		mask[1] = index;
		l_mask[1] = index << 1;
		r_mask[1] = index >> 1;
		m[1] = (mask[1] | l_mask[1] | r_mask[1]);
		i = 1;

		while (i > 0) {
			if ((m[i] & t_mask) == t_mask) {
				i--;
			}
			else {
				index = (m[i] + 1) & ~m[i];
				m[i] |= index;
				if ((index & t_mask) != 0) {
					mask[i + 1] = mask[i] | index;
					l_mask[i + 1] = (l_mask[i] | index) << 1;
					r_mask[i + 1] = (r_mask[i] | index) >> 1;
					m[i + 1] = (mask[i + 1] | l_mask[i + 1] | r_mask[i + 1]);
					i++;
					if (i == mark) {
						total_masks[total_conditions] = mask[i];
						total_l_masks[total_conditions] = l_mask[i];
						total_r_masks[total_conditions] = r_mask[i];
						total_conditions++;
						if (total_conditions == steps) {
							if (computed) {
								cudaMemcpy(results, results_cuda, sizeof(size_t) * steps / THREAD_NUM, cudaMemcpyDeviceToHost);

								for (int j = 0; j < steps / THREAD_NUM; j++) {
									total += results[j];
								}

								computed = false;
							}

							// start computation
							cudaMemcpy(masks_cuda, total_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
							cudaMemcpy(l_masks_cuda, total_l_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
							cudaMemcpy(r_masks_cuda, total_r_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);

							solve_nqueen_cuda_kernel << <steps / THREAD_NUM, THREAD_NUM >> > (n, n - mark, masks_cuda, l_masks_cuda, r_masks_cuda, results_cuda, total_conditions);

							computed = true;

							total_conditions = 0;
						}
						i--;
					}
				}
				else {
					i--;
				}
			}
		}
	}


	if (computed) {
		cudaMemcpy(results, results_cuda, sizeof(size_t) * steps / THREAD_NUM, cudaMemcpyDeviceToHost);

		for (int j = 0; j < steps / THREAD_NUM; j++) {
			total += results[j];
		}

		computed = false;
	}

	cudaMemcpy(masks_cuda, total_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
	cudaMemcpy(l_masks_cuda, total_l_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
	cudaMemcpy(r_masks_cuda, total_r_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);

	solve_nqueen_cuda_kernel << <steps / THREAD_NUM, THREAD_NUM >> > (n, n - mark, masks_cuda, l_masks_cuda, r_masks_cuda, results_cuda, total_conditions);

	cudaMemcpy(results, results_cuda, sizeof(size_t) * steps / THREAD_NUM, cudaMemcpyDeviceToHost);

	for (int j = 0; j < steps / THREAD_NUM; j++) {
		total += results[j];
	}

	total *= 2;

	if (n % 2 == 1) {
		computed = false;
		total_conditions = 0;

		index = (1 << (n - 1) / 2);
		m[0] |= index;

		mask[1] = index;
		l_mask[1] = index << 1;
		r_mask[1] = index >> 1;
		m[1] = (mask[1] | l_mask[1] | r_mask[1]);
		i = 1;

		while (i > 0) {
			if ((m[i] & t_mask) == t_mask) {
				i--;
			}
			else {
				index = (m[i] + 1) & ~m[i];
				m[i] |= index;
				if ((index & t_mask) != 0) {
					mask[i + 1] = mask[i] | index;
					l_mask[i + 1] = (l_mask[i] | index) << 1;
					r_mask[i + 1] = (r_mask[i] | index) >> 1;
					m[i + 1] = (mask[i + 1] | l_mask[i + 1] | r_mask[i + 1]);
					i++;
					if (i == mark) {
						total_masks[total_conditions] = mask[i];
						total_l_masks[total_conditions] = l_mask[i];
						total_r_masks[total_conditions] = r_mask[i];
						total_conditions++;
						if (total_conditions == steps) {
							if (computed) {
								cudaMemcpy(results, results_cuda, sizeof(size_t) * steps / THREAD_NUM, cudaMemcpyDeviceToHost);

								for (int j = 0; j < steps / THREAD_NUM; j++) {
									total += results[j];
								}

								computed = false;
							}

							// start computation
							cudaMemcpy(masks_cuda, total_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
							cudaMemcpy(l_masks_cuda, total_l_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
							cudaMemcpy(r_masks_cuda, total_r_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);

							solve_nqueen_cuda_kernel << <steps / THREAD_NUM, THREAD_NUM >> > (n, n - mark, masks_cuda, l_masks_cuda, r_masks_cuda, results_cuda, total_conditions);

							computed = true;

							total_conditions = 0;
						}
						i--;
					}
				}
				else {
					i--;
				}
			}
		}

		if (computed) {
			cudaMemcpy(results, results_cuda, sizeof(size_t) * steps / THREAD_NUM, cudaMemcpyDeviceToHost);

			for (int j = 0; j < steps / THREAD_NUM; j++) {
				total += results[j];
			}

			computed = false;
		}

		cudaMemcpy(masks_cuda, total_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
		cudaMemcpy(l_masks_cuda, total_l_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);
		cudaMemcpy(r_masks_cuda, total_r_masks, sizeof(size_t) * total_conditions, cudaMemcpyHostToDevice);

		solve_nqueen_cuda_kernel << <steps / THREAD_NUM, THREAD_NUM >> > (n, n - mark, masks_cuda, l_masks_cuda, r_masks_cuda, results_cuda, total_conditions);

		cudaMemcpy(results, results_cuda, sizeof(size_t) * steps / THREAD_NUM, cudaMemcpyDeviceToHost);

		for (int j = 0; j < steps / THREAD_NUM; j++) {
			total += results[j];
		}
	}

	cudaFree(masks_cuda);
	cudaFree(l_masks_cuda);
	cudaFree(r_masks_cuda);
	cudaFree(results_cuda);

	delete[] total_masks;
	delete[] total_l_masks;
	delete[] total_r_masks;
	delete[] results;

	bunk = 1;

	return total;
}


bool InitCUDA()
{
	int count;

	cudaGetDeviceCount(&count);
	if (count == 0) {
		fprintf(stderr, "There is no device.\n");
		return false;
	}

	int i;
	for (i = 0; i < count; i++) {
		cudaDeviceProp prop;
		if (cudaGetDeviceProperties(&prop, i) == cudaSuccess) {
			if (prop.major >= 1) {
				break;
			}
		}
	}

	if (i == count) {
		fprintf(stderr, "There is no device supporting CUDA 1.x.\n");
		return false;
	}

	cudaSetDevice(i);

	return true;
}


int main(int argc, char** argv)
{
	int n = 18;
	//clock_t start, end;
	LARGE_INTEGER frequency;
	LARGE_INTEGER t1, t2;
	double elapsedTime;
	long long solution;
	bool cpu = true, gpu = true;
	int argstart = 1, steps = 24576;

	/*
	if (argc >= 2 && argv[1][0] == '-') {
		if (argv[1][1] == 'c' || argv[1][1] == 'C') {
			gpu = false;
		}
		else if (argv[1][1] == 'g' || argv[1][1] == 'G') {
			cpu = false;
		}

		argstart = 2;
	}

	if (argc < argstart + 1) {
		printf("Usage: %s [-c|-g] n steps\n", argv[0]);
		printf("  -c: CPU only\n");
		printf("  -g: GPU only\n");
		printf("  n: n-queen\n");
		printf("  steps: step for GPU\n");
		printf("Default to 8 queen\n");
	}
	else {
		n = atoi(argv[argstart]);
		if (n <= 1 || n > 32) {
			printf("Invalid n, n should be > 1 and <= 32\n");
			printf("Note: n > 18 will require a very very long time to compute!\n");
			return 0;
		}

		if (argc >= argstart + 2) {
			steps = atoi(argv[argstart + 1]);
			if (steps <= THREAD_NUM || steps % THREAD_NUM != 0) {
				printf("Invalid step, step should be multiple of %d\n", THREAD_NUM);
				return 0;
			}
		}
	}
	*/
	if (gpu) {
		if (!InitCUDA()) {
			return 0;
		}

		printf("CUDA initialized.\n");
	}
	/*
	if (cpu) {
		start = clock();
		solution = solve_nqueen_mcpu(n);
		//solution = solve_nqueen(n);
		end = clock();

		printf("CPU: %d queen = %lld  time = %d\n", n, solution, end - start);
	}
	*/
	if (gpu) {
		//start = clock();
		QueryPerformanceFrequency(&frequency);
		QueryPerformanceCounter(&t1);
		solution = solve_nqueen_cuda(QUEENS, steps);
		//end = clock();
		QueryPerformanceCounter(&t2);
		elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;

		//printf("GPU: %d queen = %lld  time = %d\n", n, solution, end - start);
		printf("CUDA %d-Queens Problem - Number of solutions: %lld Elapsed Time: %.3f milliseconds", QUEENS, solution, elapsedTime);
	}
	getchar();
	return 0;
}

Edit:
The code was fixed, now it reports correctly the number of solutions:

CUDA 20-Queens Problem - Number of solutions: 39029188884 Elapsed Time: 363900.660 milliseconds

Over 39 billion solutions in 6 minutes :t

aw27 · May 12, 2019, 02:11:13 AM

Attaching .exe.
You can select the number of queens on the command line. Default is 12
For example for 18 queens:
Cuda2 18

You need a Nvidia graphics card. I built for XP from Visual Studio but don't know about NVidia SDK restrictions on supported OS.

LiaoMi · May 12, 2019, 03:43:23 AM

Quote from: AW on May 12, 2019, 02:11:13 AM
Attaching .exe.
You can select the number of queens on the command line. Default is 12
For example for 18 queens:
Cuda2 18

You need a Nvidia graphics card. I built for XP from Visual Studio but don't know about NVidia SDK restrictions on supported OS.

I thought on my video card NVIDIA Quadro P4000 8Gb the calculations should be faster ::) anyway, the results are impressive!!!

Code Select

Note: n >= 20 will require a very long time to compute!
Number of queens set to 20

CUDA initialized. Calculating number of solutions
CUDA 20-Queens Problem - Number of solutions: 39029188884 Elapsed Time: 444264.628 milliseconds

aw27 · May 12, 2019, 06:34:04 AM

Quote from: LiaoMi on May 12, 2019, 03:43:23 AM
I thought on my video card NVIDIA Quadro P4000 8Gb the calculations should be faster ::) anyway, the results are impressive!!!

You have a better card, I have a GeForce GTX 1060 6GB mid-range game card.
The explanation for the differences may be elsewhere.
https://technical.city/en/video/Quadro-P4000-vs-GeForce-GTX-1060

daydreamer · May 12, 2019, 09:16:49 AM

days???,what happened to measure things in clock cycles???
maybe with 500 polys todays highend nvidia will result in 500 clock cycles?just curious how fast

aw27 · May 16, 2019, 04:43:59 PM

This is the 10 queens problems solved in MASM. There are no significant changes from the template used in this exercise, so it was easy to produce.

The CUDA kernel was taken from https://github.com/mamenyaka/CUDA-N-Queens/blob/master/queen.cu and compiled to PTX.

People can test and if it works say "yea" if it does not say "nay". :lol:

At the moment, I am not seeing ways to produce CUDA programs in ASM without resorting to PTX. This is a bit restrictive and there are aspects like managed memory (same memory for CPU and GPU) that I don't know how to deal with using PTX.

TimoVJL · May 16, 2019, 07:42:36 PM

cicc is a compiler for .ptx

Code Select

cicc.exe -arch compute_20 Add2.ci -o Add2.ptx

nvcc option --dryrun show it's usage

Code Select

nvcc Add2.cu -ptx --keep --dryrun --use-local-env --cl-version 2013 -o Add2.ptx

Siekmanski · May 16, 2019, 07:44:33 PM

Hi AW,

It doesn't work on my graphics card.
Just some thinking outside of the box;

What if you use pixelshader code to do the calculations.
- Write the input data to texture(s) memory.
- Let the pixelshader do the calculations.
- Write the results back to CPU memory.

Don't know if this is as fast as a CUDA program or even a logical approach?

LiaoMi · May 16, 2019, 07:52:51 PM

Quote from: AW on May 16, 2019, 04:43:59 PM
This is the 10 queens problems solved in MASM. There are no significant changes from the template used in this exercise, so it was easy to produce.

The CUDA kernel was taken from https://github.com/mamenyaka/CUDA-N-Queens/blob/master/queen.cu and compiled to PTX.

People can test and if it works say "yea" if it does not say "nay". :lol:

At the moment, I am not seeing ways to produce CUDA programs in ASM without resorting to PTX. This is a bit restrictive and there are aspects like managed memory (same memory for CPU and GPU) that I don't know how to deal with using PTX.

Hi AW,

works like a charm!

QuoteAt the moment, I am not seeing ways to produce CUDA programs in ASM without resorting to PTX

There are not many options ..
1.PTX (PTX quasi-assembly language) intrinsics (Macro intrinsics) - need to be written
2.SIMD or Asm to PTX Translator - need to be written
3.Use ready-made functionality

NVIDIA CUDA Math API
The reference guide for the CUDA Math API.
cuFFT
The API reference guide for cuFFT, the CUDA Fast Fourier Transform library.
cuBLAS
The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across multiple GPUs.
NVBLAS
The NVBLAS library is a multi-GPUs accelerated drop-in BLAS (Basic Linear Algebra Subprograms) built on top of the NVIDIA cuBLAS Library.
nvJPEG
The nvJPEG Library provides high-performance GPU accelerated JPEG decoding functionality for image formats commonly used in deep learning and hyperscale multimedia applications
NPP
NVIDIA NPP is a library of functions for performing CUDA accelerated processing. The initial set of functionality in the library focuses on imaging and video processing and is widely applicable for developers in these areas. NPP will evolve over time to encompass more of the compute heavy tasks in a variety of problem domains. The NPP library is written to maximize flexibility, while maintaining high performance.
NVRTC (Runtime Compilation)
NVRTC is a runtime compilation library for CUDA C++. It accepts CUDA C++ source code in character string form and creates handles that can be used to obtain the PTX. The PTX string generated by NVRTC can be loaded by cuModuleLoadData and cuModuleLoadDataEx, and linked with other modules by cuLinkAddData of the CUDA Driver API. This facility can often provide optimizations and performance not possible in a purely offline static compilation.
cuSOLVER
The API reference guide for cuSolver, the CUDA sparse matix library.

+ etc

4. Use standard compiler
5. Combine manual PTX writing and automatic translator CudaPAD https://www.codeproject.com/Articles/999744/CudaPAD

QuoteThis is a bit restrictive and there are aspects like managed memory

One of the options to find examples on C++ :icon_exclaim:

aw27 · May 16, 2019, 09:05:37 PM

Quote from: TimoVJL on May 16, 2019, 07:42:36 PM
nvcc option --dryrun show it's usage

I have been using:
nvcc -ptx --machine 32 kernel.cu
or
nvcc -ptx --machine 64 kernel.cu

Quote from: Siekmanski on May 16, 2019, 07:44:33 PM
What if you use pixelshader code to do the calculations.

I feel sorry, I am not in a position to answer that question (the reason is igorance :().
CUDA is not only about number crunching there are stunning graphics samples. However, concepts like pixel shaders are more related to specialized Graphics Engines like DirectX. CUDA can also play together with DirectX, there are samples for that.

@LiaoMi
Thank you LiaoMi. CUDA is a real Universe to explore.
BTW, OpenCL appears half dead - the NVidia samples I have seen are for VS 2008 and VS 2010.

aw27 · November 13, 2019, 12:00:45 AM

I had the opportunity to solve the 20-queens problem using a Tesla V100. The time is about 1/3 of what I got previously. But a Tesla costs a fortune.

Yeah, GPU-Z is broken, does not report that CUDA is available.

hutch-- · November 13, 2019, 12:22:08 AM

My earlier version of GPUZ (1.16.0) reports on my now aging NVIDIA GeForce GTX 960 that OpenCL, CUDA, Physx and DirectCompute 5.0 are present.

The MASM Forum

News:

N-Queen Problem

aw27

LiaoMi

aw27

aw27

aw27

LiaoMi

aw27

daydreamer

aw27

TimoVJL

Siekmanski

LiaoMi

aw27

aw27

hutch--