This is an OpenCL GPGPU application which computes a sum of floats from 0 to 63. As we know from high school, there is a formula invented by Gauss for that and the expected result is 2016.0
- The code is strongly based on this Dr. Dobbs article (http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3). However, the kernel source is supplied in the resources, not loaded from the directory as in the Dr. Dobbs example.
- It was tested successfully in Intel, AMD and NVidia GPUs.
- It was produced with UASM.
Computed sum = 2016.0.
Check passed.
<Press any key to Exit>
:biggrin:
Damn, I was expecting a graphics extravaganza.
Computed sum = 2016.0.
Check passed.
<Press any key to Exit>
Quote from: hutch-- on June 25, 2019, 01:40:49 PM
:biggrin:
Damn, I was expecting a graphics extravaganza.
May be someone will do at least a Mandelbrot? :icon_idea:
Hi Atelier!
The kernel doesn't work in 32 bit. How you builded it?
LATER: .res I obtained is not exactly identical but look good. Any way clCreateProgramWithSource fail.
LATER2: my mistake creating context :thumbsup:
Quote from: AW on June 25, 2019, 05:30:30 PM
Quote from: hutch-- on June 25, 2019, 01:40:49 PM
:biggrin:
Damn, I was expecting a graphics extravaganza.
May be someone will do at least a Mandelbrot? :icon_idea:
it works ok here,but I expect something more blurry(like in gaussian blur) :biggrin:
"Check passed" - but why does it take ages?
Quote from: HSE on June 25, 2019, 08:39:43 PM
LATER2: my mistake creating context :thumbsup:
I know it works in 32-bit but the movq bug is still there in 32-bit causing problems to print the result.
Quote from: jj2007 on June 25, 2019, 09:48:00 PM
"Check passed" - but why does it take ages?
It does not take ages here, any idea?
Quote from: daydreamer on June 25, 2019, 09:46:35 PM
it works ok here,but I expect something more blurry(like in gaussian blur) :biggrin:
I am not an expert in blurry stuff. :sad:
Quote from: AW on June 25, 2019, 11:09:03 PM
I know it works in 32-bit but the movq/movsd bug is still there in 32-bit causing problems to print the result.
Well! Then there is 2 bugs because UASM32 build VMOVD instruction with EVEX disabled.
Because in 32 bit general registers can not contain a real8, I used Kusswurm technique:
movups reg2, xmm2
printf("Computed sum = %.1f.\n", reg2.r8[0*8]);
were:
RegXMM union 16
i8 sbyte 16 dup (0)
i16 sword 8 dup (0)
i32 sdword 4 dup (0)
i64 qword 2 dup (0)
r4 real4 4 dup (0.0)
r8 real8 2 dup (0.0)
RegXMM ends
.data
reg2 RegXMM <>
Computed sum = 2016.0.
Check passed.
This is a 32-bit version (without using the Kusswurm union).
In 32-bit, C-Style Calling does not appear to support assignment of return values (or may be I am missing something).
Fantastic :thumbsup:
32-bit return values perhaps rely on other secret option :biggrin:
Quote from: AW on June 25, 2019, 11:13:29 PM
Quote from: jj2007 on June 25, 2019, 09:48:00 PM
"Check passed" - but why does it take ages?
It does not take ages here, any idea?
Some initialisation required maybe? It takes about 1.5 seconds until I see the result, while launching a complex graphics application takes only a few milliseconds. Note this is nothing against your code (compliments), I am just curious why there is a delay. You are not generating Millions of numbers somewhere afaics...
Quote from: AW on June 25, 2019, 05:30:30 PM
Quote from: hutch-- on June 25, 2019, 01:40:49 PM
:biggrin:
Damn, I was expecting a graphics extravaganza.
May be someone will do at least a Mandelbrot? :icon_idea:
Hi AW,
the program says that the platform is invalid, apparently I am doing something wrong, on the web, I read that function calls in the nvidia system looks different ... most likely this is the problem :undecided:
QuoteOpenCLMandelBrot>ucl
Found 2 platform(s)
*** Platforms Information ***
Platform 0 - Name: NVIDIA CUDA
Vendor: NVIDIA Corporation
Version: OpenCL 1.2 CUDA 10.1.120
Profile: FULL_PROFILE
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer
Platform 1 - Name: Intel(R) OpenCL
Vendor: Intel(R) Corporation
Version: OpenCL 1.2
Profile: FULL_PROFILE
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_dx9_media_sharing cl_intel_dx9_media_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_fp64
*** Devices Information ***
Platform 0
Device Name: Quadro P4000
Driver Version: 430.39
Device Profile: FULL_PROFILE
Device Version: OpenCL 1.2 CUDA
clGetDeviceIDs failed.
QuoteComputed sum = 2016.0.
Check passed.
<Press any key to Exit>
Quote
the program says that the platform is invalid,
We are 2 getting an invalid platform :sad:
I will have a look tomorrow and see if I can figure out what's going on.
Quote from: jj2007 on June 26, 2019, 03:51:15 AM
Some initialisation required maybe? It takes about 1.5 seconds until I see the result, while launching a complex graphics application takes only a few milliseconds. Note this is nothing against your code (compliments), I am just curious why there is a delay. You are not generating Millions of numbers somewhere afaics...
Some people report delays. Searching google "why opencl slow to initialize"....
My Win 10 64 pro does not like either.
A:\OpenCL\Release>openclmandelbrot
Getting device IDs: Invalid platform
Creating context: Invalid device
Getting context info: Invalid context
Getting device info: Invalid device
Device 0:
x64 version
A:\OpenCL\x64\Release>openclmandelbrot
Getting device IDs: Invalid platform
The mandelbrot sample can be made work after some surgery, in particular removing the attempt to use the CPU for the calculations.
What it does is produce a .bmp called output.bmp in the folder of the .exe
I attach what is required to see the mandelbrot.
Quote from: AW on June 26, 2019, 06:53:37 AM
The mandelbrot sample can be made work after some surgery, in particular removing the attempt to use the CPU for the calculations.
What it does is produce a .bmp called output.bmp in the folder of the .exe
I attach what is required to see the mandelbrot.
:azn: :thumbsup:
one more example (Source Code) .. MatrixMultiplication_OpenCL_cpp
Jose,
The last one worked OK, I deleted the existing bitmap and ran it and it produced the new one.
Quote from: HSE on June 25, 2019, 11:26:48 PM
Quote from: AW on June 25, 2019, 11:09:03 PM
I know it works in 32-bit but the movq/movsd bug is still there in 32-bit causing problems to print the result.
Well! Then there is 2 bugs because UASM32 build VMOVD instruction with EVEX disabled.
Because in 32 bit general registers can not contain a real8, I used Kusswurm technique:
movups reg2, xmm2
printf("Computed sum = %.1f.\n", reg2.r8[0*8]);
were:
RegXMM union 16
i8 sbyte 16 dup (0)
i16 sword 8 dup (0)
i32 sdword 4 dup (0)
i64 qword 2 dup (0)
r4 real4 4 dup (0.0)
r8 real8 2 dup (0.0)
RegXMM ends
.data
reg2 RegXMM <>
Computed sum = 2016.0.
Check passed.
Quick note on this, UASM has built-in types for XMM/YMM/ZMM to match the C intrinsic types. So if you don't need it to be cross-asm compatible, or put the definition in an IFDEF you can use __m128 built-in type which is already a union of structs with each element type (byte/word/dword/qword/real4/real8).
With regards to the VMOVD/Q issue and C call return I will check these out.
Quote from: johnsa on June 26, 2019, 06:08:51 PM
... to match the C intrinsic types. ... to be cross-asm compatible
I don't know C :biggrin:, and the idea always is compatibility :thumbsup:.
Thanks.
works great with mandel AW,btw gaussian blur is a 2d paint program function,so I was thinking it was related to the gaussian math,but applied in different way on image
Everything you always wanted to know about your GPU, CUDA, OpenCL, Vulkan etc and were afraid to ask is in this little application:
(https://www.dropbox.com/s/m23xdussz24c0uu/opencl.jpg?dl=1)
Found 2 platform(s)
*** Platforms Information ***
Platform 0 - Name: NVIDIA CUDA
Vendor: NVIDIA Corporation
Version: OpenCL 1.2 CUDA 10.2.120
Profile: FULL_PROFILE
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer
Platform 1 - Name: Intel(R) CPU Runtime for OpenCL(TM) Applications
Vendor: Intel(R) Corporation
Version: OpenCL 2.1 WINDOWS
Profile: FULL_PROFILE
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_dx9_media_sharing cl_intel_dx9_media_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint
*** Devices Information ***
Platform 0
Device Name: GeForce GTX 1060 6GB
Driver Version: 430.86
Device Profile: FULL_PROFILE
Device Version: OpenCL 1.2 CUDA
Platform 1
Device Name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Driver Version: 18.1.0.0920
Device Profile: FULL_PROFILE
Device Version: OpenCL 2.1 (Build 0)
<Press any key to Exit>
My NVidia card has no OpenCL CPU capability, (it appears that all NVidias have no CPU capability). However, we can install Intel OpenCL for CPUs from their site.
WARNING: The procedure is convoluted and may fail for unforeseen reasons. You have been advised.
I proceeded this way, you may need to proceed differently:
1) Download from https://software.intel.com/en-us/articles/opencl-drivers the
Intel® CPU Runtime for OpenCL™ Applications 18.1 for Windows* OS (64bit or 32bit). If you have an AMD CPU, this will not work for you, of course, but a similar course may be available on the AMD site.
2) Install it.
3) This will disable the Cuda OpenCL drivers. Don't worry.
4) Insert in the Registry at Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors a DWORD value intelocl64.dll of 0. Do the same for the WOW6432 hive at Computer\HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Khronos\OpenCL\Vendors and make a DWORD value intelocl32.dll of 0
For further information on this Registry procedure search for OpenCL Khronos Registry Key.
5) Now reinstall the CUDA drivers.
You got another OpenCL platform. Wow!
(https://www.dropbox.com/s/8demhluzqy1zs1d/opencl2.jpg?dl=1)
I attach the updated detection application that now provides for OpenCL detection of everything (previously were GPUs).
:thumbsup:
Found 2 platform(s)
*** Platforms Information ***
Platform 0 - Name: NVIDIA CUDA
Vendor: NVIDIA Corporation
Version: OpenCL 1.2 CUDA 10.1.120
Profile: FULL_PROFILE
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing
cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer
Platform 1 - Name: Intel(R) OpenCL
Vendor: Intel(R) Corporation
Version: OpenCL 1.2
Profile: FULL_PROFILE
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread
cl_khr_spir cl_khr_dx9_media_sharing cl_intel_dx9_media_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_fp64
*** Devices Information ***
Platform 0
Device Name: Quadro P4000
Driver Version: 430.39
Device Profile: FULL_PROFILE
Device Version: OpenCL 1.2 CUDA
Platform 1
Device Name: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Driver Version: 5.2.0.10094
Device Profile: FULL_PROFILE
Device Version: OpenCL 1.2 (Build 10094)
QuoteFound 2 platform(s)
*** Platforms Information ***
Platform 0 - Name: NVIDIA CUDA
Vendor: NVIDIA Corporation
Version: OpenCL 1.2 CUDA 8.0.0
Profile: FULL_PROFILE
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer
Platform 1 - Name: Intel(R) OpenCL
Vendor: Intel(R) Corporation
Version: OpenCL 2.0
Profile: FULL_PROFILE
Extensions: cl_intel_dx9_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir
*** Devices Information ***
Platform 0
Device Name: GeForce 940M
Driver Version: 382.05
Device Profile: FULL_PROFILE
Device Version: OpenCL 1.2 CUDA
Platform 1
Device Name: Intel(R) HD Graphics 5500
Driver Version: 20.19.15.4835
Device Profile: FULL_PROFILE
Device Version: OpenCL 2.0
Device Name: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
Driver Version: 5.2.0.10094
Device Profile: FULL_PROFILE
Device Version: OpenCL 2.0 (Build 10094)
<Press any key to Exit>
I tested gpu caps viewer and it shows opencl has 10+ compute units,but a new nvidia card shows 1500+ CUDA cores in its papers,does this means it runs more than 100 times faster?
also it shows me disappointed figures in usuable RAM and installed RAM is big difference
Found 2 platform(s)
*** Platforms Information ***
Platform 0 - Name: NVIDIA CUDA
Vendor: NVIDIA Corporation
Version: OpenCL 1.2 CUDA 10.0.132
Profile: FULL_PROFILE
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer
Platform 1 - Name: Experimental OpenCL 2.1 CPU Only Platform
Vendor: Intel(R) Corporation
Version: OpenCL 2.1
Profile: FULL_PROFILE
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_dx9_media_sharing cl_intel_dx9_media_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_fp64 cl_khr_image2d_from_buffer
*** Devices Information ***
Platform 0
Device Name: GeForce GTX 980 Ti
Driver Version: 416.81
Device Profile: FULL_PROFILE
Device Version: OpenCL 1.2 CUDA
Platform 1
Device Name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Driver Version: 6.3.0.1904
Device Profile: FULL_PROFILE
Device Version: OpenCL 2.1 (Build 18)
So, a lot of people had OpenCL available for the CPUs (much more than I was thinking, but now I joined the gang too :biggrin:).
I modified again the mandelbrot C/C++ sample provided by LiaoMi, this time to make it work 1st priority for OpenCL CPU, if available, if not available it will work for the GPU.
BTW, the modifications are all in a single function, so I will leave it here:
cl_context create_context(cl_uint* num_devices) {
cl_platform_id * platforms;
cl_uint num_platforms;
cl_int err;
cl_device_id *devices;
cl_uint num_cpus=0;
cl_context context;
cl_uint n_devices = 0;
int i;
*num_devices = 0;
if (clGetPlatformIDs(0, NULL, &num_platforms) != CL_SUCCESS)
{
perror("No platforms");
exit(1);
}
platforms = malloc(sizeof(cl_platform_id)*num_platforms);
err = clGetPlatformIDs(num_platforms, platforms, NULL);
if (err < 0) {
perror("Couldn't identify a platform");
exit(1);
}
// Look for a CL_DEVICE_TYPE_CPU
for (i = 0; i < num_platforms; i++)
{
if (clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_CPU, 0, NULL, &num_cpus) == CL_SUCCESS)
{
*num_devices = num_cpus;
devices = malloc(num_cpus * sizeof(cl_device_id));
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_CPU, num_cpus, devices, NULL);
break;
}
}
if (num_cpus == 0)
{
for (i = 0; i < num_platforms; i++)
{
if (clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &n_devices) == CL_SUCCESS)
{
*num_devices = n_devices;
devices = malloc(n_devices * sizeof(cl_device_id));
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, n_devices, devices, NULL);
break;
}
}
}
context = clCreateContext(0, *num_devices, devices, NULL, NULL, &err);
check_succeeded((char*)"Creating context", err);
return context;
}
I attach also the built .exe
This is what I got:
Device 0: Intel(R) Corporation Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Conclusion: it works and produces the expected mandelbrot.
@daydreamer
Well, from what I read CUDA, may be 10 to 25% faster in most cases. But I have not done any tests on that (yet).
GPUcaps is a nice toy.
An older ATI/AMD driver might not have OpenCL or AMD Crimson driver don't have it to older cards ?
https://forums.guru3d.com/threads/non-gcn-crimson-and-opencl.407531/
EDIT: OclInfoDyn.c for dynamic load OpenCL.dll or dll name from command line.
OclInfoDyn64.exe Intelocl64.dll
Number of platforms: 1
Platform: 0
Platform Vendor: Intel(R) Corporation
Number of devices: 1
Device: 0
Type: 2 CL_DEVICE_TYPE_CPU
Name: AMD Athlon(tm) II X2 220 Processor
Vendor: Intel(R) Corporation
Available: Yes
Compute Units: 2
Clock Frequency: 0 MHz
Global Memory: 8191 mb
Max Allocateable Memory: 2048 mb
Local Memory: 32768 kb
Quote from: AW on June 27, 2019, 10:23:27 PM
So, a lot of people had OpenCL available for the CPUs (much more than I was thinking, but now I joined the gang too :biggrin:).
I modified again the mandelbrot C/C++ sample provided by LiaoMi, this time to make it work 1st priority for OpenCL CPU, if available, if not available it will work for the GPU.
BTW, the modifications are all in a single function, so I will leave it here:
cl_context create_context(cl_uint* num_devices) {
cl_platform_id * platforms;
cl_uint num_platforms;
cl_int err;
cl_device_id *devices;
cl_uint num_cpus=0;
cl_context context;
cl_uint n_devices = 0;
int i;
*num_devices = 0;
if (clGetPlatformIDs(0, NULL, &num_platforms) != CL_SUCCESS)
{
perror("No platforms");
exit(1);
}
platforms = malloc(sizeof(cl_platform_id)*num_platforms);
err = clGetPlatformIDs(num_platforms, platforms, NULL);
if (err < 0) {
perror("Couldn't identify a platform");
exit(1);
}
// Look for a CL_DEVICE_TYPE_CPU
for (i = 0; i < num_platforms; i++)
{
if (clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_CPU, 0, NULL, &num_cpus) == CL_SUCCESS)
{
*num_devices = num_cpus;
devices = malloc(num_cpus * sizeof(cl_device_id));
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_CPU, num_cpus, devices, NULL);
break;
}
}
if (num_cpus == 0)
{
for (i = 0; i < num_platforms; i++)
{
if (clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &n_devices) == CL_SUCCESS)
{
*num_devices = n_devices;
devices = malloc(n_devices * sizeof(cl_device_id));
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, n_devices, devices, NULL);
break;
}
}
}
context = clCreateContext(0, *num_devices, devices, NULL, NULL, &err);
check_succeeded((char*)"Creating context", err);
return context;
}
I attach also the built .exe
This is what I got:
Device 0: Intel(R) Corporation Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Conclusion: it works and produces the expected mandelbrot.
@daydreamer
Well, from what I read CUDA, may be 10 to 25% faster in most cases. But I have not done any tests on that (yet).
Hi AW,
my results, does this example work for everyone?
QuoteNew folder>OpenCLMandelBrot
Device 0: Intel(R) Corporation Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Loading kernel: Invalid value
Building program: Invalid program
Setting kernel arg: Invalid kernel
Running kernel: Invalid kernel
I forgot to include the kernel file in ZIP file, sorry. :sad:
It is in attachment, please place it in the same folder as the .exe.