Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go. :idea:
It would be nice if you can provide example in asm. ;)
Regards, Biterider
Done some conversion of cuda and nvml for 64bit asm:
https://github.com/mrfearless/libraries/tree/master/Cuda
Updated my repo to include a CudaDetection radasm project I used whilst testing some stuff out, mainly using the nvml to detect gpu, not got round to the cuda part or testing any of the functions.
Have to use the appropriate CUDA SDK version libraries and dll's - I have a folder for v8.0 and v9.1 SDK, just copy from the appropriate folder and overwrite in the CudaDetection folder. Currently CudaDetection folder contains the 9.1 version libs and dll files.
Also i played around with OpenCL a little (collected a few libs, converted an include file) https://github.com/mrfearless/libraries/tree/master/OpenCL - updated to include a OpenCLDetection project - havent actually tested anything yet - obviously meant to, or was going to use project to help convert the include file - i think.
Done some stuff with Vulkan as well, some conversion of the includes and some minor testing using the example/sample/tutorials from the sdk as a basis to see if it would work - only got so far, got distracted etc, so havent got back to it, got up to the 4th example: 04-init_command_buffer - but cant recall if that is working. https://github.com/mrfearless/libraries/tree/master/Vulkan
Hi fearless
Thank you very much. It is a beginning... :t
Anyone else?
Biterider
Quote from: Biterider on April 30, 2019, 09:57:23 PM
Hi fearless
Thank you very much. It is a beginning... :t
Anyone else?
Biterider
take a look at Hitchhikers 1k demo's in old forum, he just had minimal d3d or opengl call/setup quad and pixelshader in clear text,pixelshaders can be used together with texturemaps as arrays
Quote from: Biterider on April 30, 2019, 07:26:14 PM
Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go. :idea:
It would be nice if you can provide example in asm. ;)
Regards, Biterider
Cuda.. PTX assembly... etc..
http://masm32.com/board/index.php?topic=4711.0 (http://masm32.com/board/index.php?topic=4711.0)
Translating GPU Binaries to Tiered SIMD Architectureshttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1333&rep=rep1&type=pdf (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1333&rep=rep1&type=pdf)
(https://i.ibb.co/hRSG6XK/CudaPTX.png)
CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly
https://github.com/SunsetQuest/CudaPAD (https://github.com/SunsetQuest/CudaPAD)
CudaPAD
http://www.codeproject.com/Articles/999744/CudaPAD
(https://cloud.githubusercontent.com/assets/10804507/21625199/711f6e1e-d1bf-11e6-888f-d30ddf585231.png)
Hi,
PTX ISA (PDF) - v10.1.105 (older) - Last updated March 26, 2019 https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf)
PTX Writer's Guide to Interoperability https://docs.nvidia.com/cuda/pdf/PTX_Writers_Guide_To_Interoperability.pdf (https://docs.nvidia.com/cuda/pdf/PTX_Writers_Guide_To_Interoperability.pdf)
CUDA Toolkit 10.1 Download - https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal (https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal)
Cuda_Assembly Example exe + src + inc 8)
format PE64 GUI 5.0
entry start
include 'win64a.inc'
include 'cuda.inc'
section '.text' code readable executable
start:
push rbp
; init
lea rbx,[_cuInit]
invoke cuInit,0
test eax,eax
jnz Error
lea rbx,[_cuDeviceGet]
invoke cuDeviceGet,CudaDevice,0 ; get the first device
test eax,eax
jnz Error
lea rbx,[_cuCtxCreate]
invoke cuCtxCreate,CudaContext,CU_CTX_SCHED_SPIN+CU_CTX_MAP_HOST,[CudaDevice] ; this context associates the device with this cpu thread
test eax,eax
jnz Error
lea rbx,[_cuMemAlloc]
invoke cuMemAlloc,CudaNumberArray,256*4
test eax,eax
jnz Error
; load ptx source
lea rbx,[_cuModuleLoadData]
invoke cuModuleLoadData,CudaModule,PTXSourceData
test eax,eax
jnz Error
lea rbx,[_cuModuleGetFunction]
invoke cuModuleGetFunction,CudaFunction,[CudaModule],PTXFunction
test eax,eax
jnz Error
; fill in NumberArray and LogNumberArray with x87 calculations for comparision
fld1
fld1
mov ecx,256*4
lea rax,[NumberArray+rcx]
lea rbx,[LogNumberArray+rcx]
neg rcx
@@: fst dword[rax+rcx]
fld1
fld st1
fyl2x
fstp dword[rbx+rcx]
fadd st0,st1
add rcx,4
jnz @b
fcompp
; call the function
lea rbx,[_cuMemcpyHtoD]
invoke cuMemcpyHtoD,[CudaNumberArray],NumberArray,256*4
test eax,eax
jnz Error
lea rbx,[_cuParamSetSize]
invoke cuParamSetSize,[CudaFunction],12 ; 8 byte first agument + 4 byte second
test eax,eax
jnz Error
mov rax,[CudaNumberArray] ; fill in the arugments
mov qword[Message+0],rax ;
mov dword[Message+8],256 ;
lea rbx,[_cuParamSetv]
invoke cuParamSetv,[CudaFunction],0,Message,12 ; and pass in (cuParamSeti doesn't always work so well)
test eax,eax
jnz Error
lea rbx,[_cuFuncSetBlockShape]
invoke cuFuncSetBlockShape,[CudaFunction],256,1,1
test eax,eax
jnz Error
lea rbx,[_cuLaunchGrid]
invoke cuLaunchGrid,[CudaFunction],1,1
test eax,eax
jnz Error
lea rbx,[_cuCtxSynchronize]
invoke cuCtxSynchronize
test eax,eax
jnz Error
lea rbx,[_cuMemcpyDtoH]
invoke cuMemcpyDtoH,CudaLogNumberArray,[CudaNumberArray],256*4
test eax,eax
jnz Error
; clean up
lea rbx,[_cuMemFree]
invoke cuMemFree,[CudaNumberArray]
test eax,eax
jnz Error
lea rbx,[_cuCtxDestroy]
invoke cuCtxDestroy,[CudaContext]
test eax,eax
jnz Error
lea rdi,[Message]
cld
mov rax,'32bit fl'
stosq
mov rax,'oats: x'
stosq
mov rax,' lg2.a'
stosq
mov rax,'pprox.f3'
stosq
mov rax,'2(x) '
stosq
mov rax,'fyl2x(x,'
stosq
mov rax,'1.0) '
stosq
mov al,10
stosb
xor ebx,ebx
@@: fld dword[NumberArray+4*rbx]
fstp qword[rsp-8]
mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi],xmm0
fld dword[CudaLogNumberArray+4*rbx]
fstp qword[rsp-8]
mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi+16],xmm0
fld dword[LogNumberArray+4*rbx]
fstp qword[rsp-8]
mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi+32],xmm0
mov byte[rdi+48],10
add rdi,49
add rbx,1
cmp rbx,20
jb @b
mov byte[rdi],0
invoke MessageBox,NULL,Message,NULL,MB_OK
invoke ExitProcess,0
Error:
mov ecx,42
cmp eax,8
ja @f
mul ecx
lea rax,[err000+rax]
jmp .Print
@@:
cmp eax,100
jb @f
cmp eax,101
ja @f
sub eax,100
mul ecx
lea rax,[err100+rax]
jmp .Print
@@:
cmp eax,200
jb @f
cmp eax,216
ja @f
sub eax,200
mul ecx
lea rax,[err200+rax]
jmp .Print
@@:
cmp eax,300
jb @f
cmp eax,304
ja @f
sub eax,300
mul ecx
lea rax,[err300+rax]
jmp .Print
@@:
cmp eax,400
jne @f
sub eax,400
mul ecx
lea rax,[err400+rax]
jmp .Print
@@:
cmp eax,500
jne @f
sub eax,500
mul ecx
lea rax,[err500+rax]
jmp .Print
@@:
cmp eax,600
jne @f
sub eax,600
mul ecx
lea rax,[err600+rax]
jmp .Print
@@:
cmp eax,700
jb @f
cmp eax,709
ja @f
sub eax,700
mul ecx
lea rax,[err700+rax]
jmp .Print
@@:
cmp eax,999
jne @f
sub eax,999
mul ecx
lea rax,[err999+rax]
jmp .Print
@@:
lea rax,[errNoMatch]
.Print:
mov rdx,[rbx+0]
mov qword[Message+0],rdx
mov rdx,[rbx+8]
mov qword[Message+8],rdx
mov edx,[rbx+16]
mov dword[Message+16],edx
mov byte[Message+19],10
mov rdx,[rax+0]
mov qword[Message+20],rdx
mov rdx,[rax+8]
mov qword[Message+28],rdx
mov rdx,[rax+16]
mov qword[Message+36],rdx
mov rdx,[rax+24]
mov qword[Message+44],rdx
mov rdx,[rax+32]
mov qword[Message+52],rdx
mov edx,[rax+40]
mov dword[Message+60],edx
mov byte[Message+62],0
invoke MessageBox,NULL,Message,NULL,MB_OK
invoke ExitProcess,0
section '.data' data readable
PTXFunction: db 'log_2',0
; log_2(*inout,length):
; xind = %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return
PTXSourceData:
db ' .version 1.4'
db ' .target sm_13'
db ' .entry log_2 ('
db ' .param .u64 _inout,'
db ' .param .s32 _length)'
db ' {'
db ' .reg .u16 w1,w2;' ; word-sized registers
db ' .reg .u32 e1,xind;' ; dwords
db ' .reg .u64 r1;' ; qwords
db ' .reg .f32 f1;' ; floats
db ' .reg .pred p1;' ; conditions
db ' mov.u16 w1,%ctaid.x;'
db ' mov.u16 w2,%ntid.x;'
db ' cvt.u32.u16 e1,%tid.x;'
db ' mad.wide.u16 xind,w1,w2,e1;'
db ' ld.param.s32 e1,[_length];'
db ' setp.le.s32 p1,e1,xind;'
db ' @p1 bra $Lt_Exit;' ; if p1, branch
db ' ld.param.u64 r1,[_inout];'
db ' mad.wide.u32 r1,xind,4,r1;'
db ' ld.global.f32 f1,[r1+0];'
db ' lg2.approx.f32 f1,f1;'
db ' st.global.f32 [r1+0],f1;'
db '$Lt_Exit:'
db ' exit;'
db ' }'
db 0
section '.data' data readable writeable
align 16
ErrorMessageFormat db 'error code:',10,'hex: 0x%.8x',10,'dec: %u',0
MessageFormat db '%16.7f',0
err000:
db 'CUDA_SUCCESS ';= 0
db 'CUDA_ERROR_INVALID_VALUE ';= 1
db 'CUDA_ERROR_OUT_OF_MEMORY ';= 2
db 'CUDA_ERROR_NOT_INITIALIZED ';= 3
db 'CUDA_ERROR_DEINITIALIZED ';= 4
db 'CUDA_ERROR_PROFILER_DISABLED ';= 5
db 'CUDA_ERROR_PROFILER_NOT_INITIALIZED ';= 6
db 'CUDA_ERROR_PROFILER_ALREADY_STARTED ';= 7
db 'CUDA_ERROR_PROFILER_ALREADY_STOPPED ';= 8
err100:
db 'CUDA_ERROR_NO_DEVICE ';= 100
db 'CUDA_ERROR_INVALID_DEVICE ';= 101
err200:
db 'CUDA_ERROR_INVALID_IMAGE ';= 200
db 'CUDA_ERROR_INVALID_CONTEXT ';= 201
db 'CUDA_ERROR_CONTEXT_ALREADY_CURRENT ';= 202
db '? ';=
db '? ';=
db 'CUDA_ERROR_MAP_FAILED ';= 205
db 'CUDA_ERROR_UNMAP_FAILED ';= 206
db 'CUDA_ERROR_ARRAY_IS_MAPPED ';= 207
db 'CUDA_ERROR_ALREADY_MAPPED ';= 208
db 'CUDA_ERROR_NO_BINARY_FOR_GPU ';= 209
db 'CUDA_ERROR_ALREADY_ACQUIRED ';= 210
db 'CUDA_ERROR_NOT_MAPPED ';= 211
db 'CUDA_ERROR_NOT_MAPPED_AS_ARRAY ';= 212
db 'CUDA_ERROR_NOT_MAPPED_AS_POINTER ';= 213
db 'CUDA_ERROR_ECC_UNCORRECTABLE ';= 214
db 'CUDA_ERROR_UNSUPPORTED_LIMIT ';= 215
db 'CUDA_ERROR_CONTEXT_ALREADY_IN_USE ';= 216
err300:
db 'CUDA_ERROR_INVALID_SOURCE ';= 300
db 'CUDA_ERROR_FILE_NOT_FOUND ';= 301
db 'CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND ';= 302
db 'CUDA_ERROR_SHARED_OBJECT_INIT_FAILED ';= 303
db 'CUDA_ERROR_OPERATING_SYSTEM ';= 304
err400:
db 'CUDA_ERROR_INVALID_HANDLE ';= 400
err500:
db 'CUDA_ERROR_NOT_FOUND ';= 500
err600:
db 'CUDA_ERROR_NOT_READY ';= 600
err700:
db 'CUDA_ERROR_LAUNCH_FAILED ';= 700
db 'CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES ';= 701
db 'CUDA_ERROR_LAUNCH_TIMEOUT ';= 702
db 'CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING ';= 703
db 'CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED ';= 704
db 'CUDA_ERROR_PEER_ACCESS_NOT_ENABLED ';= 705
db '? ';=
db '? ';=
db 'CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE ';= 708
db 'CUDA_ERROR_CONTEXT_IS_DESTROYED ';= 709
err999:
db 'CUDA_ERROR_UNKNOWN ';= 999
errNoMatch:
db '? '
_cuInit: db 'cuInit '
_cuDeviceGet: db 'cuDeviceGet '
_cuCtxCreate: db 'cuCtxCreate '
_cuMemAlloc: db 'cuMemAlloc '
_cuModuleLoadData: db 'cuModuleLoadData '
_cuModuleGetFunction: db 'cuModuleGetFunction '
_cuMemcpyHtoD: db 'cuMemcpyHtoD '
_cuParamSeti: db 'cuParamSeti '
_cuParamSetv: db 'cuParamSetv '
_cuFuncSetBlockShape: db 'cuFuncSetBlockShape '
_cuLaunchGrid: db 'cuLaunchGrid '
_cuParamSetSize: db 'cuParamSetSize '
_cuCtxSynchronize: db 'cuCtxSynchronize '
_cuMemcpyDtoH: db 'cuMemcpyDtoH '
_cuMemFree: db 'cuMemFree '
_cuCtxDestroy: db 'cuCtxDestroy '
align 16
CudaDevice dq ?
CudaContext dq ?
CudaModule dq ?
CudaFunction dq ?
CudaNumberArray dq ?
NumberArray rd 256
LogNumberArray rd 256
CudaLogNumberArray rd 256
Temp rb 32
Message rb 50*60
section '.idata' import data readable writeable
library kernel32,'KERNEL32.DLL',\
user32,'USER32.DLL',\
msvcrt,'MSVCRT.DLL',\
cuda,'NVCUDA.DLL'
include 'api\kernel32.inc'
include 'api\user32.inc'
include 'api_cuda.inc'
import msvcrt,\
sprintf,'sprintf'
PTX Data
section '.data' data readable
PTXFunction: db 'log_2',0
; log_2(*inout,length):
; xind = %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return
PTXSourceData:
db ' .version 1.4'
db ' .target sm_13'
db ' .entry log_2 ('
db ' .param .u64 _inout,'
db ' .param .s32 _length)'
db ' {'
db ' .reg .u16 w1,w2;' ; word-sized registers
db ' .reg .u32 e1,xind;' ; dwords
db ' .reg .u64 r1;' ; qwords
db ' .reg .f32 f1;' ; floats
db ' .reg .pred p1;' ; conditions
db ' mov.u16 w1,%ctaid.x;'
db ' mov.u16 w2,%ntid.x;'
db ' cvt.u32.u16 e1,%tid.x;'
db ' mad.wide.u16 xind,w1,w2,e1;'
db ' ld.param.s32 e1,[_length];'
db ' setp.le.s32 p1,e1,xind;'
db ' @p1 bra $Lt_Exit;' ; if p1, branch
db ' ld.param.u64 r1,[_inout];'
db ' mad.wide.u32 r1,xind,4,r1;'
db ' ld.global.f32 f1,[r1+0];'
db ' lg2.approx.f32 f1,f1;'
db ' st.global.f32 [r1+0],f1;'
db '$Lt_Exit:'
db ' exit;'
db ' }'
db 0
With the toolkit from the posts above, you can freely write modern code, it remains to find an idea for a super fast algorithm :P
Thanks LiaoMi
Very good and interesting work :t
Biterider
Hi Biterider,
the example is old, but perfect for demonstration! :icon14:
mrfearless has a different set of Api https://github.com/mrfearless/libraries/tree/master/Cuda (https://github.com/mrfearless/libraries/tree/master/Cuda), maybe because of the CUDA Toolkit version, although the compiled exe works fine for me. We can easily rewrite the example for UASM, the main thing is to understand how to use new 10.1 api :badgrin:
Nsight Productivity Utilities PTX Debugger + Tools
NVIDIA® Nsight™ Systems
NVIDIA® Nsight Systems™ is a system-wide performance analysis tool designed to visualize an application's algorithms, identify the largest optimization opportunities, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large servers to our smallest SoC.
NVIDIA® Nsight™ Graphics
NVIDIA® Nsight Graphics™ is a standalone application for the debugging, profiling, and analysis of graphics applications. It allows you to optimize the performance of your Direct3D 11, Direct3D 12, DirectX Raytracing, OpenGL, Vulkan, and NVIDIA VKRay based applications.
NVIDIA® Nsight™ Eclipse Edition
NVIDIA® Nsight™ Eclipse Edition is a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug, and profile CUDA-C applications. Nsight Eclipse Edition supports a rich set of commercial and free plugins.
NVIDIA® Nsight™ Compute
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. Nsight Compute also provides customizable and data-driven user interface and metric collection that can be extended with analysis scripts for post-processing results.
NVIDIA® Nsight™ Visual Studio Edition
An application development environment for heterogeneous platforms, Nsight Visual Studio Edition brings GPU computing into Microsoft Visual Studio. Build, debug, profile and trace heterogeneous compute, graphics, virtual reality, RTX, .NET, and UWP applications built with CUDA C/C++, OpenCL, DirectCompute, Direct3D (11,12,DXR), Vulkan (1.1, Vulkan Ray Tracing Extension), OpenGL, OpenVR, and the Oculus SDK.
Download NVIDIA® Nsight™ Visual Studio Edition 2019.1 - https://developer.nvidia.com/gameworksdownload#?dn=nsight-visual-studio-edition-2019-1 (https://developer.nvidia.com/gameworksdownload#?dn=nsight-visual-studio-edition-2019-1) (free registration is required, then you can download files NVIDIA_Nsight_Visual_Studio_Edition_Win64_2019.1.1.19056.msi) 492Mb :eusa_boohoo:
(https://developer.nvidia.com/sites/default/files/akamai/tools/nsvse/nsvse60/60_Next-Gen_Debugger_PTX+SASS.png)
Probably a good few changes since i last looked at it. Think the version i was working with of the cuda sdk was v8 primarily, just to see if i could get a few things working - and mainly with the nvml part. So any api's and include file conversions are probably based on that v8. I think there was some reason for only using x64 asm - cant quite recall why that and not x86 - maybe there wasnt an x86 dll for nvml or something (i think)
Anyhow looks like you have a lot of stuff figured out, so looks promising.
Hi fearless,
I checked your sdk for the presence of api and compared with the original version 10.1, my conclusion, your sdk is not complete, all functions are present, nothing has changed in the new CUDA versions.
I attach my version of converted .H files https://www.solidfiles.com/v/yR7MaKLr5Vkx6 (https://www.solidfiles.com/v/yR7MaKLr5Vkx6), library files as usual in CUDA Toolkit 10.1.
I did not convert two folders, for the reason that there are templates and classes that are not supported by the converter.
thrust/detail
thrust/system
I have not tried to convert .H files using the H2INCX utility, made in a fast way to cover more files .. h2incX need a graphical interface to use as front-end, in the form of a separate program that generates flags, creates subdirectories during conversion and manages the conversion list, upx has a similar separate interface.
The Linux "Cuda_Assembly Example" version can be found below, FASM source code :icon14:
P.S. The converter did not give any errors, but you may have to correct something :bgrin:
First bugs after conversion :biggrin:
С++
/**
* External memory buffer descriptor
*/
typedef struct CUDA_EXTERNAL_MEMORY_BUFFER_DESC_st {
/**
* Offset into the memory object where the buffer's base is
*/
unsigned long long offset;
/**
* Size of the buffer
*/
unsigned long long size;
/**
* Flags reserved for future use. Must be zero.
*/
unsigned int flags;
unsigned int reserved[16];
} CUDA_EXTERNAL_MEMORY_BUFFER_DESC;
Conversion
CUDA_EXTERNAL_MEMORY_BUFFER_DESC STRUCT DEFALIGNMASM
long DWORD ?
long DWORD ?
flags DWORD ?
reserved DWORD 16 dup (?)
CUDA_EXTERNAL_MEMORY_BUFFER_DESC ENDS
:redface: How many bytes is unsigned long long?
https://stackoverflow.com/questions/5836329/how-many-bytes-is-unsigned-long-long (https://stackoverflow.com/questions/5836329/how-many-bytes-is-unsigned-long-long)
https://en.cppreference.com/w/cpp/language/types (https://en.cppreference.com/w/cpp/language/types) :t
Hi again :biggrin:,
I can successfully compile under UASM assembler. There is one problem, in the header files, you can control the version of api. In my case version 2 is used, but should be 1 ... cuCtxCreate_v2 -> cuCtxCreate
JMP QWORD PTR DS:[<&cuInit>]
JMP QWORD PTR DS:[<&cuDeviceGet>]
JMP QWORD PTR DS:[<&cuCtxCreate_v2>]
JMP QWORD PTR DS:[<&cuCtxDestroy_v2>]
JMP QWORD PTR DS:[<&cuCtxSynchronize>]
JMP QWORD PTR DS:[<&cuModuleLoadData>]
JMP QWORD PTR DS:[<&cuModuleGetFunction>]
JMP QWORD PTR DS:[<&cuMemAlloc_v2>]
JMP QWORD PTR DS:[<&cuMemFree_v2>]
JMP QWORD PTR DS:[<&cuMemcpyHtoD_v2>]
JMP QWORD PTR DS:[<&cuMemcpyDtoH_v2>]
JMP QWORD PTR DS:[<&cuFuncSetBlockShape>]
JMP QWORD PTR DS:[<&cuParamSetSize>]
JMP QWORD PTR DS:[<&cuParamSetv>]
JMP QWORD PTR DS:[<&cuLaunchGrid>]
JMP QWORD PTR DS:[<&FatalExit>]
JMP QWORD PTR DS:[<&MessageBoxA>]
I can adjust the parameters in the macro, but in this case I get an error Error A2101: Macro nesting level too deep in the UASM assembler :icon_confused:
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879): Included by
mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909): Included by
mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719): Included by
mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763): Included by
mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796): Included by
mywindow1.asm(32): Main line code
Where can i patch macros depth?! :icon_exclaim: I still use the version 2.47.1.0, due to a bug in version 2.48 :idea:
In my opinion in the file cuda.inc everything is correct,
#if defined(CUDA_FORCE_API_VERSION)
#if (CUDA_FORCE_API_VERSION == 3010)
#define __CUDA_API_VERSION 3010
#else
#error "Unsupported value of CUDA_FORCE_API_VERSION"
#endif
#else
#define __CUDA_API_VERSION 10010
#endif /* CUDA_FORCE_API_VERSION */
Why different functions are indicated as unavailable :icon_eek: After the experiments, I realized that the problem is in the alignment of the page. I forgot to remove the /Force flag and added /LARGEADDRESSAWARE:NO But no change .. Api functions have invalid addresses ..
The latest version for testing, reached -
lea rbx, [_ cuCtxSynchronize]
call cuCtxSynchronize
---------------------------
Error
---------------------------
cuCtxSynchronize
CUDA_ERROR_LAUNCH_FAILED
---------------------------
OK
---------------------------
The task to convert header files was not so easy, there are places where api variables have the form as DWORD, additional adjustment of translator templates and API checking is strongly required :icon_exclaim:
#if defined(_WIN64) || defined(__LP64__)
typedef unsigned long long CUdeviceptr;
#else
typedef unsigned int CUdeviceptr;
#endif
vs
IF DEFINED(_WIN64) OR DEFINED(__LP64__)
long TYPEDEF DWORD
ELSE
CUdeviceptr TYPEDEF DWORD
ENDIF
Hi LiaoMi
I found the same issue while translating the header files. In the next few days, I'll make some additions to h2incX to better convert these files.
Biterider
Quote from: LiaoMi on May 14, 2019, 12:45:26 AM
The latest version for testing, reached -
lea rbx, [_ cuCtxSynchronize]
call cuCtxSynchronize
---------------------------
Error
---------------------------
cuCtxSynchronize
CUDA_ERROR_LAUNCH_FAILED
---------------------------
OK
---------------------------
I produced a very simple example and faced as well that issue :( . It took me a few hours to realize what might be the problem. I got it fixed by calling the v2 versions. :t
Here is the example which consists in calculating the double of each element of an array of 1000 elements. It does it by launching 1024 threads (24 will do nothing), each will handle only one element.
It may need to be modified to run in other targets, it was done for sm_30
OPTION casemap:none
N equ 1000
includelib kernel32.lib
ExitProcess proto :dword
includelib msvcrt.lib
printf proto :ptr, :vararg
_getch proto
includelib cuda.lib
cuInit proto :dword
cuDeviceGet proto :ptr, :dword
cuCtxCreate_v2 proto :ptr, :dword, :dword
cuModuleLoadData proto :ptr, :ptr
cuModuleGetFunction proto :ptr, :ptr, :ptr
cuMemAlloc_v2 proto :ptr, :qword
cuMemcpyHtoD_v2 proto :ptr, :ptr, :qword
cuMemcpyDtoH_v2 proto :ptr, :ptr, :qword
cuLaunchKernel proto :ptr, :dword, :dword, :dword, :dword, :dword, :dword, :ptr, :ptr, :ptr
cuCtxSynchronize proto
cuMemFree_v2 proto :ptr
.code
mulby2 proc
LOCAL cuDevice:dword
LOCAL cuContext:ptr
LOCAL cuModule:ptr
LOCAL cudaFunction:ptr
LOCAL _da: qword
LOCAL _db: qword
LOCAL ha[N]:dword
LOCAL hb[N]:dword
LOCAL memSize : qword
LOCAL args[2] : ptr
sub rsp, 58h
and rsp, -16 ; align
mov eax, sizeof DWORD * N
mov memSize, rax
mov rcx,0
call cuInit
cmp eax, 0
jnz @exit
lea rcx, cuDevice
mov edx,0
call cuDeviceGet
cmp eax, 0
jnz @exit
lea rcx, cuContext
mov edx, 0
mov r8d, cuDevice
call cuCtxCreate_v2
cmp eax, 0
jnz @exit
lea rcx, cuModule
mov rdx, offset PTXSourceData
call cuModuleLoadData
cmp eax, 0
jnz @exit
lea rcx, cudaFunction
mov rdx, cuModule
lea r8, PTXFunction
call cuModuleGetFunction
cmp eax, 0
jnz @exit
mov ecx, N
mov r10d,0
lea r11, ha
@@:
mov dword ptr [r11], r10d
add r11, 4
inc r10d
dec ecx
jnz @B
lea rcx, _da
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
lea rcx, _db
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
mov rcx, _da
lea rdx, ha
mov r8, memSize
call cuMemcpyHtoD_v2
cmp eax, 0
jnz @exit
lea rax, _da
mov qword ptr args, rax
lea rax, _db
mov qword ptr args+8, rax
mov rcx, cudaFunction
mov edx ,1
mov r8d, 1
mov r9d, 1
mov dword ptr [rsp+20h], 1024
mov dword ptr [rsp+28h], 1
mov dword ptr [rsp+30h], 1
mov dword ptr [rsp+38h], 0
mov qword ptr [rsp+40h], 0
lea rax, args
mov qword ptr [rsp+48h], rax
mov qword ptr [rsp+50h], 0
call cuLaunchKernel
cmp eax, 0
jnz @exit
call cuCtxSynchronize
cmp eax, 0
jnz @exit
lea rcx, hb
mov rdx, _db
mov r8, memSize
call cuMemcpyDtoH_v2
cmp eax, 0
jnz @exit
mov ebx, 0
lea rdi, hb
@@:
mov rcx, offset report
mov edx, dword ptr [rdi]
call printf
add rdi, 4
inc ebx
cmp ebx, N
jl @B
mov rcx, _da
call cuMemFree_v2
mov rcx, _db
call cuMemFree_v2
mov eax,0
@exit:
mov ecx,0
call _getch
call ExitProcess
mulby2 endp
.data
report db "%d",10,0
; Generated by NVIDIA NVVM Compiler
; Compiler Build ID: CL-25769353
; Cuda compilation tools, release 10.1, V10.1.105
; Based on LLVM 3.4svn
PTXFunction db '_Z3addPiS_',0
PTXSourceData \
db '.version 6.4 ',10
db '.target sm_30 ',10
db '.address_size 64 ',10
db ' ',10
db ' // .globl _Z3addPiS_ ',10
db ' ',10
db '.visible .entry _Z3addPiS_( ',10
db ' .param .u64 _Z3addPiS__param_0, ',10
db ' .param .u64 _Z3addPiS__param_1 ',10
db ') ',10
db '{ ',10
db ' .reg .pred %p<2>; ',10
db ' .reg .b32 %r<4>; ',10
db ' .reg .b64 %rd<8>; ',10
db ' ',10
db ' ',10
db ' ld.param.u64 %rd1, [_Z3addPiS__param_0]; ',10
db ' ld.param.u64 %rd2, [_Z3addPiS__param_1]; ',10
db ' mov.u32 %r1, %tid.x; ',10
db ' setp.gt.s32 %p1, %r1, 999; ',10
db ' @%p1 bra BB0_2; ',10
db ' ',10
db ' cvta.to.global.u64 %rd3, %rd1; ',10
db ' mul.wide.s32 %rd4, %r1, 4; ',10
db ' add.s64 %rd5, %rd3, %rd4; ',10
db ' ld.global.u32 %r2, [%rd5]; ',10
db ' shl.b32 %r3, %r2, 1; ',10
db ' cvta.to.global.u64 %rd6, %rd2; ',10
db ' add.s64 %rd7, %rd6, %rd4; ',10
db ' st.global.u32 [%rd7], %r3; ',10
db ' ',10
db 'BB0_2: ',10
db ' ret; ',10
db '} ',10
db 0
end
The PTX was obtained from this:
#define N 1000
__global__ void add(int* a, int* b) {
int i = threadIdx.x;
if (i < N) {
b[i] = 2 * a[i];
}
}
To build:
SET MASM64LIB=\masm32\lib64
\masm32\bin\ml64 -c -Zp8 c64.asm
\masm32\bin\link /ENTRY:mulby2 /SUBSYSTEM:console /LIBPATH:%MASM64LIB% /FIXED /MACHINE:X64 c64.obj
I need to have some kind of manual to deal with architecture and api. Here you can download full documentation in pdf, on api, on inline functions and more - https://docs.nvidia.com/cuda/ (https://docs.nvidia.com/cuda/)
Quote from: Biterider on May 14, 2019, 12:51:12 AM
Hi LiaoMi
I found the same issue while translating the header files. In the next few days, I'll make some additions to h2incX to better convert these files.
Biterider
Hi Biterider,
I fixed in my own conversion the following steps ..
1.Make a copy of all files for backup
2.Replace all types described here https://en.cppreference.com/w/cpp/language/types (https://en.cppreference.com/w/cpp/language/types) in the form of __int64 (unsigned long long = __int64, unsigned long = long) everything on the table, there are many of them ..
3.Define type in converter CUdeviceptr = CUdeviceptr
4.Delete all parameters in original headers CUdeviceptr = DWORD
5.Add the correct macro
IF DEFINED(_WIN64) OR DEFINED(__LP64__)
CUdeviceptr TYPEDEF QWORD
ELSE
CUdeviceptr TYPEDEF DWORD
ENDIF
Perhaps there are also other bottlenecks that I did not see .. Compiling with my converted file works fine .. I have done only one so far - cuda.inc, in the post above the file was already corrected a couple of days ago ..
Quote from: AW on May 15, 2019, 09:19:02 PM
Quote from: LiaoMi on May 14, 2019, 12:45:26 AM
The latest version for testing, reached -
lea rbx, [_ cuCtxSynchronize]
call cuCtxSynchronize
---------------------------
Error
---------------------------
cuCtxSynchronize
CUDA_ERROR_LAUNCH_FAILED
---------------------------
OK
---------------------------
I produced a very simple example and faced as well that issue :( . It took me a few hours to realize what might be the problem. I got it fixed by calling the v2 versions. :t
Here is the example which consists in calculating the double of each element of an array of 1000 elements. It does it by launching 1024 threads (24 will do nothing), each will handle only one element.
It may need to be modified to run in other targets, it was done for sm_30
OPTION casemap:none
N equ 1000
includelib kernel32.lib
ExitProcess proto :dword
includelib msvcrt.lib
printf proto :ptr, :vararg
_getch proto
includelib cuda.lib
cuInit proto :dword
cuDeviceGet proto :ptr, :dword
cuCtxCreate_v2 proto :ptr, :dword, :dword
cuModuleLoadData proto :ptr, :ptr
cuModuleGetFunction proto :ptr, :ptr, :ptr
cuMemAlloc_v2 proto :ptr, :qword
cuMemcpyHtoD_v2 proto :ptr, :ptr, :qword
cuMemcpyDtoH_v2 proto :ptr, :ptr, :qword
cuLaunchKernel proto :ptr, :dword, :dword, :dword, :dword, :dword, :dword, :ptr, :ptr, :ptr
cuCtxSynchronize proto
cuMemFree_v2 proto :ptr
.code
mulby2 proc
LOCAL cuDevice:dword
LOCAL cuContext:ptr
LOCAL cuModule:ptr
LOCAL cudaFunction:ptr
LOCAL _da: qword
LOCAL _db: qword
LOCAL ha[N]:dword
LOCAL hb[N]:dword
LOCAL memSize : qword
LOCAL args[2] : ptr
sub rsp, 58h
and rsp, -16 ; align
mov eax, sizeof DWORD * N
mov memSize, rax
mov rcx,0
call cuInit
cmp eax, 0
jnz @exit
lea rcx, cuDevice
mov edx,0
call cuDeviceGet
cmp eax, 0
jnz @exit
lea rcx, cuContext
mov edx, 0
mov r8d, cuDevice
call cuCtxCreate_v2
cmp eax, 0
jnz @exit
lea rcx, cuModule
mov rdx, offset PTXSourceData
call cuModuleLoadData
cmp eax, 0
jnz @exit
lea rcx, cudaFunction
mov rdx, cuModule
lea r8, PTXFunction
call cuModuleGetFunction
cmp eax, 0
jnz @exit
mov ecx, N
mov r10d,0
lea r11, ha
@@:
mov dword ptr [r11], r10d
add r11, 4
inc r10d
dec ecx
jnz @B
lea rcx, _da
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
lea rcx, _db
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
mov rcx, _da
lea rdx, ha
mov r8, memSize
call cuMemcpyHtoD_v2
cmp eax, 0
jnz @exit
lea rax, _da
mov qword ptr args, rax
lea rax, _db
mov qword ptr args+8, rax
mov rcx, cudaFunction
mov edx ,1
mov r8d, 1
mov r9d, 1
mov dword ptr [rsp+20h], 1024
mov dword ptr [rsp+28h], 1
mov dword ptr [rsp+30h], 1
mov dword ptr [rsp+38h], 0
mov qword ptr [rsp+40h], 0
lea rax, args
mov qword ptr [rsp+48h], rax
mov qword ptr [rsp+50h], 0
call cuLaunchKernel
cmp eax, 0
jnz @exit
call cuCtxSynchronize
cmp eax, 0
jnz @exit
lea rcx, hb
mov rdx, _db
mov r8, memSize
call cuMemcpyDtoH_v2
cmp eax, 0
jnz @exit
mov ebx, 0
lea rdi, hb
@@:
mov rcx, offset report
mov edx, dword ptr [rdi]
call printf
add rdi, 4
inc ebx
cmp ebx, N
jl @B
mov rcx, _da
call cuMemFree_v2
mov rcx, _db
call cuMemFree_v2
mov eax,0
@exit:
mov ecx,0
call _getch
call ExitProcess
mulby2 endp
.data
report db "%d",10,0
; Generated by NVIDIA NVVM Compiler
; Compiler Build ID: CL-25769353
; Cuda compilation tools, release 10.1, V10.1.105
; Based on LLVM 3.4svn
PTXFunction db '_Z3addPiS_',0
PTXSourceData \
db '.version 6.4 ',10
db '.target sm_30 ',10
db '.address_size 64 ',10
db ' ',10
db ' // .globl _Z3addPiS_ ',10
db ' ',10
db '.visible .entry _Z3addPiS_( ',10
db ' .param .u64 _Z3addPiS__param_0, ',10
db ' .param .u64 _Z3addPiS__param_1 ',10
db ') ',10
db '{ ',10
db ' .reg .pred %p<2>; ',10
db ' .reg .b32 %r<4>; ',10
db ' .reg .b64 %rd<8>; ',10
db ' ',10
db ' ',10
db ' ld.param.u64 %rd1, [_Z3addPiS__param_0]; ',10
db ' ld.param.u64 %rd2, [_Z3addPiS__param_1]; ',10
db ' mov.u32 %r1, %tid.x; ',10
db ' setp.gt.s32 %p1, %r1, 999; ',10
db ' @%p1 bra BB0_2; ',10
db ' ',10
db ' cvta.to.global.u64 %rd3, %rd1; ',10
db ' mul.wide.s32 %rd4, %r1, 4; ',10
db ' add.s64 %rd5, %rd3, %rd4; ',10
db ' ld.global.u32 %r2, [%rd5]; ',10
db ' shl.b32 %r3, %r2, 1; ',10
db ' cvta.to.global.u64 %rd6, %rd2; ',10
db ' add.s64 %rd7, %rd6, %rd4; ',10
db ' st.global.u32 [%rd7], %r3; ',10
db ' ',10
db 'BB0_2: ',10
db ' ret; ',10
db '} ',10
db 0
end
The PTX was obtained from this:
#define N 1000
__global__ void add(int* a, int* b) {
int i = threadIdx.x;
if (i < N) {
b[i] = 2 * a[i];
}
}
To build:
SET MASM64LIB=\masm32\lib64
\masm32\bin\ml64 -c -Zp8 c64.asm
\masm32\bin\link /ENTRY:mulby2 /SUBSYSTEM:console /LIBPATH:%MASM64LIB% /FIXED /MACHINE:X64 c64.obj
Hi AW,
great job! :eusa_clap: :t I was so tortured with this function that I started thinking about errors in the code. I will try to change the function and I will try your example! Thank you!
What is the difference between these functions?
The new CUDA "v2" API has been supported since v4.0 of the CUDA toolkit (i.e. for quite a while; we are now on 7)..... The old API is deprecated by NVidia, and does not support certain useful functionality such as batched matrix-multiply.
From my understanding,
new CUDA "v2" API is a functional extension of all api to a new level. Therefore, the parameters may be different, I did not take this into account. Some features are preserved to maintain compatibility, and some exist only in one old form. All these parameters should be described by the version of the toolkit. ::)
All current documentation in one archive -
CUDA Toolkit v10.1.168_pdf.zip https://www.solidfiles.com/v/5aYWVB7nx3Qq5 (https://www.solidfiles.com/v/5aYWVB7nx3Qq5) 100Mb
I've never seriously considered using CUDA because I thought it's useful for one brand only, Nvidia. Now I am confused by two elements:
1. It seems Nvidia has over 80% of the market (https://wccftech.com/nvidia-amd-discrete-gpu-market-share-q4-2018-report/)
2. There are many manufacturers but they all produce Nvidia cards, apparently: https://graphicscardhub.com/graphics-card-manufacturers-brands/
So what is "Nvidia"? A manufacturer brand, a "graphics OS" of some sorts, a standard?
Quote from: jj2007 on May 15, 2019, 11:52:56 PM
I've never seriously considered using CUDA because I thought it's useful for one brand only, Nvidia. Now I am confused by two elements:
1. It seems Nvidia has over 80% of the market (https://wccftech.com/nvidia-amd-discrete-gpu-market-share-q4-2018-report/)
2. There are many manufacturers but they all produce Nvidia cards, apparently: https://graphicscardhub.com/graphics-card-manufacturers-brands/
So what is "Nvidia"? A manufacturer brand, a "graphics OS" of some sorts, a standard?
Hi jj2007,
for me it is the market leader in the production of GPU (graphics processing unit)
Intel = CPU
NVIDIA = GPU, NVIDIA, inventor of the GPU, which creates interactive graphics on laptops, workstations, mobile devices, notebooks, PCs, and more.
Cuda Initial release June 23, 2007 (11 years ago)
CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.NVIDIA Major releases and acquisitions -
The release of the RIVA TNT in 1998 solidified Nvidia's reputation for developing capable graphics adapters. In late 1999, Nvidia released the GeForce 256 (NV10), most notably introducing on-board transformation and lighting (T&L) to consumer-level 3D hardware. Running at 120 MHz and featuring four pixel pipelines, it implemented advanced video acceleration, motion compensation and hardware sub-picture alpha blending.Unification is not required for GPU computations, this is probably one of the reasons why there is no computation standard in graphical interfaces, gpu calculations seem to me more tied to the programming language than to the technical features of parallelization.
OpenCL (https://www.khronos.org/opencl/)
nvidia OpenCL (https://developer.nvidia.com/opencl)
GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)
Hi LiaoMi,
I noticed that in the #include file they have
#if defined(__CUDA_API_VERSION_INTERNAL) || __CUDA_API_VERSION >= 3020
...
#define cuCtxCreate cuCtxCreate_v2
So I run, cuCtxGetApiVersion and got that the version was 3020.
All explained, so easy when we know the answer. :biggrin:
Another attractive example! :P
Here is a demonstration of Neural Network created in a CUDA GPU using FASM.
This program requires win64 and a NVIDIA cuda enabled graphics card. The code is written specifically for a NVIDIA 620 96-core card with Clockrate 1G4Hz. As the neural network size was chosen as multiples of 96, the code may have to be modded for cards with more or less cores.
The first part of the code constructs a neural network in PC memory and in GPU memory. You can run it in either to test speed. The second part of the code is the PTX (GPU pseudo assembly) program thread that is loaded into the NVIDIA GPU in ascii, and run on continuous 96 core 'blocks' (one thread per core) until the program has finished.
My time results show this 96-core card to run at equivalent speed to a dual-core CPU running at 3GHz and fully using 128-bit SIMD.
To put this another way, if this program was run on the latest 3000 core NVIDIA card it would run 5 times faster than on the latest 16-core AVX (256-bit SIMD) CPU available today.
I was a bit disappointed with CUDA performance, CPU core for GPU core, it should have run up to 5 times faster, however, this is just a first attempt.
It does not work for me ... :(
---------------------------
Error
---------------------------
error code:
hex: 0x000e28c0
dec: 927936
---------------------------
OK
---------------------------
Quote from: AW on May 16, 2019, 03:22:06 AM
Hi LiaoMi,
I noticed that in the #include file they have
#if defined(__CUDA_API_VERSION_INTERNAL) || __CUDA_API_VERSION >= 3020
...
#define cuCtxCreate cuCtxCreate_v2
So I run, cuCtxGetApiVersion and got that the version was 3020.
All explained, so easy when we know the answer. :biggrin:
I need to find an error, somewhere something is wrong, I also have version 3020 based on the results of the function. I compared the original binary file and the import of api, only old api are used there, in the example (Cuda_UASM_v4.zip) I'm trying to translate, all api are similar to your version.
---------------------------
Error
---------------------------
cuCtxSynchronize
CUDA_ERROR_LAUNCH_FAILED
---------------------------
OK
---------------------------
I will try to set the old api and once again check the parameters of the functions.
Quote from: TimoVJL on May 16, 2019, 03:17:55 AM
OpenCL (https://www.khronos.org/opencl/)
nvidia OpenCL (https://developer.nvidia.com/opencl)
GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)
Hi TimoVJL,
if I understand correctly, the main meaning of the OpenCL is to provide a common set of api for all variations of the GPU, this is almost like Qt for the world of graphics accelerators :idea:
Quote from: LiaoMi on May 16, 2019, 10:06:57 PM
Quote from: TimoVJL on May 16, 2019, 03:17:55 AM
OpenCL (https://www.khronos.org/opencl/)
nvidia OpenCL (https://developer.nvidia.com/opencl)
GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)
Hi TimoVJL,
if I understand correctly, the main meaning of the OpenCL is to provide a common set of api for all variations of the GPU, this is almost like Qt for the world of graphics accelerators :idea:
Even with Android smart phones ;)
CudaVsOpenCL (https://wiki.tiker.net/CudaVsOpenCL)
Quote from: TimoVJL on May 16, 2019, 03:17:55 AM
GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)
:t
Cuda is also on Android with the most recent (NVidia) Tegra models.
And you can start developing right now (https://docs.nvidia.com/gameworks/content/developertools/mobile/getting_started.htm), everything is organized to make the journey as pleasant as possible. :t
Is OpenCL disappearing from the map? (https://community.amd.com/thread/232600) :shock:
Quite a short list:
https://www.nvidia.in/object/tegra-phones-tablets-in.html
They are mostly into chips, not end user products. (https://en.wikipedia.org/wiki/Tegra#Tegra_3)
Like this one: https://www.nvidia.in/object/tegra-x1-processor-in.html
Everything works :biggrin:, here is the final example translated from fasm ... The mistake was that new api were used, after I set the correct parameters for the file's header: cuda.inc
CUDA_FORCE_API_VERSION EQU 3010
CUdeviceptr typedef QWORD
_WIN64 equ 1
and the example was launched on gpu, all functions worked correctly. There is some dependence of the PTX code and version of the Api. It is time to start with the neural network example :eusa_dance:.
Here is another interesting topic related to graphics - Single-threaded CUDA OpenGL Interop https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st (https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st)
@LiaoMi
Did you include the translated from FASM source code?
BTW, have you ever tried to use cubin ot fatbin instead of PTX? It will save the compilation time, right?
Quote from: AW on May 17, 2019, 05:25:09 AM
@LiaoMi
Did you include the translated from FASM source code?
BTW, have you ever tried to use cubin ot fatbin instead of PTX? It will save the compilation time, right?
@AW
Please accept my apologies, I did not attach the most important thing :icon_eek:, now the source is in place, please download again .. you can change the paths of the system libraries and compile with UASM. Cuda.inc will work unfortunately only on UASM, you can use your batch file, the inc folder also contains the translate64.inc file.
Quotehave you ever tried to use cubin ot fatbin instead of PTX
I thought about it a little, but did not study it more deeply, because I don't see any advantage in this, relying on the description in the documentation, working with PTX gives more possibilities and more flexibility.
This was written by other users https://devtalk.nvidia.com/default/topic/504259/cubin-vs-ptx/ (https://devtalk.nvidia.com/default/topic/504259/cubin-vs-ptx/)
.fatbin from the same story, I think it will be useful only when strict code optimization is needed, for a certain type of video card .. Maybe I'm wrong, or maybe not :eusa_boohoo:
QuoteIt will save the compilation time, right?
Anyway, in binary form, the execution time should be optimal :icon_exclaim:
Useful link
CUDA Binary Utilities cuobjdump
cuobjdump extracts information from CUDA binary files (both standalone and those embedded in host binaries) and presents them in human readable format. The output of cuobjdump includes CUDA assembly code for each kernel, CUDA ELF section headers, string tables, relocators and other CUDA specific sections. It also extracts embedded ptx text from host binaries. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump)
$ cuobjdump a.out -ptx -sass
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
identifier = add.cu
code for sm_20
Function : _Z3addPiS_S_
.headerflags @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV R6, c[0x0][0x20]; /* 0x2800400080019de4 */
/*0010*/ MOV R7, c[0x0][0x24]; /* 0x280040009001dde4 */
/*0018*/ MOV R2, c[0x0][0x28]; /* 0x28004000a0009de4 */
/*0020*/ MOV R3, c[0x0][0x2c]; /* 0x28004000b000dde4 */
/*0028*/ LDU.E R0, [R6]; /* 0x8c00000000601c85 */
/*0030*/ MOV R4, c[0x0][0x30]; /* 0x28004000c0011de4 */
/*0038*/ LDU.E R2, [R2]; /* 0x8c00000000209c85 */
/*0040*/ MOV R5, c[0x0][0x34]; /* 0x28004000d0015de4 */
/*0048*/ IADD R0, R2, R0; /* 0x4800000000201c03 */
/*0050*/ ST.E [R4], R0; /* 0x9400000000401c85 */
/*0058*/ EXIT; /* 0x8000000000001de7 */
.............................
Fatbin ptx code:
================
arch = sm_20
code version = [4,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
identifier = add.cu
.version 4.0
.target sm_20
.address_size 64
.visible .entry _Z3addPiS_S_(
.param .u64 _Z3addPiS_S__param_0,
.param .u64 _Z3addPiS_S__param_1,
.param .u64 _Z3addPiS_S__param_2
)
{
.reg .s32 %r<4>;
.reg .s64 %rd<7>;
ld.param.u64 %rd1, [_Z3addPiS_S__param_0];
ld.param.u64 %rd2, [_Z3addPiS_S__param_1];
ld.param.u64 %rd3, [_Z3addPiS_S__param_2];
cvta.to.global.u64 %rd4, %rd3;
cvta.to.global.u64 %rd5, %rd2;
cvta.to.global.u64 %rd6, %rd1;
ldu.global.u32 %r1, [%rd6];
ldu.global.u32 %r2, [%rd5];
add.s32 %r3, %r2, %r1;
st.global.u32 [%rd4], %r3;
ret;
}
I get an error when running your program in an old computer with a GE FORCE GT 640
(https://www.dropbox.com/s/qg14qieqvglnu4i/cuda1.jpg?dl=1)
I can't also run my queens program but it may be because I can't compile a PTX to versions below sm_30 with my nvcc.exe and possible GE FORCE GT 640 may require under that. The error I get with my proggy is CUDA_ERROR_NO_BINARY_FOR_GPU (error 209). What tool do you use to compile to sm_13? This appears way too old.
sm_13 was already ready in the example :biggrin:
CUDA Toolkit v10.1.168_pdf -> CUDA_C_Programming_Guide
3.1.2. Binary Compatibility
Binary code is architecture-specific. A cubin object is generated using the compiler
option -code that specifies the targeted architecture: For example, compiling with
-code=sm_35 produces binary code for devices of compute capability 3.5. Binary
compatibility is guaranteed from one minor revision to the next one, but not from one
minor revision to the previous one or across major revisions. In other words, a cubin
object generated for compute capability X.y will only execute on devices of compute
capability X.z where z≥y.
3.1.3. PTX Compatibility
Some PTX instructions are only supported on devices of higher compute capabilities.
For example, Warp Shuffle Functions are only supported on devices of compute
capability 3.0 and above. The -arch compiler option specifies the compute capability
that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for
example, must be compiled with -arch=compute_30 (or higher).
PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from an
earlier PTX version may not make use of some hardware features. For example, a binary
targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for
compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these
were not available on Pascal. As a result, the final binary may perform worse than would
be possible if the binary were generated using the latest version of PTX.
3.1.4. Application Compatibility
To execute code on devices of specific compute capability, an application must load
binary or PTX code that is compatible with this compute capability as described in
Binary Compatibility and PTX Compatibility. In particular, to be able to execute code
on future architectures with higher compute capability (for which no binary code can be
generated yet), an application must load PTX code that will be just-in-time compiled for
these devices (see Just-in-Time Compilation).
Which PTX and binary code gets embedded in a CUDA C application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,
nvcc x.cu
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=\'compute_60,sm_60\'
embeds binary code compatible with compute capability 3.5 and 5.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 6.0
(third -gencode option).
Host code is generated to automatically select at runtime the most appropriate code to
load and execute, which, in the above example, will be:
‣ 3.5 binary code for devices with compute capability 3.5 and 3.7,
‣ 5.0 binary code for devices with compute capability 5.0 and 5.2,
‣ 6.0 binary code for devices with compute capability 6.0 and 6.1,
‣ PTX code which is compiled to binary code at runtime for devices with compute
capability 7.0 and higher.
x.cu can have an optimized code path that uses warp shuffle operations, for example,
which are only supported in devices of compute capability 3.0 and higher. The
__CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with arch=compute_35 for example,
__CUDA_ARCH__ is equal to 350.
Applications using the driver API must compile code to separate files and explicitly load
and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architecures, Independent Thread Scheduling may alter the set of
participating threads, leading to incorrect results. To aid migration while implementing
the corrective actions detailed in Independent Thread Scheduling, Volta developers
can opt-in to Pascal's thread scheduling with the compiler option combination arch=compute_60
-code=sm_70.
The nvcc user manual lists various shorthand for the -arch, -code, and -gencode
compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35 code=compute_35,sm_35 (which is the same as
-gencode
arch=compute_35,code=\'compute_35,sm_35\').
Quote from: AW on May 17, 2019, 07:07:07 AM
I get an error when running your program in an old computer with a GE FORCE GT 640
(https://www.dropbox.com/s/qg14qieqvglnu4i/cuda1.jpg?dl=1)
I can't also run my queens program but it may be because I can't compile a PTX to versions below sm_30 with my nvcc.exe and possible GE FORCE GT 640 may require under that. The error I get with my proggy is CUDA_ERROR_NO_BINARY_FOR_GPU (error 209). What tool do you use to compile to sm_13? This appears way too old.
There are differences in working with memory, so this is a normal phenomenon, I can't answer the question at the moment, I can only guess :P
Matching SM architectures (CUDA arch and CUDA gencode) for various NVIDIA cards https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ (https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/)
I've seen some confusion regarding NVIDIA's nvcc sm flags and what they're used for:
When compiling with NVCC, the arch flag ('-arch') specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes ('-gencode') allows for more PTX generations, and can be repeated many times for different architectures.Here is an example for Cubin :t ...
/// Init cuda
SAFE_CALL(cuInit(0));
/// Get device
SAFE_CALL(cuDeviceGet(&dev, 0));
/// Create context
SAFE_CALL(cuCtxCreate(&ctx, 0, dev));
SAFE_CALL(cuModuleLoad(&m, "test.cubin"));
CUfunction hfunc;
SAFE_CALL(cuModuleGetFunction(&hfunc, m, "my_kernel"));
P.S>
SM stands for Streaming Multiprocessor and the number indicates the features supported by the architecture.
The real problem is the paranoia of forced obsolescence we are assisting these days. From VS I can't even build for sm_30, minimum is sm_35.
If I can't solve the issues, I will have to install and old NVidia SDK version.
Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.
I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.
Quote from: AW on May 17, 2019, 03:35:35 PM
The real problem is the paranoia of forced obsolescence we are assisting these days. From VS I can't even build for sm_30, minimum is sm_35.
If I can't solve the issues, I will have to install and old NVidia SDK version.
Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.
I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.
I have problems installing Cuda 10.1 in visual studio 2019
https://devtalk.nvidia.com/default/topic/1049613/nsight-visual-studio-edition/visual-studio-2019-and-cuda-10-1/1 (https://devtalk.nvidia.com/default/topic/1049613/nsight-visual-studio-edition/visual-studio-2019-and-cuda-10-1/1)
I don't even talk about old sdk in my case ...
My proposal is to update the CudaPad https://github.com/SunsetQuest/CudaPAD (https://github.com/SunsetQuest/CudaPAD) project for our needs. Add more options, add portable versions of Cuda SDK :idea: There is already SM version selection in the project.
Quote from: AW on May 17, 2019, 03:35:35 PM
I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.
Starting pass 2
mywindow1.obj
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'NumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'LogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'NumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'CudaLogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'LogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err000' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err100' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err200' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err300' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err400' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err500' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err600' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err700' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err999' invalid without /LARGEADDRESSAWARE:NO
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
cuda.lib(nvcuda.dll)
msvcrt.lib(msvcrt.dll)
msvcrt.lib(msvcrt.dll)
msvcrt.lib(msvcrt.dll)
Kernel32.lib(KERNEL32.dll)
Kernel32.lib(KERNEL32.dll)
Kernel32.lib(KERNEL32.dll)
User32.lib(USER32.dll)
User32.lib(USER32.dll)
User32.lib(USER32.dll)
LINK : fatal error LNK1165: link failed because of fixup errors
The wording of the linker error message is a bit misleading. In fact you are using an addressing mode
that uses a 32 bit offset, which results in a 32 bit address in the symbol table.
Here is a description of my problem:
MODULE1.ASM
...
VAR1 DB 256 dup (0)
...
MODULE2.ASM
...
extrn VAR1:BYTE
...
MOV al, byte ptr[VAR1 + rbx]
In 64 bit mode only very few addressing modes can handle absolute 64 bit offsets.
Most complex addressing modes are restricted to 32 bit offsets (like the [mem + reg]
used above).
So the assembler generates an entry in the symbol table that states VAR1 as a
32bit offset for MODULE2.
The symbol table of MODULE1 only has an entry that states VAR1 as a 64bit offset.
In this case the linker complains that symbol VAR1 has different address sizes.
To avoid this a different addressing mode must be used - something like this:
LEA rsi, VAR1
MOV al, byte ptr[rsi + rbx]
I have problems with VS 2019 as well in CUDA so I am using VS 2017.
CudaPad is fine but it works with nvcc.exe, so can't produce miracles. If I select sm_23 it reports: nvcc fatal : Value 'sm_23' is not defined for option 'gpu-architecture' :(
I never used /LARGEADDRESSAWARE:NO because it is a kludge for transitioning from win32 to win64.
If you can't fix it I will have a look at your program during the week end and with some luck I will try to build it for /LARGEADDRESSAWARE:YES.
nvopencc.exe work's in Windows 7, if downloaded from CUDA Toolkit 3.2 (https://developer.nvidia.com/cuda-toolkit-32-downloads), but from CUDA Toolkit 4.2.9 just crash.
That toolkit is a .msi file, so TLMsiListEx (http://masm32.com/board/index.php?topic=7435.msg81276#msg81276) can be useful.
nvcc error : 'cudafe++' died with status 0xC0000374 :( RIP :P
Quote from: AW on May 17, 2019, 06:54:44 PM
I have problems with VS 2019 as well in CUDA so I am using VS 2017.
CudaPad is fine but it works with nvcc.exe, so can't produce miracles. If I select sm_23 it reports: nvcc fatal : Value 'sm_23' is not defined for option 'gpu-architecture' :(
I never used /LARGEADDRESSAWARE:NO because it is a kludge for transitioning from win32 to win64.
If you can't fix it I will have a look at your program during the week end and with some luck I will try to build it for /LARGEADDRESSAWARE:YES.
Without using / LARGEADDRESSAWARE: NO :icon14:
That was the problem, it is fixed. :t
(https://www.dropbox.com/s/6dg0kcjjt3o9tq6/cuda2.jpg?dl=1)
@LiaoMi ... have you found any one of the linux flavours to be superior than others for this kind of thing .. was thinking of trying CentOS 7
regards mikeb
ps i worked on a stock control system many years ago where i used distributions to assess the new weighting rather than the linear moving weight scheme you have in your neural system . If you do move to a similar scheme then i advise you to use a skewed distribution such as Students T as very rarely in nature is any distribution symmetric a failing which is sadly almost always overlooked .. i include in this the FFT methods now prevalent for ascertaining independance of variables as these are highly symmetricised methods [ see the many excellent examples and links provided by Siekmanski ]
A small example in C for testing an old card like nVidia G210#define WIN32_LEAN_AND_MEAN
#include <windows.h>
//#include "cuda.h"
#pragma comment(lib, "cuda.lib")
#pragma comment(lib, "msvcrt.lib")
#define CUDAAPI __stdcall
typedef int CUdevice;
typedef struct CUctx_st *CUcontext;
typedef struct CUmod_st *CUmodule;
typedef enum cudaError_enum {
CUDA_SUCCESS = 0,
CUDA_ERROR_INVALID_VALUE = 1,
CUDA_ERROR_OUT_OF_MEMORY = 2,
CUDA_ERROR_NOT_INITIALIZED = 3,
CUDA_ERROR_DEINITIALIZED = 4,
CUDA_ERROR_NO_DEVICE = 100,
CUDA_ERROR_INVALID_DEVICE = 101,
CUDA_ERROR_INVALID_IMAGE = 200,
CUDA_ERROR_INVALID_CONTEXT = 201,
CUDA_ERROR_CONTEXT_ALREADY_CURRENT = 202,
CUDA_ERROR_MAP_FAILED = 205,
CUDA_ERROR_UNMAP_FAILED = 206,
CUDA_ERROR_ARRAY_IS_MAPPED = 207,
CUDA_ERROR_ALREADY_MAPPED = 208,
CUDA_ERROR_NO_BINARY_FOR_GPU = 209,
CUDA_ERROR_ALREADY_ACQUIRED = 210,
CUDA_ERROR_NOT_MAPPED = 211,
CUDA_ERROR_NOT_MAPPED_AS_ARRAY = 212,
CUDA_ERROR_NOT_MAPPED_AS_POINTER = 213,
CUDA_ERROR_ECC_UNCORRECTABLE = 214,
CUDA_ERROR_UNSUPPORTED_LIMIT = 215,
CUDA_ERROR_INVALID_SOURCE = 300,
CUDA_ERROR_FILE_NOT_FOUND = 301,
CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND = 302,
CUDA_ERROR_SHARED_OBJECT_INIT_FAILED = 303,
CUDA_ERROR_OPERATING_SYSTEM = 304,
CUDA_ERROR_INVALID_HANDLE = 400,
CUDA_ERROR_NOT_FOUND = 500,
CUDA_ERROR_NOT_READY = 600,
CUDA_ERROR_LAUNCH_FAILED = 700,
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701,
CUDA_ERROR_LAUNCH_TIMEOUT = 702,
CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING = 703,
CUDA_ERROR_UNKNOWN = 999
} CUresult;
CUresult CUDAAPI cuDriverGetVersion(int *driverVersion);
CUresult CUDAAPI cuInit(unsigned int Flags);
CUresult CUDAAPI cuDeviceGetCount(int *count);
CUresult CUDAAPI cuDeviceGet(CUdevice *device, int ordinal);
CUresult CUDAAPI cuCtxCreate(CUcontext *pctx, unsigned int flags, CUdevice dev);
CUresult CUDAAPI cuCtxGetApiVersion(CUcontext ctx, unsigned int *version);
CUresult CUDAAPI cuDeviceComputeCapability(int *major, int *minor, CUdevice dev);
CUresult CUDAAPI cuModuleLoad(CUmodule *module, const char *fname);
CUresult CUDAAPI cuCtxDetach(CUcontext ctx);
CUresult CUDAAPI cuGetErrorString(CUresult error, const char **pStr);
void __cdecl mainCRTStartup(void)
{
int deviceCount = 0;
int iDrvVer, iApiVer;
CUdevice dev;
CUcontext ctx;
CUmodule mod;
CUresult err;
int iMajor, iMinor;
cuDriverGetVersion(&iDrvVer);
printf("Driver Version: %d.%d\n", iDrvVer / 1000, iDrvVer % 1000);
if (!(err = cuInit(0))) {
err = cuDeviceGetCount(&deviceCount);
err = cuDeviceGet(&dev, 0);
err = cuCtxCreate(&ctx, 0, dev);
err = cuCtxGetApiVersion(ctx, &iApiVer);
printf("API Version: %d.%d\n", iApiVer / 1000, iApiVer % 1000);
err = cuDeviceComputeCapability(&iMajor, &iMinor, dev);
printf("DeviceComputeCabability: %d.%d\n", iMajor, iMinor);
//err = cuModuleLoadData(&mod, PTXSourceData);
err = cuModuleLoad(&mod, "Test.ptx");
if (!err) {
printf("Test.ptx loaded\n");
} else {
char *perr;
cuGetErrorString(err, &perr);
printf("error loading ptx: %u %s\n", err, perr);
}
err = cuCtxDetach(ctx);
}
ExitProcess(0);
}
Test.ptx .version 1.4
.target sm_10, map_f64_to_f32
// compiled with open64/lib//be.exe
// nvopencc 3.2 built on 2010-11-04
.entry _Z3addPiS_ (
.param .u64 __cudaparm__Z3addPiS__a,
.param .u64 __cudaparm__Z3addPiS__b)
{
.reg .u32 %r<6>;
.reg .u64 %rd<8>;
.reg .pred %p<3>;
.loc 15 4 0
$LDWbegin__Z3addPiS_:
cvt.s32.u16 %r1, %tid.x;
mov.u32 %r2, 999;
setp.gt.s32 %p1, %r1, %r2;
@%p1 bra $Lt_0_1026;
.loc 15 7 0
cvt.s64.s32 %rd1, %r1;
mul.wide.s32 %rd2, %r1, 4;
ld.param.u64 %rd3, [__cudaparm__Z3addPiS__a];
add.u64 %rd4, %rd3, %rd2;
ld.global.s32 %r3, [%rd4+0];
mul.lo.s32 %r4, %r3, 2;
ld.param.u64 %rd5, [__cudaparm__Z3addPiS__b];
add.u64 %rd6, %rd5, %rd2;
st.global.s32 [%rd6+0], %r4;
$Lt_0_1026:
.loc 15 9 0
exit;
$LDWend__Z3addPiS_:
} // _Z3addPiS_
It just tells that it load that shit ;)
PS: it took a while to get G210 working with Windows 8.1, just a nightmare, as a old nVidia card and Windows 8.1 have a some dating problems.
(MS: women regardless of marital status)
EDIT: how to create that kernel code: Add.itypedef struct uint3
{
unsigned int x, y, z;
}uint3;
uint3 extern const threadIdx;
//__global__
__attribute__((global)) __attribute__((__used__))
void add(int* a, int* b) {
int i = threadIdx.x;
if (i < 1000) {
b[i] = 2 * a[i];
}
}
open64\bin\nvopencc -TARG:compute_10 -m64 -OPT:ftz=1 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0 "Add" "Add.i" -o "Add.ptx"
for CUDA Toolkit 7 >compiler\nvvm\bin\cicc -arch compute_30 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -nvvmir-library "compiler\bin/../nvvm/libdevice/libdevice.compute_30.10.bc" --orig_src_file_name "Add.cu" "Add.i" -o "Add.ptx"
EDIT: copy driver / global code part from .cu to .ci file and add a header-file for it
Minimal header for .ci,
ci_include.htypedef struct _uint3
{
unsigned int x, y, z;
}uint3;
typedef struct _dim3
{
unsigned int x, y, z;
}dim3;
extern const uint3 threadIdx;
extern const uint3 blockIdx;
extern const dim3 blockDim;
extern const dim3 gridDim;
extern const int warpSize;
#define __device__ __attribute__((device))
#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))
#define bool _Bool
clang (https://www.llvm.org/docs/CompileCudaWithLLVM.html) backend support CUDA
EDIT: AMD clang (https://github.com/GPUOpen-Tools/RGA/tree/master/Core/ROCm/OpenCL/win64/bin)clang.exe -c Add.cu -nocudainc -nocudalib
Quote from: mikeburr on May 17, 2019, 10:18:48 PM
@LiaoMi ... have you found any one of the linux flavours to be superior than others for this kind of thing .. was thinking of trying CentOS 7
regards mikeb
ps i worked on a stock control system many years ago where i used distributions to assess the new weighting rather than the linear moving weight scheme you have in your neural system . If you do move to a similar scheme then i advise you to use a skewed distribution such as Students T as very rarely in nature is any distribution symmetric a failing which is sadly almost always overlooked .. i include in this the FFT methods now prevalent for ascertaining independance of variables as these are highly symmetricised methods [ see the many excellent examples and links provided by Siekmanski ]
Hi mikeburr,
I can not say exactly ... Was the question about gpu or neural networks? The assembly language example is not mine, the author is no longer active, there are flaws in the example, I hope we'll be able to add improvements, so tips on the algorithm are welcome.
From my point of view, for choosing the system environment only two factors are important, having the right tools and the usability of the data. Therefore, for neural networks, I would choose java or dotnet, in both programming languages there are hard points in working with gpu, but it can be solved. The more standard OSes are the best) Java + Linux = Dotnet + Windows, as an option Dotnet + Linux = Java + Windows. Why not an assembler, because of the amount of data, object-oriented processing methods and the ability to use ready-made items. Anyway, examples written in assembly language have attractiveness on stable processes, on stable models that dont require further intervention.
Perhaps it would be more logical to find a suitable github project, where the most interesting neural network model for gpu has already been chosen. After all, the main discussion is the computational power of GPU and algorithms for assembler. It makes no sense to copy all the complexities of an object-oriented language :P
I don't have Linux at the moment, but the previous example needs to be adapted for Linux ...
How to install NVIDIA CUDA Toolkit on CentOS 7 Linux - https://linuxconfig.org/how-to-install-nvidia-cuda-toolkit-on-centos-7-linux (https://linuxconfig.org/how-to-install-nvidia-cuda-toolkit-on-centos-7-linux)
This is a simplified version of the log 2 example produced by LiaoMi, under 3 API:
- Runtime API
- Driver API
- Driver API in MASM
Common C/C++
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
#define N 32
extern "C" __global__ void log_2(float* inout, int length) {
int xind = blockIdx.x * blockDim.x + threadIdx.x;
if (xind < length) {
float tempFloat = log2(*(inout + xind));
*(inout + xind) = tempFloat;
}
}
//#define RUNTIMEAPI 1
//#define DRIVERAPI 1
#define MASMAPI 1
extern "C" int logasm2();
extern "C" char PTXFunction;
extern "C" char PTXSourceData;
#if defined(RUNTIMEAPI)
void testruntimeAPi()
{
float hinOut[N];
float* dinOut;
cudaMalloc((void**)& dinOut, N * sizeof(float));
for (int i = 1; i <= N; ++i) {
hinOut[i-1] = (float)i;
}
cudaMemcpy(dinOut, hinOut, N * sizeof(float), cudaMemcpyHostToDevice);
log_2 <<<N, 1 >>> (dinOut, N);
cudaMemcpy(hinOut, dinOut, N * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; ++i) {
printf("%d %f\n", i+1, hinOut[i]);
}
cudaFree(dinOut);
getchar();
}
#endif
#if defined(DRIVERAPI)
int testDriverApi()
{
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction cudaFunction;
CUresult result;
CUdeviceptr dinOut;
int count;
float hinOut[N];
unsigned int memSize = sizeof(float) * N;
if (cuInit(0))
return 1;
if (cuDeviceGet(&cuDevice, 0))
return 1;
if (cuCtxCreate(&cuContext, 0, cuDevice))
return 1;
if (cuModuleLoadData(&cuModule, &PTXSourceData))
return 1;
if (cuModuleGetFunction(&cudaFunction, cuModule, &PTXFunction))
return 1;
for (int i = 1; i <= N; ++i) {
hinOut[i - 1] = (float)i;
}
cuMemAlloc(&dinOut, memSize);
cuMemcpyHtoD(dinOut, hinOut, memSize);
count = N;
void *args[] = { &dinOut, &count };
result = cuLaunchKernel(cudaFunction, N, 1, 1, 1, 1, 1, 0, 0, &args[0], 0);
if (result!=CUDA_SUCCESS)
return 1;
cuCtxSynchronize();
cuMemcpyDtoH(hinOut, dinOut, memSize);
for (int i = 0; i < N; ++i) {
printf("%d %f\n", i + 1, hinOut[i]);
}
cuMemFree(dinOut);
getchar();
return 0;
}
#endif
#if defined(MASMAPI)
void testMasmAPI()
{
int err=logasm2();
printf("Error %d (0=NO ERROR)",err);
getchar();
return;
}
#endif
int main()
{
#if RUNTIMEAPI
testruntimeAPi();
#elif DRIVERAPI
testDriverApi();
#elif MASMAPI
testMasmAPI();
#endif
return 0;
}
MASM:
OPTION casemap:none
N equ 32
public PTXFunction
public PTXSourceData
printf proto :ptr, :vararg
cuInit proto :dword
cuDeviceGet proto :ptr, :dword
cuCtxCreate_v2 proto :ptr, :dword, :dword
cuModuleLoadData proto :ptr, :ptr
cuModuleGetFunction proto :ptr, :ptr, :ptr
cuMemAlloc_v2 proto :ptr, :qword
cuMemcpyHtoD_v2 proto :ptr, :ptr, :qword
cuMemcpyDtoH_v2 proto :ptr, :ptr, :qword
cuLaunchKernel proto :ptr, :dword, :dword, :dword, :dword, :dword, :dword, :ptr, :ptr, :ptr
cuCtxSynchronize proto
cuMemFree_v2 proto :ptr
cuCtxGetApiVersion proto :ptr, :ptr
.code
logasm2 proc
LOCAL cuDevice:dword
LOCAL cuContext:ptr
LOCAL cuModule:ptr
LOCAL cudaFunction:ptr
LOCAL dInOut: qword
LOCAL hInOut[N]:real4
LOCAL memSize : qword
LOCAL args[2] : ptr
LOCAL qt : dword;
sub rsp, 58h
and rsp, -16 ; align
mov eax, sizeof REAL4 * N
mov memSize, rax
mov rcx,0
call cuInit
cmp eax, 0
jnz @exit
lea rcx, cuDevice
mov edx,0
call cuDeviceGet
cmp eax, 0
jnz @exit
lea rcx, cuContext
mov edx, 0
mov r8d, cuDevice
call cuCtxCreate_v2
cmp eax, 0
jnz @exit
lea rcx, cuModule
mov rdx, offset PTXSourceData
call cuModuleLoadData
cmp eax, 0
jnz @exit
lea rcx, cudaFunction
mov rdx, cuModule
lea r8, PTXFunction
call cuModuleGetFunction
cmp eax, 0
jnz @exit
mov ecx, N
lea r11, hInOut
fld1
fld1
@@:
fst dword ptr [r11]
fadd st,st(1)
add r11, 4
dec ecx
jnz @B
lea rcx, dInOut
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
mov rcx, dInOut
lea rdx, hInOut
mov r8, memSize
call cuMemcpyHtoD_v2
cmp eax, 0
jnz @exit
lea rax, dInOut
mov qword ptr args, rax
mov qt, N
lea rax, qt
mov qword ptr args+8, rax
mov rcx, cudaFunction
mov edx ,1
mov r8d, 1
mov r9d, 1
mov dword ptr [rsp+20h], N
mov dword ptr [rsp+28h], 1
mov dword ptr [rsp+30h], 1
mov dword ptr [rsp+38h], 0
mov qword ptr [rsp+40h], 0
lea rax, args
mov qword ptr [rsp+48h], rax
mov qword ptr [rsp+50h], 0
call cuLaunchKernel
cmp eax, 0
jnz @exit
call cuCtxSynchronize
cmp eax, 0
jnz @exit
lea rcx, hInOut
mov rdx, dInOut
mov r8, memSize
call cuMemcpyDtoH_v2
cmp eax, 0
jnz @exit
mov ebx, 1
lea rdi, hInOut
@@:
mov rcx, offset report
mov edx, ebx
fld dword ptr [rdi]
fstp qword ptr [rsp+20h]
mov r8, qword ptr [rsp+20h]
call printf
add rdi, 4
inc ebx
cmp ebx, N
jle @B
mov rcx, dInOut
call cuMemFree_v2
mov eax,0
@exit:
ret
logasm2 endp
.data
report db "%d %f",10,0
; Generated by NVIDIA NVVM Compiler
; Compiler Build ID: CL-25769353
; Cuda compilation tools, release 10.1, V10.1.105
; Based on LLVM 3.4svn
PTXFunction db 'log_2',0
PTXSourceData \
db '.version 6.4 ',10
db '.target sm_30 ',10
db '.address_size 64 ',10
db ' ',10
db ' // .globl log_2 ',10
db ' ',10
db '.visible .entry log_2( ',10
db ' .param .u64 log_2_param_0, ',10
db ' .param .u32 log_2_param_1 ',10
db ') ',10
db '{ ',10
db ' .reg .pred %p<2>; ',10
db ' .reg .f32 %f<3>; ',10
db ' .reg .b32 %r<6>; ',10
db ' .reg .b64 %rd<5>; ',10
db ' ',10
db ' ',10
db ' ld.param.u64 %rd1, [log_2_param_0]; ',10
db ' ld.param.u32 %r2, [log_2_param_1]; ',10
db ' mov.u32 %r3, %ctaid.x; ',10
db ' mov.u32 %r4, %ntid.x; ',10
db ' mov.u32 %r5, %tid.x; ',10
db ' mad.lo.s32 %r1, %r4, %r3, %r5; ',10
db ' setp.ge.s32 %p1, %r1, %r2; ',10
db ' @%p1 bra BB0_2; ',10
db ' ',10
db ' cvta.to.global.u64 %rd2, %rd1; ',10
db ' mul.wide.s32 %rd3, %r1, 4; ',10
db ' add.s64 %rd4, %rd2, %rd3; ',10
db ' ld.global.f32 %f1, [%rd4]; ',10
db ' lg2.approx.ftz.f32 %f2, %f1; ',10
db ' st.global.f32 [%rd4], %f2; ',10
db ' ',10
db 'BB0_2: ',10
db ' ret; ',10
db '} ',10
db 0
end
Output:
1 0.000000
2 1.000000
3 1.584962
4 2.000000
5 2.321928
6 2.584962
7 2.807355
8 3.000000
9 3.169925
10 3.321928
11 3.459432
12 3.584962
13 3.700440
14 3.807355
15 3.906890
16 4.000000
17 4.087463
18 4.169925
19 4.247927
20 4.321928
21 4.392317
22 4.459432
23 4.523562
24 4.584962
25 4.643856
26 4.700439
27 4.754887
28 4.807355
29 4.857981
30 4.906890
31 4.954196
32 5.000000
Error 0 (0=NO ERROR)
clang (https://www.llvm.org/docs/CompileCudaWithLLVM.html) backend supports CUDA :thumbsup:
So for textual .ptx, only ptxas / fatbinary stubs are needed.
a minimal ci_include.htypedef struct _uint3
{
unsigned int x, y, z;
}uint3;
typedef struct _dim3
{
unsigned int x, y, z;
}dim3;
extern const uint3 threadIdx;
extern const uint3 blockIdx;
extern const dim3 blockDim;
extern const dim3 gridDim;
extern const int warpSize;
#define __device__ __attribute__((device))
#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))
#define bool _Bool
a fake ptxas just copy that temporary .s file#define WIN32_LEAN_AND_MEAN
#include <windows.h>
int __cdecl main(int argc, char **argv)
{
char fname[260];
char *ptx = argv[argc-1];
while(*ptx) ptx++; // find end
char *pend = ptx;
while(*pend != '-') pend--;
ptx = pend;
while(*ptx != '\\') ptx--;
ptx++;
strcpy(fname, ptx);
char *pext = fname + (pend - ptx);
*(long*)pext = *(long*)".ptx"; // add ext
*(pext+4) = 0; // cut name
return !CopyFile(argv[argc-1], fname, FALSE);
}
a commandline used in testsclang.exe -c add.cu -nocudainc -nocudalib --cuda-gpu-arch=sm_30
Quote from: TimoVJL on May 24, 2019, 07:50:02 PM
clang (https://www.llvm.org/docs/CompileCudaWithLLVM.html) backend supports CUDA :thumbsup:
Hi TimoVJL,
need to try, there is a
web compiler for Cuda and Cuda LLVM
https://cuda.godbolt.org/ (https://cuda.godbolt.org/)
ordinary compiler
https://godbolt.org/ (https://godbolt.org/)
CUDA CompressionA GPU-based LZSS compression algorithm, highly tuned for NVIDIA GPGPUs and for streaming data, leveraging the respective strengths of CPUs and GPUs together - https://github.com/adnanozsoy/CUDA_Compression (https://github.com/adnanozsoy/CUDA_Compression)
Cuda lzss compression https://github.com/abshkbh/cuda-lzss (https://github.com/abshkbh/cuda-lzss)
Algorithms for Compression on GPUs http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6642/pdf/imm6642.pdf (http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6642/pdf/imm6642.pdf)
Compression library using Nvidia's CUDA - https://stackoverflow.com/questions/456829/compression-library-using-nvidias-cuda (https://stackoverflow.com/questions/456829/compression-library-using-nvidias-cuda)
Parallel lossless compression using GPUs http://on-demand.gputechconf.com/gtc/2014/presentations/S4459-parallel-lossless-compression-using-gpus.pdf (http://on-demand.gputechconf.com/gtc/2014/presentations/S4459-parallel-lossless-compression-using-gpus.pdf)
Breakthrough in CUDA data compression https://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx (https://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx)
The results:
BMP, 540 Kb
Full bzip2 compression on CPU - 218 ms
BW Transform on CPU - 171 ms
Full bzip2 compression on GPU - 93 ms [ minus 53% ]
BW Transform on GPU - 46 ms [ minus 73% ]
BMP, 1112 Kb
Full bzip2 compression on CPU - 467 ms
BW Transform on CPU - 343 ms
Full bzip2 compression on GPU - 249 ms [ minus 46% ]
BW Transform on GPU - 140 ms [ minus 59% ]
PDF, 1919 Kb
Full bzip2 compression on CPU - 1513 ms
BW Transform on CPU - 731 ms
Full bzip2 compression on GPU - 1107 ms [ minus 26% ]
BW Transform on GPU - 311 ms [ minus 57% ]
PDF, 3425 Kb
Full bzip2 compression on CPU - 2168 ms
BW Transform on CPU - 793 ms
Full bzip2 compression on GPU - 1856 ms [ minus 14% ]
BW Transform on GPU - 481 ms [ minus 39% ]
(https://www.wave-access.com/media/90569/bzip2_gpu.png)
Learn How To Do Alphablending with CUDAhttps://www.codeproject.com/Articles/41977/Learn-How-To-Do-Alphablending-with-CUDA (https://www.codeproject.com/Articles/41977/Learn-How-To-Do-Alphablending-with-CUDA)
A good link :thumbsup:
Less reasons to download a huge nVidia SDK.
Hi,
I contacted the author of the program CudaPAD, with the result that there was an update for the Visual Studio 2019, now the program works on the latest version of both Visual Studio and cuda_10.1.168_425.25_win10, source code can be found here https://github.com/SunsetQuest/CudaPAD
Everything works ...
(https://i.imgur.com/4YxnCIW.png)
Thank you, LiaoMi. Very helpful :thumbsup:
are there a way to get aviable VRAM I can use for memory allocating a big array? and compare to system ram,on older computer that have been upgraded with better nvidia or not it would probably run much faster
Hi
Now that I have some free time, I've been working on CUDA for a bit. Really amazing.
I was able to integrate the code from LiaoMi into one of my 64-bit demos and it works like a charm. :thumbsup:
I failed in 32 bits because I need to change the PTX code. :sad:
So far I found the newest ISA documentation I need to read https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf
For me, the question remains whether the PTX was the right choice or to use nvcc to create the code ... :icon_idea:
Biterider
PS: link corrected - thanks to TimoVJL -
A working link for PARALLEL THREAD EXECUTION ISA
(https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf)
Hi
I found a bunch of CUDA tutorials that go from 1 to 12. You're really good at getting a basic understanding of the GPU architecture, the features, the pitfalls. Each tutorial is accompanied with examples. :thup:
Here the link to the first one https://www.youtube.com/watch?v=m0nhePeHwFs (https://www.youtube.com/watch?v=m0nhePeHwFs)
The rest is easy to find.
Biterider
Quote from: Biterider on June 19, 2019, 10:43:39 PM
Hi
I found a bunch of CUDA tutorials that go from 1 to 12. You're really good at getting a basic understanding of the GPU architecture, the features, the pitfalls. Each tutorial is accompanied with examples. :thup:
Here the link to the first one https://www.youtube.com/watch?v=m0nhePeHwFs (https://www.youtube.com/watch?v=m0nhePeHwFs)
The rest is easy to find.
Biterider
Hi Biterider,
thanks for the interesting video!
Quote from: Biterider on June 18, 2019, 06:35:05 PM
For me, the question remains whether the PTX was the right choice or to use nvcc to create the code ... :icon_idea:
I watched standalone compilers and emulators for Cuda, they were not perfect, in addition, writing a Cuda translator is a very difficult task. If you try to write your own macros for ptx, this will require a serious study of the internal architecture of the GPU, Nvidia tried to create a special emulator that could help developers, and later they closed the project, since it was too complicated. For me personally, the best solution is the CudaPad application, with dynamic programming of C code.
From the documentation...1.1.3. Purpose of NVCC
The compilation trajectory involves several splitting, compilation, preprocessing, and merging steps for each CUDA source file. It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process. All non-CUDA compilation steps are forwarded to a C++ host compiler that is supported by nvcc, and nvcc translates its options to appropriate host compiler command line options.
1.2. Supported Host Compilers
A general purpose C++ host compiler is needed by nvcc in the following situations:
During non-CUDA phases (except the run phase), because these phases will be forwarded by nvcc to this compiler.
During CUDA phases, for several preprocessing stages and host code compilation (see also The CUDA Compilation Trajectory).
nvcc assumes that the host compiler is installed with the standard method designed by the compiler provider. If the host compiler installation is non-standard, the user must make sure that the environment is set appropriately and use relevant nvcc compile options.
Figure 1. CUDA Compilation Trajectory(https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/graphics/cuda-compilation-from-cu-to-executable.png)
thanks Biterider
I downloaded NVASM many years ago,it was only for the oldest 128bit hardware version of pixelshaders
I tried out nvidias' Cg+ toolkit years ago,similar to C,but often 4 floats simultanously like SSE
also newer C++ you can include code <vector>
is that what you should code in C++ with vectors and compile to different cpus or CUDA and benchmark same code on different hardware?CUDA probably supports vectors in C++???
Biterider,LiaoMi or anyone else,have you tested CUDA with benchmark fibonnacci,primes,PI code?