Using the GPU

Biterider · April 30, 2019, 07:26:14 PM

Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go. :idea:

It would be nice if you can provide example in asm. ;)

Regards, Biterider

fearless · April 30, 2019, 08:31:04 PM

Done some conversion of cuda and nvml for 64bit asm:

https://github.com/mrfearless/libraries/tree/master/Cuda
Updated my repo to include a CudaDetection radasm project I used whilst testing some stuff out, mainly using the nvml to detect gpu, not got round to the cuda part or testing any of the functions.

Have to use the appropriate CUDA SDK version libraries and dll's - I have a folder for v8.0 and v9.1 SDK, just copy from the appropriate folder and overwrite in the CudaDetection folder. Currently CudaDetection folder contains the 9.1 version libs and dll files.
Also i played around with OpenCL a little (collected a few libs, converted an include file) https://github.com/mrfearless/libraries/tree/master/OpenCL - updated to include a OpenCLDetection project - havent actually tested anything yet - obviously meant to, or was going to use project to help convert the include file - i think.

Done some stuff with Vulkan as well, some conversion of the includes and some minor testing using the example/sample/tutorials from the sdk as a basis to see if it would work - only got so far, got distracted etc, so havent got back to it, got up to the 4th example: 04-init_command_buffer - but cant recall if that is working. https://github.com/mrfearless/libraries/tree/master/Vulkan

Biterider · April 30, 2019, 09:57:23 PM

Hi fearless
Thank you very much. It is a beginning... :t

Anyone else?

Biterider

daydreamer · April 30, 2019, 11:27:22 PM

Quote from: Biterider on April 30, 2019, 09:57:23 PM
Hi fearless
Thank you very much. It is a beginning... :t

Anyone else?

Biterider

take a look at Hitchhikers 1k demo's in old forum, he just had minimal d3d or opengl call/setup quad and pixelshader in clear text,pixelshaders can be used together with texturemaps as arrays

LiaoMi · May 01, 2019, 09:47:01 AM

Quote from: Biterider on April 30, 2019, 07:26:14 PM
Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go. :idea:

It would be nice if you can provide example in asm. ;)

Regards, Biterider

Cuda.. PTX assembly... etc..
http://masm32.com/board/index.php?topic=4711.0

Translating GPU Binaries to Tiered SIMD Architectures
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1333&rep=rep1&type=pdf

LiaoMi · May 01, 2019, 10:01:30 AM

CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly
https://github.com/SunsetQuest/CudaPAD

CudaPAD
http://www.codeproject.com/Articles/999744/CudaPAD

LiaoMi · May 12, 2019, 12:17:48 AM

Hi,

PTX ISA (PDF) - v10.1.105 (older) - Last updated March 26, 2019 https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf

PTX Writer's Guide to Interoperability https://docs.nvidia.com/cuda/pdf/PTX_Writers_Guide_To_Interoperability.pdf

CUDA Toolkit 10.1 Download - https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal

Cuda_Assembly Example exe + src + inc 8)

Code Select

format PE64 GUI 5.0
entry start

include 'win64a.inc'
include 'cuda.inc'

section '.text' code readable executable

  start:
	  push	rbp

     ; init
	   lea	rbx,[_cuInit]
	invoke	cuInit,0
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuDeviceGet]
	invoke	cuDeviceGet,CudaDevice,0   ; get the first device
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuCtxCreate]
	invoke	cuCtxCreate,CudaContext,CU_CTX_SCHED_SPIN+CU_CTX_MAP_HOST,[CudaDevice]	    ; this context associates the device with this cpu thread
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuMemAlloc]
	invoke	cuMemAlloc,CudaNumberArray,256*4
	  test	eax,eax
	   jnz	Error

    ; load ptx source
	   lea	rbx,[_cuModuleLoadData]
	invoke	cuModuleLoadData,CudaModule,PTXSourceData
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuModuleGetFunction]
	invoke	cuModuleGetFunction,CudaFunction,[CudaModule],PTXFunction
	  test	eax,eax
	   jnz	Error

    ; fill in NumberArray and LogNumberArray with x87 calculations for comparision
	  fld1
	  fld1
	   mov	ecx,256*4
	   lea	rax,[NumberArray+rcx]
	   lea	rbx,[LogNumberArray+rcx]
	   neg	rcx
   @@:	   fst	dword[rax+rcx]
	  fld1
	   fld	st1
	 fyl2x
	  fstp	dword[rbx+rcx]
	  fadd	st0,st1
	   add	rcx,4
	   jnz	@b
	fcompp

    ; call the function
	   lea	rbx,[_cuMemcpyHtoD]
	invoke	cuMemcpyHtoD,[CudaNumberArray],NumberArray,256*4
	  test	eax,eax
	   jnz	Error


	   lea	rbx,[_cuParamSetSize]
	invoke	cuParamSetSize,[CudaFunction],12    ; 8 byte first agument + 4 byte second
	  test	eax,eax
	   jnz	Error

	   mov	rax,[CudaNumberArray]		   ; fill in the arugments
	   mov	qword[Message+0],rax		   ;
	   mov	dword[Message+8],256		   ;
	   lea	rbx,[_cuParamSetv]
	invoke	cuParamSetv,[CudaFunction],0,Message,12        ; and pass in    (cuParamSeti doesn't always work so well)
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuFuncSetBlockShape]
	invoke	cuFuncSetBlockShape,[CudaFunction],256,1,1
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuLaunchGrid]
	invoke	cuLaunchGrid,[CudaFunction],1,1
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuCtxSynchronize]
	invoke	cuCtxSynchronize
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuMemcpyDtoH]
	invoke	cuMemcpyDtoH,CudaLogNumberArray,[CudaNumberArray],256*4
	  test	eax,eax
	   jnz	Error

   ; clean up
	   lea	rbx,[_cuMemFree]
	invoke	cuMemFree,[CudaNumberArray]
	  test	eax,eax
	   jnz	Error

	   lea	rbx,[_cuCtxDestroy]
	invoke	cuCtxDestroy,[CudaContext]
	  test	eax,eax
	   jnz	Error

	   lea	rdi,[Message]
	   cld
	   mov	rax,'32bit fl'
	 stosq
	   mov	rax,'oats:  x'
	 stosq
	   mov	rax,'   lg2.a'
	 stosq
	   mov	rax,'pprox.f3'
	 stosq
	   mov	rax,'2(x)    '
	 stosq
	   mov	rax,'fyl2x(x,'
	 stosq
	   mov	rax,'1.0)    '
	 stosq
	   mov	al,10
	 stosb
	   xor	ebx,ebx
     @@:   fld	dword[NumberArray+4*rbx]
	  fstp	qword[rsp-8]
	   mov	rax,qword[rsp-8]
	invoke	sprintf,Temp,MessageFormat,rax
	movdqu	xmm0,xword[Temp]
	movdqu	[rdi],xmm0
	   fld	dword[CudaLogNumberArray+4*rbx]
	  fstp	qword[rsp-8]
	   mov	rax,qword[rsp-8]
	invoke	sprintf,Temp,MessageFormat,rax
	movdqu	xmm0,xword[Temp]
	movdqu	[rdi+16],xmm0
	   fld	dword[LogNumberArray+4*rbx]
	  fstp	qword[rsp-8]
	   mov	rax,qword[rsp-8]
	invoke	sprintf,Temp,MessageFormat,rax
	movdqu	xmm0,xword[Temp]
	movdqu	[rdi+32],xmm0
	   mov	byte[rdi+48],10
	   add	rdi,49
	   add	rbx,1
	   cmp	rbx,20
	    jb	@b
	   mov	byte[rdi],0
	invoke	MessageBox,NULL,Message,NULL,MB_OK

	invoke	ExitProcess,0

Error:
	   mov	ecx,42
	   cmp	eax,8
	    ja	@f
	   mul	ecx
	   lea	rax,[err000+rax]
	   jmp	.Print
	@@:
	   cmp	eax,100
	    jb	@f
	   cmp	eax,101
	    ja	@f
	   sub	eax,100
	   mul	ecx
	   lea	rax,[err100+rax]
	   jmp	.Print
	@@:
	   cmp	eax,200
	    jb	@f
	   cmp	eax,216
	    ja	@f
	   sub	eax,200
	   mul	ecx
	   lea	rax,[err200+rax]
	   jmp	.Print
	@@:
	   cmp	eax,300
	    jb	@f
	   cmp	eax,304
	    ja	@f
	   sub	eax,300
	   mul	ecx
	   lea	rax,[err300+rax]
	   jmp	.Print
	@@:
	   cmp	eax,400
	   jne	@f
	   sub	eax,400
	   mul	ecx
	   lea	rax,[err400+rax]
	   jmp	.Print
	@@:
	   cmp	eax,500
	   jne	@f
	   sub	eax,500
	   mul	ecx
	   lea	rax,[err500+rax]
	   jmp	.Print
	@@:
	   cmp	eax,600
	   jne	@f
	   sub	eax,600
	   mul	ecx
	   lea	rax,[err600+rax]
	   jmp	.Print
	@@:
	   cmp	eax,700
	    jb	@f
	   cmp	eax,709
	    ja	@f
	   sub	eax,700
	   mul	ecx
	   lea	rax,[err700+rax]
	   jmp	.Print
	@@:
	   cmp	eax,999
	   jne	@f
	   sub	eax,999
	   mul	ecx
	   lea	rax,[err999+rax]
	   jmp	.Print
	@@:
	   lea	rax,[errNoMatch]

  .Print:
	   mov	rdx,[rbx+0]
	   mov	qword[Message+0],rdx
	   mov	rdx,[rbx+8]
	   mov	qword[Message+8],rdx
	   mov	edx,[rbx+16]
	   mov	dword[Message+16],edx
	   mov	byte[Message+19],10

	   mov	rdx,[rax+0]
	   mov	qword[Message+20],rdx
	   mov	rdx,[rax+8]
	   mov	qword[Message+28],rdx
	   mov	rdx,[rax+16]
	   mov	qword[Message+36],rdx
	   mov	rdx,[rax+24]
	   mov	qword[Message+44],rdx
	   mov	rdx,[rax+32]
	   mov	qword[Message+52],rdx
	   mov	edx,[rax+40]
	   mov	dword[Message+60],edx
	   mov	byte[Message+62],0

	invoke	MessageBox,NULL,Message,NULL,MB_OK
	invoke	ExitProcess,0

section '.data' data readable

  PTXFunction: db 'log_2',0

; log_2(*inout,length):
; xind =  %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return

  PTXSourceData:
db '     .version 1.4'
db '     .target sm_13'
db '        .entry log_2 ('
db '                .param .u64 _inout,'
db '                .param .s32 _length)'
db '        {'
db '        .reg .u16 w1,w2;'		      ; word-sized registers
db '        .reg .u32 e1,xind;' 	      ; dwords
db '        .reg .u64 r1;'		      ; qwords
db '        .reg .f32 f1;'		      ; floats
db '        .reg .pred p1;'		      ; conditions
db '        mov.u16         w1,%ctaid.x;'
db '        mov.u16         w2,%ntid.x;'
db '        cvt.u32.u16     e1,%tid.x;'
db '        mad.wide.u16    xind,w1,w2,e1;'
db '        ld.param.s32    e1,[_length];'
db '        setp.le.s32     p1,e1,xind;'
db '   @p1  bra             $Lt_Exit;'	      ; if p1, branch
db '        ld.param.u64    r1,[_inout];'
db '        mad.wide.u32    r1,xind,4,r1;'
db '        ld.global.f32   f1,[r1+0];'
db '        lg2.approx.f32  f1,f1;'
db '        st.global.f32   [r1+0],f1;'
db '$Lt_Exit:'
db '        exit;'
db '        }'
db 0

section '.data' data readable writeable

align 16
  ErrorMessageFormat db 'error code:',10,'hex: 0x%.8x',10,'dec: %u',0
  MessageFormat db '%16.7f',0
 err000:
db    'CUDA_SUCCESS                              ';= 0
db    'CUDA_ERROR_INVALID_VALUE                  ';= 1
db    'CUDA_ERROR_OUT_OF_MEMORY                  ';= 2
db    'CUDA_ERROR_NOT_INITIALIZED                ';= 3
db    'CUDA_ERROR_DEINITIALIZED                  ';= 4
db    'CUDA_ERROR_PROFILER_DISABLED              ';= 5
db    'CUDA_ERROR_PROFILER_NOT_INITIALIZED       ';= 6
db    'CUDA_ERROR_PROFILER_ALREADY_STARTED       ';= 7
db    'CUDA_ERROR_PROFILER_ALREADY_STOPPED       ';= 8
 err100:
db    'CUDA_ERROR_NO_DEVICE                      ';= 100
db    'CUDA_ERROR_INVALID_DEVICE                 ';= 101
 err200:
db    'CUDA_ERROR_INVALID_IMAGE                  ';= 200
db    'CUDA_ERROR_INVALID_CONTEXT                ';= 201
db    'CUDA_ERROR_CONTEXT_ALREADY_CURRENT        ';= 202
db    '?                                         ';=
db    '?                                         ';=
db    'CUDA_ERROR_MAP_FAILED                     ';= 205
db    'CUDA_ERROR_UNMAP_FAILED                   ';= 206
db    'CUDA_ERROR_ARRAY_IS_MAPPED                ';= 207
db    'CUDA_ERROR_ALREADY_MAPPED                 ';= 208
db    'CUDA_ERROR_NO_BINARY_FOR_GPU              ';= 209
db    'CUDA_ERROR_ALREADY_ACQUIRED               ';= 210
db    'CUDA_ERROR_NOT_MAPPED                     ';= 211
db    'CUDA_ERROR_NOT_MAPPED_AS_ARRAY            ';= 212
db    'CUDA_ERROR_NOT_MAPPED_AS_POINTER          ';= 213
db    'CUDA_ERROR_ECC_UNCORRECTABLE              ';= 214
db    'CUDA_ERROR_UNSUPPORTED_LIMIT              ';= 215
db    'CUDA_ERROR_CONTEXT_ALREADY_IN_USE         ';= 216
 err300:
db    'CUDA_ERROR_INVALID_SOURCE                 ';= 300
db    'CUDA_ERROR_FILE_NOT_FOUND                 ';= 301
db    'CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND ';= 302
db    'CUDA_ERROR_SHARED_OBJECT_INIT_FAILED      ';= 303
db    'CUDA_ERROR_OPERATING_SYSTEM               ';= 304
 err400:
db    'CUDA_ERROR_INVALID_HANDLE                 ';= 400
 err500:
db    'CUDA_ERROR_NOT_FOUND                      ';= 500
 err600:
db    'CUDA_ERROR_NOT_READY                      ';= 600
 err700:
db    'CUDA_ERROR_LAUNCH_FAILED                  ';= 700
db    'CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES        ';= 701
db    'CUDA_ERROR_LAUNCH_TIMEOUT                 ';= 702
db    'CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING  ';= 703
db    'CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED    ';= 704
db    'CUDA_ERROR_PEER_ACCESS_NOT_ENABLED        ';= 705
db    '?                                         ';=
db    '?                                         ';=
db    'CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE         ';= 708
db    'CUDA_ERROR_CONTEXT_IS_DESTROYED           ';= 709
 err999:
db    'CUDA_ERROR_UNKNOWN                        ';= 999
 errNoMatch:
db    '?                                         '


_cuInit:		 db 'cuInit              '
_cuDeviceGet:		 db 'cuDeviceGet         '
_cuCtxCreate:		 db 'cuCtxCreate         '
_cuMemAlloc:		 db 'cuMemAlloc          '
_cuModuleLoadData:	 db 'cuModuleLoadData    '
_cuModuleGetFunction:	 db 'cuModuleGetFunction '
_cuMemcpyHtoD:		 db 'cuMemcpyHtoD        '
_cuParamSeti:		 db 'cuParamSeti         '
_cuParamSetv:		 db 'cuParamSetv         '
_cuFuncSetBlockShape:	 db 'cuFuncSetBlockShape '
_cuLaunchGrid:		 db 'cuLaunchGrid        '
_cuParamSetSize:	 db 'cuParamSetSize      '
_cuCtxSynchronize:	 db 'cuCtxSynchronize    '
_cuMemcpyDtoH:		 db 'cuMemcpyDtoH        '
_cuMemFree:		 db 'cuMemFree           '
_cuCtxDestroy:		 db 'cuCtxDestroy        '



align 16
  CudaDevice	   dq ?
  CudaContext	   dq ?
  CudaModule	   dq ?
  CudaFunction	   dq ?
  CudaNumberArray  dq ?
  NumberArray	   rd 256
  LogNumberArray   rd 256
  CudaLogNumberArray rd 256
  Temp	rb 32
  Message rb 50*60

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
	  user32,'USER32.DLL',\
	  msvcrt,'MSVCRT.DLL',\
	  cuda,'NVCUDA.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'
  include 'api_cuda.inc'

 import msvcrt,\
	sprintf,'sprintf'

PTX Data

Code Select

section '.data' data readable

  PTXFunction: db 'log_2',0

; log_2(*inout,length):
; xind =  %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return

  PTXSourceData:
db '     .version 1.4'
db '     .target sm_13'
db '        .entry log_2 ('
db '                .param .u64 _inout,'
db '                .param .s32 _length)'
db '        {'
db '        .reg .u16 w1,w2;'		      ; word-sized registers
db '        .reg .u32 e1,xind;' 	      ; dwords
db '        .reg .u64 r1;'		      ; qwords
db '        .reg .f32 f1;'		      ; floats
db '        .reg .pred p1;'		      ; conditions
db '        mov.u16         w1,%ctaid.x;'
db '        mov.u16         w2,%ntid.x;'
db '        cvt.u32.u16     e1,%tid.x;'
db '        mad.wide.u16    xind,w1,w2,e1;'
db '        ld.param.s32    e1,[_length];'
db '        setp.le.s32     p1,e1,xind;'
db '   @p1  bra             $Lt_Exit;'	      ; if p1, branch
db '        ld.param.u64    r1,[_inout];'
db '        mad.wide.u32    r1,xind,4,r1;'
db '        ld.global.f32   f1,[r1+0];'
db '        lg2.approx.f32  f1,f1;'
db '        st.global.f32   [r1+0],f1;'
db '$Lt_Exit:'
db '        exit;'
db '        }'
db 0

LiaoMi · May 12, 2019, 12:24:34 AM

With the toolkit from the posts above, you can freely write modern code, it remains to find an idea for a super fast algorithm :P

Biterider · May 12, 2019, 06:35:24 AM

Thanks LiaoMi
Very good and interesting work :t

Biterider

LiaoMi · May 12, 2019, 09:00:45 AM

Hi Biterider,

the example is old, but perfect for demonstration! :icon14:
mrfearless has a different set of Api https://github.com/mrfearless/libraries/tree/master/Cuda, maybe because of the CUDA Toolkit version, although the compiled exe works fine for me. We can easily rewrite the example for UASM, the main thing is to understand how to use new 10.1 api

Nsight Productivity Utilities PTX Debugger + Tools

NVIDIA® Nsight™ Systems
NVIDIA® Nsight Systems™ is a system-wide performance analysis tool designed to visualize an application's algorithms, identify the largest optimization opportunities, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large servers to our smallest SoC.

NVIDIA® Nsight™ Graphics
NVIDIA® Nsight Graphics™ is a standalone application for the debugging, profiling, and analysis of graphics applications. It allows you to optimize the performance of your Direct3D 11, Direct3D 12, DirectX Raytracing, OpenGL, Vulkan, and NVIDIA VKRay based applications.

NVIDIA® Nsight™ Eclipse Edition
NVIDIA® Nsight™ Eclipse Edition is a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug, and profile CUDA-C applications. Nsight Eclipse Edition supports a rich set of commercial and free plugins.

NVIDIA® Nsight™ Compute
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. Nsight Compute also provides customizable and data-driven user interface and metric collection that can be extended with analysis scripts for post-processing results.

NVIDIA® Nsight™ Visual Studio Edition
An application development environment for heterogeneous platforms, Nsight Visual Studio Edition brings GPU computing into Microsoft Visual Studio. Build, debug, profile and trace heterogeneous compute, graphics, virtual reality, RTX, .NET, and UWP applications built with CUDA C/C++, OpenCL, DirectCompute, Direct3D (11,12,DXR), Vulkan (1.1, Vulkan Ray Tracing Extension), OpenGL, OpenVR, and the Oculus SDK.
Download NVIDIA® Nsight™ Visual Studio Edition 2019.1 - https://developer.nvidia.com/gameworksdownload#?dn=nsight-visual-studio-edition-2019-1 (free registration is required, then you can download files NVIDIA_Nsight_Visual_Studio_Edition_Win64_2019.1.1.19056.msi) 492Mb

fearless · May 12, 2019, 01:01:12 PM

Probably a good few changes since i last looked at it. Think the version i was working with of the cuda sdk was v8 primarily, just to see if i could get a few things working - and mainly with the nvml part. So any api's and include file conversions are probably based on that v8. I think there was some reason for only using x64 asm - cant quite recall why that and not x86 - maybe there wasnt an x86 dll for nvml or something (i think)
Anyhow looks like you have a lot of stuff figured out, so looks promising.

LiaoMi · May 12, 2019, 10:06:34 PM

Hi fearless,

I checked your sdk for the presence of api and compared with the original version 10.1, my conclusion, your sdk is not complete, all functions are present, nothing has changed in the new CUDA versions.

I attach my version of converted .H files https://www.solidfiles.com/v/yR7MaKLr5Vkx6, library files as usual in CUDA Toolkit 10.1.

I did not convert two folders, for the reason that there are templates and classes that are not supported by the converter.

Code Select

thrust/detail
thrust/system

I have not tried to convert .H files using the H2INCX utility, made in a fast way to cover more files .. h2incX need a graphical interface to use as front-end, in the form of a separate program that generates flags, creates subdirectories during conversion and manages the conversion list, upx has a similar separate interface.

The Linux "Cuda_Assembly Example" version can be found below, FASM source code :icon14:

P.S. The converter did not give any errors, but you may have to correct something

LiaoMi · May 12, 2019, 11:13:17 PM

First bugs after conversion

С++

Code Select

/**
 * External memory buffer descriptor
 */
typedef struct CUDA_EXTERNAL_MEMORY_BUFFER_DESC_st {
    /**
     * Offset into the memory object where the buffer's base is
     */
    unsigned long long offset;
    /**
     * Size of the buffer
     */
    unsigned long long size;
    /**
     * Flags reserved for future use. Must be zero.
     */
    unsigned int flags;
    unsigned int reserved[16];
} CUDA_EXTERNAL_MEMORY_BUFFER_DESC;

Conversion

Code Select


CUDA_EXTERNAL_MEMORY_BUFFER_DESC	STRUCT DEFALIGNMASM
	long DWORD ?
	long DWORD ?
	flags DWORD ?
	reserved DWORD 16 dup (?)
CUDA_EXTERNAL_MEMORY_BUFFER_DESC		ENDS

:redface: How many bytes is unsigned long long?
https://stackoverflow.com/questions/5836329/how-many-bytes-is-unsigned-long-long
https://en.cppreference.com/w/cpp/language/types :t

LiaoMi · May 13, 2019, 01:30:06 AM

Hi again

,

I can successfully compile under UASM assembler. There is one problem, in the header files, you can control the version of api. In my case version 2 is used, but should be 1 ... cuCtxCreate_v2 -> cuCtxCreate

Code Select

JMP     QWORD PTR DS:[<&cuInit>]
JMP     QWORD PTR DS:[<&cuDeviceGet>]
JMP     QWORD PTR DS:[<&cuCtxCreate_v2>]
JMP     QWORD PTR DS:[<&cuCtxDestroy_v2>]
JMP     QWORD PTR DS:[<&cuCtxSynchronize>]
JMP     QWORD PTR DS:[<&cuModuleLoadData>]
JMP     QWORD PTR DS:[<&cuModuleGetFunction>]
JMP     QWORD PTR DS:[<&cuMemAlloc_v2>]
JMP     QWORD PTR DS:[<&cuMemFree_v2>]
JMP     QWORD PTR DS:[<&cuMemcpyHtoD_v2>]
JMP     QWORD PTR DS:[<&cuMemcpyDtoH_v2>]
JMP     QWORD PTR DS:[<&cuFuncSetBlockShape>]
JMP     QWORD PTR DS:[<&cuParamSetSize>]
JMP     QWORD PTR DS:[<&cuParamSetv>]
JMP     QWORD PTR DS:[<&cuLaunchGrid>]
JMP     QWORD PTR DS:[<&FatalExit>]
JMP     QWORD PTR DS:[<&MessageBoxA>]

I can adjust the parameters in the macro, but in this case I get an error Error A2101: Macro nesting level too deep in the UASM assembler :icon_confused:

Code Select


E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879) : Error A2101: Macro nesting level too deep
 E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909) : Error A2101: Macro nesting level too deep
 E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719) : Error A2101: Macro nesting level too deep
 E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763) : Error A2101: Macro nesting level too deep
 E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796) : Error A2101: Macro nesting level too deep
 E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796): Included by
  mywindow1.asm(32): Main line code

Where can i patch macros depth?! :icon_exclaim: I still use the version 2.47.1.0, due to a bug in version 2.48 :idea:

LiaoMi · May 13, 2019, 04:52:37 AM

In my opinion in the file cuda.inc everything is correct,

Code Select

#if defined(CUDA_FORCE_API_VERSION)
    #if (CUDA_FORCE_API_VERSION == 3010)
        #define __CUDA_API_VERSION 3010
    #else
        #error "Unsupported value of CUDA_FORCE_API_VERSION"
    #endif
#else
    #define __CUDA_API_VERSION 10010
#endif /* CUDA_FORCE_API_VERSION */

Why different functions are indicated as unavailable :icon_eek: After the experiments, I realized that the problem is in the alignment of the page. I forgot to remove the /Force flag and added /LARGEADDRESSAWARE:NO But no change .. Api functions have invalid addresses ..

The MASM Forum

News:

Using the GPU

Biterider

fearless

Biterider

daydreamer

LiaoMi

LiaoMi

LiaoMi

LiaoMi

Biterider

LiaoMi

fearless

LiaoMi

LiaoMi

LiaoMi

LiaoMi