News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Using the GPU

Started by Biterider, April 30, 2019, 07:26:14 PM

Previous topic - Next topic

Biterider

Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go.  :idea:

It would be nice if you can provide example in asm.  ;)

Regards, Biterider

fearless

Done some conversion of cuda and nvml for 64bit asm:

https://github.com/mrfearless/libraries/tree/master/Cuda
Updated my repo to include a CudaDetection radasm project I used whilst testing some stuff out, mainly using the nvml to detect gpu, not got round to the cuda part or testing any of the functions.

Have to use the appropriate CUDA SDK version libraries and dll's - I have a folder for v8.0 and v9.1 SDK, just copy from the appropriate folder and overwrite in the CudaDetection folder. Currently CudaDetection folder contains the 9.1 version libs and dll files.
Also i played around with OpenCL a little (collected a few libs, converted an include file) https://github.com/mrfearless/libraries/tree/master/OpenCL - updated to include a OpenCLDetection project - havent actually tested anything yet - obviously meant to, or was going to use project to help convert the include file - i think.

Done some stuff with Vulkan as well, some conversion of the includes and some minor testing using the example/sample/tutorials from the sdk as a basis to see if it would work - only got so far, got distracted etc, so havent got back to it, got up to the 4th example: 04-init_command_buffer - but cant recall if that is working. https://github.com/mrfearless/libraries/tree/master/Vulkan

Biterider

Hi fearless
Thank you very much. It is a beginning...  :t

Anyone else?

Biterider

daydreamer

Quote from: Biterider on April 30, 2019, 09:57:23 PM
Hi fearless
Thank you very much. It is a beginning...  :t

Anyone else?

Biterider
take a look at Hitchhikers 1k demo's in old forum, he just had minimal d3d or opengl call/setup quad and pixelshader in clear text,pixelshaders can be used together with texturemaps as arrays



my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

LiaoMi

Quote from: Biterider on April 30, 2019, 07:26:14 PM
Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go.  :idea:

It would be nice if you can provide example in asm.  ;)

Regards, Biterider

Cuda.. PTX assembly... etc..
http://masm32.com/board/index.php?topic=4711.0

Translating GPU Binaries to Tiered SIMD Architectures
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1333&rep=rep1&type=pdf


LiaoMi

CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly
https://github.com/SunsetQuest/CudaPAD

CudaPAD
http://www.codeproject.com/Articles/999744/CudaPAD


LiaoMi

#6
Hi,

PTX ISA (PDF) - v10.1.105 (older) - Last updated March 26, 2019 https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf

PTX Writer's Guide to Interoperability https://docs.nvidia.com/cuda/pdf/PTX_Writers_Guide_To_Interoperability.pdf

CUDA Toolkit 10.1 Download - https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal

Cuda_Assembly Example exe + src + inc  8)

format PE64 GUI 5.0
entry start

include 'win64a.inc'
include 'cuda.inc'

section '.text' code readable executable

  start:
  push rbp

     ; init
   lea rbx,[_cuInit]
invoke cuInit,0
  test eax,eax
   jnz Error

   lea rbx,[_cuDeviceGet]
invoke cuDeviceGet,CudaDevice,0   ; get the first device
  test eax,eax
   jnz Error

   lea rbx,[_cuCtxCreate]
invoke cuCtxCreate,CudaContext,CU_CTX_SCHED_SPIN+CU_CTX_MAP_HOST,[CudaDevice]     ; this context associates the device with this cpu thread
  test eax,eax
   jnz Error

   lea rbx,[_cuMemAlloc]
invoke cuMemAlloc,CudaNumberArray,256*4
  test eax,eax
   jnz Error

    ; load ptx source
   lea rbx,[_cuModuleLoadData]
invoke cuModuleLoadData,CudaModule,PTXSourceData
  test eax,eax
   jnz Error

   lea rbx,[_cuModuleGetFunction]
invoke cuModuleGetFunction,CudaFunction,[CudaModule],PTXFunction
  test eax,eax
   jnz Error

    ; fill in NumberArray and LogNumberArray with x87 calculations for comparision
  fld1
  fld1
   mov ecx,256*4
   lea rax,[NumberArray+rcx]
   lea rbx,[LogNumberArray+rcx]
   neg rcx
   @@:    fst dword[rax+rcx]
  fld1
   fld st1
fyl2x
  fstp dword[rbx+rcx]
  fadd st0,st1
   add rcx,4
   jnz @b
fcompp

    ; call the function
   lea rbx,[_cuMemcpyHtoD]
invoke cuMemcpyHtoD,[CudaNumberArray],NumberArray,256*4
  test eax,eax
   jnz Error


   lea rbx,[_cuParamSetSize]
invoke cuParamSetSize,[CudaFunction],12    ; 8 byte first agument + 4 byte second
  test eax,eax
   jnz Error

   mov rax,[CudaNumberArray]    ; fill in the arugments
   mov qword[Message+0],rax    ;
   mov dword[Message+8],256    ;
   lea rbx,[_cuParamSetv]
invoke cuParamSetv,[CudaFunction],0,Message,12        ; and pass in    (cuParamSeti doesn't always work so well)
  test eax,eax
   jnz Error

   lea rbx,[_cuFuncSetBlockShape]
invoke cuFuncSetBlockShape,[CudaFunction],256,1,1
  test eax,eax
   jnz Error

   lea rbx,[_cuLaunchGrid]
invoke cuLaunchGrid,[CudaFunction],1,1
  test eax,eax
   jnz Error

   lea rbx,[_cuCtxSynchronize]
invoke cuCtxSynchronize
  test eax,eax
   jnz Error

   lea rbx,[_cuMemcpyDtoH]
invoke cuMemcpyDtoH,CudaLogNumberArray,[CudaNumberArray],256*4
  test eax,eax
   jnz Error

   ; clean up
   lea rbx,[_cuMemFree]
invoke cuMemFree,[CudaNumberArray]
  test eax,eax
   jnz Error

   lea rbx,[_cuCtxDestroy]
invoke cuCtxDestroy,[CudaContext]
  test eax,eax
   jnz Error

   lea rdi,[Message]
   cld
   mov rax,'32bit fl'
stosq
   mov rax,'oats:  x'
stosq
   mov rax,'   lg2.a'
stosq
   mov rax,'pprox.f3'
stosq
   mov rax,'2(x)    '
stosq
   mov rax,'fyl2x(x,'
stosq
   mov rax,'1.0)    '
stosq
   mov al,10
stosb
   xor ebx,ebx
     @@:   fld dword[NumberArray+4*rbx]
  fstp qword[rsp-8]
   mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi],xmm0
   fld dword[CudaLogNumberArray+4*rbx]
  fstp qword[rsp-8]
   mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi+16],xmm0
   fld dword[LogNumberArray+4*rbx]
  fstp qword[rsp-8]
   mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi+32],xmm0
   mov byte[rdi+48],10
   add rdi,49
   add rbx,1
   cmp rbx,20
    jb @b
   mov byte[rdi],0
invoke MessageBox,NULL,Message,NULL,MB_OK

invoke ExitProcess,0

Error:
   mov ecx,42
   cmp eax,8
    ja @f
   mul ecx
   lea rax,[err000+rax]
   jmp .Print
@@:
   cmp eax,100
    jb @f
   cmp eax,101
    ja @f
   sub eax,100
   mul ecx
   lea rax,[err100+rax]
   jmp .Print
@@:
   cmp eax,200
    jb @f
   cmp eax,216
    ja @f
   sub eax,200
   mul ecx
   lea rax,[err200+rax]
   jmp .Print
@@:
   cmp eax,300
    jb @f
   cmp eax,304
    ja @f
   sub eax,300
   mul ecx
   lea rax,[err300+rax]
   jmp .Print
@@:
   cmp eax,400
   jne @f
   sub eax,400
   mul ecx
   lea rax,[err400+rax]
   jmp .Print
@@:
   cmp eax,500
   jne @f
   sub eax,500
   mul ecx
   lea rax,[err500+rax]
   jmp .Print
@@:
   cmp eax,600
   jne @f
   sub eax,600
   mul ecx
   lea rax,[err600+rax]
   jmp .Print
@@:
   cmp eax,700
    jb @f
   cmp eax,709
    ja @f
   sub eax,700
   mul ecx
   lea rax,[err700+rax]
   jmp .Print
@@:
   cmp eax,999
   jne @f
   sub eax,999
   mul ecx
   lea rax,[err999+rax]
   jmp .Print
@@:
   lea rax,[errNoMatch]

  .Print:
   mov rdx,[rbx+0]
   mov qword[Message+0],rdx
   mov rdx,[rbx+8]
   mov qword[Message+8],rdx
   mov edx,[rbx+16]
   mov dword[Message+16],edx
   mov byte[Message+19],10

   mov rdx,[rax+0]
   mov qword[Message+20],rdx
   mov rdx,[rax+8]
   mov qword[Message+28],rdx
   mov rdx,[rax+16]
   mov qword[Message+36],rdx
   mov rdx,[rax+24]
   mov qword[Message+44],rdx
   mov rdx,[rax+32]
   mov qword[Message+52],rdx
   mov edx,[rax+40]
   mov dword[Message+60],edx
   mov byte[Message+62],0

invoke MessageBox,NULL,Message,NULL,MB_OK
invoke ExitProcess,0

section '.data' data readable

  PTXFunction: db 'log_2',0

; log_2(*inout,length):
; xind =  %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return

  PTXSourceData:
db '     .version 1.4'
db '     .target sm_13'
db '        .entry log_2 ('
db '                .param .u64 _inout,'
db '                .param .s32 _length)'
db '        {'
db '        .reg .u16 w1,w2;'       ; word-sized registers
db '        .reg .u32 e1,xind;'       ; dwords
db '        .reg .u64 r1;'       ; qwords
db '        .reg .f32 f1;'       ; floats
db '        .reg .pred p1;'       ; conditions
db '        mov.u16         w1,%ctaid.x;'
db '        mov.u16         w2,%ntid.x;'
db '        cvt.u32.u16     e1,%tid.x;'
db '        mad.wide.u16    xind,w1,w2,e1;'
db '        ld.param.s32    e1,[_length];'
db '        setp.le.s32     p1,e1,xind;'
db '   @p1  bra             $Lt_Exit;'       ; if p1, branch
db '        ld.param.u64    r1,[_inout];'
db '        mad.wide.u32    r1,xind,4,r1;'
db '        ld.global.f32   f1,[r1+0];'
db '        lg2.approx.f32  f1,f1;'
db '        st.global.f32   [r1+0],f1;'
db '$Lt_Exit:'
db '        exit;'
db '        }'
db 0

section '.data' data readable writeable

align 16
  ErrorMessageFormat db 'error code:',10,'hex: 0x%.8x',10,'dec: %u',0
  MessageFormat db '%16.7f',0
err000:
db    'CUDA_SUCCESS                              ';= 0
db    'CUDA_ERROR_INVALID_VALUE                  ';= 1
db    'CUDA_ERROR_OUT_OF_MEMORY                  ';= 2
db    'CUDA_ERROR_NOT_INITIALIZED                ';= 3
db    'CUDA_ERROR_DEINITIALIZED                  ';= 4
db    'CUDA_ERROR_PROFILER_DISABLED              ';= 5
db    'CUDA_ERROR_PROFILER_NOT_INITIALIZED       ';= 6
db    'CUDA_ERROR_PROFILER_ALREADY_STARTED       ';= 7
db    'CUDA_ERROR_PROFILER_ALREADY_STOPPED       ';= 8
err100:
db    'CUDA_ERROR_NO_DEVICE                      ';= 100
db    'CUDA_ERROR_INVALID_DEVICE                 ';= 101
err200:
db    'CUDA_ERROR_INVALID_IMAGE                  ';= 200
db    'CUDA_ERROR_INVALID_CONTEXT                ';= 201
db    'CUDA_ERROR_CONTEXT_ALREADY_CURRENT        ';= 202
db    '?                                         ';=
db    '?                                         ';=
db    'CUDA_ERROR_MAP_FAILED                     ';= 205
db    'CUDA_ERROR_UNMAP_FAILED                   ';= 206
db    'CUDA_ERROR_ARRAY_IS_MAPPED                ';= 207
db    'CUDA_ERROR_ALREADY_MAPPED                 ';= 208
db    'CUDA_ERROR_NO_BINARY_FOR_GPU              ';= 209
db    'CUDA_ERROR_ALREADY_ACQUIRED               ';= 210
db    'CUDA_ERROR_NOT_MAPPED                     ';= 211
db    'CUDA_ERROR_NOT_MAPPED_AS_ARRAY            ';= 212
db    'CUDA_ERROR_NOT_MAPPED_AS_POINTER          ';= 213
db    'CUDA_ERROR_ECC_UNCORRECTABLE              ';= 214
db    'CUDA_ERROR_UNSUPPORTED_LIMIT              ';= 215
db    'CUDA_ERROR_CONTEXT_ALREADY_IN_USE         ';= 216
err300:
db    'CUDA_ERROR_INVALID_SOURCE                 ';= 300
db    'CUDA_ERROR_FILE_NOT_FOUND                 ';= 301
db    'CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND ';= 302
db    'CUDA_ERROR_SHARED_OBJECT_INIT_FAILED      ';= 303
db    'CUDA_ERROR_OPERATING_SYSTEM               ';= 304
err400:
db    'CUDA_ERROR_INVALID_HANDLE                 ';= 400
err500:
db    'CUDA_ERROR_NOT_FOUND                      ';= 500
err600:
db    'CUDA_ERROR_NOT_READY                      ';= 600
err700:
db    'CUDA_ERROR_LAUNCH_FAILED                  ';= 700
db    'CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES        ';= 701
db    'CUDA_ERROR_LAUNCH_TIMEOUT                 ';= 702
db    'CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING  ';= 703
db    'CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED    ';= 704
db    'CUDA_ERROR_PEER_ACCESS_NOT_ENABLED        ';= 705
db    '?                                         ';=
db    '?                                         ';=
db    'CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE         ';= 708
db    'CUDA_ERROR_CONTEXT_IS_DESTROYED           ';= 709
err999:
db    'CUDA_ERROR_UNKNOWN                        ';= 999
errNoMatch:
db    '?                                         '


_cuInit: db 'cuInit              '
_cuDeviceGet: db 'cuDeviceGet         '
_cuCtxCreate: db 'cuCtxCreate         '
_cuMemAlloc: db 'cuMemAlloc          '
_cuModuleLoadData: db 'cuModuleLoadData    '
_cuModuleGetFunction: db 'cuModuleGetFunction '
_cuMemcpyHtoD: db 'cuMemcpyHtoD        '
_cuParamSeti: db 'cuParamSeti         '
_cuParamSetv: db 'cuParamSetv         '
_cuFuncSetBlockShape: db 'cuFuncSetBlockShape '
_cuLaunchGrid: db 'cuLaunchGrid        '
_cuParamSetSize: db 'cuParamSetSize      '
_cuCtxSynchronize: db 'cuCtxSynchronize    '
_cuMemcpyDtoH: db 'cuMemcpyDtoH        '
_cuMemFree: db 'cuMemFree           '
_cuCtxDestroy: db 'cuCtxDestroy        '



align 16
  CudaDevice    dq ?
  CudaContext    dq ?
  CudaModule    dq ?
  CudaFunction    dq ?
  CudaNumberArray  dq ?
  NumberArray    rd 256
  LogNumberArray   rd 256
  CudaLogNumberArray rd 256
  Temp rb 32
  Message rb 50*60

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
  user32,'USER32.DLL',\
  msvcrt,'MSVCRT.DLL',\
  cuda,'NVCUDA.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'
  include 'api_cuda.inc'

import msvcrt,\
sprintf,'sprintf'


PTX Data
section '.data' data readable

  PTXFunction: db 'log_2',0

; log_2(*inout,length):
; xind =  %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return

  PTXSourceData:
db '     .version 1.4'
db '     .target sm_13'
db '        .entry log_2 ('
db '                .param .u64 _inout,'
db '                .param .s32 _length)'
db '        {'
db '        .reg .u16 w1,w2;'       ; word-sized registers
db '        .reg .u32 e1,xind;'       ; dwords
db '        .reg .u64 r1;'       ; qwords
db '        .reg .f32 f1;'       ; floats
db '        .reg .pred p1;'       ; conditions
db '        mov.u16         w1,%ctaid.x;'
db '        mov.u16         w2,%ntid.x;'
db '        cvt.u32.u16     e1,%tid.x;'
db '        mad.wide.u16    xind,w1,w2,e1;'
db '        ld.param.s32    e1,[_length];'
db '        setp.le.s32     p1,e1,xind;'
db '   @p1  bra             $Lt_Exit;'       ; if p1, branch
db '        ld.param.u64    r1,[_inout];'
db '        mad.wide.u32    r1,xind,4,r1;'
db '        ld.global.f32   f1,[r1+0];'
db '        lg2.approx.f32  f1,f1;'
db '        st.global.f32   [r1+0],f1;'
db '$Lt_Exit:'
db '        exit;'
db '        }'
db 0


LiaoMi

With the toolkit from the posts above, you can freely write modern code, it remains to find an idea for a super fast algorithm  :P

Biterider

Thanks LiaoMi
Very good and interesting work  :t

Biterider

LiaoMi

Hi Biterider,

the example is old, but perfect for demonstration!  :icon14:
mrfearless has a different set of Api https://github.com/mrfearless/libraries/tree/master/Cuda, maybe because of the CUDA Toolkit version, although the compiled exe works fine for me. We can easily rewrite the example for UASM, the main thing is to understand how to use new 10.1 api  :badgrin:

Nsight Productivity Utilities PTX Debugger + Tools

NVIDIA® Nsight™ Systems
NVIDIA® Nsight Systems™ is a system-wide performance analysis tool designed to visualize an application's algorithms, identify the largest optimization opportunities, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large servers to our smallest SoC.

NVIDIA® Nsight™ Graphics
NVIDIA® Nsight Graphics™ is a standalone application for the debugging, profiling, and analysis of graphics applications. It allows you to optimize the performance of your Direct3D 11, Direct3D 12, DirectX Raytracing, OpenGL, Vulkan, and NVIDIA VKRay based applications.

NVIDIA® Nsight™ Eclipse Edition
NVIDIA® Nsight™ Eclipse Edition is a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug, and profile CUDA-C applications. Nsight Eclipse Edition supports a rich set of commercial and free plugins.

NVIDIA® Nsight™ Compute
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. Nsight Compute also provides customizable and data-driven user interface and metric collection that can be extended with analysis scripts for post-processing results.

NVIDIA® Nsight™ Visual Studio Edition
An application development environment for heterogeneous platforms, Nsight Visual Studio Edition brings GPU computing into Microsoft Visual Studio. Build, debug, profile and trace heterogeneous compute, graphics, virtual reality, RTX, .NET, and UWP applications built with CUDA C/C++, OpenCL, DirectCompute, Direct3D (11,12,DXR), Vulkan (1.1, Vulkan Ray Tracing Extension), OpenGL, OpenVR, and the Oculus SDK.
Download NVIDIA® Nsight™ Visual Studio Edition 2019.1 - https://developer.nvidia.com/gameworksdownload#?dn=nsight-visual-studio-edition-2019-1 (free registration is required, then you can download files NVIDIA_Nsight_Visual_Studio_Edition_Win64_2019.1.1.19056.msi) 492Mb  :eusa_boohoo:


fearless

Probably a good few changes since i last looked at it. Think the version i was working with of the cuda sdk was v8 primarily, just to see if i could get a few things working - and mainly with the nvml part. So any api's and include file conversions are probably based on that v8. I think there was some reason for only using x64 asm - cant quite recall why that and not x86 - maybe there wasnt an x86 dll for nvml or something (i think)
Anyhow looks like you have a lot of stuff figured out, so looks promising.

LiaoMi

Hi fearless,

I checked your sdk for the presence of api and compared with the original version 10.1, my conclusion, your sdk is not complete, all functions are present, nothing has changed in the new CUDA versions.

I attach my version of converted .H files https://www.solidfiles.com/v/yR7MaKLr5Vkx6, library files as usual in CUDA Toolkit 10.1.

I did not convert two folders, for the reason that there are templates and classes that are not supported by the converter.
thrust/detail
thrust/system


I have not tried to convert .H files using the H2INCX utility, made in a fast way to cover more files .. h2incX need a graphical interface to use as front-end, in the form of a separate program that generates flags, creates subdirectories during conversion and manages the conversion list, upx has a similar separate interface.

The Linux "Cuda_Assembly Example" version can be found below, FASM source code  :icon14:

P.S. The converter did not give any errors, but you may have to correct something  :bgrin:

LiaoMi

First bugs after conversion :biggrin:

С++
/**
* External memory buffer descriptor
*/
typedef struct CUDA_EXTERNAL_MEMORY_BUFFER_DESC_st {
    /**
     * Offset into the memory object where the buffer's base is
     */
    unsigned long long offset;
    /**
     * Size of the buffer
     */
    unsigned long long size;
    /**
     * Flags reserved for future use. Must be zero.
     */
    unsigned int flags;
    unsigned int reserved[16];
} CUDA_EXTERNAL_MEMORY_BUFFER_DESC;


Conversion

CUDA_EXTERNAL_MEMORY_BUFFER_DESC STRUCT DEFALIGNMASM
long DWORD ?
long DWORD ?
flags DWORD ?
reserved DWORD 16 dup (?)
CUDA_EXTERNAL_MEMORY_BUFFER_DESC ENDS


:redface: How many bytes is unsigned long long?
https://stackoverflow.com/questions/5836329/how-many-bytes-is-unsigned-long-long
https://en.cppreference.com/w/cpp/language/types  :t

LiaoMi

#13
Hi again  :biggrin:,

I can successfully compile under UASM assembler. There is one problem, in the header files, you can control the version of api. In my case version 2 is used, but should be 1 ... cuCtxCreate_v2 -> cuCtxCreate

JMP     QWORD PTR DS:[<&cuInit>]
JMP     QWORD PTR DS:[<&cuDeviceGet>]
JMP     QWORD PTR DS:[<&cuCtxCreate_v2>]
JMP     QWORD PTR DS:[<&cuCtxDestroy_v2>]
JMP     QWORD PTR DS:[<&cuCtxSynchronize>]
JMP     QWORD PTR DS:[<&cuModuleLoadData>]
JMP     QWORD PTR DS:[<&cuModuleGetFunction>]
JMP     QWORD PTR DS:[<&cuMemAlloc_v2>]
JMP     QWORD PTR DS:[<&cuMemFree_v2>]
JMP     QWORD PTR DS:[<&cuMemcpyHtoD_v2>]
JMP     QWORD PTR DS:[<&cuMemcpyDtoH_v2>]
JMP     QWORD PTR DS:[<&cuFuncSetBlockShape>]
JMP     QWORD PTR DS:[<&cuParamSetSize>]
JMP     QWORD PTR DS:[<&cuParamSetv>]
JMP     QWORD PTR DS:[<&cuLaunchGrid>]
JMP     QWORD PTR DS:[<&FatalExit>]
JMP     QWORD PTR DS:[<&MessageBoxA>]


I can adjust the parameters in the macro, but in this case I get an error Error A2101: Macro nesting level too deep in the UASM assembler  :icon_confused:


E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796): Included by
  mywindow1.asm(32): Main line code


Where can i patch macros depth?!  :icon_exclaim: I still use the version 2.47.1.0, due to a bug in version 2.48  :idea:

LiaoMi

In my opinion in the file cuda.inc everything is correct,

#if defined(CUDA_FORCE_API_VERSION)
    #if (CUDA_FORCE_API_VERSION == 3010)
        #define __CUDA_API_VERSION 3010
    #else
        #error "Unsupported value of CUDA_FORCE_API_VERSION"
    #endif
#else
    #define __CUDA_API_VERSION 10010
#endif /* CUDA_FORCE_API_VERSION */



Why different functions are indicated as unavailable  :icon_eek: After the experiments, I realized that the problem is in the alignment of the page. I forgot to remove the /Force flag and added /LARGEADDRESSAWARE:NO But no change .. Api functions have invalid addresses ..