The MASM Forum

General => The Workshop => Topic started by: Biterider on April 30, 2019, 07:26:14 PM

Title: Using the GPU
Post by: Biterider on April 30, 2019, 07:26:14 PM
Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go.  :idea:

It would be nice if you can provide example in asm.  ;)

Regards, Biterider
Title: Re: Using the GPU
Post by: fearless on April 30, 2019, 08:31:04 PM
Done some conversion of cuda and nvml for 64bit asm:

https://github.com/mrfearless/libraries/tree/master/Cuda
Updated my repo to include a CudaDetection radasm project I used whilst testing some stuff out, mainly using the nvml to detect gpu, not got round to the cuda part or testing any of the functions.

Have to use the appropriate CUDA SDK version libraries and dll's - I have a folder for v8.0 and v9.1 SDK, just copy from the appropriate folder and overwrite in the CudaDetection folder. Currently CudaDetection folder contains the 9.1 version libs and dll files.
Also i played around with OpenCL a little (collected a few libs, converted an include file) https://github.com/mrfearless/libraries/tree/master/OpenCL - updated to include a OpenCLDetection project - havent actually tested anything yet - obviously meant to, or was going to use project to help convert the include file - i think.

Done some stuff with Vulkan as well, some conversion of the includes and some minor testing using the example/sample/tutorials from the sdk as a basis to see if it would work - only got so far, got distracted etc, so havent got back to it, got up to the 4th example: 04-init_command_buffer - but cant recall if that is working. https://github.com/mrfearless/libraries/tree/master/Vulkan
Title: Re: Using the GPU
Post by: Biterider on April 30, 2019, 09:57:23 PM
Hi fearless
Thank you very much. It is a beginning...  :t

Anyone else?

Biterider
Title: Re: Using the GPU
Post by: daydreamer on April 30, 2019, 11:27:22 PM
Quote from: Biterider on April 30, 2019, 09:57:23 PM
Hi fearless
Thank you very much. It is a beginning...  :t

Anyone else?

Biterider
take a look at Hitchhikers 1k demo's in old forum, he just had minimal d3d or opengl call/setup quad and pixelshader in clear text,pixelshaders can be used together with texturemaps as arrays



Title: Re: Using the GPU
Post by: LiaoMi on May 01, 2019, 09:47:01 AM
Quote from: Biterider on April 30, 2019, 07:26:14 PM
Hi
I would like to ask if anyone has succeeded to interface the GPU from asm.
Using CUDA or DirectCompute seems the right way to go.  :idea:

It would be nice if you can provide example in asm.  ;)

Regards, Biterider

Cuda.. PTX assembly... etc..
http://masm32.com/board/index.php?topic=4711.0 (http://masm32.com/board/index.php?topic=4711.0)

Translating GPU Binaries to Tiered SIMD Architectures
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1333&rep=rep1&type=pdf (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1333&rep=rep1&type=pdf)

(https://i.ibb.co/hRSG6XK/CudaPTX.png)
Title: Re: Using the GPU
Post by: LiaoMi on May 01, 2019, 10:01:30 AM
CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly
https://github.com/SunsetQuest/CudaPAD (https://github.com/SunsetQuest/CudaPAD)

CudaPAD
http://www.codeproject.com/Articles/999744/CudaPAD

(https://cloud.githubusercontent.com/assets/10804507/21625199/711f6e1e-d1bf-11e6-888f-d30ddf585231.png)
Title: Re: Using the GPU
Post by: LiaoMi on May 12, 2019, 12:17:48 AM
Hi,

PTX ISA (PDF) - v10.1.105 (older) - Last updated March 26, 2019 https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf)

PTX Writer's Guide to Interoperability https://docs.nvidia.com/cuda/pdf/PTX_Writers_Guide_To_Interoperability.pdf (https://docs.nvidia.com/cuda/pdf/PTX_Writers_Guide_To_Interoperability.pdf)

CUDA Toolkit 10.1 Download - https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal (https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal)

Cuda_Assembly Example exe + src + inc  8)

format PE64 GUI 5.0
entry start

include 'win64a.inc'
include 'cuda.inc'

section '.text' code readable executable

  start:
  push rbp

     ; init
   lea rbx,[_cuInit]
invoke cuInit,0
  test eax,eax
   jnz Error

   lea rbx,[_cuDeviceGet]
invoke cuDeviceGet,CudaDevice,0   ; get the first device
  test eax,eax
   jnz Error

   lea rbx,[_cuCtxCreate]
invoke cuCtxCreate,CudaContext,CU_CTX_SCHED_SPIN+CU_CTX_MAP_HOST,[CudaDevice]     ; this context associates the device with this cpu thread
  test eax,eax
   jnz Error

   lea rbx,[_cuMemAlloc]
invoke cuMemAlloc,CudaNumberArray,256*4
  test eax,eax
   jnz Error

    ; load ptx source
   lea rbx,[_cuModuleLoadData]
invoke cuModuleLoadData,CudaModule,PTXSourceData
  test eax,eax
   jnz Error

   lea rbx,[_cuModuleGetFunction]
invoke cuModuleGetFunction,CudaFunction,[CudaModule],PTXFunction
  test eax,eax
   jnz Error

    ; fill in NumberArray and LogNumberArray with x87 calculations for comparision
  fld1
  fld1
   mov ecx,256*4
   lea rax,[NumberArray+rcx]
   lea rbx,[LogNumberArray+rcx]
   neg rcx
   @@:    fst dword[rax+rcx]
  fld1
   fld st1
fyl2x
  fstp dword[rbx+rcx]
  fadd st0,st1
   add rcx,4
   jnz @b
fcompp

    ; call the function
   lea rbx,[_cuMemcpyHtoD]
invoke cuMemcpyHtoD,[CudaNumberArray],NumberArray,256*4
  test eax,eax
   jnz Error


   lea rbx,[_cuParamSetSize]
invoke cuParamSetSize,[CudaFunction],12    ; 8 byte first agument + 4 byte second
  test eax,eax
   jnz Error

   mov rax,[CudaNumberArray]    ; fill in the arugments
   mov qword[Message+0],rax    ;
   mov dword[Message+8],256    ;
   lea rbx,[_cuParamSetv]
invoke cuParamSetv,[CudaFunction],0,Message,12        ; and pass in    (cuParamSeti doesn't always work so well)
  test eax,eax
   jnz Error

   lea rbx,[_cuFuncSetBlockShape]
invoke cuFuncSetBlockShape,[CudaFunction],256,1,1
  test eax,eax
   jnz Error

   lea rbx,[_cuLaunchGrid]
invoke cuLaunchGrid,[CudaFunction],1,1
  test eax,eax
   jnz Error

   lea rbx,[_cuCtxSynchronize]
invoke cuCtxSynchronize
  test eax,eax
   jnz Error

   lea rbx,[_cuMemcpyDtoH]
invoke cuMemcpyDtoH,CudaLogNumberArray,[CudaNumberArray],256*4
  test eax,eax
   jnz Error

   ; clean up
   lea rbx,[_cuMemFree]
invoke cuMemFree,[CudaNumberArray]
  test eax,eax
   jnz Error

   lea rbx,[_cuCtxDestroy]
invoke cuCtxDestroy,[CudaContext]
  test eax,eax
   jnz Error

   lea rdi,[Message]
   cld
   mov rax,'32bit fl'
stosq
   mov rax,'oats:  x'
stosq
   mov rax,'   lg2.a'
stosq
   mov rax,'pprox.f3'
stosq
   mov rax,'2(x)    '
stosq
   mov rax,'fyl2x(x,'
stosq
   mov rax,'1.0)    '
stosq
   mov al,10
stosb
   xor ebx,ebx
     @@:   fld dword[NumberArray+4*rbx]
  fstp qword[rsp-8]
   mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi],xmm0
   fld dword[CudaLogNumberArray+4*rbx]
  fstp qword[rsp-8]
   mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi+16],xmm0
   fld dword[LogNumberArray+4*rbx]
  fstp qword[rsp-8]
   mov rax,qword[rsp-8]
invoke sprintf,Temp,MessageFormat,rax
movdqu xmm0,xword[Temp]
movdqu [rdi+32],xmm0
   mov byte[rdi+48],10
   add rdi,49
   add rbx,1
   cmp rbx,20
    jb @b
   mov byte[rdi],0
invoke MessageBox,NULL,Message,NULL,MB_OK

invoke ExitProcess,0

Error:
   mov ecx,42
   cmp eax,8
    ja @f
   mul ecx
   lea rax,[err000+rax]
   jmp .Print
@@:
   cmp eax,100
    jb @f
   cmp eax,101
    ja @f
   sub eax,100
   mul ecx
   lea rax,[err100+rax]
   jmp .Print
@@:
   cmp eax,200
    jb @f
   cmp eax,216
    ja @f
   sub eax,200
   mul ecx
   lea rax,[err200+rax]
   jmp .Print
@@:
   cmp eax,300
    jb @f
   cmp eax,304
    ja @f
   sub eax,300
   mul ecx
   lea rax,[err300+rax]
   jmp .Print
@@:
   cmp eax,400
   jne @f
   sub eax,400
   mul ecx
   lea rax,[err400+rax]
   jmp .Print
@@:
   cmp eax,500
   jne @f
   sub eax,500
   mul ecx
   lea rax,[err500+rax]
   jmp .Print
@@:
   cmp eax,600
   jne @f
   sub eax,600
   mul ecx
   lea rax,[err600+rax]
   jmp .Print
@@:
   cmp eax,700
    jb @f
   cmp eax,709
    ja @f
   sub eax,700
   mul ecx
   lea rax,[err700+rax]
   jmp .Print
@@:
   cmp eax,999
   jne @f
   sub eax,999
   mul ecx
   lea rax,[err999+rax]
   jmp .Print
@@:
   lea rax,[errNoMatch]

  .Print:
   mov rdx,[rbx+0]
   mov qword[Message+0],rdx
   mov rdx,[rbx+8]
   mov qword[Message+8],rdx
   mov edx,[rbx+16]
   mov dword[Message+16],edx
   mov byte[Message+19],10

   mov rdx,[rax+0]
   mov qword[Message+20],rdx
   mov rdx,[rax+8]
   mov qword[Message+28],rdx
   mov rdx,[rax+16]
   mov qword[Message+36],rdx
   mov rdx,[rax+24]
   mov qword[Message+44],rdx
   mov rdx,[rax+32]
   mov qword[Message+52],rdx
   mov edx,[rax+40]
   mov dword[Message+60],edx
   mov byte[Message+62],0

invoke MessageBox,NULL,Message,NULL,MB_OK
invoke ExitProcess,0

section '.data' data readable

  PTXFunction: db 'log_2',0

; log_2(*inout,length):
; xind =  %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return

  PTXSourceData:
db '     .version 1.4'
db '     .target sm_13'
db '        .entry log_2 ('
db '                .param .u64 _inout,'
db '                .param .s32 _length)'
db '        {'
db '        .reg .u16 w1,w2;'       ; word-sized registers
db '        .reg .u32 e1,xind;'       ; dwords
db '        .reg .u64 r1;'       ; qwords
db '        .reg .f32 f1;'       ; floats
db '        .reg .pred p1;'       ; conditions
db '        mov.u16         w1,%ctaid.x;'
db '        mov.u16         w2,%ntid.x;'
db '        cvt.u32.u16     e1,%tid.x;'
db '        mad.wide.u16    xind,w1,w2,e1;'
db '        ld.param.s32    e1,[_length];'
db '        setp.le.s32     p1,e1,xind;'
db '   @p1  bra             $Lt_Exit;'       ; if p1, branch
db '        ld.param.u64    r1,[_inout];'
db '        mad.wide.u32    r1,xind,4,r1;'
db '        ld.global.f32   f1,[r1+0];'
db '        lg2.approx.f32  f1,f1;'
db '        st.global.f32   [r1+0],f1;'
db '$Lt_Exit:'
db '        exit;'
db '        }'
db 0

section '.data' data readable writeable

align 16
  ErrorMessageFormat db 'error code:',10,'hex: 0x%.8x',10,'dec: %u',0
  MessageFormat db '%16.7f',0
err000:
db    'CUDA_SUCCESS                              ';= 0
db    'CUDA_ERROR_INVALID_VALUE                  ';= 1
db    'CUDA_ERROR_OUT_OF_MEMORY                  ';= 2
db    'CUDA_ERROR_NOT_INITIALIZED                ';= 3
db    'CUDA_ERROR_DEINITIALIZED                  ';= 4
db    'CUDA_ERROR_PROFILER_DISABLED              ';= 5
db    'CUDA_ERROR_PROFILER_NOT_INITIALIZED       ';= 6
db    'CUDA_ERROR_PROFILER_ALREADY_STARTED       ';= 7
db    'CUDA_ERROR_PROFILER_ALREADY_STOPPED       ';= 8
err100:
db    'CUDA_ERROR_NO_DEVICE                      ';= 100
db    'CUDA_ERROR_INVALID_DEVICE                 ';= 101
err200:
db    'CUDA_ERROR_INVALID_IMAGE                  ';= 200
db    'CUDA_ERROR_INVALID_CONTEXT                ';= 201
db    'CUDA_ERROR_CONTEXT_ALREADY_CURRENT        ';= 202
db    '?                                         ';=
db    '?                                         ';=
db    'CUDA_ERROR_MAP_FAILED                     ';= 205
db    'CUDA_ERROR_UNMAP_FAILED                   ';= 206
db    'CUDA_ERROR_ARRAY_IS_MAPPED                ';= 207
db    'CUDA_ERROR_ALREADY_MAPPED                 ';= 208
db    'CUDA_ERROR_NO_BINARY_FOR_GPU              ';= 209
db    'CUDA_ERROR_ALREADY_ACQUIRED               ';= 210
db    'CUDA_ERROR_NOT_MAPPED                     ';= 211
db    'CUDA_ERROR_NOT_MAPPED_AS_ARRAY            ';= 212
db    'CUDA_ERROR_NOT_MAPPED_AS_POINTER          ';= 213
db    'CUDA_ERROR_ECC_UNCORRECTABLE              ';= 214
db    'CUDA_ERROR_UNSUPPORTED_LIMIT              ';= 215
db    'CUDA_ERROR_CONTEXT_ALREADY_IN_USE         ';= 216
err300:
db    'CUDA_ERROR_INVALID_SOURCE                 ';= 300
db    'CUDA_ERROR_FILE_NOT_FOUND                 ';= 301
db    'CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND ';= 302
db    'CUDA_ERROR_SHARED_OBJECT_INIT_FAILED      ';= 303
db    'CUDA_ERROR_OPERATING_SYSTEM               ';= 304
err400:
db    'CUDA_ERROR_INVALID_HANDLE                 ';= 400
err500:
db    'CUDA_ERROR_NOT_FOUND                      ';= 500
err600:
db    'CUDA_ERROR_NOT_READY                      ';= 600
err700:
db    'CUDA_ERROR_LAUNCH_FAILED                  ';= 700
db    'CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES        ';= 701
db    'CUDA_ERROR_LAUNCH_TIMEOUT                 ';= 702
db    'CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING  ';= 703
db    'CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED    ';= 704
db    'CUDA_ERROR_PEER_ACCESS_NOT_ENABLED        ';= 705
db    '?                                         ';=
db    '?                                         ';=
db    'CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE         ';= 708
db    'CUDA_ERROR_CONTEXT_IS_DESTROYED           ';= 709
err999:
db    'CUDA_ERROR_UNKNOWN                        ';= 999
errNoMatch:
db    '?                                         '


_cuInit: db 'cuInit              '
_cuDeviceGet: db 'cuDeviceGet         '
_cuCtxCreate: db 'cuCtxCreate         '
_cuMemAlloc: db 'cuMemAlloc          '
_cuModuleLoadData: db 'cuModuleLoadData    '
_cuModuleGetFunction: db 'cuModuleGetFunction '
_cuMemcpyHtoD: db 'cuMemcpyHtoD        '
_cuParamSeti: db 'cuParamSeti         '
_cuParamSetv: db 'cuParamSetv         '
_cuFuncSetBlockShape: db 'cuFuncSetBlockShape '
_cuLaunchGrid: db 'cuLaunchGrid        '
_cuParamSetSize: db 'cuParamSetSize      '
_cuCtxSynchronize: db 'cuCtxSynchronize    '
_cuMemcpyDtoH: db 'cuMemcpyDtoH        '
_cuMemFree: db 'cuMemFree           '
_cuCtxDestroy: db 'cuCtxDestroy        '



align 16
  CudaDevice    dq ?
  CudaContext    dq ?
  CudaModule    dq ?
  CudaFunction    dq ?
  CudaNumberArray  dq ?
  NumberArray    rd 256
  LogNumberArray   rd 256
  CudaLogNumberArray rd 256
  Temp rb 32
  Message rb 50*60

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
  user32,'USER32.DLL',\
  msvcrt,'MSVCRT.DLL',\
  cuda,'NVCUDA.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'
  include 'api_cuda.inc'

import msvcrt,\
sprintf,'sprintf'


PTX Data
section '.data' data readable

  PTXFunction: db 'log_2',0

; log_2(*inout,length):
; xind =  %ctaid.x * %ntid.x + %tid.x
; if xind < length , [inout + 4 * xind] = log2([inout + 4 * xind])
; return

  PTXSourceData:
db '     .version 1.4'
db '     .target sm_13'
db '        .entry log_2 ('
db '                .param .u64 _inout,'
db '                .param .s32 _length)'
db '        {'
db '        .reg .u16 w1,w2;'       ; word-sized registers
db '        .reg .u32 e1,xind;'       ; dwords
db '        .reg .u64 r1;'       ; qwords
db '        .reg .f32 f1;'       ; floats
db '        .reg .pred p1;'       ; conditions
db '        mov.u16         w1,%ctaid.x;'
db '        mov.u16         w2,%ntid.x;'
db '        cvt.u32.u16     e1,%tid.x;'
db '        mad.wide.u16    xind,w1,w2,e1;'
db '        ld.param.s32    e1,[_length];'
db '        setp.le.s32     p1,e1,xind;'
db '   @p1  bra             $Lt_Exit;'       ; if p1, branch
db '        ld.param.u64    r1,[_inout];'
db '        mad.wide.u32    r1,xind,4,r1;'
db '        ld.global.f32   f1,[r1+0];'
db '        lg2.approx.f32  f1,f1;'
db '        st.global.f32   [r1+0],f1;'
db '$Lt_Exit:'
db '        exit;'
db '        }'
db 0

Title: Re: Using the GPU
Post by: LiaoMi on May 12, 2019, 12:24:34 AM
With the toolkit from the posts above, you can freely write modern code, it remains to find an idea for a super fast algorithm  :P
Title: Re: Using the GPU
Post by: Biterider on May 12, 2019, 06:35:24 AM
Thanks LiaoMi
Very good and interesting work  :t

Biterider
Title: Re: Using the GPU
Post by: LiaoMi on May 12, 2019, 09:00:45 AM
Hi Biterider,

the example is old, but perfect for demonstration!  :icon14:
mrfearless has a different set of Api https://github.com/mrfearless/libraries/tree/master/Cuda (https://github.com/mrfearless/libraries/tree/master/Cuda), maybe because of the CUDA Toolkit version, although the compiled exe works fine for me. We can easily rewrite the example for UASM, the main thing is to understand how to use new 10.1 api  :badgrin:

Nsight Productivity Utilities PTX Debugger + Tools

NVIDIA® Nsight™ Systems
NVIDIA® Nsight Systems™ is a system-wide performance analysis tool designed to visualize an application's algorithms, identify the largest optimization opportunities, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large servers to our smallest SoC.

NVIDIA® Nsight™ Graphics
NVIDIA® Nsight Graphics™ is a standalone application for the debugging, profiling, and analysis of graphics applications. It allows you to optimize the performance of your Direct3D 11, Direct3D 12, DirectX Raytracing, OpenGL, Vulkan, and NVIDIA VKRay based applications.

NVIDIA® Nsight™ Eclipse Edition
NVIDIA® Nsight™ Eclipse Edition is a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug, and profile CUDA-C applications. Nsight Eclipse Edition supports a rich set of commercial and free plugins.

NVIDIA® Nsight™ Compute
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. Nsight Compute also provides customizable and data-driven user interface and metric collection that can be extended with analysis scripts for post-processing results.

NVIDIA® Nsight™ Visual Studio Edition
An application development environment for heterogeneous platforms, Nsight Visual Studio Edition brings GPU computing into Microsoft Visual Studio. Build, debug, profile and trace heterogeneous compute, graphics, virtual reality, RTX, .NET, and UWP applications built with CUDA C/C++, OpenCL, DirectCompute, Direct3D (11,12,DXR), Vulkan (1.1, Vulkan Ray Tracing Extension), OpenGL, OpenVR, and the Oculus SDK.
Download NVIDIA® Nsight™ Visual Studio Edition 2019.1 - https://developer.nvidia.com/gameworksdownload#?dn=nsight-visual-studio-edition-2019-1 (https://developer.nvidia.com/gameworksdownload#?dn=nsight-visual-studio-edition-2019-1) (free registration is required, then you can download files NVIDIA_Nsight_Visual_Studio_Edition_Win64_2019.1.1.19056.msi) 492Mb  :eusa_boohoo:

(https://developer.nvidia.com/sites/default/files/akamai/tools/nsvse/nsvse60/60_Next-Gen_Debugger_PTX+SASS.png)
Title: Re: Using the GPU
Post by: fearless on May 12, 2019, 01:01:12 PM
Probably a good few changes since i last looked at it. Think the version i was working with of the cuda sdk was v8 primarily, just to see if i could get a few things working - and mainly with the nvml part. So any api's and include file conversions are probably based on that v8. I think there was some reason for only using x64 asm - cant quite recall why that and not x86 - maybe there wasnt an x86 dll for nvml or something (i think)
Anyhow looks like you have a lot of stuff figured out, so looks promising.
Title: Re: Using the GPU
Post by: LiaoMi on May 12, 2019, 10:06:34 PM
Hi fearless,

I checked your sdk for the presence of api and compared with the original version 10.1, my conclusion, your sdk is not complete, all functions are present, nothing has changed in the new CUDA versions.

I attach my version of converted .H files https://www.solidfiles.com/v/yR7MaKLr5Vkx6 (https://www.solidfiles.com/v/yR7MaKLr5Vkx6), library files as usual in CUDA Toolkit 10.1.

I did not convert two folders, for the reason that there are templates and classes that are not supported by the converter.
thrust/detail
thrust/system


I have not tried to convert .H files using the H2INCX utility, made in a fast way to cover more files .. h2incX need a graphical interface to use as front-end, in the form of a separate program that generates flags, creates subdirectories during conversion and manages the conversion list, upx has a similar separate interface.

The Linux "Cuda_Assembly Example" version can be found below, FASM source code  :icon14:

P.S. The converter did not give any errors, but you may have to correct something  :bgrin:
Title: Re: Using the GPU
Post by: LiaoMi on May 12, 2019, 11:13:17 PM
First bugs after conversion :biggrin:

С++
/**
* External memory buffer descriptor
*/
typedef struct CUDA_EXTERNAL_MEMORY_BUFFER_DESC_st {
    /**
     * Offset into the memory object where the buffer's base is
     */
    unsigned long long offset;
    /**
     * Size of the buffer
     */
    unsigned long long size;
    /**
     * Flags reserved for future use. Must be zero.
     */
    unsigned int flags;
    unsigned int reserved[16];
} CUDA_EXTERNAL_MEMORY_BUFFER_DESC;


Conversion

CUDA_EXTERNAL_MEMORY_BUFFER_DESC STRUCT DEFALIGNMASM
long DWORD ?
long DWORD ?
flags DWORD ?
reserved DWORD 16 dup (?)
CUDA_EXTERNAL_MEMORY_BUFFER_DESC ENDS


:redface: How many bytes is unsigned long long?
https://stackoverflow.com/questions/5836329/how-many-bytes-is-unsigned-long-long (https://stackoverflow.com/questions/5836329/how-many-bytes-is-unsigned-long-long)
https://en.cppreference.com/w/cpp/language/types (https://en.cppreference.com/w/cpp/language/types)  :t
Title: Re: Using the GPU
Post by: LiaoMi on May 13, 2019, 01:30:06 AM
Hi again  :biggrin:,

I can successfully compile under UASM assembler. There is one problem, in the header files, you can control the version of api. In my case version 2 is used, but should be 1 ... cuCtxCreate_v2 -> cuCtxCreate

JMP     QWORD PTR DS:[<&cuInit>]
JMP     QWORD PTR DS:[<&cuDeviceGet>]
JMP     QWORD PTR DS:[<&cuCtxCreate_v2>]
JMP     QWORD PTR DS:[<&cuCtxDestroy_v2>]
JMP     QWORD PTR DS:[<&cuCtxSynchronize>]
JMP     QWORD PTR DS:[<&cuModuleLoadData>]
JMP     QWORD PTR DS:[<&cuModuleGetFunction>]
JMP     QWORD PTR DS:[<&cuMemAlloc_v2>]
JMP     QWORD PTR DS:[<&cuMemFree_v2>]
JMP     QWORD PTR DS:[<&cuMemcpyHtoD_v2>]
JMP     QWORD PTR DS:[<&cuMemcpyDtoH_v2>]
JMP     QWORD PTR DS:[<&cuFuncSetBlockShape>]
JMP     QWORD PTR DS:[<&cuParamSetSize>]
JMP     QWORD PTR DS:[<&cuParamSetv>]
JMP     QWORD PTR DS:[<&cuLaunchGrid>]
JMP     QWORD PTR DS:[<&FatalExit>]
JMP     QWORD PTR DS:[<&MessageBoxA>]


I can adjust the parameters in the macro, but in this case I get an error Error A2101: Macro nesting level too deep in the UASM assembler  :icon_confused:


E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4879): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(4909): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5719): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5763): Included by
  mywindow1.asm(32): Main line code
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796) : Error A2101: Macro nesting level too deep
E:\DATA\MASM64\HJWasm\Cuda\inc\cuda.inc(5796): Included by
  mywindow1.asm(32): Main line code


Where can i patch macros depth?!  :icon_exclaim: I still use the version 2.47.1.0, due to a bug in version 2.48  :idea:
Title: Re: Using the GPU
Post by: LiaoMi on May 13, 2019, 04:52:37 AM
In my opinion in the file cuda.inc everything is correct,

#if defined(CUDA_FORCE_API_VERSION)
    #if (CUDA_FORCE_API_VERSION == 3010)
        #define __CUDA_API_VERSION 3010
    #else
        #error "Unsupported value of CUDA_FORCE_API_VERSION"
    #endif
#else
    #define __CUDA_API_VERSION 10010
#endif /* CUDA_FORCE_API_VERSION */



Why different functions are indicated as unavailable  :icon_eek: After the experiments, I realized that the problem is in the alignment of the page. I forgot to remove the /Force flag and added /LARGEADDRESSAWARE:NO But no change .. Api functions have invalid addresses ..
Title: Re: Using the GPU
Post by: LiaoMi on May 14, 2019, 12:45:26 AM
The latest version for testing, reached -
lea rbx, [_ cuCtxSynchronize]
call cuCtxSynchronize


---------------------------
Error
---------------------------
cuCtxSynchronize   
CUDA_ERROR_LAUNCH_FAILED                 
---------------------------
OK   
---------------------------


The task to convert header files was not so easy, there are places where api variables have the form as DWORD, additional adjustment of translator templates and API checking is strongly required  :icon_exclaim:

#if defined(_WIN64) || defined(__LP64__)
typedef unsigned long long CUdeviceptr;
#else
typedef unsigned int CUdeviceptr;
#endif


vs

IF DEFINED(_WIN64) OR DEFINED(__LP64__)
long TYPEDEF DWORD
ELSE
CUdeviceptr TYPEDEF DWORD
ENDIF
Title: Re: Using the GPU
Post by: Biterider on May 14, 2019, 12:51:12 AM
Hi LiaoMi
I found the same issue while translating the header files. In the next few days, I'll make some additions to h2incX to better convert these files.
Biterider
Title: Re: Using the GPU
Post by: aw27 on May 15, 2019, 09:19:02 PM
Quote from: LiaoMi on May 14, 2019, 12:45:26 AM
The latest version for testing, reached -
lea rbx, [_ cuCtxSynchronize]
call cuCtxSynchronize


---------------------------
Error
---------------------------
cuCtxSynchronize   
CUDA_ERROR_LAUNCH_FAILED                 
---------------------------
OK   
---------------------------


I produced a very simple example and faced as well that issue  :( . It took me a few hours to realize what might be the problem. I got it fixed by calling the v2 versions.  :t

Here is the example which consists in calculating the double of each element of an array of 1000 elements. It does it by launching 1024 threads (24 will do nothing), each will handle only one element.

It may need to be modified to run in other targets, it was done for sm_30


OPTION casemap:none

N equ 1000

includelib kernel32.lib
ExitProcess proto :dword
includelib msvcrt.lib
printf proto :ptr, :vararg
_getch proto
includelib cuda.lib
cuInit proto :dword
cuDeviceGet proto :ptr, :dword
cuCtxCreate_v2 proto :ptr, :dword, :dword
cuModuleLoadData proto :ptr, :ptr
cuModuleGetFunction proto :ptr, :ptr, :ptr
cuMemAlloc_v2 proto :ptr, :qword
cuMemcpyHtoD_v2 proto :ptr, :ptr, :qword
cuMemcpyDtoH_v2 proto :ptr, :ptr, :qword
cuLaunchKernel proto :ptr, :dword, :dword, :dword, :dword, :dword, :dword, :ptr, :ptr, :ptr
cuCtxSynchronize proto
cuMemFree_v2 proto :ptr

.code

mulby2 proc
LOCAL cuDevice:dword
LOCAL cuContext:ptr
LOCAL cuModule:ptr
LOCAL cudaFunction:ptr
LOCAL _da: qword
LOCAL _db: qword
LOCAL ha[N]:dword
LOCAL hb[N]:dword
LOCAL memSize : qword
LOCAL args[2] : ptr

sub rsp, 58h
and rsp, -16 ; align
mov eax, sizeof DWORD * N
mov memSize, rax
mov rcx,0
call cuInit
cmp eax, 0
jnz @exit

lea rcx, cuDevice
mov edx,0
call cuDeviceGet
cmp eax, 0
jnz @exit

lea rcx, cuContext
mov edx, 0
mov r8d, cuDevice
call cuCtxCreate_v2
cmp eax, 0
jnz @exit

lea rcx, cuModule
mov rdx, offset PTXSourceData
call cuModuleLoadData
cmp eax, 0
jnz @exit

lea rcx, cudaFunction
mov rdx, cuModule
lea r8, PTXFunction
call cuModuleGetFunction
cmp eax, 0
jnz @exit

mov ecx, N
mov r10d,0
lea r11, ha
@@:
mov dword ptr [r11], r10d
add r11, 4
inc r10d
dec ecx
jnz @B

lea rcx, _da
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
lea rcx, _db
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit

mov rcx, _da
lea rdx, ha
mov r8, memSize
call cuMemcpyHtoD_v2
cmp eax, 0
jnz @exit

lea rax, _da
mov qword ptr args, rax
lea rax, _db
mov qword ptr args+8, rax

mov rcx, cudaFunction
mov edx ,1
mov r8d, 1
mov r9d, 1
mov dword ptr [rsp+20h], 1024
mov dword ptr [rsp+28h], 1
mov dword ptr [rsp+30h], 1
mov dword ptr [rsp+38h], 0
mov qword ptr [rsp+40h], 0
lea rax, args
mov qword ptr [rsp+48h], rax
mov qword ptr [rsp+50h], 0
call cuLaunchKernel
cmp eax, 0
jnz @exit

call cuCtxSynchronize
cmp eax, 0
jnz @exit

lea rcx, hb
mov rdx, _db
mov r8, memSize
call cuMemcpyDtoH_v2
cmp eax, 0
jnz @exit

mov ebx, 0
lea rdi, hb
@@:
mov rcx, offset report
mov edx, dword ptr [rdi]
call printf
add rdi, 4
inc ebx
cmp ebx, N
jl @B

mov rcx, _da
call cuMemFree_v2
mov rcx, _db
call cuMemFree_v2
mov eax,0
@exit:
mov ecx,0
call _getch
call ExitProcess
mulby2 endp

.data
report db "%d",10,0

; Generated by NVIDIA NVVM Compiler
; Compiler Build ID: CL-25769353
; Cuda compilation tools, release 10.1, V10.1.105
; Based on LLVM 3.4svn

PTXFunction db '_Z3addPiS_',0
PTXSourceData \
db '.version 6.4 ',10
db '.target sm_30 ',10
db '.address_size 64 ',10
db ' ',10
db ' // .globl _Z3addPiS_ ',10
db ' ',10
db '.visible .entry _Z3addPiS_( ',10
db ' .param .u64 _Z3addPiS__param_0, ',10
db ' .param .u64 _Z3addPiS__param_1 ',10
db ') ',10
db '{ ',10
db ' .reg .pred %p<2>; ',10
db ' .reg .b32 %r<4>; ',10
db ' .reg .b64 %rd<8>; ',10
db ' ',10
db ' ',10
db ' ld.param.u64 %rd1, [_Z3addPiS__param_0]; ',10
db ' ld.param.u64 %rd2, [_Z3addPiS__param_1]; ',10
db ' mov.u32 %r1, %tid.x; ',10
db ' setp.gt.s32 %p1, %r1, 999; ',10
db ' @%p1 bra BB0_2; ',10
db ' ',10
db ' cvta.to.global.u64 %rd3, %rd1; ',10
db ' mul.wide.s32 %rd4, %r1, 4; ',10
db ' add.s64 %rd5, %rd3, %rd4; ',10
db ' ld.global.u32 %r2, [%rd5]; ',10
db ' shl.b32 %r3, %r2, 1; ',10
db ' cvta.to.global.u64 %rd6, %rd2; ',10
db ' add.s64 %rd7, %rd6, %rd4; ',10
db ' st.global.u32 [%rd7], %r3; ',10
db ' ',10
db 'BB0_2: ',10
db ' ret; ',10
db '} ',10
db  0

end


The PTX was obtained from this:


#define N 1000

__global__ void add(int* a, int* b) {
int i = threadIdx.x;
if (i < N) {
b[i] = 2 * a[i];
}
}


To build:
SET MASM64LIB=\masm32\lib64
\masm32\bin\ml64  -c -Zp8 c64.asm
\masm32\bin\link /ENTRY:mulby2 /SUBSYSTEM:console /LIBPATH:%MASM64LIB% /FIXED /MACHINE:X64 c64.obj
Title: Re: Using the GPU
Post by: LiaoMi on May 15, 2019, 10:41:01 PM
I need to have some kind of manual to deal with architecture and api. Here you can download full documentation in pdf, on api, on inline functions and more - https://docs.nvidia.com/cuda/ (https://docs.nvidia.com/cuda/)

Quote from: Biterider on May 14, 2019, 12:51:12 AM
Hi LiaoMi
I found the same issue while translating the header files. In the next few days, I'll make some additions to h2incX to better convert these files.
Biterider

Hi Biterider,

I fixed in my own conversion the following steps ..
1.Make a copy of all files for backup
2.Replace all types described here https://en.cppreference.com/w/cpp/language/types (https://en.cppreference.com/w/cpp/language/types) in the form of __int64 (unsigned long long = __int64, unsigned long = long) everything on the table, there are many of them ..
3.Define type in converter CUdeviceptr = CUdeviceptr
4.Delete all parameters in original headers CUdeviceptr = DWORD
5.Add the correct macro
IF DEFINED(_WIN64) OR DEFINED(__LP64__)
CUdeviceptr TYPEDEF QWORD
ELSE
CUdeviceptr TYPEDEF DWORD
ENDIF


Perhaps there are also other bottlenecks that I did not see .. Compiling with my converted file works fine .. I have done only one so far - cuda.inc, in the post above the file was already corrected a couple of days ago ..

Quote from: AW on May 15, 2019, 09:19:02 PM
Quote from: LiaoMi on May 14, 2019, 12:45:26 AM
The latest version for testing, reached -
lea rbx, [_ cuCtxSynchronize]
call cuCtxSynchronize


---------------------------
Error
---------------------------
cuCtxSynchronize   
CUDA_ERROR_LAUNCH_FAILED                 
---------------------------
OK   
---------------------------


I produced a very simple example and faced as well that issue  :( . It took me a few hours to realize what might be the problem. I got it fixed by calling the v2 versions.  :t

Here is the example which consists in calculating the double of each element of an array of 1000 elements. It does it by launching 1024 threads (24 will do nothing), each will handle only one element.

It may need to be modified to run in other targets, it was done for sm_30


OPTION casemap:none

N equ 1000

includelib kernel32.lib
ExitProcess proto :dword
includelib msvcrt.lib
printf proto :ptr, :vararg
_getch proto
includelib cuda.lib
cuInit proto :dword
cuDeviceGet proto :ptr, :dword
cuCtxCreate_v2 proto :ptr, :dword, :dword
cuModuleLoadData proto :ptr, :ptr
cuModuleGetFunction proto :ptr, :ptr, :ptr
cuMemAlloc_v2 proto :ptr, :qword
cuMemcpyHtoD_v2 proto :ptr, :ptr, :qword
cuMemcpyDtoH_v2 proto :ptr, :ptr, :qword
cuLaunchKernel proto :ptr, :dword, :dword, :dword, :dword, :dword, :dword, :ptr, :ptr, :ptr
cuCtxSynchronize proto
cuMemFree_v2 proto :ptr

.code

mulby2 proc
LOCAL cuDevice:dword
LOCAL cuContext:ptr
LOCAL cuModule:ptr
LOCAL cudaFunction:ptr
LOCAL _da: qword
LOCAL _db: qword
LOCAL ha[N]:dword
LOCAL hb[N]:dword
LOCAL memSize : qword
LOCAL args[2] : ptr

sub rsp, 58h
and rsp, -16 ; align
mov eax, sizeof DWORD * N
mov memSize, rax
mov rcx,0
call cuInit
cmp eax, 0
jnz @exit

lea rcx, cuDevice
mov edx,0
call cuDeviceGet
cmp eax, 0
jnz @exit

lea rcx, cuContext
mov edx, 0
mov r8d, cuDevice
call cuCtxCreate_v2
cmp eax, 0
jnz @exit

lea rcx, cuModule
mov rdx, offset PTXSourceData
call cuModuleLoadData
cmp eax, 0
jnz @exit

lea rcx, cudaFunction
mov rdx, cuModule
lea r8, PTXFunction
call cuModuleGetFunction
cmp eax, 0
jnz @exit

mov ecx, N
mov r10d,0
lea r11, ha
@@:
mov dword ptr [r11], r10d
add r11, 4
inc r10d
dec ecx
jnz @B

lea rcx, _da
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit
lea rcx, _db
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit

mov rcx, _da
lea rdx, ha
mov r8, memSize
call cuMemcpyHtoD_v2
cmp eax, 0
jnz @exit

lea rax, _da
mov qword ptr args, rax
lea rax, _db
mov qword ptr args+8, rax

mov rcx, cudaFunction
mov edx ,1
mov r8d, 1
mov r9d, 1
mov dword ptr [rsp+20h], 1024
mov dword ptr [rsp+28h], 1
mov dword ptr [rsp+30h], 1
mov dword ptr [rsp+38h], 0
mov qword ptr [rsp+40h], 0
lea rax, args
mov qword ptr [rsp+48h], rax
mov qword ptr [rsp+50h], 0
call cuLaunchKernel
cmp eax, 0
jnz @exit

call cuCtxSynchronize
cmp eax, 0
jnz @exit

lea rcx, hb
mov rdx, _db
mov r8, memSize
call cuMemcpyDtoH_v2
cmp eax, 0
jnz @exit

mov ebx, 0
lea rdi, hb
@@:
mov rcx, offset report
mov edx, dword ptr [rdi]
call printf
add rdi, 4
inc ebx
cmp ebx, N
jl @B

mov rcx, _da
call cuMemFree_v2
mov rcx, _db
call cuMemFree_v2
mov eax,0
@exit:
mov ecx,0
call _getch
call ExitProcess
mulby2 endp

.data
report db "%d",10,0

; Generated by NVIDIA NVVM Compiler
; Compiler Build ID: CL-25769353
; Cuda compilation tools, release 10.1, V10.1.105
; Based on LLVM 3.4svn

PTXFunction db '_Z3addPiS_',0
PTXSourceData \
db '.version 6.4 ',10
db '.target sm_30 ',10
db '.address_size 64 ',10
db ' ',10
db ' // .globl _Z3addPiS_ ',10
db ' ',10
db '.visible .entry _Z3addPiS_( ',10
db ' .param .u64 _Z3addPiS__param_0, ',10
db ' .param .u64 _Z3addPiS__param_1 ',10
db ') ',10
db '{ ',10
db ' .reg .pred %p<2>; ',10
db ' .reg .b32 %r<4>; ',10
db ' .reg .b64 %rd<8>; ',10
db ' ',10
db ' ',10
db ' ld.param.u64 %rd1, [_Z3addPiS__param_0]; ',10
db ' ld.param.u64 %rd2, [_Z3addPiS__param_1]; ',10
db ' mov.u32 %r1, %tid.x; ',10
db ' setp.gt.s32 %p1, %r1, 999; ',10
db ' @%p1 bra BB0_2; ',10
db ' ',10
db ' cvta.to.global.u64 %rd3, %rd1; ',10
db ' mul.wide.s32 %rd4, %r1, 4; ',10
db ' add.s64 %rd5, %rd3, %rd4; ',10
db ' ld.global.u32 %r2, [%rd5]; ',10
db ' shl.b32 %r3, %r2, 1; ',10
db ' cvta.to.global.u64 %rd6, %rd2; ',10
db ' add.s64 %rd7, %rd6, %rd4; ',10
db ' st.global.u32 [%rd7], %r3; ',10
db ' ',10
db 'BB0_2: ',10
db ' ret; ',10
db '} ',10
db  0

end


The PTX was obtained from this:


#define N 1000

__global__ void add(int* a, int* b) {
int i = threadIdx.x;
if (i < N) {
b[i] = 2 * a[i];
}
}


To build:
SET MASM64LIB=\masm32\lib64
\masm32\bin\ml64  -c -Zp8 c64.asm
\masm32\bin\link /ENTRY:mulby2 /SUBSYSTEM:console /LIBPATH:%MASM64LIB% /FIXED /MACHINE:X64 c64.obj

Hi AW,

great job!  :eusa_clap: :t I was so tortured with this function that I started thinking about errors in the code. I will try to change the function and I will try your example! Thank you!

What is the difference between these functions?
The new CUDA "v2" API has been supported since v4.0 of the CUDA toolkit (i.e. for quite a while; we are now on 7)..... The old API is deprecated by NVidia, and does not support certain useful functionality such as batched matrix-multiply.
From my understanding, new CUDA "v2" API is a functional extension of all api to a new level. Therefore, the parameters may be different, I did not take this into account. Some features are preserved to maintain compatibility, and some exist only in one old form. All these parameters should be described by the version of the toolkit.  ::)

All current documentation in one archive - CUDA Toolkit v10.1.168_pdf.zip https://www.solidfiles.com/v/5aYWVB7nx3Qq5 (https://www.solidfiles.com/v/5aYWVB7nx3Qq5) 100Mb
Title: Re: Using the GPU
Post by: jj2007 on May 15, 2019, 11:52:56 PM
I've never seriously considered using CUDA because I thought it's useful for one brand only, Nvidia. Now I am confused by two elements:

1. It seems Nvidia has over 80% of the market (https://wccftech.com/nvidia-amd-discrete-gpu-market-share-q4-2018-report/)
2. There are many manufacturers but they all produce Nvidia cards, apparently: https://graphicscardhub.com/graphics-card-manufacturers-brands/

So what is "Nvidia"? A manufacturer brand, a "graphics OS" of some sorts, a standard?
Title: Re: Using the GPU
Post by: LiaoMi on May 16, 2019, 02:38:53 AM
Quote from: jj2007 on May 15, 2019, 11:52:56 PM
I've never seriously considered using CUDA because I thought it's useful for one brand only, Nvidia. Now I am confused by two elements:

1. It seems Nvidia has over 80% of the market (https://wccftech.com/nvidia-amd-discrete-gpu-market-share-q4-2018-report/)
2. There are many manufacturers but they all produce Nvidia cards, apparently: https://graphicscardhub.com/graphics-card-manufacturers-brands/

So what is "Nvidia"? A manufacturer brand, a "graphics OS" of some sorts, a standard?

Hi jj2007,

for me it is the market leader in the production of GPU (graphics processing unit)

Intel = CPU
NVIDIA = GPU, NVIDIA, inventor of the GPU, which creates interactive graphics on laptops, workstations, mobile devices, notebooks, PCs, and more.

Cuda Initial release June 23, 2007 (11 years ago) CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

NVIDIA Major releases and acquisitions - The release of the RIVA TNT in 1998 solidified Nvidia's reputation for developing capable graphics adapters. In late 1999, Nvidia released the GeForce 256 (NV10), most notably introducing on-board transformation and lighting (T&L) to consumer-level 3D hardware. Running at 120 MHz and featuring four pixel pipelines, it implemented advanced video acceleration, motion compensation and hardware sub-picture alpha blending.

Unification is not required for GPU computations, this is probably one of the reasons why there is no computation standard in graphical interfaces, gpu calculations seem to me more tied to the programming language than to the technical features of parallelization.
Title: Re: Using the GPU
Post by: TimoVJL on May 16, 2019, 03:17:55 AM
OpenCL (https://www.khronos.org/opencl/)
nvidia OpenCL (https://developer.nvidia.com/opencl)

GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)
Title: Re: Using the GPU
Post by: aw27 on May 16, 2019, 03:22:06 AM
Hi LiaoMi,

I noticed that in the #include file they have
#if defined(__CUDA_API_VERSION_INTERNAL) || __CUDA_API_VERSION >= 3020
...
#define cuCtxCreate                         cuCtxCreate_v2

So I run, cuCtxGetApiVersion and got that the version was 3020.

All explained, so easy when we know the answer.  :biggrin:
Title: Re: Using the GPU
Post by: LiaoMi on May 16, 2019, 09:28:13 PM
Another attractive example!  :P

Here is a demonstration of Neural Network created in a CUDA GPU using FASM.

This program requires win64 and a NVIDIA cuda enabled graphics card. The code is written specifically for a NVIDIA 620 96-core card with Clockrate 1G4Hz. As the neural network size was chosen as multiples of 96, the code may have to be modded for cards with more or less cores.

The first part of the code constructs a neural network in PC memory and in GPU memory. You can run it in either to test speed. The second part of the code is the PTX (GPU pseudo assembly) program thread that is loaded into the NVIDIA GPU in ascii, and run on continuous 96 core 'blocks' (one thread per core) until the program has finished.

My time results show this 96-core card to run at equivalent speed to a dual-core CPU running at 3GHz and fully using 128-bit SIMD.

To put this another way, if this program was run on the latest 3000 core NVIDIA card it would run 5 times faster than on the latest 16-core AVX (256-bit SIMD) CPU available today.

I was a bit disappointed with CUDA performance, CPU core for GPU core, it should have run up to 5 times faster, however, this is just a first attempt.


It does not work for me ...  :(
---------------------------
Error
---------------------------
error code:
hex: 0x000e28c0
dec: 927936
---------------------------
OK   
---------------------------
Title: Re: Using the GPU
Post by: LiaoMi on May 16, 2019, 10:06:57 PM
Quote from: AW on May 16, 2019, 03:22:06 AM
Hi LiaoMi,

I noticed that in the #include file they have
#if defined(__CUDA_API_VERSION_INTERNAL) || __CUDA_API_VERSION >= 3020
...
#define cuCtxCreate                         cuCtxCreate_v2

So I run, cuCtxGetApiVersion and got that the version was 3020.

All explained, so easy when we know the answer.  :biggrin:

I need to find an error, somewhere something is wrong, I also have version 3020 based on the results of the function. I compared the original binary file and the import of api, only old api are used there, in the example (Cuda_UASM_v4.zip) I'm trying to translate, all api are similar to your version.

---------------------------
Error
---------------------------
cuCtxSynchronize   
CUDA_ERROR_LAUNCH_FAILED                 
---------------------------
OK   
---------------------------


I will try to set the old api and once again check the parameters of the functions.


Quote from: TimoVJL on May 16, 2019, 03:17:55 AM
OpenCL (https://www.khronos.org/opencl/)
nvidia OpenCL (https://developer.nvidia.com/opencl)

GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)

Hi TimoVJL,

if I understand correctly, the main meaning of the OpenCL is to provide a common set of api for all variations of the GPU, this is almost like Qt for the world of graphics accelerators  :idea:
Title: Re: Using the GPU
Post by: TimoVJL on May 16, 2019, 10:46:50 PM
Quote from: LiaoMi on May 16, 2019, 10:06:57 PM
Quote from: TimoVJL on May 16, 2019, 03:17:55 AM
OpenCL (https://www.khronos.org/opencl/)
nvidia OpenCL (https://developer.nvidia.com/opencl)

GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)

Hi TimoVJL,

if I understand correctly, the main meaning of the OpenCL is to provide a common set of api for all variations of the GPU, this is almost like Qt for the world of graphics accelerators  :idea:
Even with Android smart phones ;)

CudaVsOpenCL (https://wiki.tiker.net/CudaVsOpenCL)
Title: Re: Using the GPU
Post by: HSE on May 16, 2019, 11:14:34 PM
Quote from: TimoVJL on May 16, 2019, 03:17:55 AM
GPU_Caps_Viewer (http://www.ozone3d.net/gpu_caps_viewer/)
:t
Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 01:18:27 AM
Cuda is also on Android with the most recent (NVidia) Tegra models.
And you can start developing right now  (https://docs.nvidia.com/gameworks/content/developertools/mobile/getting_started.htm), everything is organized to make the journey as pleasant as possible.  :t
Is OpenCL disappearing from the map? (https://community.amd.com/thread/232600)  :shock:


Title: Re: Using the GPU
Post by: TimoVJL on May 17, 2019, 01:46:58 AM
Quite a short list:

https://www.nvidia.in/object/tegra-phones-tablets-in.html
Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 02:03:36 AM
They are mostly into chips, not end user products. (https://en.wikipedia.org/wiki/Tegra#Tegra_3)
Like this one: https://www.nvidia.in/object/tegra-x1-processor-in.html
Title: Re: Using the GPU
Post by: LiaoMi on May 17, 2019, 03:15:49 AM
Everything works  :biggrin:, here is the final example translated from fasm ... The mistake was that new api were used, after I set the correct parameters for the file's header: cuda.inc

CUDA_FORCE_API_VERSION EQU 3010
CUdeviceptr typedef QWORD

_WIN64 equ 1


and the example was launched on gpu, all functions worked correctly. There is some dependence of the PTX code and version of the Api. It is time to start with the neural network example  :eusa_dance:.

Here is another interesting topic related to graphics - Single-threaded CUDA OpenGL Interop https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st (https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st)
Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 05:25:09 AM
@LiaoMi
Did you include the translated from FASM source code?

BTW, have you ever tried to use cubin ot fatbin instead of PTX? It will save the compilation time, right?


Title: Re: Using the GPU
Post by: LiaoMi on May 17, 2019, 06:38:59 AM
Quote from: AW on May 17, 2019, 05:25:09 AM
@LiaoMi
Did you include the translated from FASM source code?

BTW, have you ever tried to use cubin ot fatbin instead of PTX? It will save the compilation time, right?

@AW
Please accept my apologies, I did not attach the most important thing  :icon_eek:, now the source is in place, please download again .. you can change the paths of the system libraries and compile with UASM. Cuda.inc will work unfortunately only on UASM, you can use your batch file, the inc folder also contains the translate64.inc file.

Quotehave you ever tried to use cubin ot fatbin instead of PTX
I thought about it a little, but did not study it more deeply, because I don't see any advantage in this, relying on the description in the documentation, working with PTX gives more possibilities and more flexibility.
This was written by other users https://devtalk.nvidia.com/default/topic/504259/cubin-vs-ptx/ (https://devtalk.nvidia.com/default/topic/504259/cubin-vs-ptx/)

.fatbin from the same story, I think it will be useful only when strict code optimization is needed, for a certain type of video card .. Maybe I'm wrong, or maybe not :eusa_boohoo:

QuoteIt will save the compilation time, right?
Anyway, in binary form, the execution time should be optimal  :icon_exclaim:

Useful link CUDA Binary Utilities cuobjdump
cuobjdump extracts information from CUDA binary files (both standalone and those embedded in host binaries) and presents them in human readable format. The output of cuobjdump includes CUDA assembly code for each kernel, CUDA ELF section headers, string tables, relocators and other CUDA specific sections. It also extracts embedded ptx text from host binaries. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump)
$ cuobjdump a.out -ptx -sass
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
identifier = add.cu

        code for sm_20
                Function : _Z3addPiS_S_
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];  /* 0x2800440400005de4 */
        /*0008*/         MOV R6, c[0x0][0x20];   /* 0x2800400080019de4 */
        /*0010*/         MOV R7, c[0x0][0x24];   /* 0x280040009001dde4 */
        /*0018*/         MOV R2, c[0x0][0x28];   /* 0x28004000a0009de4 */
        /*0020*/         MOV R3, c[0x0][0x2c];   /* 0x28004000b000dde4 */
        /*0028*/         LDU.E R0, [R6];         /* 0x8c00000000601c85 */
        /*0030*/         MOV R4, c[0x0][0x30];   /* 0x28004000c0011de4 */
        /*0038*/         LDU.E R2, [R2];         /* 0x8c00000000209c85 */
        /*0040*/         MOV R5, c[0x0][0x34];   /* 0x28004000d0015de4 */
        /*0048*/         IADD R0, R2, R0;        /* 0x4800000000201c03 */
        /*0050*/         ST.E [R4], R0;          /* 0x9400000000401c85 */
        /*0058*/         EXIT;                   /* 0x8000000000001de7 */
                .............................

Fatbin ptx code:
================
arch = sm_20
code version = [4,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
identifier = add.cu

.version 4.0
.target sm_20
.address_size 64

.visible .entry _Z3addPiS_S_(
.param .u64 _Z3addPiS_S__param_0,
.param .u64 _Z3addPiS_S__param_1,
.param .u64 _Z3addPiS_S__param_2
)
{
.reg .s32 %r<4>;
.reg .s64 %rd<7>;

ld.param.u64 %rd1, [_Z3addPiS_S__param_0];
ld.param.u64 %rd2, [_Z3addPiS_S__param_1];
ld.param.u64 %rd3, [_Z3addPiS_S__param_2];
cvta.to.global.u64 %rd4, %rd3;
cvta.to.global.u64 %rd5, %rd2;
cvta.to.global.u64 %rd6, %rd1;
ldu.global.u32 %r1, [%rd6];
ldu.global.u32 %r2, [%rd5];
add.s32 %r3, %r2, %r1;
st.global.u32 [%rd4], %r3;
ret;
}


Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 07:07:07 AM
I get an error when running your program in an old computer with a GE FORCE GT 640

(https://www.dropbox.com/s/qg14qieqvglnu4i/cuda1.jpg?dl=1)

I can't also run my queens program but it may be because I can't compile a PTX to versions below sm_30 with my nvcc.exe and possible GE FORCE GT 640 may require under that. The error I get with my proggy is CUDA_ERROR_NO_BINARY_FOR_GPU (error 209). What tool do you use to compile to sm_13? This appears way too old.
Title: Re: Using the GPU
Post by: LiaoMi on May 17, 2019, 07:36:40 AM
sm_13 was already ready in the example  :biggrin:

CUDA Toolkit v10.1.168_pdf -> CUDA_C_Programming_Guide

3.1.2. Binary Compatibility
Binary code is architecture-specific. A cubin object is generated using the compiler
option -code that specifies the targeted architecture: For example, compiling with
-code=sm_35 produces binary code for devices of compute capability 3.5. Binary
compatibility is guaranteed from one minor revision to the next one, but not from one
minor revision to the previous one or across major revisions. In other words, a cubin
object generated for compute capability X.y will only execute on devices of compute
capability X.z where z≥y.

3.1.3. PTX Compatibility
Some PTX instructions are only supported on devices of higher compute capabilities.
For example, Warp Shuffle Functions are only supported on devices of compute
capability 3.0 and above. The -arch compiler option specifies the compute capability
that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for
example, must be compiled with -arch=compute_30 (or higher).
PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from an
earlier PTX version may not make use of some hardware features. For example, a binary
targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for
compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these
were not available on Pascal. As a result, the final binary may perform worse than would
be possible if the binary were generated using the latest version of PTX.
3.1.4. Application Compatibility
To execute code on devices of specific compute capability, an application must load
binary or PTX code that is compatible with this compute capability as described in
Binary Compatibility and PTX Compatibility. In particular, to be able to execute code
on future architectures with higher compute capability (for which no binary code can be
generated yet), an application must load PTX code that will be just-in-time compiled for
these devices (see Just-in-Time Compilation).
Which PTX and binary code gets embedded in a CUDA C application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,
nvcc x.cu
        -gencode arch=compute_35,code=sm_35
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=\'compute_60,sm_60\'
embeds binary code compatible with compute capability 3.5 and 5.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 6.0
(third -gencode option).
Host code is generated to automatically select at runtime the most appropriate code to
load and execute, which, in the above example, will be:
‣ 3.5 binary code for devices with compute capability 3.5 and 3.7,
‣ 5.0 binary code for devices with compute capability 5.0 and 5.2,
‣ 6.0 binary code for devices with compute capability 6.0 and 6.1,
‣ PTX code which is compiled to binary code at runtime for devices with compute
capability 7.0 and higher.
x.cu can have an optimized code path that uses warp shuffle operations, for example,
which are only supported in devices of compute capability 3.0 and higher. The
__CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with arch=compute_35 for example,
__CUDA_ARCH__ is equal to 350.
Applications using the driver API must compile code to separate files and explicitly load
and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architecures, Independent Thread Scheduling may alter the set of
participating threads, leading to incorrect results. To aid migration while implementing
the corrective actions detailed in Independent Thread Scheduling, Volta developers
can opt-in to Pascal's thread scheduling with the compiler option combination arch=compute_60
-code=sm_70.
The nvcc user manual lists various shorthand for the -arch, -code, and -gencode
compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35 code=compute_35,sm_35 (which is the same as
-gencode
arch=compute_35,code=\'compute_35,sm_35\').
Title: Re: Using the GPU
Post by: LiaoMi on May 17, 2019, 07:42:11 AM
Quote from: AW on May 17, 2019, 07:07:07 AM
I get an error when running your program in an old computer with a GE FORCE GT 640

(https://www.dropbox.com/s/qg14qieqvglnu4i/cuda1.jpg?dl=1)

I can't also run my queens program but it may be because I can't compile a PTX to versions below sm_30 with my nvcc.exe and possible GE FORCE GT 640 may require under that. The error I get with my proggy is CUDA_ERROR_NO_BINARY_FOR_GPU (error 209). What tool do you use to compile to sm_13? This appears way too old.

There are differences in working with memory, so this is a normal phenomenon, I can't answer the question at the moment, I can only guess :P

Matching SM architectures (CUDA arch and CUDA gencode) for various NVIDIA cards https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ (https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/)
I've seen some confusion regarding NVIDIA's nvcc sm flags and what they're used for:
When compiling with NVCC, the arch flag ('-arch') specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes ('-gencode') allows for more PTX generations, and can be repeated many times for different architectures.


Here is an example for Cubin  :t ...

    /// Init cuda
    SAFE_CALL(cuInit(0));
    /// Get device
    SAFE_CALL(cuDeviceGet(&dev, 0));
    /// Create context
    SAFE_CALL(cuCtxCreate(&ctx, 0, dev));
   
    SAFE_CALL(cuModuleLoad(&m, "test.cubin"));

    CUfunction hfunc;
    SAFE_CALL(cuModuleGetFunction(&hfunc, m, "my_kernel"));


P.S>SM stands for Streaming Multiprocessor and the number indicates the features supported by the architecture.
Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 03:35:35 PM
The real problem is the paranoia of forced obsolescence we are assisting these days. From VS I can't even build for sm_30, minimum is sm_35.
If I can't solve the issues, I will have to install and old NVidia SDK version.

Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.

I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.

Title: Re: Using the GPU
Post by: LiaoMi on May 17, 2019, 06:23:05 PM
Quote from: AW on May 17, 2019, 03:35:35 PM
The real problem is the paranoia of forced obsolescence we are assisting these days. From VS I can't even build for sm_30, minimum is sm_35.
If I can't solve the issues, I will have to install and old NVidia SDK version.

Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.

I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.

I have problems installing Cuda 10.1 in visual studio 2019
https://devtalk.nvidia.com/default/topic/1049613/nsight-visual-studio-edition/visual-studio-2019-and-cuda-10-1/1 (https://devtalk.nvidia.com/default/topic/1049613/nsight-visual-studio-edition/visual-studio-2019-and-cuda-10-1/1)
I don't even talk about old sdk in my case ...

My proposal is to update the CudaPad https://github.com/SunsetQuest/CudaPAD (https://github.com/SunsetQuest/CudaPAD) project for our needs. Add more options, add portable versions of Cuda SDK  :idea: There is already SM version selection in the project.


Quote from: AW on May 17, 2019, 03:35:35 PM
I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.

Starting pass 2
     mywindow1.obj
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'NumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'LogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'NumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'CudaLogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'LogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err000' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err100' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err200' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err300' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err400' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err500' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err600' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err700' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err999' invalid without /LARGEADDRESSAWARE:NO
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     msvcrt.lib(msvcrt.dll)
     msvcrt.lib(msvcrt.dll)
     msvcrt.lib(msvcrt.dll)
     Kernel32.lib(KERNEL32.dll)
     Kernel32.lib(KERNEL32.dll)
     Kernel32.lib(KERNEL32.dll)
     User32.lib(USER32.dll)
     User32.lib(USER32.dll)
     User32.lib(USER32.dll)
LINK : fatal error LNK1165: link failed because of fixup errors


The wording of the linker error message is a bit misleading. In fact you are using an addressing mode

that uses a 32 bit offset, which results in a 32 bit address in the symbol table.

Here is a description of my problem:

MODULE1.ASM
...
VAR1    DB 256 dup (0)
...
MODULE2.ASM
...
extrn    VAR1:BYTE
...
MOV   al, byte ptr[VAR1 + rbx]

In 64 bit mode only very few addressing modes can handle absolute 64 bit offsets.
Most complex addressing modes are restricted to 32 bit offsets (like the [mem + reg]
used above).

So the assembler generates an entry in the symbol table that states VAR1 as a
32bit offset for MODULE2.

The symbol table of MODULE1 only has an entry that states VAR1 as a 64bit offset.
In this case the linker complains that symbol VAR1 has different address sizes.

To avoid this a different addressing mode must be used - something like this:
LEA   rsi, VAR1
MOV   al, byte ptr[rsi + rbx]
Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 06:54:44 PM
I have problems with VS 2019 as well in CUDA so I am using VS 2017.

CudaPad is fine but it works with nvcc.exe, so can't produce miracles. If I select sm_23 it reports: nvcc fatal   : Value 'sm_23' is not defined for option 'gpu-architecture'  :(

I never used /LARGEADDRESSAWARE:NO because it is a kludge for transitioning from win32 to win64.
If you can't fix it I will have a look at your program during the week end and with some luck I will try to build it for /LARGEADDRESSAWARE:YES.

Title: Re: Using the GPU
Post by: TimoVJL on May 17, 2019, 07:29:10 PM
nvopencc.exe work's in Windows 7, if downloaded from CUDA Toolkit 3.2 (https://developer.nvidia.com/cuda-toolkit-32-downloads), but from CUDA Toolkit 4.2.9 just crash.

That toolkit is a .msi file, so TLMsiListEx (http://masm32.com/board/index.php?topic=7435.msg81276#msg81276) can be useful.

nvcc error   : 'cudafe++' died with status 0xC0000374  :( RIP  :P
Title: Re: Using the GPU
Post by: LiaoMi on May 17, 2019, 09:45:36 PM
Quote from: AW on May 17, 2019, 06:54:44 PM
I have problems with VS 2019 as well in CUDA so I am using VS 2017.

CudaPad is fine but it works with nvcc.exe, so can't produce miracles. If I select sm_23 it reports: nvcc fatal   : Value 'sm_23' is not defined for option 'gpu-architecture'  :(

I never used /LARGEADDRESSAWARE:NO because it is a kludge for transitioning from win32 to win64.
If you can't fix it I will have a look at your program during the week end and with some luck I will try to build it for /LARGEADDRESSAWARE:YES.

Without using / LARGEADDRESSAWARE: NO  :icon14:
Title: Re: Using the GPU
Post by: aw27 on May 17, 2019, 09:59:24 PM
That was the problem, it is fixed.  :t

(https://www.dropbox.com/s/6dg0kcjjt3o9tq6/cuda2.jpg?dl=1)
Title: Re: Using the GPU
Post by: mikeburr on May 17, 2019, 10:18:48 PM
@LiaoMi ... have you found any one of the linux flavours to be superior than others for this kind of thing .. was thinking of trying CentOS 7   
regards mikeb
ps i worked on a stock control system many years ago where i used distributions to assess the new weighting rather than the linear moving weight scheme  you have in your neural system . If you do move to a similar scheme then i advise you to use a skewed distribution such as Students T as very rarely in nature is any distribution symmetric  a failing which is sadly almost always overlooked .. i include in this the FFT methods now prevalent for ascertaining independance of variables as these are highly symmetricised methods [ see the many excellent examples and links provided by Siekmanski ]   
Title: Re: Using the GPU
Post by: TimoVJL on May 18, 2019, 04:17:12 AM
A small example in C for testing an old card like nVidia G210#define WIN32_LEAN_AND_MEAN
#include <windows.h>
//#include "cuda.h"
#pragma comment(lib, "cuda.lib")
#pragma comment(lib, "msvcrt.lib")

#define CUDAAPI __stdcall
typedef int CUdevice;
typedef struct CUctx_st *CUcontext;
typedef struct CUmod_st *CUmodule;
typedef enum cudaError_enum {
    CUDA_SUCCESS                              = 0,
    CUDA_ERROR_INVALID_VALUE                  = 1,
    CUDA_ERROR_OUT_OF_MEMORY                  = 2,
    CUDA_ERROR_NOT_INITIALIZED                = 3,
    CUDA_ERROR_DEINITIALIZED                  = 4,
    CUDA_ERROR_NO_DEVICE                      = 100,
    CUDA_ERROR_INVALID_DEVICE                 = 101,
    CUDA_ERROR_INVALID_IMAGE                  = 200,
    CUDA_ERROR_INVALID_CONTEXT                = 201,
    CUDA_ERROR_CONTEXT_ALREADY_CURRENT        = 202,
    CUDA_ERROR_MAP_FAILED                     = 205,
    CUDA_ERROR_UNMAP_FAILED                   = 206,
    CUDA_ERROR_ARRAY_IS_MAPPED                = 207,
    CUDA_ERROR_ALREADY_MAPPED                 = 208,
    CUDA_ERROR_NO_BINARY_FOR_GPU              = 209,
    CUDA_ERROR_ALREADY_ACQUIRED               = 210,
    CUDA_ERROR_NOT_MAPPED                     = 211,
    CUDA_ERROR_NOT_MAPPED_AS_ARRAY            = 212,
    CUDA_ERROR_NOT_MAPPED_AS_POINTER          = 213,
    CUDA_ERROR_ECC_UNCORRECTABLE              = 214,
    CUDA_ERROR_UNSUPPORTED_LIMIT              = 215,
    CUDA_ERROR_INVALID_SOURCE                 = 300,
    CUDA_ERROR_FILE_NOT_FOUND                 = 301,
    CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND = 302,
    CUDA_ERROR_SHARED_OBJECT_INIT_FAILED      = 303,
    CUDA_ERROR_OPERATING_SYSTEM               = 304,
    CUDA_ERROR_INVALID_HANDLE                 = 400,
    CUDA_ERROR_NOT_FOUND                      = 500,
    CUDA_ERROR_NOT_READY                      = 600,
    CUDA_ERROR_LAUNCH_FAILED                  = 700,
    CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES        = 701,
    CUDA_ERROR_LAUNCH_TIMEOUT                 = 702,
    CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING  = 703,
    CUDA_ERROR_UNKNOWN                        = 999
} CUresult;
CUresult CUDAAPI cuDriverGetVersion(int *driverVersion);
CUresult CUDAAPI cuInit(unsigned int Flags);
CUresult CUDAAPI cuDeviceGetCount(int *count);
CUresult CUDAAPI cuDeviceGet(CUdevice *device, int ordinal);
CUresult CUDAAPI cuCtxCreate(CUcontext *pctx, unsigned int flags, CUdevice dev);
CUresult CUDAAPI cuCtxGetApiVersion(CUcontext ctx, unsigned int *version);
CUresult CUDAAPI cuDeviceComputeCapability(int *major, int *minor, CUdevice dev);
CUresult CUDAAPI cuModuleLoad(CUmodule *module, const char *fname);
CUresult CUDAAPI cuCtxDetach(CUcontext ctx);
CUresult CUDAAPI cuGetErrorString(CUresult error, const char **pStr);

void __cdecl mainCRTStartup(void)
{
int deviceCount = 0;
int iDrvVer, iApiVer;
CUdevice dev;
CUcontext ctx;
CUmodule mod;
CUresult err;
int iMajor, iMinor;
cuDriverGetVersion(&iDrvVer);
printf("Driver Version: %d.%d\n", iDrvVer / 1000, iDrvVer % 1000);
if (!(err = cuInit(0))) {
err = cuDeviceGetCount(&deviceCount);
err = cuDeviceGet(&dev, 0);
err = cuCtxCreate(&ctx, 0, dev);
err = cuCtxGetApiVersion(ctx, &iApiVer);
printf("API Version: %d.%d\n", iApiVer / 1000, iApiVer % 1000);
err = cuDeviceComputeCapability(&iMajor, &iMinor, dev);
printf("DeviceComputeCabability: %d.%d\n", iMajor, iMinor);
//err = cuModuleLoadData(&mod, PTXSourceData);
err = cuModuleLoad(&mod, "Test.ptx");
if (!err) {
printf("Test.ptx loaded\n");
} else {
char *perr;
cuGetErrorString(err, &perr);
printf("error loading ptx: %u %s\n", err, perr);
}
err = cuCtxDetach(ctx);
}
ExitProcess(0);
}
Test.ptx .version 1.4
.target sm_10, map_f64_to_f32
// compiled with open64/lib//be.exe
// nvopencc 3.2 built on 2010-11-04
.entry _Z3addPiS_ (
.param .u64 __cudaparm__Z3addPiS__a,
.param .u64 __cudaparm__Z3addPiS__b)
{
.reg .u32 %r<6>;
.reg .u64 %rd<8>;
.reg .pred %p<3>;
.loc 15 4 0
$LDWbegin__Z3addPiS_:
cvt.s32.u16 %r1, %tid.x;
mov.u32 %r2, 999;
setp.gt.s32 %p1, %r1, %r2;
@%p1 bra $Lt_0_1026;
.loc 15 7 0
cvt.s64.s32 %rd1, %r1;
mul.wide.s32 %rd2, %r1, 4;
ld.param.u64 %rd3, [__cudaparm__Z3addPiS__a];
add.u64 %rd4, %rd3, %rd2;
ld.global.s32 %r3, [%rd4+0];
mul.lo.s32 %r4, %r3, 2;
ld.param.u64 %rd5, [__cudaparm__Z3addPiS__b];
add.u64 %rd6, %rd5, %rd2;
st.global.s32 [%rd6+0], %r4;
$Lt_0_1026:
.loc 15 9 0
exit;
$LDWend__Z3addPiS_:
} // _Z3addPiS_

It just tells that it load that shit ;)

PS: it took a while to get G210 working with Windows 8.1, just a nightmare, as a old nVidia card and Windows 8.1 have a some dating problems.
(MS: women regardless of marital status)

EDIT: how to create that kernel code: Add.itypedef struct  uint3
{
    unsigned int x, y, z;
}uint3;

uint3  extern const threadIdx;
//__global__
__attribute__((global)) __attribute__((__used__))
void add(int* a, int* b) {
int i = threadIdx.x;
if (i < 1000) {
b[i] = 2 * a[i];
}
}
open64\bin\nvopencc  -TARG:compute_10 -m64 -OPT:ftz=1 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0 "Add" "Add.i"  -o "Add.ptx"
for CUDA Toolkit 7 >compiler\nvvm\bin\cicc -arch compute_30 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -nvvmir-library "compiler\bin/../nvvm/libdevice/libdevice.compute_30.10.bc" --orig_src_file_name "Add.cu"  "Add.i" -o "Add.ptx"

EDIT: copy driver / global code part from .cu to .ci file and add a header-file for it
Minimal header for .ci,
ci_include.htypedef struct _uint3
{
    unsigned int x, y, z;
}uint3;

typedef struct _dim3
{
    unsigned int x, y, z;
}dim3;

extern const uint3 threadIdx;
extern const uint3 blockIdx;
extern const dim3 blockDim;
extern const dim3 gridDim;
extern const int warpSize;

#define __device__ __attribute__((device))
#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))

#define bool _Bool

clang (https://www.llvm.org/docs/CompileCudaWithLLVM.html) backend support CUDA

EDIT: AMD clang (https://github.com/GPUOpen-Tools/RGA/tree/master/Core/ROCm/OpenCL/win64/bin)clang.exe -c Add.cu -nocudainc  -nocudalib
Title: Re: Using the GPU
Post by: LiaoMi on May 18, 2019, 04:37:35 AM
Quote from: mikeburr on May 17, 2019, 10:18:48 PM
@LiaoMi ... have you found any one of the linux flavours to be superior than others for this kind of thing .. was thinking of trying CentOS 7   
regards mikeb
ps i worked on a stock control system many years ago where i used distributions to assess the new weighting rather than the linear moving weight scheme  you have in your neural system . If you do move to a similar scheme then i advise you to use a skewed distribution such as Students T as very rarely in nature is any distribution symmetric  a failing which is sadly almost always overlooked .. i include in this the FFT methods now prevalent for ascertaining independance of variables as these are highly symmetricised methods [ see the many excellent examples and links provided by Siekmanski ]   

Hi mikeburr,

I can not say exactly ... Was the question about gpu or neural networks? The assembly language example is not mine, the author is no longer active, there are flaws in the example, I hope we'll be able to add improvements, so tips on the algorithm are welcome.
From my point of view, for choosing the system environment only two factors are important, having the right tools and the usability of the data. Therefore, for neural networks, I would choose java or dotnet, in both programming languages there are hard points in working with gpu, but it can be solved. The more standard OSes are the best) Java + Linux = Dotnet + Windows, as an option Dotnet + Linux = Java + Windows. Why not an assembler, because of the amount of data, object-oriented processing methods and the ability to use ready-made items. Anyway, examples written in assembly language have attractiveness on stable processes, on stable models that dont require further intervention.

Perhaps it would be more logical to find a suitable github project, where the most interesting neural network model for gpu has already been chosen. After all, the main discussion is the computational power of GPU and algorithms for assembler. It makes no sense to copy all the complexities of an object-oriented language  :P

I don't have Linux at the moment, but the previous example needs to be adapted for Linux ...

How to install NVIDIA CUDA Toolkit on CentOS 7 Linux - https://linuxconfig.org/how-to-install-nvidia-cuda-toolkit-on-centos-7-linux (https://linuxconfig.org/how-to-install-nvidia-cuda-toolkit-on-centos-7-linux)
Title: Re: Using the GPU
Post by: aw27 on May 18, 2019, 05:11:08 AM
This is a simplified version of the log 2 example produced by LiaoMi, under 3 API:
- Runtime API
- Driver API
- Driver API in MASM

Common C/C++


#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>


#define N 32

extern "C" __global__ void log_2(float* inout, int length) {
int xind = blockIdx.x * blockDim.x + threadIdx.x;
if (xind < length) {
float tempFloat = log2(*(inout + xind));
*(inout +  xind) = tempFloat;
}
}

//#define RUNTIMEAPI 1
//#define DRIVERAPI 1
#define MASMAPI 1

extern "C" int logasm2();
extern "C" char PTXFunction;
extern "C" char PTXSourceData;

#if defined(RUNTIMEAPI)
void testruntimeAPi()
{
float hinOut[N];
float* dinOut;
cudaMalloc((void**)& dinOut, N * sizeof(float));

for (int i = 1; i <= N; ++i) {
hinOut[i-1] = (float)i;
}
cudaMemcpy(dinOut, hinOut, N * sizeof(float), cudaMemcpyHostToDevice);
log_2 <<<N, 1 >>> (dinOut, N);
cudaMemcpy(hinOut, dinOut, N * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; ++i) {
printf("%d %f\n", i+1, hinOut[i]);
}
cudaFree(dinOut);
getchar();
}
#endif

#if defined(DRIVERAPI)
int testDriverApi()
{
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction cudaFunction;
CUresult result;
CUdeviceptr dinOut;
int count;

float hinOut[N];
unsigned int memSize = sizeof(float) * N;

if (cuInit(0))
return 1;
if (cuDeviceGet(&cuDevice, 0))
return 1;
if (cuCtxCreate(&cuContext, 0, cuDevice))
return 1;
if (cuModuleLoadData(&cuModule, &PTXSourceData))
return 1;
if (cuModuleGetFunction(&cudaFunction, cuModule, &PTXFunction))
return 1;
for (int i = 1; i <= N; ++i) {
hinOut[i - 1] = (float)i;
}
cuMemAlloc(&dinOut, memSize);
cuMemcpyHtoD(dinOut, hinOut, memSize);
count = N;
void *args[] = { &dinOut, &count };

result = cuLaunchKernel(cudaFunction, N, 1, 1, 1, 1, 1, 0, 0, &args[0], 0);
if (result!=CUDA_SUCCESS)
return 1;
cuCtxSynchronize();
cuMemcpyDtoH(hinOut, dinOut, memSize);
for (int i = 0; i < N; ++i) {
printf("%d %f\n", i + 1, hinOut[i]);
}
cuMemFree(dinOut);

getchar();
return 0;
}
#endif

#if defined(MASMAPI)
void testMasmAPI()
{
int err=logasm2();
printf("Error %d (0=NO ERROR)",err);
getchar();
return;
}
#endif

int main()
{
#if RUNTIMEAPI
testruntimeAPi();
#elif DRIVERAPI
testDriverApi();
#elif MASMAPI
testMasmAPI();
#endif
return 0;
}




MASM:


OPTION casemap:none

N equ 32
public PTXFunction
public PTXSourceData

printf proto :ptr, :vararg

cuInit proto :dword
cuDeviceGet proto :ptr, :dword
cuCtxCreate_v2 proto :ptr, :dword, :dword
cuModuleLoadData proto :ptr, :ptr
cuModuleGetFunction proto :ptr, :ptr, :ptr
cuMemAlloc_v2 proto :ptr, :qword
cuMemcpyHtoD_v2 proto :ptr, :ptr, :qword
cuMemcpyDtoH_v2 proto :ptr, :ptr, :qword
cuLaunchKernel proto :ptr, :dword, :dword, :dword, :dword, :dword, :dword, :ptr, :ptr, :ptr
cuCtxSynchronize proto
cuMemFree_v2 proto :ptr
cuCtxGetApiVersion proto :ptr, :ptr

.code

logasm2 proc
LOCAL cuDevice:dword
LOCAL cuContext:ptr
LOCAL cuModule:ptr
LOCAL cudaFunction:ptr
LOCAL dInOut: qword
LOCAL hInOut[N]:real4
LOCAL memSize : qword
LOCAL args[2] : ptr
LOCAL qt : dword;

sub rsp, 58h
and rsp, -16 ; align
mov eax, sizeof REAL4 * N
mov memSize, rax
mov rcx,0
call cuInit
cmp eax, 0
jnz @exit

lea rcx, cuDevice
mov edx,0
call cuDeviceGet
cmp eax, 0
jnz @exit

lea rcx, cuContext
mov edx, 0
mov r8d, cuDevice
call cuCtxCreate_v2
cmp eax, 0
jnz @exit

lea rcx, cuModule
mov rdx, offset PTXSourceData
call cuModuleLoadData
cmp eax, 0
jnz @exit

lea rcx, cudaFunction
mov rdx, cuModule
lea r8, PTXFunction
call cuModuleGetFunction
cmp eax, 0
jnz @exit

mov ecx, N
lea r11, hInOut
fld1
fld1
@@:
fst dword ptr [r11]
fadd st,st(1)
add r11, 4
dec ecx
jnz @B

lea rcx, dInOut
mov rdx, memSize
call cuMemAlloc_v2
cmp eax, 0
jnz @exit

mov rcx, dInOut
lea rdx, hInOut
mov r8, memSize
call cuMemcpyHtoD_v2
cmp eax, 0
jnz @exit

lea rax, dInOut
mov qword ptr args, rax
mov qt, N
lea rax, qt
mov qword ptr args+8, rax

mov rcx, cudaFunction
mov edx ,1
mov r8d, 1
mov r9d, 1
mov dword ptr [rsp+20h], N
mov dword ptr [rsp+28h], 1
mov dword ptr [rsp+30h], 1
mov dword ptr [rsp+38h], 0
mov qword ptr [rsp+40h], 0
lea rax, args
mov qword ptr [rsp+48h], rax
mov qword ptr [rsp+50h], 0
call cuLaunchKernel
cmp eax, 0
jnz @exit

call cuCtxSynchronize
cmp eax, 0
jnz @exit

lea rcx, hInOut
mov rdx, dInOut
mov r8, memSize
call cuMemcpyDtoH_v2
cmp eax, 0
jnz @exit

mov ebx, 1
lea rdi, hInOut
@@:
mov rcx, offset report
mov edx, ebx
fld dword ptr [rdi]
fstp qword ptr [rsp+20h]
mov r8, qword ptr [rsp+20h]
call printf
add rdi, 4
inc ebx
cmp ebx, N
jle @B

mov rcx, dInOut
call cuMemFree_v2
mov eax,0
@exit:
ret
logasm2 endp

.data
report db "%d %f",10,0

; Generated by NVIDIA NVVM Compiler
; Compiler Build ID: CL-25769353
; Cuda compilation tools, release 10.1, V10.1.105
; Based on LLVM 3.4svn
PTXFunction db 'log_2',0
PTXSourceData \
db '.version 6.4 ',10
db '.target sm_30 ',10
db '.address_size 64 ',10
db ' ',10
db ' // .globl log_2 ',10
db ' ',10
db '.visible .entry log_2( ',10
db ' .param .u64 log_2_param_0, ',10
db ' .param .u32 log_2_param_1 ',10
db ') ',10
db '{ ',10
db ' .reg .pred %p<2>; ',10
db ' .reg .f32 %f<3>; ',10
db ' .reg .b32 %r<6>; ',10
db ' .reg .b64 %rd<5>; ',10
db ' ',10
db ' ',10
db ' ld.param.u64 %rd1, [log_2_param_0]; ',10
db ' ld.param.u32 %r2, [log_2_param_1]; ',10
db ' mov.u32 %r3, %ctaid.x; ',10
db ' mov.u32 %r4, %ntid.x; ',10
db ' mov.u32 %r5, %tid.x; ',10
db ' mad.lo.s32 %r1, %r4, %r3, %r5; ',10
db ' setp.ge.s32 %p1, %r1, %r2; ',10
db ' @%p1 bra BB0_2; ',10
db ' ',10
db ' cvta.to.global.u64 %rd2, %rd1; ',10
db ' mul.wide.s32 %rd3, %r1, 4; ',10
db ' add.s64 %rd4, %rd2, %rd3; ',10
db ' ld.global.f32 %f1, [%rd4]; ',10
db ' lg2.approx.ftz.f32 %f2, %f1; ',10
db ' st.global.f32 [%rd4], %f2; ',10
db ' ',10
db 'BB0_2: ',10
db ' ret; ',10
db '} ',10
db 0
end



Output:
1 0.000000
2 1.000000
3 1.584962
4 2.000000
5 2.321928
6 2.584962
7 2.807355
8 3.000000
9 3.169925
10 3.321928
11 3.459432
12 3.584962
13 3.700440
14 3.807355
15 3.906890
16 4.000000
17 4.087463
18 4.169925
19 4.247927
20 4.321928
21 4.392317
22 4.459432
23 4.523562
24 4.584962
25 4.643856
26 4.700439
27 4.754887
28 4.807355
29 4.857981
30 4.906890
31 4.954196
32 5.000000
Error 0 (0=NO ERROR)
Title: Re: Using the GPU
Post by: TimoVJL on May 24, 2019, 07:50:02 PM
clang (https://www.llvm.org/docs/CompileCudaWithLLVM.html) backend supports CUDA :thumbsup:
So for textual .ptx, only ptxas / fatbinary stubs are needed.
a minimal ci_include.htypedef struct _uint3
{
    unsigned int x, y, z;
}uint3;

typedef struct _dim3
{
    unsigned int x, y, z;
}dim3;

extern const uint3 threadIdx;
extern const uint3 blockIdx;
extern const dim3 blockDim;
extern const dim3 gridDim;
extern const int warpSize;

#define __device__ __attribute__((device))
#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))

#define bool _Bool
a fake ptxas just copy that temporary .s file#define WIN32_LEAN_AND_MEAN
#include <windows.h>

int __cdecl main(int argc, char **argv)
{
char fname[260];
char *ptx = argv[argc-1];
while(*ptx) ptx++; // find end
char *pend = ptx;
while(*pend != '-') pend--;
ptx = pend;
while(*ptx != '\\') ptx--;
ptx++;
strcpy(fname, ptx);
char *pext = fname + (pend - ptx);
*(long*)pext = *(long*)".ptx"; // add ext
*(pext+4) = 0; // cut name
return !CopyFile(argv[argc-1], fname, FALSE);
}
a commandline used in testsclang.exe -c add.cu -nocudainc -nocudalib --cuda-gpu-arch=sm_30
Title: Re: Using the GPU
Post by: LiaoMi on May 28, 2019, 10:46:49 PM
Quote from: TimoVJL on May 24, 2019, 07:50:02 PM
clang (https://www.llvm.org/docs/CompileCudaWithLLVM.html) backend supports CUDA :thumbsup:

Hi TimoVJL,

need to try, there is a web compiler for Cuda and Cuda LLVM
https://cuda.godbolt.org/ (https://cuda.godbolt.org/)
ordinary compiler
https://godbolt.org/ (https://godbolt.org/)

CUDA Compression
A GPU-based LZSS compression algorithm, highly tuned for NVIDIA GPGPUs and for streaming data, leveraging the respective strengths of CPUs and GPUs together - https://github.com/adnanozsoy/CUDA_Compression (https://github.com/adnanozsoy/CUDA_Compression)
Cuda lzss compression https://github.com/abshkbh/cuda-lzss (https://github.com/abshkbh/cuda-lzss)
Algorithms for Compression on GPUs http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6642/pdf/imm6642.pdf (http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6642/pdf/imm6642.pdf)
Compression library using Nvidia's CUDA - https://stackoverflow.com/questions/456829/compression-library-using-nvidias-cuda (https://stackoverflow.com/questions/456829/compression-library-using-nvidias-cuda)
Parallel lossless compression using GPUs http://on-demand.gputechconf.com/gtc/2014/presentations/S4459-parallel-lossless-compression-using-gpus.pdf (http://on-demand.gputechconf.com/gtc/2014/presentations/S4459-parallel-lossless-compression-using-gpus.pdf)
Breakthrough in CUDA data compression https://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx (https://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx)

The results:

BMP, 540 Kb
Full bzip2 compression on CPU - 218 ms
BW Transform on CPU - 171 ms
Full bzip2 compression on GPU - 93 ms [ minus 53% ]
BW Transform on GPU - 46 ms [ minus 73% ]

BMP, 1112 Kb
Full bzip2 compression on CPU - 467 ms
BW Transform on CPU - 343 ms
Full bzip2 compression on GPU - 249 ms [ minus 46% ]
BW Transform on GPU - 140 ms [ minus 59% ]

PDF, 1919 Kb
Full bzip2 compression on CPU - 1513 ms
BW Transform on CPU - 731 ms
Full bzip2 compression on GPU - 1107 ms [ minus 26% ]
BW Transform on GPU - 311 ms [ minus 57% ]

PDF, 3425 Kb
Full bzip2 compression on CPU - 2168 ms
BW Transform on CPU - 793 ms
Full bzip2 compression on GPU - 1856 ms [ minus 14% ]
BW Transform on GPU - 481 ms [ minus 39% ]


(https://www.wave-access.com/media/90569/bzip2_gpu.png)

Learn How To Do Alphablending with CUDA
https://www.codeproject.com/Articles/41977/Learn-How-To-Do-Alphablending-with-CUDA (https://www.codeproject.com/Articles/41977/Learn-How-To-Do-Alphablending-with-CUDA)
Title: Re: Using the GPU
Post by: TimoVJL on May 29, 2019, 12:49:08 AM
A good link :thumbsup:
Less reasons to download a huge nVidia SDK.
Title: Re: Using the GPU
Post by: LiaoMi on June 03, 2019, 09:41:45 PM
Hi,

I contacted the author of the program CudaPAD, with the result that there was an update for the Visual Studio 2019, now the program works on the latest version of both Visual Studio and cuda_10.1.168_425.25_win10, source code can be found here https://github.com/SunsetQuest/CudaPAD

Everything works ...
(https://i.imgur.com/4YxnCIW.png)
Title: Re: Using the GPU
Post by: aw27 on June 04, 2019, 04:01:03 AM
Thank you, LiaoMi. Very helpful  :thumbsup:
Title: Re: Using the GPU
Post by: daydreamer on June 13, 2019, 04:25:45 AM
are there a way to get aviable VRAM I can use for memory allocating a big array? and compare to system ram,on older computer that have been upgraded with better nvidia or not it would probably run much faster
Title: Re: Using the GPU
Post by: Biterider on June 18, 2019, 06:35:05 PM
Hi
Now that I have some free time, I've been working on CUDA for a bit. Really amazing.
I was able to integrate the code from LiaoMi into one of my 64-bit demos and it works like a charm.  :thumbsup:
I failed in 32 bits because I need to change the PTX code.  :sad:
So far I found the newest ISA documentation I need to read https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf

For me, the question remains whether the PTX was the right choice or to use nvcc to create the code ...  :icon_idea:

Biterider

PS: link corrected - thanks to TimoVJL -
Title: Re: Using the GPU
Post by: TimoVJL on June 18, 2019, 07:56:02 PM
A working link for PARALLEL THREAD EXECUTION ISA
(https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf)
Title: Re: Using the GPU
Post by: Biterider on June 19, 2019, 10:43:39 PM
Hi
I found a bunch of CUDA tutorials that go from 1 to 12. You're really good at getting a basic understanding of the GPU architecture, the features, the pitfalls. Each tutorial is accompanied with examples.  :thup:

Here the link to the first one https://www.youtube.com/watch?v=m0nhePeHwFs (https://www.youtube.com/watch?v=m0nhePeHwFs)
The rest is easy to find.

Biterider
Title: Re: Using the GPU
Post by: LiaoMi on June 20, 2019, 07:32:38 PM
Quote from: Biterider on June 19, 2019, 10:43:39 PM
Hi
I found a bunch of CUDA tutorials that go from 1 to 12. You're really good at getting a basic understanding of the GPU architecture, the features, the pitfalls. Each tutorial is accompanied with examples.  :thup:

Here the link to the first one https://www.youtube.com/watch?v=m0nhePeHwFs (https://www.youtube.com/watch?v=m0nhePeHwFs)
The rest is easy to find.

Biterider

Hi Biterider,

thanks for the interesting video!


Quote from: Biterider on June 18, 2019, 06:35:05 PM
For me, the question remains whether the PTX was the right choice or to use nvcc to create the code ...  :icon_idea:

I watched standalone compilers and emulators for Cuda, they were not perfect, in addition, writing a Cuda translator is a very difficult task. If you try to write your own macros for ptx, this will require a serious study of the internal architecture of the GPU, Nvidia tried to create a special emulator that could help developers, and later they closed the project, since it was too complicated. For me personally, the best solution is the CudaPad application, with dynamic programming of C code.

From the documentation...
1.1.3. Purpose of NVCC
The compilation trajectory involves several splitting, compilation, preprocessing, and merging steps for each CUDA source file. It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process. All non-CUDA compilation steps are forwarded to a C++ host compiler that is supported by nvcc, and nvcc translates its options to appropriate host compiler command line options.

1.2. Supported Host Compilers
A general purpose C++ host compiler is needed by nvcc in the following situations:
During non-CUDA phases (except the run phase), because these phases will be forwarded by nvcc to this compiler.
During CUDA phases, for several preprocessing stages and host code compilation (see also The CUDA Compilation Trajectory).
nvcc assumes that the host compiler is installed with the standard method designed by the compiler provider. If the host compiler installation is non-standard, the user must make sure that the environment is set appropriately and use relevant nvcc compile options.

Figure 1. CUDA Compilation Trajectory
(https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/graphics/cuda-compilation-from-cu-to-executable.png)
Title: Re: Using the GPU
Post by: daydreamer on June 20, 2019, 09:15:45 PM
thanks Biterider
I downloaded NVASM many years ago,it was only for the oldest 128bit hardware version of pixelshaders
I tried out nvidias' Cg+ toolkit years ago,similar to C,but often 4 floats simultanously like SSE
also newer C++ you can include code <vector>
is that what you should code in C++ with vectors and compile to different cpus or CUDA and benchmark same code on different hardware?CUDA probably supports vectors in C++???

Biterider,LiaoMi or anyone else,have you tested CUDA with benchmark fibonnacci,primes,PI code?