News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Using the GPU

Started by Biterider, April 30, 2019, 07:26:14 PM

Previous topic - Next topic

LiaoMi

#30
Everything works  :biggrin:, here is the final example translated from fasm ... The mistake was that new api were used, after I set the correct parameters for the file's header: cuda.inc

CUDA_FORCE_API_VERSION EQU 3010
CUdeviceptr typedef QWORD

_WIN64 equ 1


and the example was launched on gpu, all functions worked correctly. There is some dependence of the PTX code and version of the Api. It is time to start with the neural network example  :eusa_dance:.

Here is another interesting topic related to graphics - Single-threaded CUDA OpenGL Interop https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st

aw27

@LiaoMi
Did you include the translated from FASM source code?

BTW, have you ever tried to use cubin ot fatbin instead of PTX? It will save the compilation time, right?



LiaoMi

Quote from: AW on May 17, 2019, 05:25:09 AM
@LiaoMi
Did you include the translated from FASM source code?

BTW, have you ever tried to use cubin ot fatbin instead of PTX? It will save the compilation time, right?

@AW
Please accept my apologies, I did not attach the most important thing  :icon_eek:, now the source is in place, please download again .. you can change the paths of the system libraries and compile with UASM. Cuda.inc will work unfortunately only on UASM, you can use your batch file, the inc folder also contains the translate64.inc file.

Quotehave you ever tried to use cubin ot fatbin instead of PTX
I thought about it a little, but did not study it more deeply, because I don't see any advantage in this, relying on the description in the documentation, working with PTX gives more possibilities and more flexibility.
This was written by other users https://devtalk.nvidia.com/default/topic/504259/cubin-vs-ptx/

.fatbin from the same story, I think it will be useful only when strict code optimization is needed, for a certain type of video card .. Maybe I'm wrong, or maybe not :eusa_boohoo:

QuoteIt will save the compilation time, right?
Anyway, in binary form, the execution time should be optimal  :icon_exclaim:

Useful link CUDA Binary Utilities cuobjdump
cuobjdump extracts information from CUDA binary files (both standalone and those embedded in host binaries) and presents them in human readable format. The output of cuobjdump includes CUDA assembly code for each kernel, CUDA ELF section headers, string tables, relocators and other CUDA specific sections. It also extracts embedded ptx text from host binaries. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump
$ cuobjdump a.out -ptx -sass
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
identifier = add.cu

        code for sm_20
                Function : _Z3addPiS_S_
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];  /* 0x2800440400005de4 */
        /*0008*/         MOV R6, c[0x0][0x20];   /* 0x2800400080019de4 */
        /*0010*/         MOV R7, c[0x0][0x24];   /* 0x280040009001dde4 */
        /*0018*/         MOV R2, c[0x0][0x28];   /* 0x28004000a0009de4 */
        /*0020*/         MOV R3, c[0x0][0x2c];   /* 0x28004000b000dde4 */
        /*0028*/         LDU.E R0, [R6];         /* 0x8c00000000601c85 */
        /*0030*/         MOV R4, c[0x0][0x30];   /* 0x28004000c0011de4 */
        /*0038*/         LDU.E R2, [R2];         /* 0x8c00000000209c85 */
        /*0040*/         MOV R5, c[0x0][0x34];   /* 0x28004000d0015de4 */
        /*0048*/         IADD R0, R2, R0;        /* 0x4800000000201c03 */
        /*0050*/         ST.E [R4], R0;          /* 0x9400000000401c85 */
        /*0058*/         EXIT;                   /* 0x8000000000001de7 */
                .............................

Fatbin ptx code:
================
arch = sm_20
code version = [4,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
identifier = add.cu

.version 4.0
.target sm_20
.address_size 64

.visible .entry _Z3addPiS_S_(
.param .u64 _Z3addPiS_S__param_0,
.param .u64 _Z3addPiS_S__param_1,
.param .u64 _Z3addPiS_S__param_2
)
{
.reg .s32 %r<4>;
.reg .s64 %rd<7>;

ld.param.u64 %rd1, [_Z3addPiS_S__param_0];
ld.param.u64 %rd2, [_Z3addPiS_S__param_1];
ld.param.u64 %rd3, [_Z3addPiS_S__param_2];
cvta.to.global.u64 %rd4, %rd3;
cvta.to.global.u64 %rd5, %rd2;
cvta.to.global.u64 %rd6, %rd1;
ldu.global.u32 %r1, [%rd6];
ldu.global.u32 %r2, [%rd5];
add.s32 %r3, %r2, %r1;
st.global.u32 [%rd4], %r3;
ret;
}



aw27

I get an error when running your program in an old computer with a GE FORCE GT 640



I can't also run my queens program but it may be because I can't compile a PTX to versions below sm_30 with my nvcc.exe and possible GE FORCE GT 640 may require under that. The error I get with my proggy is CUDA_ERROR_NO_BINARY_FOR_GPU (error 209). What tool do you use to compile to sm_13? This appears way too old.

LiaoMi

sm_13 was already ready in the example  :biggrin:

CUDA Toolkit v10.1.168_pdf -> CUDA_C_Programming_Guide

3.1.2. Binary Compatibility
Binary code is architecture-specific. A cubin object is generated using the compiler
option -code that specifies the targeted architecture: For example, compiling with
-code=sm_35 produces binary code for devices of compute capability 3.5. Binary
compatibility is guaranteed from one minor revision to the next one, but not from one
minor revision to the previous one or across major revisions. In other words, a cubin
object generated for compute capability X.y will only execute on devices of compute
capability X.z where z≥y.

3.1.3. PTX Compatibility
Some PTX instructions are only supported on devices of higher compute capabilities.
For example, Warp Shuffle Functions are only supported on devices of compute
capability 3.0 and above. The -arch compiler option specifies the compute capability
that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for
example, must be compiled with -arch=compute_30 (or higher).
PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from an
earlier PTX version may not make use of some hardware features. For example, a binary
targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for
compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these
were not available on Pascal. As a result, the final binary may perform worse than would
be possible if the binary were generated using the latest version of PTX.
3.1.4. Application Compatibility
To execute code on devices of specific compute capability, an application must load
binary or PTX code that is compatible with this compute capability as described in
Binary Compatibility and PTX Compatibility. In particular, to be able to execute code
on future architectures with higher compute capability (for which no binary code can be
generated yet), an application must load PTX code that will be just-in-time compiled for
these devices (see Just-in-Time Compilation).
Which PTX and binary code gets embedded in a CUDA C application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,
nvcc x.cu
        -gencode arch=compute_35,code=sm_35
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=\'compute_60,sm_60\'
embeds binary code compatible with compute capability 3.5 and 5.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 6.0
(third -gencode option).
Host code is generated to automatically select at runtime the most appropriate code to
load and execute, which, in the above example, will be:
‣ 3.5 binary code for devices with compute capability 3.5 and 3.7,
‣ 5.0 binary code for devices with compute capability 5.0 and 5.2,
‣ 6.0 binary code for devices with compute capability 6.0 and 6.1,
‣ PTX code which is compiled to binary code at runtime for devices with compute
capability 7.0 and higher.
x.cu can have an optimized code path that uses warp shuffle operations, for example,
which are only supported in devices of compute capability 3.0 and higher. The
__CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with arch=compute_35 for example,
__CUDA_ARCH__ is equal to 350.
Applications using the driver API must compile code to separate files and explicitly load
and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architecures, Independent Thread Scheduling may alter the set of
participating threads, leading to incorrect results. To aid migration while implementing
the corrective actions detailed in Independent Thread Scheduling, Volta developers
can opt-in to Pascal's thread scheduling with the compiler option combination arch=compute_60
-code=sm_70.
The nvcc user manual lists various shorthand for the -arch, -code, and -gencode
compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35 code=compute_35,sm_35 (which is the same as
-gencode
arch=compute_35,code=\'compute_35,sm_35\').

LiaoMi

Quote from: AW on May 17, 2019, 07:07:07 AM
I get an error when running your program in an old computer with a GE FORCE GT 640



I can't also run my queens program but it may be because I can't compile a PTX to versions below sm_30 with my nvcc.exe and possible GE FORCE GT 640 may require under that. The error I get with my proggy is CUDA_ERROR_NO_BINARY_FOR_GPU (error 209). What tool do you use to compile to sm_13? This appears way too old.

There are differences in working with memory, so this is a normal phenomenon, I can't answer the question at the moment, I can only guess :P

Matching SM architectures (CUDA arch and CUDA gencode) for various NVIDIA cards https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
I've seen some confusion regarding NVIDIA's nvcc sm flags and what they're used for:
When compiling with NVCC, the arch flag ('-arch') specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes ('-gencode') allows for more PTX generations, and can be repeated many times for different architectures.


Here is an example for Cubin  :t ...

    /// Init cuda
    SAFE_CALL(cuInit(0));
    /// Get device
    SAFE_CALL(cuDeviceGet(&dev, 0));
    /// Create context
    SAFE_CALL(cuCtxCreate(&ctx, 0, dev));
   
    SAFE_CALL(cuModuleLoad(&m, "test.cubin"));

    CUfunction hfunc;
    SAFE_CALL(cuModuleGetFunction(&hfunc, m, "my_kernel"));


P.S>SM stands for Streaming Multiprocessor and the number indicates the features supported by the architecture.

aw27

#36
The real problem is the paranoia of forced obsolescence we are assisting these days. From VS I can't even build for sm_30, minimum is sm_35.
If I can't solve the issues, I will have to install and old NVidia SDK version.

Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.

I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.


LiaoMi

Quote from: AW on May 17, 2019, 03:35:35 PM
The real problem is the paranoia of forced obsolescence we are assisting these days. From VS I can't even build for sm_30, minimum is sm_35.
If I can't solve the issues, I will have to install and old NVidia SDK version.

Later:
I could run my little programs on the computer with the NVidia GE FORCE GT 640 after installing the Nvidia SDK 10.1 on that computer.
So it was a problem of outdated drivers.

I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.

I have problems installing Cuda 10.1 in visual studio 2019
https://devtalk.nvidia.com/default/topic/1049613/nsight-visual-studio-edition/visual-studio-2019-and-cuda-10-1/1
I don't even talk about old sdk in my case ...

My proposal is to update the CudaPad https://github.com/SunsetQuest/CudaPAD project for our needs. Add more options, add portable versions of Cuda SDK  :idea: There is already SM version selection in the project.


Quote from: AW on May 17, 2019, 03:35:35 PM
I can't still run your mywindows1.exe program. I believe the problem might related to your usage of the /LARGEADRESSAWARE:NO switch but I can't build your sample straightaway to confirm.

Starting pass 2
     mywindow1.obj
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'NumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'LogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'NumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'CudaLogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'LogNumberArray' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err000' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err100' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err200' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err300' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err400' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err500' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err600' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err700' invalid without /LARGEADDRESSAWARE:NO
mywindow1.obj : error LNK2017: 'ADDR32' relocation to 'err999' invalid without /LARGEADDRESSAWARE:NO
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     cuda.lib(nvcuda.dll)
     msvcrt.lib(msvcrt.dll)
     msvcrt.lib(msvcrt.dll)
     msvcrt.lib(msvcrt.dll)
     Kernel32.lib(KERNEL32.dll)
     Kernel32.lib(KERNEL32.dll)
     Kernel32.lib(KERNEL32.dll)
     User32.lib(USER32.dll)
     User32.lib(USER32.dll)
     User32.lib(USER32.dll)
LINK : fatal error LNK1165: link failed because of fixup errors


The wording of the linker error message is a bit misleading. In fact you are using an addressing mode

that uses a 32 bit offset, which results in a 32 bit address in the symbol table.

Here is a description of my problem:

MODULE1.ASM
...
VAR1    DB 256 dup (0)
...
MODULE2.ASM
...
extrn    VAR1:BYTE
...
MOV   al, byte ptr[VAR1 + rbx]

In 64 bit mode only very few addressing modes can handle absolute 64 bit offsets.
Most complex addressing modes are restricted to 32 bit offsets (like the [mem + reg]
used above).

So the assembler generates an entry in the symbol table that states VAR1 as a
32bit offset for MODULE2.

The symbol table of MODULE1 only has an entry that states VAR1 as a 64bit offset.
In this case the linker complains that symbol VAR1 has different address sizes.

To avoid this a different addressing mode must be used - something like this:
LEA   rsi, VAR1
MOV   al, byte ptr[rsi + rbx]

aw27

I have problems with VS 2019 as well in CUDA so I am using VS 2017.

CudaPad is fine but it works with nvcc.exe, so can't produce miracles. If I select sm_23 it reports: nvcc fatal   : Value 'sm_23' is not defined for option 'gpu-architecture'  :(

I never used /LARGEADDRESSAWARE:NO because it is a kludge for transitioning from win32 to win64.
If you can't fix it I will have a look at your program during the week end and with some luck I will try to build it for /LARGEADDRESSAWARE:YES.


TimoVJL

nvopencc.exe work's in Windows 7, if downloaded from CUDA Toolkit 3.2, but from CUDA Toolkit 4.2.9 just crash.

That toolkit is a .msi file, so TLMsiListEx can be useful.

nvcc error   : 'cudafe++' died with status 0xC0000374  :( RIP  :P
May the source be with you

LiaoMi

Quote from: AW on May 17, 2019, 06:54:44 PM
I have problems with VS 2019 as well in CUDA so I am using VS 2017.

CudaPad is fine but it works with nvcc.exe, so can't produce miracles. If I select sm_23 it reports: nvcc fatal   : Value 'sm_23' is not defined for option 'gpu-architecture'  :(

I never used /LARGEADDRESSAWARE:NO because it is a kludge for transitioning from win32 to win64.
If you can't fix it I will have a look at your program during the week end and with some luck I will try to build it for /LARGEADDRESSAWARE:YES.

Without using / LARGEADDRESSAWARE: NO  :icon14:

aw27

That was the problem, it is fixed.  :t


mikeburr

@LiaoMi ... have you found any one of the linux flavours to be superior than others for this kind of thing .. was thinking of trying CentOS 7   
regards mikeb
ps i worked on a stock control system many years ago where i used distributions to assess the new weighting rather than the linear moving weight scheme  you have in your neural system . If you do move to a similar scheme then i advise you to use a skewed distribution such as Students T as very rarely in nature is any distribution symmetric  a failing which is sadly almost always overlooked .. i include in this the FFT methods now prevalent for ascertaining independance of variables as these are highly symmetricised methods [ see the many excellent examples and links provided by Siekmanski ]   

TimoVJL

#43
A small example in C for testing an old card like nVidia G210#define WIN32_LEAN_AND_MEAN
#include <windows.h>
//#include "cuda.h"
#pragma comment(lib, "cuda.lib")
#pragma comment(lib, "msvcrt.lib")

#define CUDAAPI __stdcall
typedef int CUdevice;
typedef struct CUctx_st *CUcontext;
typedef struct CUmod_st *CUmodule;
typedef enum cudaError_enum {
    CUDA_SUCCESS                              = 0,
    CUDA_ERROR_INVALID_VALUE                  = 1,
    CUDA_ERROR_OUT_OF_MEMORY                  = 2,
    CUDA_ERROR_NOT_INITIALIZED                = 3,
    CUDA_ERROR_DEINITIALIZED                  = 4,
    CUDA_ERROR_NO_DEVICE                      = 100,
    CUDA_ERROR_INVALID_DEVICE                 = 101,
    CUDA_ERROR_INVALID_IMAGE                  = 200,
    CUDA_ERROR_INVALID_CONTEXT                = 201,
    CUDA_ERROR_CONTEXT_ALREADY_CURRENT        = 202,
    CUDA_ERROR_MAP_FAILED                     = 205,
    CUDA_ERROR_UNMAP_FAILED                   = 206,
    CUDA_ERROR_ARRAY_IS_MAPPED                = 207,
    CUDA_ERROR_ALREADY_MAPPED                 = 208,
    CUDA_ERROR_NO_BINARY_FOR_GPU              = 209,
    CUDA_ERROR_ALREADY_ACQUIRED               = 210,
    CUDA_ERROR_NOT_MAPPED                     = 211,
    CUDA_ERROR_NOT_MAPPED_AS_ARRAY            = 212,
    CUDA_ERROR_NOT_MAPPED_AS_POINTER          = 213,
    CUDA_ERROR_ECC_UNCORRECTABLE              = 214,
    CUDA_ERROR_UNSUPPORTED_LIMIT              = 215,
    CUDA_ERROR_INVALID_SOURCE                 = 300,
    CUDA_ERROR_FILE_NOT_FOUND                 = 301,
    CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND = 302,
    CUDA_ERROR_SHARED_OBJECT_INIT_FAILED      = 303,
    CUDA_ERROR_OPERATING_SYSTEM               = 304,
    CUDA_ERROR_INVALID_HANDLE                 = 400,
    CUDA_ERROR_NOT_FOUND                      = 500,
    CUDA_ERROR_NOT_READY                      = 600,
    CUDA_ERROR_LAUNCH_FAILED                  = 700,
    CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES        = 701,
    CUDA_ERROR_LAUNCH_TIMEOUT                 = 702,
    CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING  = 703,
    CUDA_ERROR_UNKNOWN                        = 999
} CUresult;
CUresult CUDAAPI cuDriverGetVersion(int *driverVersion);
CUresult CUDAAPI cuInit(unsigned int Flags);
CUresult CUDAAPI cuDeviceGetCount(int *count);
CUresult CUDAAPI cuDeviceGet(CUdevice *device, int ordinal);
CUresult CUDAAPI cuCtxCreate(CUcontext *pctx, unsigned int flags, CUdevice dev);
CUresult CUDAAPI cuCtxGetApiVersion(CUcontext ctx, unsigned int *version);
CUresult CUDAAPI cuDeviceComputeCapability(int *major, int *minor, CUdevice dev);
CUresult CUDAAPI cuModuleLoad(CUmodule *module, const char *fname);
CUresult CUDAAPI cuCtxDetach(CUcontext ctx);
CUresult CUDAAPI cuGetErrorString(CUresult error, const char **pStr);

void __cdecl mainCRTStartup(void)
{
int deviceCount = 0;
int iDrvVer, iApiVer;
CUdevice dev;
CUcontext ctx;
CUmodule mod;
CUresult err;
int iMajor, iMinor;
cuDriverGetVersion(&iDrvVer);
printf("Driver Version: %d.%d\n", iDrvVer / 1000, iDrvVer % 1000);
if (!(err = cuInit(0))) {
err = cuDeviceGetCount(&deviceCount);
err = cuDeviceGet(&dev, 0);
err = cuCtxCreate(&ctx, 0, dev);
err = cuCtxGetApiVersion(ctx, &iApiVer);
printf("API Version: %d.%d\n", iApiVer / 1000, iApiVer % 1000);
err = cuDeviceComputeCapability(&iMajor, &iMinor, dev);
printf("DeviceComputeCabability: %d.%d\n", iMajor, iMinor);
//err = cuModuleLoadData(&mod, PTXSourceData);
err = cuModuleLoad(&mod, "Test.ptx");
if (!err) {
printf("Test.ptx loaded\n");
} else {
char *perr;
cuGetErrorString(err, &perr);
printf("error loading ptx: %u %s\n", err, perr);
}
err = cuCtxDetach(ctx);
}
ExitProcess(0);
}
Test.ptx .version 1.4
.target sm_10, map_f64_to_f32
// compiled with open64/lib//be.exe
// nvopencc 3.2 built on 2010-11-04
.entry _Z3addPiS_ (
.param .u64 __cudaparm__Z3addPiS__a,
.param .u64 __cudaparm__Z3addPiS__b)
{
.reg .u32 %r<6>;
.reg .u64 %rd<8>;
.reg .pred %p<3>;
.loc 15 4 0
$LDWbegin__Z3addPiS_:
cvt.s32.u16 %r1, %tid.x;
mov.u32 %r2, 999;
setp.gt.s32 %p1, %r1, %r2;
@%p1 bra $Lt_0_1026;
.loc 15 7 0
cvt.s64.s32 %rd1, %r1;
mul.wide.s32 %rd2, %r1, 4;
ld.param.u64 %rd3, [__cudaparm__Z3addPiS__a];
add.u64 %rd4, %rd3, %rd2;
ld.global.s32 %r3, [%rd4+0];
mul.lo.s32 %r4, %r3, 2;
ld.param.u64 %rd5, [__cudaparm__Z3addPiS__b];
add.u64 %rd6, %rd5, %rd2;
st.global.s32 [%rd6+0], %r4;
$Lt_0_1026:
.loc 15 9 0
exit;
$LDWend__Z3addPiS_:
} // _Z3addPiS_

It just tells that it load that shit ;)

PS: it took a while to get G210 working with Windows 8.1, just a nightmare, as a old nVidia card and Windows 8.1 have a some dating problems.
(MS: women regardless of marital status)

EDIT: how to create that kernel code: Add.itypedef struct  uint3
{
    unsigned int x, y, z;
}uint3;

uint3  extern const threadIdx;
//__global__
__attribute__((global)) __attribute__((__used__))
void add(int* a, int* b) {
int i = threadIdx.x;
if (i < 1000) {
b[i] = 2 * a[i];
}
}
open64\bin\nvopencc  -TARG:compute_10 -m64 -OPT:ftz=1 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0 "Add" "Add.i"  -o "Add.ptx"
for CUDA Toolkit 7 >compiler\nvvm\bin\cicc -arch compute_30 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -nvvmir-library "compiler\bin/../nvvm/libdevice/libdevice.compute_30.10.bc" --orig_src_file_name "Add.cu"  "Add.i" -o "Add.ptx"

EDIT: copy driver / global code part from .cu to .ci file and add a header-file for it
Minimal header for .ci,
ci_include.htypedef struct _uint3
{
    unsigned int x, y, z;
}uint3;

typedef struct _dim3
{
    unsigned int x, y, z;
}dim3;

extern const uint3 threadIdx;
extern const uint3 blockIdx;
extern const dim3 blockDim;
extern const dim3 gridDim;
extern const int warpSize;

#define __device__ __attribute__((device))
#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))

#define bool _Bool

clang backend support CUDA

EDIT: AMD clangclang.exe -c Add.cu -nocudainc  -nocudalib
May the source be with you

LiaoMi

Quote from: mikeburr on May 17, 2019, 10:18:48 PM
@LiaoMi ... have you found any one of the linux flavours to be superior than others for this kind of thing .. was thinking of trying CentOS 7   
regards mikeb
ps i worked on a stock control system many years ago where i used distributions to assess the new weighting rather than the linear moving weight scheme  you have in your neural system . If you do move to a similar scheme then i advise you to use a skewed distribution such as Students T as very rarely in nature is any distribution symmetric  a failing which is sadly almost always overlooked .. i include in this the FFT methods now prevalent for ascertaining independance of variables as these are highly symmetricised methods [ see the many excellent examples and links provided by Siekmanski ]   

Hi mikeburr,

I can not say exactly ... Was the question about gpu or neural networks? The assembly language example is not mine, the author is no longer active, there are flaws in the example, I hope we'll be able to add improvements, so tips on the algorithm are welcome.
From my point of view, for choosing the system environment only two factors are important, having the right tools and the usability of the data. Therefore, for neural networks, I would choose java or dotnet, in both programming languages there are hard points in working with gpu, but it can be solved. The more standard OSes are the best) Java + Linux = Dotnet + Windows, as an option Dotnet + Linux = Java + Windows. Why not an assembler, because of the amount of data, object-oriented processing methods and the ability to use ready-made items. Anyway, examples written in assembly language have attractiveness on stable processes, on stable models that dont require further intervention.

Perhaps it would be more logical to find a suitable github project, where the most interesting neural network model for gpu has already been chosen. After all, the main discussion is the computational power of GPU and algorithms for assembler. It makes no sense to copy all the complexities of an object-oriented language  :P

I don't have Linux at the moment, but the previous example needs to be adapted for Linux ...

How to install NVIDIA CUDA Toolkit on CentOS 7 Linux - https://linuxconfig.org/how-to-install-nvidia-cuda-toolkit-on-centos-7-linux