sm_13 was already ready in the example

CUDA Toolkit v10.1.168_pdf -> CUDA_C_Programming_Guide
3.1.2. Binary CompatibilityBinary code is architecture-specific. A cubin object is generated using the compiler
option -code that specifies the targeted architecture: For example, compiling with
-code=sm_35 produces binary code for devices of compute capability 3.5. Binary
compatibility is guaranteed from one minor revision to the next one, but not from one
minor revision to the previous one or across major revisions. In other words, a cubin
object generated for compute capability X.y will only execute on devices of compute
capability X.z where z≥y.
3.1.3. PTX CompatibilitySome PTX instructions are only supported on devices of higher compute capabilities.
For example, Warp Shuffle Functions are only supported on devices of compute
capability 3.0 and above. The -arch compiler option specifies the compute capability
that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for
example, must be compiled with -arch=compute_30 (or higher).
PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from an
earlier PTX version may not make use of some hardware features. For example, a binary
targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for
compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these
were not available on Pascal. As a result, the final binary may perform worse than would
be possible if the binary were generated using the latest version of PTX.
3.1.4. Application CompatibilityTo execute code on devices of specific compute capability, an application must load
binary or PTX code that is compatible with this compute capability as described in
Binary Compatibility and PTX Compatibility. In particular, to be able to execute code
on future architectures with higher compute capability (for which no binary code can be
generated yet), an application must load PTX code that will be just-in-time compiled for
these devices (see Just-in-Time Compilation).
Which PTX and binary code gets embedded in a CUDA C application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,
nvcc x.cu
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=\'compute_60,sm_60\'
embeds binary code compatible with compute capability 3.5 and 5.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 6.0
(third -gencode option).
Host code is generated to automatically select at runtime the most appropriate code to
load and execute, which, in the above example, will be:
‣ 3.5 binary code for devices with compute capability 3.5 and 3.7,
‣ 5.0 binary code for devices with compute capability 5.0 and 5.2,
‣ 6.0 binary code for devices with compute capability 6.0 and 6.1,
‣ PTX code which is compiled to binary code at runtime for devices with compute
capability 7.0 and higher.
x.cu can have an optimized code path that uses warp shuffle operations, for example,
which are only supported in devices of compute capability 3.0 and higher. The
__CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with arch=compute_35 for example,
__CUDA_ARCH__ is equal to 350.
Applications using the driver API must compile code to separate files and explicitly load
and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architecures, Independent Thread Scheduling may alter the set of
participating threads, leading to incorrect results. To aid migration while implementing
the corrective actions detailed in Independent Thread Scheduling, Volta developers
can opt-in to Pascal's thread scheduling with the compiler option combination arch=compute_60
-code=sm_70.
The nvcc user manual lists various shorthand for the -arch, -code, and -gencode
compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35 code=compute_35,sm_35 (which is the same as
-gencode
arch=compute_35,code=\'compute_35,sm_35\').