nvptx (GNU libgomp)

Previous: AMD Radeon (GCN), Up: Offload-Target Specifics [Contents][Index]

12.2 nvptx ¶

On the hardware side, there is the hierarchy (fine to coarse):

thread
warp
thread block
streaming multiprocessor

All OpenMP and OpenACC levels are used, i.e.

OpenMP’s simd and OpenACC’s vector map to threads
OpenMP’s threads (“parallel”) and OpenACC’s workers map to warps
OpenMP’s teams and OpenACC’s gang use a threadpool with the size of the number of teams or gangs, respectively.

The used sizes are

The warp_size is always 32
CUDA kernel launched: dim={#teams,1,1}, blocks={#threads,warp_size,1}.
The number of teams is limited by the number of blocks the device can host simultaneously.

Additional information can be obtained by setting the environment variable to GOMP_DEBUG=1 (very verbose; grep for kernel.*launch for launch parameters).

GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA, which caches the JIT in the user’s directory (see CUDA documentation; can be tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}.

Note: While PTX ISA is generic, the -mptx= and -march= commandline options still affect the used PTX ISA code and, thus, the requirements on CUDA version and hardware.

The implementation remark:

I/O within OpenMP target regions and OpenACC compute regions is supported using the C library printf functions. Additionally, the Fortran print/write statements are supported within OpenMP target regions, but not yet within OpenACC compute regions.
Compilation OpenMP code that contains requires reverse_offload requires at least -march=sm_35, compiling for -march=sm_30 is not supported.
For code containing reverse offload (i.e. target regions with device(ancestor:1)), there is a slight performance penalty for all target regions, consisting mostly of shutdown delay Per device, reverse offload regions are processed serially such that the next reverse offload region is only executed after the previous one returned.
OpenMP code that has a requires directive with unified_shared_memory runs on nvptx devices if and only if all of those support the pageableMemoryAccess property;⁵ otherwise, all nvptx device are removed from the list of available devices (“host fallback”).
The default per-warp stack size is 128 kiB; see also -msoft-stack in the GCC manual.
The OpenMP routines omp_target_memcpy_rect and omp_target_memcpy_rect_async and the target update directive for non-contiguous list items will use the 2D and 3D memory-copy functions of the CUDA library. Higher dimensions will call those functions in a loop and are therefore supported.
Low-latency memory (omp_low_lat_mem_space) is supported when the the access trait is set to cgroup, and libgomp has been built for PTX ISA version 4.1 or higher (such as in GCC’s default configuration). The default pool size is 8 kiB per team, but may be adjusted at runtime by setting environment variable GOMP_NVPTX_LOWLAT_POOL=bytes. The maximum value is limited by the available hardware, and care should be taken that the selected pool size does not unduly limit the number of teams that can run simultaneously.
omp_low_lat_mem_alloc cannot be used with true low-latency memory because the definition implies the omp_atv_all trait; main graphics memory is used instead.
omp_cgroup_mem_alloc, omp_pteam_mem_alloc, and omp_thread_mem_alloc, all use low-latency memory as first preference, and fall back to main graphics memory when the low-latency pool is exhausted.
The unique identifier (UID), used with OpenMP’s API UID routines, consists of the ‘GPU-’ prefix followed by the 16-bytes UUID as returned by the CUDA runtime library. This UUID is output in grouped lower-case hex digits; the grouping of those 32 digits is: 8 digits, hyphen, 4 digits, hyphen, 4 digits, hyphen, 16 digits. This leads to a string like GPU-a8081c9e-f03e-18eb-1827-bf5ba95afa5d. The output matches the format used by nvidia-smi.

12.2 nvptx ¶

Footnotes

(5)