On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
warp_size
is always 32
dim={#teams,1,1}, blocks={#threads,warp_size,1}
.
Additional information can be obtained by setting the environment variable to
GOMP_DEBUG=1
(very verbose; grep for kernel.*launch
for launch
parameters).
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user’s directory (see CUDA documentation; can be
tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}
.
Note: While PTX ISA is generic, the -mptx=
and -march=
commandline
options still affect the used PTX ISA code and, thus, the requirements on
CUDA version and hardware.
The implementation remark:
printf
functions.
Additionally, the Fortran print
/write
statements are
supported within OpenMP target regions, but not yet within OpenACC compute
regions. requires reverse_offload
requires at least -march=sm_35
, compiling for -march=sm_30
is not supported.
target
regions with
device(ancestor:1)
), there is a slight performance penalty
for all target regions, consisting mostly of shutdown delay
Per device, reverse offload regions are processed serially such that
the next reverse offload region is only executed after the previous
one returned.
requires
directive with
unified_shared_memory
runs on nvptx devices if and only if
all of those support the pageableMemoryAccess
property;5
otherwise, all nvptx device are removed from the list of available
devices (“host fallback”).
-msoft-stack
in the GCC manual.
omp_target_memcpy_rect
and
omp_target_memcpy_rect_async
and the target update
directive for non-contiguous list items will use the 2D and 3D
memory-copy functions of the CUDA library. Higher dimensions will
call those functions in a loop and are therefore supported.
omp_low_lat_mem_space
) is supported when the
the access
trait is set to cgroup
, and libgomp has
been built for PTX ISA version 4.1 or higher (such as in GCC’s
default configuration). The default pool size
is 8 kiB per team, but may be adjusted at runtime by setting environment
variable GOMP_NVPTX_LOWLAT_POOL=bytes
. The maximum value is
limited by the available hardware, and care should be taken that the
selected pool size does not unduly limit the number of teams that can
run simultaneously.
omp_low_lat_mem_alloc
cannot be used with true low-latency memory
because the definition implies the omp_atv_all
trait; main
graphics memory is used instead.
omp_cgroup_mem_alloc
, omp_pteam_mem_alloc
, and
omp_thread_mem_alloc
, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.
GPU-a8081c9e-f03e-18eb-1827-bf5ba95afa5d
. The output
matches the format used by nvidia-smi
.