On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
warp_size
is always 32
dim={#teams,1,1}, blocks={#threads,warp_size,1}
.
Additional information can be obtained by setting the environment variable to
GOMP_DEBUG=1
(very verbose; grep for kernel.*launch
for launch
parameters).
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user’s directory (see CUDA documentation; can be
tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}
.
Note: While PTX ISA is generic, the -mptx=
and -march=
commandline
options still affect the used PTX ISA code and, thus, the requirements on
CUDA version and hardware.
The implementation remark:
printf
functions. Note that the Fortran
print
/write
statements are not supported, yet.
requires reverse_offload
requires at least -march=sm_35
, compiling for -march=sm_30
is not supported.
target
regions with
device(ancestor:1)
), there is a slight performance penalty
for all target regions, consisting mostly of shutdown delay
Per device, reverse offload regions are processed serially such that
the next reverse offload region is only executed after the previous
one returned.
requires
directive with
unified_shared_memory
will remove any nvptx device from the
list of available devices (“host fallback”).
-msoft-stack
in the GCC manual.
omp_target_memcpy_rect
and
omp_target_memcpy_rect_async
and the target update
directive for non-contiguous list items will use the 2D and 3D
memory-copy functions of the CUDA library. Higher dimensions will
call those functions in a loop and are therefore supported.
omp_low_lat_mem_space
) is supported when the
the access
trait is set to cgroup
, the ISA is at least
sm_53
, and the PTX version is at least 4.1. The default pool size
is 8 kiB per team, but may be adjusted at runtime by setting environment
variable GOMP_NVPTX_LOWLAT_POOL=bytes
. The maximum value is
limited by the available hardware, and care should be taken that the
selected pool size does not unduly limit the number of teams that can
run simultaneously.
omp_low_lat_mem_alloc
cannot be used with true low-latency memory
because the definition implies the omp_atv_all
trait; main
graphics memory is used instead.
omp_cgroup_mem_alloc
, omp_pteam_mem_alloc
, and
omp_thread_mem_alloc
, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.