On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
warp_sizeis always 32
Additional information can be obtained by setting the environment variable to
GOMP_DEBUG=1 (very verbose; grep for
kernel.*launch for launch
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user’s directory (see CUDA documentation; can be
tuned by the environment variables
Note: While PTX ISA is generic, the
options still affect the used PTX ISA code and, thus, the requirements on
CUDA version and hardware.
The implementation remark:
printffunctions. Note that the Fortran
writestatements are not supported, yet.
requires reverse_offloadrequires at least
-march=sm_35, compiling for
-march=sm_30is not supported.
device(ancestor:1)), there is a slight performance penalty for all target regions, consisting mostly of shutdown delay Per device, reverse offload regions are processed serially such that the next reverse offload region is only executed after the previous one returned.
unified_shared_memorywill remove any nvptx device from the list of available devices (“host fallback”).
-msoft-stackin the GCC manual.
target updatedirective for non-contiguous list items will use the 2D and 3D memory-copy functions of the CUDA library. Higher dimensions will call those functions in a loop and are therefore supported.