OpenACC

This page contains information on GCC's implementation of the OpenACC specification and related functionality. OpenACC is intended for programming accelerator devices such as GPUs, including code offloading to these devices. Specifically, currently devices using Nvidia PTX (nvptx) are supported.

OpenACC is an experimental feature of GCC 5.1 and may not meet the needs of general application development. Compared to GCC 5, the GCC 6 and 7 release series include a much improved implementation of the OpenACC 2.0a specification. The gomp-4_0-branch supports most of the OpenACC 2.5 specification.

For discussing this project, please use the standard GCC resources (mailing lists, Bugzilla, and so on). It's helpful to put a [OpenACC] tag into your email's Subject line, and set the openacc keyword in any Bugzilla issues filed.

For instructions how to build offloading-enabled GCC from source, see Offloading: How to try offloading enabled GCC.

Implementation Status

Listing first the most current work in progress, followed by the GCC release series from most current to older.

Work in Progress (gomp-4_0-branch)

Current development continues on gomp-4_0-branch. Please add a [gomp4] tag to any patches posted for inclusion in that branch.

Work is ongoing to merge gomp-4_0-branch code into trunk, for the next GCC release series.

The implementation status on gomp-4_0-branch is based on the GCC 6 release series (see below), with additional support for the OpenACC 2.5 specification and the following exceptions:

gomp-4_0-branch contains the following enhancements:

Known Issues

The following are known issues in gomp-4_0-branch.

GCC 7 Release Series

Compared to GCC 6, the GCC 7 release series contains the following OpenACC changes:

GCC 6 Release Series (GCC 6.1 released on 2016-04-27)

Compared to GCC 5, the GCC 6 release series includes a much improved implementation of the OpenACC 2.0a specification.

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for an arbitrary number of gangs, up to 32 workers, and 32 vectors.

Initial support for parallelized execution of OpenACC kernels constructs:

The device_type clause is not supported. The bind and nohost clauses are not supported. The host_data directive is not supported in Fortran, <https://gcc.gnu.org/PR70598>.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

GCC 5 Release Series (GCC 5.1 released on 2015-04-22)

The GCC 5 release series includes a preliminary implementation of the OpenACC 2.0a specification. No further OpenACC development work is planned for this release series.

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for one gang, one worker, and a number of vectors. These vectors all execute in "vector-redundant" mode. This means that inside a parallel construct, offloaded code outside of any loop construct is executed by all vectors, not just a single vector. The reduction clause is not supported with the parallel construct.

The kernels construct is supported only in a simplistic way: the code is offloaded, but executes with just one gang, one worker, one vector. No directives are supported inside kernels constructs. Reductions are not supported inside kernels constructs.

The atomic, cache, declare, host_data, and routine directives are not supported. The default(none), device_type, firstprivate, and private clauses are not supported. A parallel construct's implicit data attributes for scalar data types will be treated as present_or_copy instead of firstprivate. Only the collapse clause is supported for loop constructs, and there is incomplete support for the reduction clause.

Combined directives (kernels loop, parallel loop) are not supported; use kernels alone, or parallel followed by loop, instead.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

Issue Tracking

/!\ Incomplete.

Open OpenACC bugs

Known issues with offloading.

Notes on exploiting parallelism

The "kernels" directive allows the optimization facilities of the underlying compiler to detect parallel loops. Because gcc supports a large number of varying targets, its optimization passes are not as well tuned for GPU parallelism as are compilers that were developed specifically for the GPU. As a result, users may find the performance of kernel regions to be slower than expectations in comparison to other compilers. A later section provides more information. To obtain maximum performance, users are encouraged to convert kernel regions to parallel regions.

Again because gcc supports a large number of varying targets, its optimization passes are not as highly tuned to the GPU as are compilers developed specifically with that GPU as the target. Because of that, users may find with "parallel" regions that gcc does not do as good a job of scheduling "parallel" loops (loops that are declared as parallel, but that do not contain directives such as "gang", "worker", or "vector" that indicate the specific form of parallelism to use) code generation is not yet as highly tuned for the GPU as are those compilers. Adding in "gang", "worker", and "vector" directives in those cases to guide the compiler can result in significant performance gain.

GCC's strategy for utilizing parallelism focuses on selecting vector loops first, followed by worker and gang loops. This differs from most other compilers, which focus on finding gang loops first, then worker and vector. Each strategy will perform better on some codes and worse on others. When finding less effective parallelism than expected, guiding the strategy with explicit "gang", "worker", and "vector" directives often improves the performance.

Quick Reference Guide

This section presents an overview on how to use OpenACC with GCC. All of the examples assume that the compiler binaries are prefixed x86_64-none-linux-gnu-.

For general GCC/libgomp and OpenACC documentation, please refer to:

Typical Compiler Usage

In most cases, compilation of an OpenACC program with optimization turned on is:

 $ x86_64-none-linux-gnu-gcc -fopenacc -O2 -c foo.c
 $ x86_64-none-linux-gnu-gcc -fopenacc foo.o -o foo

If the program requires math library functions inside offloaded sections, link with:

 $ x86_64-none-linux-gnu-gcc -fopenacc -foffload=-lm foo.o -o foo

GCC optimization options also provide -O3, and -Ofast, which is -O3 with additional relaxations on standards compliance. For more details, please refer to the GCC optimization options documentation.

Running OpenACC Programs Built With the Toolchain

Programs built with the OpenACC toolchain will require the included libgomp runtime to run. The included libgomp is placed in <toolchain-path>/<host>/lib64, here <host> being either x86_64-none-linux-gnu or powerpc64le-none-linux-gnu. To allow the built program to find these libraries, users can either:

  1. Use RPATH to put the libgomp path into the ELF executable during link, e.g. Add -Wl,-rpath,<toolchain-path>/x86_64-none-linux-gnu/lib64 to the above link command.

  2. Set the LD_LIBRARY_PATH variable to include the libgomp path when running the built executable.

-fopenacc

Enables OpenACC in the compiler, activating directives in the code, and provides the _OPENACC preprocessor symbol. Also see description in GCC manual.

-fopenacc-dim=<gang:worker:vector>

This option specifies the default OpenACC compute dimensions for parallel constructs (when there are no explicit user defined clauses on the directive). The gang, worker and vector argument must be an positive integer or a minus sign. The minus sign allows the runtime to determine a suitable default based on the accelerator's capabilities. The default for NVPTX accelerator targets is -:32:32. Also see description in GCC manual.

-foffload-force

Specify offloading targets and options for them. The most common form of usage for this option will probably be to link in math library on accelerator target, i.e. -foffload=-lm. For general documentation of this option, please refer to the GCC Offloading Wiki.

-fopt-info-note-omp

This flag instructs the compiler to print out any parallelism that it detected.

Runtime Environment Variables for Built OpenACC Programs

These variables, read and interpreted by the libgomp OpenACC runtime included with the toolchain, can be set when executing built programs:

ACC_DEVICE_TYPE

Two possible values can be set: nvidia and host. The former is the default of offloading to NVPTX accelerator devices, and the latter means single-threaded host-fallback execution, in a shared-memory mode.

ACC_DEVICE_NUM

Value can be set to non-negative integer, designating a specific accelerator device to use (if more than one installed). Will trigger runtime error if specified number is out of range.

GOMP_OPENACC_DIM

This variable has the same functionality as the -fopenacc-dim= compiler option, except is handled by the libgomp runtime instead of the compiler.

GOMP_DEBUG

A setting of GOMP_DEBUG=1 can be added to the environment to enable some debugging output during execution. Currently it logs data management and GPU kernel launches, and if a NVPTX device type is active, also includes a dump of the offloaded PTX code.

GFortran Interoperability with CUDA libraries

CUDA libraries may be called from Fortran through the use of function interfaces. For example, to call cublassaxpy from Fortran, use the following interface block.

interface
   subroutine cublassaxpy(N, alpha, x, incx, y, incy) bind(c, name="cublasSaxpy")
     use iso_c_binding
     integer(kind=c_int), value :: N
     real*4, value :: alpha
     type(*), dimension(*) :: x
     integer(kind=c_int), value :: incx
     type(*), dimension(*) :: y
     integer(kind=c_int), value :: incy
   end subroutine cublassaxpy
end interface

The arguments passed to cublassaxpy must by declared as host_data. A more complete example can be found here:

https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00976.html

Automatic Loop Partitioning

GCC uses the following strategy to assign unused partitioning to independent acc loops. If there is only one acc loop, e.g.

#pragma acc parallel loop
for (...)

GCC assigns it gang and vector level parallelism. I.e.

#pragma acc parallel loop gang vector
for (...)

If there are two loops, L1 and L2, when L2 nested inside L1, e.g.

#pragma acc parallel loop
for (i = ...)
{
  #pragma acc loop
  for (j = ...)

then the inner loop gets vector is assigned partitioning, and the outer loop gets assigned gang and worker partitioning. I.e.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop vector
  for (j = ...)

Lastly, if there are three or more independent loops, the inner most loop will be assigned vector partitioning, the penultimate outermost loop worker partitioning, and the outermost loop gang partitioning. E.g.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop
  for (j = ...)
    #pragma acc loop
    for (k = ...)
      #pragma acc loop
      for (l = ...)

becomes

#pragma acc parallel loop gang
for (i = ...)
{
  #pragma acc loop worker
  for (j = ...)
    #pragma acc loop seq
    for (k = ...)
      #pragma acc loop vector
      for (l = ...)

The flag, -fopt-info-note-omp, may be used to show how GCC automatically partitions acc loops.