OpenACC

This page contains information on GCC's implementation of the OpenACC specification and related functionality. OpenACC is intended for programming accelerator devices such as GPUs, including code offloading to these devices. Currently devices using Nvidia PTX (nvptx) are supported, and AMD GCN device support is in development.

The GCC 7, 8, and 9 release series are currently supported by the GCC community. GCC 7, and 8 implement most of the OpenACC 2.0a specification. GCC 9 implements most of the OpenACC 2.5 specification. The OpenACC development branch (see below) implements most of the OpenACC 2.6 specification, and contains further improvements.

For discussing this project, please use the standard GCC resources (mailing lists, Bugzilla, and so on). It's helpful to put a [OpenACC] tag into your emails' subject line, and set the openacc keyword in any Bugzilla issues filed.

For instructions how to build offloading-enabled GCC from source, see Offloading: How to try offloading enabled GCC.

Quick Reference Guide

This section presents an overview on how to use OpenACC with GCC. All of the examples assume that the compiler binaries are prefixed x86_64-none-linux-gnu-.

For general GCC/libgomp and OpenACC documentation, please refer to:

Typical Compiler Usage

In most cases, compilation of an OpenACC program with optimization turned on is:

 $ x86_64-none-linux-gnu-gcc -fopenacc -O2 -c foo.c
 $ x86_64-none-linux-gnu-gcc -fopenacc foo.o -o foo

If the program requires math library functions inside offloaded sections, link with:

 $ x86_64-none-linux-gnu-gcc -fopenacc -foffload=-lm foo.o -o foo

GCC optimization options also provide -O3, and -Ofast, which is -O3 with additional relaxations on standards compliance. For more details, please refer to the GCC optimization options documentation.

Running OpenACC Programs Built With the Toolchain

Programs built with the OpenACC toolchain will require the included libgomp runtime to run. The included libgomp is placed in <toolchain-path>/<host>/lib64, here <host> being either x86_64-none-linux-gnu or powerpc64le-none-linux-gnu. To allow the built program to find these libraries, users can either:

  1. Use RPATH to put the libgomp path into the ELF executable during link, e.g. Add -Wl,-rpath,<toolchain-path>/x86_64-none-linux-gnu/lib64 to the above link command.

  2. Set the LD_LIBRARY_PATH variable to include the libgomp path when running the built executable.

-fopenacc

Enables OpenACC in the compiler, activating directives in the code, and provides the _OPENACC preprocessor symbol. Also see description in GCC manual.

-fopenacc-dim=<gang:worker:vector>

This option specifies the default OpenACC compute dimensions for parallel constructs (when there are no explicit user defined clauses on the directive). The gang, worker and vector argument must be an positive integer or a minus sign. The minus sign allows the runtime to determine a suitable default based on the accelerator's capabilities. The default for NVPTX accelerator targets is -:32:32. Also see description in GCC manual.

-foffload=[...]

Configure offloading compilation. See https://gcc.gnu.org/wiki/Offloading#Compilation_options for details.

One common usage of this option is to link in the math library also for offload targets, that is -lm -foffload=-lm. Or, -foffload=-lgfortran if you're getting link errors with Fortran code. (See PR90386 for context.)

-foffload-force

Forces offloading if the compiler wanted to avoid it.

-fopt-info-optimized-omp

This flag instructs the compiler to print out any parallelism that it detected. (Previously known as -fopt-info-note-omp.)

-fopenacc-kernels

The option -fopenacc-kernels can be used to configure the behavior of kernels constructs handling. The default behavior, -fopenacc-kernels=split, causes OpenACC kernels constructs to be split into a sequence of compute constructs, each then handled individually. To disable the default behavior, use -fopenacc-kernels=parloops, to cause the whole OpenACC kernels constructs to be handled by the parloops pass.

Runtime Environment Variables for Built OpenACC Programs

These variables, read and interpreted by the libgomp OpenACC runtime included with the toolchain, can be set when executing built programs:

ACC_DEVICE_TYPE

Two possible values can be set: nvidia and host. The former is the default of offloading to NVPTX accelerator devices, and the latter means single-threaded host-fallback execution, in a shared-memory mode.

ACC_DEVICE_NUM

Value can be set to non-negative integer, designating a specific accelerator device to use (if more than one installed). Will trigger runtime error if specified number is out of range.

GOMP_OPENACC_DIM

Override the default compute dimensions for parallel offload regions. The default setting is determined by the runtime after examining the available hardware resources. The optimal setting for a particular application may not be the default.

Syntax: GOMP_OPENACC_DIM='gang:worker:vector'
Example: GOMP_OPENACC_DIM='5120:1:32'

GOMP_DEBUG

A setting of GOMP_DEBUG=1 can be added to the environment to enable debugging output during execution. Currently it logs data management, GPU kernel launches, and for active NVPTX device types a dump of the offloaded PTX code is included.

GFortran Interoperability with CUDA libraries

Fortran interface files for the FORTRAN CUDA Library based on their corresponding header files provided by the CUDA Toolkit are publicly available on GitHub: https://github.com/MentorEmbedded/fortran-cuda-interfaces. They contain modules cublas_v2, cublas (legacy CUBLAS interface), cublasxt (multi-GPU host interface), openacc_cublas (interface for OpenACC device code regions), and cufft, which can be utilized by adding the following lines at the beginning of a Fortran program, function or subroutine:

use cublas_v2
use cublas
use cublasxt
use openacc_cublas
use cufft

The following program demonstrates how to use cublasSaxpy with gfortran.

program test
  use cublas
  implicit none

  integer, parameter :: N = 10
  integer :: i
  real*4 :: x_ref(N), y_ref(N), x(N), y(N), a

  a = 2.0

  do i = 1, N
     x(i) = 4.0 * i
     y(i) = 3.0
     x_ref(i) = x(i)
     y_ref(i) = y(i)
  end do

  call saxpy (N, a, x_ref, y_ref)

  !$acc data copyin (x) copy (y)
  !$acc host_data use_device (x, y)
  call cublassaxpy(N, a, x, 1, y, 1)
  !$acc end host_data
  !$acc end data

  do i = 1, N
     if (y(i) .ne. y_ref(i)) call abort
  end do
end program test

subroutine saxpy (nn, aa, xx, yy)
  integer :: nn
  real*4 :: aa, xx(nn), yy(nn)
  integer i
  real*4 :: t
  !$acc routine

  do i = 1, nn
    yy(i) = yy(i) + aa * xx(i)
  end do
end subroutine saxpy

To build and run this example, issue the following commands:

$gfortran -fopenacc cublas-test.f90 -lcublas $ ./a.out

You may need to add -L/path/to/cuda/lib if CUDA is not installed in a standard location. At present, the CUDA interface files for Fortran include support for calling CUBLAS and CUFFT functions from the host. As the example demonstrates, the on-device data for those CUDA functions can be managed by OpenACC. However, the CUDA functions cannot be called from within OpenACC parallel or kernels regions.

The CUDA interface files support the functions defined in CUDA 9. If new functions are added to CUDA in a later release, they can be called directly in GFortran manually by writing new function interfaces for them. For example, to call cublassaxpy from Fortran, use the following interface block.

interface
   subroutine cublassaxpy(N, alpha, x, incx, y, incy) bind(c, name="cublasSaxpy")
     use iso_c_binding
     integer(kind=c_int), value :: N
     real*4, value :: alpha
     type(*), dimension(*) :: x
     integer(kind=c_int), value :: incx
     type(*), dimension(*) :: y
     integer(kind=c_int), value :: incy
   end subroutine cublassaxpy
end interface

The arguments passed to cublassaxpy must be declared as host_data. A more complete example can be found here:

https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00976.html

Notes on GPU Parallel Code Generation

Inside an OpenACC kernels construct, the auto clause is the default on loop constructs not annotated with an explicit independent or seq clause. (This is in contrast to a parallel construct, where the independent clause is the default.) Such an explicit or implicit auto clause asks the compiler to analyze and if possible parallelize the code in the respective loop construct. Similar applies to any loops inside OpenACC kernels constructs' regions that are not marked with a loop construct at all. GCC's support for such analysis is currently insufficient, especially for nested loops. If it can parallelize a loop, GCC will currently only use gang parallelism, but not worker, or vector. As a result, users may find the performance of such loops to be slower than expected, in comparison to other compilers. It is therefore currently recommended to use the OpenACC parallel construct, or if using the kernels construct, then annotated all loops with explicit independent (or seq, if not parallelizable) clauses, but not rely on explicit or implicit auto clauses. See also the following OpenACC kernels Construct Optimization Tutorial.

GCC's optimization passes have not yet been tuned very much for GPU parallel code generation, in comparison to compilers developed specifically for GPU targets. It is therefore possible that GCC generates GPU code that while correct, does not yet make best use of the GPU's instruction set, or the GPU's hardware capabilities, such as the GPU's memory hierarchy, for example. We appreciate reports about such anomalies, and will use these to tune the GPU parallel code generation.

Users may find that GCC does not always do as good a job of scheduling parallel loops, especially in multi-level nested loop constructs. Adding explicit gang, worker, vector clauses to the respective loop directives may help to guide the compiler, and can result in significant performance gains. We appreciate reports when manual scheduling of loop nests improves performance, and can use these to improve the automatic scheduling algorithms.

OpenACC kernels Construct Optimization Tutorial

This tutorial explains how to optimize computations using OpenACC kernels constructs for offloading parts of a program to an accelerator device like a GPU.

Example Application

In this tutorial we will optimize a simple matrix multiplication. The original function looks like this (N is expected to be defined as a macro):

void
matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N])
{
  for (int j = 0; j < N; j++)
    {
      for (int i = 0; i < N; i++)
        {
          float sum = 0;
          for (int k = 0; k < N ; k++)
            sum += a[i][k] * b[k][j];
          r[i][j] = sum;
        }
    }
}

This is a normal C function that can be compiled with GCC and executed as usual. For a value of N = 2000, this runs in about 94.260000 seconds on a test machine, executed on the CPU in a single thread.

To offload and optimize this function using OpenACC, the code itself will not change. Only some #pragmas will be needed to tell the compiler what data and computations to move to the accelerator, and how to parallelize the loops.

Introducing the OpenACC kernels Construct

The OpenACC kernels construct indicates a program region to offload to the accelerator. Data clauses on the kernels region list the data to be copied to and from the accelerator device.

void
matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N])
{
  #pragma acc kernels \
    copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N])
  {
    for (int j = 0; j < N; j++)
      {
        for (int i = 0; i < N; i++)
          {
            float sum = 0;
            for (int k = 0; k < N ; k++)
              sum += a[i][k] * b[k][j];
            r[i][j] = sum;
          }
      }
  }
}

As written, GCC (built from the OpenACC development branch) will warn that it is not able to parallelize the kernels region automatically:

matmul.c: In function 'matrix_multiply._omp_fn.0':
matmul.c:4:11: warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
   #pragma acc kernels \
           ^

This means that the region will be executed on the host CPU instead of the accelerator device. This choice of the compiler can be overridden using the -foffload-force flag during compilation or by defining the ACC_DEVICE_TYPE environment variable before execution. In this example, forcing offloading would result in a large slowdown: Sequential execution on GPUs is much slower than on modern CPUs. The advantage of offloading to the accelerator comes from parallelization, which will be addressed in the next step.

Parallelizing Loops

To fully benefit from GCC's OpenACC support, each for loop in the kernels region (or an OpenACC parallel construct) needs a #pragma acc loop annotation. Parallelism annotations on the pragma tell the compiler how to parallelize the loop:

In the matrix multiplication example, the two outer loops are independent. The innermost loop has data dependences on the sum variable, as the value from one iteration is needed in the next one. We will improve this in the next step, but for now the innermost loop can be marked seq:

void
matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N])
{
  #pragma acc kernels \
    copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N])
  {
    #pragma acc loop independent
    for (int j = 0; j < N; j++)
      {
        #pragma acc loop independent
        for (int i = 0; i < N; i++)
          {
            float sum = 0;
            #pragma acc loop seq
            for (int k = 0; k < N ; k++)
              sum += a[i][k] * b[k][j];
            r[i][j] = sum;
          }
      }
  }
}

With these simple annotations, the execution time for N = 2000 on the test machine is 2.050000 seconds, a 46-fold speedup over CPU-only execution.

Reductions

The data dependence on the sum variable in the innermost loop is special: The values from different loop iterations are combined using an associative arithmetic operator. The loop computes the value of sum equal to the following expression:

(((0 + a[i][0] * b[0][j]) + a[i][1] + b[1][j]) + a[i][2] * b[2][j]) + ...

Mathematically, since addition is associative, parentheses may be moved around freely, for example:

(0 + a[i][0] * b[0][j] + a[i][1] + b[1][j]) + (a[i][2] * b[2][j]) + ...)

This means that the values of a[i][k] * b[k][j] may be computed independently and finally combined using the + operator using any grouping that may be convenient. Such calculations are called reductions, and there is special support for expressing them in OpenACC. The innermost loop can change to the following:

            float sum = 0;
            #pragma acc loop independent reduction(+: sum)
            for (int k = 0; k < N ; k++)
              sum += a[i][k] * b[k][j];
            r[i][j] = sum;

The compiler and OpenACC runtime take care of organizing the independent execution of the loop iterations and summing up the intermediate results.

Using the independent and reduction clauses like this brings execution time to 1.750000 seconds, a further 15% improvement.

Note that while addition is associative on real numbers, it is not in fact associative on floating-point numbers. Similarly to the optimizations enabled command-line flags like -ffast-math, using reductions can introduce numerical errors compared to sequential execution. It is the programmer's responsibility to ensure that the reduction is numerically correct for a given application.

Further Optimizations

For certain problems and problem sizes, some additional changes can improve execution time even further:

Automatic Loop Partitioning

GCC uses the following strategy to assign unused partitioning to independent acc loops. If there is only one acc loop, e.g.

#pragma acc parallel loop
for (...)

GCC assigns it gang and vector level parallelism. I.e.

#pragma acc parallel loop gang vector
for (...)

If there are two loops, L1 and L2, when L2 nested inside L1, e.g.

#pragma acc parallel loop
for (i = ...)
{
  #pragma acc loop
  for (j = ...)

then the inner loop gets vector is assigned partitioning, and the outer loop gets assigned gang and worker partitioning. I.e.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop vector
  for (j = ...)

Lastly, if there are three or more independent loops, the inner most loop will be assigned vector partitioning, the penultimate outermost loop worker partitioning, and the outermost loop gang partitioning. E.g.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop
  for (j = ...)
    #pragma acc loop
    for (k = ...)
      #pragma acc loop
      for (l = ...)

becomes

#pragma acc parallel loop gang
for (i = ...)
{
  #pragma acc loop worker
  for (j = ...)
    #pragma acc loop seq
    for (k = ...)
      #pragma acc loop vector
      for (l = ...)

The -fopt-info-optimized-omp flag may be used to show how GCC automatically partitions acc loops.

Implementation Status

trunk: GCC's general master development branch

GCC's general master development branch. Release branches are annually branched off of it. See https://gcc.gnu.org/develop.html for details.

Subject to the GCC community's review process, work is ongoing to contribute into trunk the OpenACC and related changes staged in the OpenACC development branch, for the next GCC release series. Compared to the OpenACC development branch (see below), GCC trunk lags behind in terms of OpenACC functionality and performance optimizations.

Compared to GCC 9, the upcoming GCC 10 release series contains the following OpenACC changes:

For Nvidia PTX (nvptx) offloading:

(Reviewed up to trunk r270579: gcc-9-branch branch point.)

openacc-gcc-8-branch: current OpenACC development branch

This Git-only branch is used for collaborative development. The branch is based on gcc-8-branch. Find it at git://gcc.gnu.org/git/gcc.git, https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/openacc-gcc-8-branch, or https://github.com/gcc-mirror/gcc/tree/openacc-gcc-8-branch. Please send email with a short-hand [og8] tag in the subject line, and use ChangeLog.openacc files.

Use this branch if you want to make use of the latest OpenACC functionality and performance optimizations that are not yet part of the official GCC releases.

The implementation status on openacc-gcc-8-branch is based on the GCC 8 release series (see below), with additional support for the OpenACC 2.6 specification and the following exceptions:

Compared to the GCC 8 release series (see below), openacc-gcc-8-branch contains the following enhancements:

Known Issues

The following are known issues in openacc-gcc-8-branch:

(Reviewed up to openacc-gcc-8-branch commit 2889b3618ae906bee9af58baf9e38e41c189e632.)

gcc-9-branch: GCC 9 Release Series (GCC 9.1 released on 2019-05-03)

Compared to GCC 8, the GCC 9 release series contains the following OpenACC changes:

For Nvidia PTX (nvptx) offloading:

(Reviewed up to trunk r270579: gcc-9-branch branch point.)

gcc-8-branch: GCC 8 Release Series (GCC 8.1 released on 2018-05-02)

Compared to GCC 7, the GCC 8 release series contains the following OpenACC changes:

gcc-7-branch: GCC 7 Release Series (GCC 7.1 released on 2017-05-02)

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for an arbitrary number of gangs, up to 32 workers, and 32 vectors.

Initial support for parallelized execution of OpenACC kernels constructs:

Compared to GCC 6, the GCC 7 release series contains the following OpenACC changes:

The device_type clause is not supported. The bind and nohost clauses are not supported.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

Issue Tracking

Open OpenACC bugs

Known issues with offloading.

None: OpenACC (last edited 2019-06-14 16:07:11 by CatherineMoore)