Differences between revisions 82 and 83
Revision 82 as of 2020-03-05 09:22:23
Size: 36815
Editor: TobiasBurnus
Comment: Fix wording, mention how Linux-distro GCC's differ, update the GCC 10 status
Revision 83 as of 2020-03-05 16:17:55
Size: 36791
Editor: TobiasBurnus
Comment: Improve wording
Deletions are marked like this. Additions are marked like this.
Line 520: Line 520:
  * OpenACC 2.6 support
Line 522: Line 523:
  * Most changes from the OG9 branch, including a mostly complete OpenACC 2.6 support   * Most changes from the OG9 branch

OpenACC

This page contains information on GCC's implementation of the OpenACC specification and related functionality. OpenACC is intended for programming accelerator devices such as GPUs, including code offloading to these devices. Currently devices using Nvidia PTX (nvptx) and AMD GCN devices (Fiji and Vega GPUs).

The GCC 8, 9 and 10 release series are currently supported by the GCC community. GCC 7, and 8 implement most of the OpenACC 2.0a specification; GCC 9 implements most of the OpenACC 2.5 specification. The OG9 development branch (openacc-gcc-9-branch; in the new GCC git repository known as devel/omp/gcc-9) implements most of the OpenACC 2.6 specification and contains further improvements. GCC 10 also implements most of OpenACC 2.6; while it includes the GCC 10 improvements, it lacks some of the OG9 features. (See below for more details on OG9 and GCC 10.)

For discussing this project, please use the standard GCC resources (mailing lists, Bugzilla, and so on). It's helpful to put a [OpenACC] tag into your emails' subject line, and set the openacc keyword in any Bugzilla issues filed.

For instructions how to build offloading-enabled GCC from source, see Offloading: How to try offloading enabled GCC.

Quick Reference Guide

This section presents an overview on how to use OpenACC with GCC. All of the examples assume that the compiler binaries are prefixed x86_64-none-linux-gnu-.

For general GCC/libgomp and OpenACC documentation, please refer to:

Typical Compiler Usage

In most cases, compilation of an OpenACC program with optimization turned on is:

 $ x86_64-none-linux-gnu-gcc -fopenacc -O2 -c foo.c
 $ x86_64-none-linux-gnu-gcc -fopenacc foo.o -o foo

If the program requires math library functions inside offloaded sections, link with:

 $ x86_64-none-linux-gnu-gcc -fopenacc -foffload=-lm foo.o -o foo

For Fortran, you may need to link with -foffload=-lgfortran as well. GCC optimization options also provide -O3, and -Ofast, which is -O3 with additional relaxations on standards compliance. For more details, please refer to the GCC optimization options documentation.

For AMD GCN devices, you have to specify additionally the GPU to be used: -march=<name> where name is either carrizo or fiji (both: third generation) – or the fifth-generation VEGA (gfx900 or gfx906). In order to apply this setting to the AMD GCN offloading target, only, and not to the host (-march=…) or all other offloading targets (as with -foffload=-march=…), use -foffload=amdgcn-amdhsa=<options>. For instance: -foffload=amdgcn-amdhsa="-march=gfx906". [NOTE: the target-triplet can be set when building the compiler and might differ between vendors; it can also be, e.g., 'amdgcn-unknown-amdhsa'.]

NOTE that many Linux distributions support offloading (at least for some accelerators); the accelerator support is not installed by default – thus, additional per-accelerator-type packages need to be installed. Additionally, the GCC compilers of Linux distributions are usually configured such that by default offloading is disabled (i.e. offloading code is executed on the host). Hence, -foffload=<target> has to be specified explicitly; the naming of the triplet depends on the vendor, but nvptx-none and amdgcn-amdhsa are more common choices.

Running OpenACC Programs Built With the Toolchain

Programs built with the OpenACC toolchain will require the included libgomp runtime to run. The included libgomp is placed in <toolchain-path>/<host>/lib*; where <host> could be, e.g., x86_64-none-linux-gnu or powerpc64le-none-linux-gnu. To allow the built program to find these libraries, users can either:

  1. Install the library in a path which used by default. – This is always the case if you use the GCC of your distribution. (If available. In most distributions, the nvtpx or gcn support is in an additional package and not installed by default.)
  2. Use RPATH to put the libgomp path into the ELF executable during link, e.g. Add -Wl,-rpath,<toolchain-path>/x86_64-none-linux-gnu/lib64 to the above link command.

  3. Set the LD_LIBRARY_PATH variable to include the libgomp path when running the built executable.

-fopenacc

Enables OpenACC in the compiler, activating directives in the code, and provides the _OPENACC preprocessor symbol. Also see description in GCC manual.

-fopenacc-dim=<gang:worker:vector>

This option specifies the default OpenACC compute dimensions for parallel constructs (when there are no explicit user defined clauses on the directive). The gang, worker and vector argument must be an positive integer or a minus sign. The minus sign allows the runtime to determine a suitable default based on the accelerator's capabilities. The default for NVPTX accelerator targets is -:32:32. Also see description in GCC manual.

-foffload=[...]

Configure offloading compilation. See https://gcc.gnu.org/wiki/Offloading#Compilation_options for details.

One common usage of this option is to link in the math library also for offload targets, that is -lm -foffload=-lm. Or, -foffload=-lgfortran if you're getting link errors with Fortran code. (See PR90386 for context.)

-foffload-force

Forces offloading if the compiler wanted to avoid it.

-fopt-info-optimized-omp

This flag instructs the compiler to print out any parallelism that it detected. (Previously known as -fopt-info-note-omp.)

-fopenacc-kernels

The option -fopenacc-kernels can be used to configure the behavior of kernels constructs handling. The default behavior, -fopenacc-kernels=split, causes OpenACC kernels constructs to be split into a sequence of compute constructs, each then handled individually. To disable the default behavior, use -fopenacc-kernels=parloops, to cause the whole OpenACC kernels constructs to be handled by the parloops pass.

Run-time Environment Variables Used by OpenACC Programs

The followed variables are processed by the GCC's OpenACC runtime libgomp, included with the toolchain, and can be used when executing the built programs. Those are also explained in the GCC libgomp documentation.

ACC_DEVICE_TYPE

Possible values can be set: nvidia, gcn and host. The first is the default of offloading to NVPTX accelerator devices, the second is used for AMD GCN, and the third means single-threaded host-fallback execution, in a shared-memory mode.

ACC_DEVICE_NUM

Value can be set to non-negative integer, designating a specific accelerator device to use (if more than one installed). Will trigger runtime error if specified number is out of range.

GOMP_OPENACC_DIM

Override the default compute dimensions for parallel offload regions. The default setting is determined by the runtime after examining the available hardware resources. The optimal setting for a particular application may not be the default.

Syntax: GOMP_OPENACC_DIM='gang:worker:vector'
Example: GOMP_OPENACC_DIM='5120:1:32'

GOMP_DEBUG

A setting of GOMP_DEBUG=1 can be added to the environment to enable debugging output during execution. Currently, it logs data management, GPU kernel launches, and for active NVPTX device types a dump of the offloaded PTX code is included.

GFortran Interoperability with CUDA libraries

Fortran interface files for the FORTRAN CUDA Library based on their corresponding header files provided by the CUDA Toolkit are publicly available on GitHub: https://github.com/MentorEmbedded/fortran-cuda-interfaces. They contain modules cublas_v2, cublas (legacy CUBLAS interface), cublasxt (multi-GPU host interface), openacc_cublas (interface for OpenACC device code regions), and cufft, which can be utilized by adding the following lines at the beginning of a Fortran program, function or subroutine:

use cublas_v2
use cublas
use cublasxt
use openacc_cublas
use cufft

The following program demonstrates how to use cublasSaxpy with gfortran.

program test
  use cublas
  implicit none

  integer, parameter :: N = 10
  integer :: i
  real*4 :: x_ref(N), y_ref(N), x(N), y(N), a

  a = 2.0

  do i = 1, N
     x(i) = 4.0 * i
     y(i) = 3.0
     x_ref(i) = x(i)
     y_ref(i) = y(i)
  end do

  call saxpy (N, a, x_ref, y_ref)

  !$acc data copyin (x) copy (y)
  !$acc host_data use_device (x, y)
  call cublassaxpy(N, a, x, 1, y, 1)
  !$acc end host_data
  !$acc end data

  do i = 1, N
     if (y(i) .ne. y_ref(i)) call abort
  end do
end program test

subroutine saxpy (nn, aa, xx, yy)
  integer :: nn
  real*4 :: aa, xx(nn), yy(nn)
  integer i
  real*4 :: t
  !$acc routine

  do i = 1, nn
    yy(i) = yy(i) + aa * xx(i)
  end do
end subroutine saxpy

To build and run this example, issue the following commands:

$gfortran -fopenacc cublas-test.f90 -lcublas
$ ./a.out

You may need to add -L/path/to/cuda/lib if CUDA is not installed in a standard location. At present, the CUDA interface files for Fortran include support for calling CUBLAS and CUFFT functions from the host. As the example demonstrates, the on-device data for those CUDA functions can be managed by OpenACC. However, the CUDA functions cannot be called from within OpenACC parallel or kernels regions.

The CUDA interface files support the functions defined in CUDA 9. If new functions are added to CUDA in a later release, they can be called directly in GFortran manually by writing new function interfaces for them. For example, to call cublassaxpy from Fortran, use the following interface block.

interface
   subroutine cublassaxpy(N, alpha, x, incx, y, incy) bind(c, name="cublasSaxpy")
     use iso_c_binding
     integer(kind=c_int), value :: N
     real*4, value :: alpha
     type(*), dimension(*) :: x
     integer(kind=c_int), value :: incx
     type(*), dimension(*) :: y
     integer(kind=c_int), value :: incy
   end subroutine cublassaxpy
end interface

The arguments passed to cublassaxpy must be declared as host_data. A more complete example can be found here:

https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00976.html

Notes on GPU Parallel Code Generation

Inside an OpenACC kernels construct, the auto clause is the default on loop constructs not annotated with an explicit independent or seq clause. (This is in contrast to a parallel construct, where the independent clause is the default.) Such an explicit or implicit auto clause asks the compiler to analyze and if possible parallelize the code in the respective loop construct. Similar applies to any loops inside OpenACC kernels constructs' regions that are not marked with a loop construct at all. GCC's support for such analysis is currently insufficient, especially for nested loops. If it can parallelize a loop, GCC will currently only use gang parallelism, but not worker, or vector. As a result, users may find the performance of such loops to be slower than expected, in comparison to other compilers. It is therefore currently recommended to use the OpenACC parallel construct, or if using the kernels construct, then annotated all loops with explicit independent (or seq, if not parallelizable) clauses, but not rely on explicit or implicit auto clauses. See also the following OpenACC kernels Construct Optimization Tutorial.

GCC's optimization passes have not yet been tuned very much for GPU parallel code generation, in comparison to compilers developed specifically for GPU targets. It is therefore possible that GCC generates GPU code that while correct, does not yet make best use of the GPU's instruction set, or the GPU's hardware capabilities, such as the GPU's memory hierarchy, for example. We appreciate reports about such anomalies, and will use these to tune the GPU parallel code generation.

Users may find that GCC does not always do as good a job of scheduling parallel loops, especially in multi-level nested loop constructs. Adding explicit gang, worker, vector clauses to the respective loop directives may help to guide the compiler, and can result in significant performance gains. We appreciate reports when manual scheduling of loop nests improves performance, and can use these to improve the automatic scheduling algorithms.

OpenACC kernels Construct Optimization Tutorial

This tutorial explains how to optimize computations using OpenACC kernels constructs for offloading parts of a program to an accelerator device like a GPU.

Example Application

In this tutorial we will optimize a simple matrix multiplication. The original function looks like this (N is expected to be defined as a macro):

void
matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N])
{
  for (int j = 0; j < N; j++)
    {
      for (int i = 0; i < N; i++)
        {
          float sum = 0;
          for (int k = 0; k < N ; k++)
            sum += a[i][k] * b[k][j];
          r[i][j] = sum;
        }
    }
}

This is a normal C function that can be compiled with GCC and executed as usual. For a value of N = 2000, this runs in about 94.260000 seconds on a test machine, executed on the CPU in a single thread.

To offload and optimize this function using OpenACC, the code itself will not change. Only some #pragmas will be needed to tell the compiler what data and computations to move to the accelerator, and how to parallelize the loops.

Introducing the OpenACC kernels Construct

The OpenACC kernels construct indicates a program region to offload to the accelerator. Data clauses on the kernels region list the data to be copied to and from the accelerator device.

void
matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N])
{
  #pragma acc kernels \
    copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N])
  {
    for (int j = 0; j < N; j++)
      {
        for (int i = 0; i < N; i++)
          {
            float sum = 0;
            for (int k = 0; k < N ; k++)
              sum += a[i][k] * b[k][j];
            r[i][j] = sum;
          }
      }
  }
}

As written, GCC (built from the OpenACC development branch) will warn that it is not able to parallelize the kernels region automatically:

matmul.c: In function 'matrix_multiply._omp_fn.0':
matmul.c:4:11: warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
   #pragma acc kernels \
           ^

This means that the region will be executed on the host CPU instead of the accelerator device. This choice of the compiler can be overridden using the -foffload-force flag during compilation or by defining the ACC_DEVICE_TYPE environment variable before execution. In this example, forcing offloading would result in a large slowdown: Sequential execution on GPUs is much slower than on modern CPUs. The advantage of offloading to the accelerator comes from parallelization, which will be addressed in the next step.

Parallelizing Loops

To fully benefit from GCC's OpenACC support, each for loop in the kernels region (or an OpenACC parallel construct) needs a #pragma acc loop annotation. Parallelism annotations on the pragma tell the compiler how to parallelize the loop:

  • independent means that all iterations of the loop are independent of each other, i.e., there are no dependences between iterations. Independent iterations may be executed in any order and freely parallelized. The compiler trusts the programmer's independent annotations, it is therefore your responsibility to ensure that you only add them when they are indeed correct.

  • seq means that the iterations of the loop are to be executed sequentially, in order.

  • auto (which is also implicit in kernels regions if no other parallelism clause is specified) means that the compiler is supposed to determine automatically whether the loop's iterations are independent. In practice, GCC's auto-parallelizer very often does not understand data dependences in multi-level loop nests, and it will almost always choose sequential execution.

In the matrix multiplication example, the two outer loops are independent. The innermost loop has data dependences on the sum variable, as the value from one iteration is needed in the next one. We will improve this in the next step, but for now the innermost loop can be marked seq:

void
matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N])
{
  #pragma acc kernels \
    copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N])
  {
    #pragma acc loop independent
    for (int j = 0; j < N; j++)
      {
        #pragma acc loop independent
        for (int i = 0; i < N; i++)
          {
            float sum = 0;
            #pragma acc loop seq
            for (int k = 0; k < N ; k++)
              sum += a[i][k] * b[k][j];
            r[i][j] = sum;
          }
      }
  }
}

With these simple annotations, the execution time for N = 2000 on the test machine is 2.050000 seconds, a 46-fold speedup over CPU-only execution.

Reductions

The data dependence on the sum variable in the innermost loop is special: The values from different loop iterations are combined using an associative arithmetic operator. The loop computes the value of sum equal to the following expression:

(((0 + a[i][0] * b[0][j]) + a[i][1] + b[1][j]) + a[i][2] * b[2][j]) + ...

Mathematically, since addition is associative, parentheses may be moved around freely, for example:

(0 + a[i][0] * b[0][j] + a[i][1] + b[1][j]) + (a[i][2] * b[2][j]) + ...)

This means that the values of a[i][k] * b[k][j] may be computed independently and finally combined using the + operator using any grouping that may be convenient. Such calculations are called reductions, and there is special support for expressing them in OpenACC. The innermost loop can change to the following:

            float sum = 0;
            #pragma acc loop independent reduction(+: sum)
            for (int k = 0; k < N ; k++)
              sum += a[i][k] * b[k][j];
            r[i][j] = sum;

The compiler and OpenACC runtime take care of organizing the independent execution of the loop iterations and summing up the intermediate results.

Using the independent and reduction clauses like this brings execution time to 1.750000 seconds, a further 15% improvement.

Note that while addition is associative on real numbers, it is not in fact associative on floating-point numbers. Similarly to the optimizations enabled command-line flags like -ffast-math, using reductions can introduce numerical errors compared to sequential execution. It is the programmer's responsibility to ensure that the reduction is numerically correct for a given application.

Further Optimizations

For certain problems and problem sizes, some additional changes can improve execution time even further:

  • The a and b arrays are only read inside the kernels region, while r is only written. It is not necessary to copy all of these arrays to and from the accelerator. Instead it suffices to only copy a and b to the accelerator, while space for r can be created on the accelerator, written to by the kernel, and copied out to the host. This can be achieved by changing the copy clause to two clauses: copyout(r[0:N][0:N]) copyin(a[0:N][0:N], b[0:N][0:N]). (In the matrix multiplication example, data movement overhead is dominated by computation time, and this does not result in a measurable improvement on large matrices.)

  • Some of the loops can be annotated with gang, worker, or vector clauses to fine-tune the allocation of the loops to the different kinds of parallelism. Otherwise, GCC computes an automatic allocation.

  • Adding num_gangs, num_workers, or vector_length clauses can further fine-tune the loop schedule.

Automatic Loop Partitioning

GCC uses the following strategy to assign unused partitioning to independent acc loops. If there is only one acc loop, e.g.

#pragma acc parallel loop
for (...)

GCC assigns it gang and vector level parallelism. I.e.

#pragma acc parallel loop gang vector
for (...)

If there are two loops, L1 and L2, when L2 nested inside L1, e.g.

#pragma acc parallel loop
for (i = ...)
{
  #pragma acc loop
  for (j = ...)

then the inner loop gets vector is assigned partitioning, and the outer loop gets assigned gang and worker partitioning. I.e.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop vector
  for (j = ...)

Lastly, if there are three or more independent loops, the inner most loop will be assigned vector partitioning, the penultimate outermost loop worker partitioning, and the outermost loop gang partitioning. E.g.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop
  for (j = ...)
    #pragma acc loop
    for (k = ...)
      #pragma acc loop
      for (l = ...)

becomes

#pragma acc parallel loop gang
for (i = ...)
{
  #pragma acc loop worker
  for (j = ...)
    #pragma acc loop seq
    for (k = ...)
      #pragma acc loop vector
      for (l = ...)

The -fopt-info-optimized-omp flag may be used to show how GCC automatically partitions acc loops.

Implementation Status

trunk/GCC-10: GCC's general master development branch

GCC 10 / GCC's general master development branch. Release branches are annually branched off of it. See https://gcc.gnu.org/develop.html for details.

Most development is done directly on the GCC master development branch (also called trunk). However, some development is first done on a branch. In case of OpenACC, the OG9 branch (devel/omp/gcc-9) contains some features which are not yet in the master development branch (and are unlikely to be in the GCC 10 release branch).

Compared to GCC 9, the upcoming GCC 10 release series contains the following OpenACC changes:

  • OpenACC 2.6 support
  • Re-implementation of OpenACC asynchronous queues
  • Support for AMD Radeon (GCN) GPUs (Fiji, VEGA)
  • Most changes from the OG9 branch
  • Several smaller fixes, some of which only in GCC 10 and not in the OG 9 branch.

For Nvidia PTX (nvptx) offloading:

  • nvptx support for OpenACC asynchronous queues

Main changes which are only on the OG9 branch:

  • Improvements to the parallelization of kernels

  • Multi-dimensional dynamic array support for OpenACC data clauses (non-contiguous arrays)
  • Some of the reduction enhancements

Known issues: See OG9 Known Issues.

OG9 branch: current OpenACC development branch

This branch is available as devel/omp/gcc-9 branch in the GCC git repository. (In the former the GCC git mirror, now gcc-old.git, it was in the openacc-gcc-9-branch.)

This Git-only branch is used for collaborative development. The branch is based on gcc-9-branch. Find it at git://gcc.gnu.org/git/gcc.git, https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/openacc-gcc-9-branch, or https://github.com/gcc-mirror/gcc/tree/openacc-gcc-9-branch. Please send email with a short-hand [og9] tag in the subject line, and use ChangeLog.openacc files.

Use this branch if you want to make use of the latest OpenACC functionality and performance optimizations that are not yet part of the official GCC releases.

The implementation status on devel/omp/gcc-9 (alias OG9) branch is based on the GCC 9 release series (see below), with additional support for the OpenACC 2.6 specification and the following exceptions:

  • GCC does not support the init, shutdown and set directives. However it does support their corresponding OpenACC runtime API routines.

  • GCC does not support the acc_memcpy_device runtime API routine.

  • private and firstprivate variables cannot contain subarray arguments.

  • gfortran only supports individual derived type members in acc update directives.

  • g++ does not support struct or class members inside data clauses.
  • GCC does not support the device_type clause and cache directive. While the C, C++ and Fortran compilers may build programs with those clauses, no special code is generated for them.

  • GCC has preliminary support for the OpenACC Profiling Interface. For more details see the libgomp documentation.
  • The nohost clause is supported, but support for the bind clause is incomplete: works only in C, and only for non-inlined functions, http://mid.mail-archive.com/87twns3ebs.fsf@hertz.schwinge.homeip.net.

Compared to the GCC 9 release series (see below), openacc-gcc-9-branch contains the following enhancements:

  • The compiler assigns shared memory storage for local variables declared inside parallel and kernels regions that are not associated with acc loops, or acc loops with an explicit gang clause. Note that variables which are used in acc loops that have been automatically assigned gang partitioning by the compiler continue to utilize thread-private storage, not shared memory.

  • Remove spurious error on combining the OpenACC auto clause with gang/worker/vector clauses on loop constructs.

  • A new -fopt-info-optimized-omp flag has been added to report how the compiler automatically assigns gang, worker and vector level parallelism to independent acc loops.

  • The option -fopenacc-kernels can be used to configure the behavior of kernels constructs handling. The default behavior, -fopenacc-kernels=split, causes OpenACC kernels constructs to be split into a sequence of compute constructs, each then handled individually. To disable the default behavior, use -fopenacc-kernels=parloops, to cause the whole OpenACC kernels constructs to be handled by the parloops pass.

    • Inside OpenACC kernels constructs, loops annotated with OpenACC loop directives with independent clauses will be parallelized as if inside a parallel construct.

Known Issues

The following are known issues in openacc-gcc-9-branch:

  • Certain profiling functionality is not implemented. See the libgomp manual for for the implementation status of the OpenACC Profiling Interface.
  • Support for dynamic arrays inside offloaded regions has been added with the following limitations:
    • The pointer-to-arrays case is not supported yet, e.g. int (*a)[100].
    • Host fallback doesn't work yet, i.e. ACC_DEVICE_TYPE=host will segfault.

  • Nested parallelism is not supported. I.e. parallel constructs cannot nest inside other parallel constructs.
  • Variables may only be used in one data clause per construct. E.g. This generates an error: #pragma acc parallel copyin (foo) copyout (foo)

  • Fortran program built with -ffpe-trap=invalid,zero,overflow may trigger floating point exceptions in the cuda driver runtime.

  • Building with -ffast-math may occasionally cause linker errors. A workaround for that is to link agains libm using -foffload=-lm.

  • GCC automatically partitions independent ACC LOOPs across gang, worker and vector partitioning, when available. At times this optimization can be too aggressive. If the program fails with "libgomp: cuLaunchKernel error: too many resources requested for launch" Try adjusting num_workers on the offending loop. You may also compile the program with -fopenacc-dim=[num_gangs]:[num_worker], where num_gangs and num_workers specify the default number of gangs and workers, respectively.

  • All acc loop private clauses allocate storage for variables in local (i.e. thread-private) storage. They will utilize shared memory storage in a future release.

  • Local variables in acc routines are currently not allocated in CUDA shared memory, regardless if the routine uses gang or worker level parallelism.

  • GCC defaults to setting num_workers=32 when it detects a worker loop without an associated num_workers clause. Sometimes this can result in a runtime error. To correct this problem, rebuild your program with the -fopenacc-dim flag suggested by the compiler, or add a num_workers clause to the offending OpenACC construct.

  • In OpenACC kernels constructs, loops not annotated with OpenACC loop directives, or annotated with OpenACC loop directives but with explicit or implicit auto clause:

    • The loop directive is supported, but most loop directive clauses are ignored.

    • No directives other than the loop directive are supported inside a kernels region.

    • reduction clauses are ignored, but loops with reductions might be parallelized.

    • If the loop can be parallelized, it will use multiple gangs, but just one worker, and a vector_length of 1.

    • Nested loops are supported, but in this scenario we can only parallelize the outer loop in this loop nest. This means that inner loops are executed sequentially on each gang.
    • No true vectorization. A dependent but vectorizable loop could be vectorized (mapped on the vector dimension), but that is currently not supported.

(Reviewed up to openacc-gcc-9-branch commit 891935361130d470567210e4a2dfefde2f634030).

gcc-9-branch: GCC 9 Release Series (GCC 9.1 released on 2019-05-03)

Compared to GCC 8, the GCC 9 release series contains the following OpenACC changes:

  • General bug fixing, performance improvements, compiler diagnostics improvements.
  • Support C++ reference data types.
  • Support C++ this pointer usage in OpenACC directives.

  • The independent and seq clauses are no longer refused as conflicting in Fortran.

  • The routine directive now correctly handles clauses specifying the level of parallelism in Fortran.

  • The private clause no longer is rejected for Fortran predetermined private loop iteration variables.

  • OpenACC async bug fixes: support multiple OpenACC wait clauses, handle (as a no-op) unseen async-arguments in acc_async_test, acc_wait, acc_wait_async, handle (as a no-op) cases of the same async-argument being used in wait and async clauses, fix handling of OpenACC wait directive without wait argument but with async clause, support acc_async_sync, acc_async_noval in acc_get_cuda_stream/acc_set_cuda_stream, handle wait clauses without async-argument.

  • Add user-friendly diagnostics for OpenACC loop parallelism assigned, enabled with -fopt-info-optimized-omp.

  • Adjust to OpenACC 2.5 data clause semantics, including support OpenACC finalize, if_present clauses.

  • OpenACC 2.5 acc_*_async versions of the relevant OpenACC runtime library routines.

For Nvidia PTX (nvptx) offloading:

  • General bug fixing, performance improvements.
  • Compatibility with new CUDA versions and new PTX versions/GPU hardware, workarounds against PTX JIT bugs.
  • Determine default num_workers at runtime, and other improvements in runtime launch geometry computation.

  • Support vector_length bigger than 32.

  • Improve support for multiple GPUs.
  • Ignore C++ exceptions in OpenACC regions.
  • New flag -misa=sm_35 (use as: -foffload=nvptx-none=-misa=sm_35, or similar) to enable additional PTX atomic instructions added in target architecture sm_35.

(Reviewed up to trunk r270579: gcc-9-branch branch point.)

gcc-8-branch: GCC 8 Release Series (GCC 8.1 released on 2018-05-02)

Compared to GCC 7, the GCC 8 release series contains the following OpenACC changes:

  • Assorted bug fixing.
  • Standard conformance: C/C++ acc_pcopyin, acc_pcreate, Fortran openacc_lib.h: acc_pcopyin, acc_pcreate.

  • OpenACC 1.0 compatibility: acc_async_wait, acc_async_wait_all.

  • Fortran: Don't restrict wait directive arguments to constant integers.

  • OpenACC 2.5 default (present) clause.

  • Nvidia PTX (nvptx) back end: general bug fixing, support for CUDA 9, workarounds against PTX JIT bugs, initial SLP vectorization support using PTX ISA vector modes/instructions.

gcc-7-branch: GCC 7 Release Series (GCC 7.1 released on 2017-05-02)

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for an arbitrary number of gangs, up to 32 workers, and 32 vectors.

Initial support for parallelized execution of OpenACC kernels constructs:

  • Parallelization of a kernels region is switched on by '-fopenacc' combined with '-O2' or higher.
  • Code is offloaded onto multiple gangs, but executes with just one worker, and a vector length of 1.
  • Directives inside a kernels region are not supported.
  • Loops with reductions can be parallelized.
  • Only kernels regions with one loop nest are parallelized.
  • Only the outer-most loop of a loop nest can be parallelized.
  • Loop nests containing sibling loops are not parallelized.

Compared to GCC 6, the GCC 7 release series contains the following OpenACC changes:

  • Assorted bug fixing.
  • Standard conformance: pointer variables inside offloaded regions in C and C++, http://mid.mail-archive.com/573244BE.5010708@codesourcery.com.

  • Standard conformance: reduction variables imply data mapping clauses on outer compute constructs, https://gcc.gnu.org/PR70895.

  • Bug fixes related to memory mappings and reductions when used with compute constructs with an async clause.
  • Improvements in assigning gang, worker, and vector parallelism to un-annotated loop constructs.
  • The default number of gangs launched is now determined at run time, and no longer hard-coded to 32.
  • The information provided by tile clauses is now used for loop scheduling.
  • The host_data construct and use_device clause are now supported in Fortran, too.
  • Improved syntax checking for the routine construct in C and C++.
  • The cache directive doesn't reject anymore valid syntax in C and C++.
  • The vector and vector_length clauses get parsed correctly in Fortran.

The device_type clause is not supported. The bind and nohost clauses are not supported.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

Issue Tracking

Open OpenACC bugs

Known issues with offloading.

None: OpenACC (last edited 2020-03-05 16:17:55 by TobiasBurnus)