Differences between revisions 31 and 32
Revision 31 as of 2017-05-01 17:38:54
Size: 19871
Editor: tschwinge
Comment: GCC 7 Release Series
Revision 32 as of 2017-05-17 19:20:35
Size: 20763
Editor: cesar
Comment: update with current OpenACC 2.5 support
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
Compared to GCC 5, the GCC 6 and 7 release series include a much improved implementation of the OpenACC 2.0a specification. Compared to GCC 5, the GCC 6 and 7 release series include a much improved implementation of the OpenACC 2.0a specification. The gomp-4_0-branch supports most of the OpenACC 2.5 specification.
Line 25: Line 25:
Current development continues on [[http://mid.mail-archive.com/87a9elqolz.fsf@schwinge.name|gomp-4_0-branch]]. Please add a `[gomp4]` tag to any patches posted for inclusion in that branch. Current development continues on [[http://mid.mail-archive.com/87a9elqolz.fsf@schwinge.name|gomp-4_0-branch]]. Please add a `[gomp4]` tag to any patches posted for inclusion in that branch. 
Line 29: Line 29:
The implementation status on gomp-4_0-branch is basically the same as with the GCC 6 release series (see below), with the following changes:

  * Assorted bug fixing.
  * Incomplete support for the device_type clause.
  * The nohost clause is supported, but support for the bind clause is incomplete:
The implementation status on gomp-4_0-branch is based on the GCC 6 release series (see below), with additional support for the OpenACC 2.5 specification and the following exceptions:

  * GCC does not support the ''init'', ''shutdown'' and ''set'' directives, and the ''acc_memcpy_device'' runtime API routine.
  * ''private'' and ''firstprivate'' variables cannot contain subarray arguments.
  * gfortran only supports individual derive type members in ''acc update'' directives.
  * g++ does not support struct or class members inside data clauses.
  * GCC does not support the ''device_type'' and ''cache'' clauses. While the C, C++ and Fortran compilers may build programs with those clauses, no special code will be generated for them.
  * GCC now has preliminary support for OpenACC profiling. For more details see the libgomp documentation.
  * Incomplete support for the ''device_type'' clause.
  * The ''nohost'' clause is supported, but support for the bind clause is incomplete:
Line 36: Line 41:
  * The runtime now supports the new async API introduced in OpenACC 2.5.
gomp-4_0-branch contains the following enhancements:
Line 38: Line 45:
  * The runtime will now inform the user when an Nvidia accelerator has insufficient registers to execute a kernels or parallel construct.
  * The compiler will now assign shared memory storage for local variables declared inside parallel and kernels regions that are not associated with acc loops, or acc loops with an explicit gang clause. Note that variables which are used in acc loops that have been automatically assigned gang partitioning by the compiler will utilze thread-private storage, not share memory.
  * A new -fopt-info-note-omp flag has been added to report how the compiler automatically assigns gang, worker and vector level parallelism to independent acc loops.
  * GCC now has preliminary support for OpenACC profiling. For more details see the libgomp documentation.
  * The runtime will now inform the user when an Nvidia accelerator has insufficient registers to execute a ''kernels'' or ''parallel'' construct.
  * The compiler will now assign shared memory storage for local variables declared inside parallel and kernels regions that are not associated with ''acc loop''s, or ''acc loop''s with an explicit gang clause. Note that variables which are used in ''acc loop''s that have been automatically assigned gang partitioning by the compiler will utilize thread-private storage, not share memory.
  * A new ''-fopt-info-note-omp'' flag has been added to report how the compiler automatically assigns ''gang'', ''worker'' and ''vector'' level parallelism to independent ''acc loop''s.


=== Known Issues ===

The following are known issues in gomp-4_0-branch.

* Certain profiling functions are not implemented. See the libgomp manual for for the implementation status of the Profiling Interface.
Line 44: Line 57:
      * Host fallback doesn't work yet, i.e. ACC_DEVICE_TYPE=host will segfault.
  * OpenACC kernels:
      * The loop directive is supported, but most loop directive clauses are ignored.
      * No directives other than the loop directive are supported inside a kernels region.
      * Reduction clauses are ignored, but loops with reductions can be parallelized.

=== Known Issues ===

The following are known issues in gomp-4_0-branch.
      * Host fallback doesn't work yet, i.e. ''ACC_DEVICE_TYPE=host'' will segfault.
Line 55: Line 59:
  * Variables may only be used in one data clause per construct. E.g. This will generate an error: #pragma acc parallel copyin (foo) copyout (foo)
  * Fortran program built with -ffpe-trap=invalid,zero,overflow may trigger floating point exceptions in the cuda driver runtime.
  * Building with -ffast-math may occasionally cause linker errors. A workaround for that is to link agains libm using -foffload=-lm.
  * As of now, GCC strictly follows the data semantics of OpenACC 2.0a. Eventually this will be promoted to the 2.5 semantics
.
  * GCC now automatically partitions independent ACC LOOPs across gang, worker and vector partitioning, when available. At times this optimization can be too aggressive. If the program fails with "libgomp: cuLaunchKernel error: too many resources requested for launch" Try adjusting num_workers on the offending loop. You may also compile the program with -fopenacc-dim=[num_gangs]:[num_worker], where num_gangs and num_workers specify the default number of gangs and workers, respectively.
  * All acc loop private clauses allocate storage for variables in local (i.e. thread-private) storage. They will utilize shared memory storage in a future release.
  * GCC defaults to setting num_workers=32 when it detects a worker loop without an associated num_workers clause. Sometimes this can result in a runtime error. To correct this problem, rebuild your program with the -fopenacc-dim flag suggested by the compiler, or add a num_workers clause to the offending OpenACC construct.
  * Variables may only be used in one data clause per construct. E.g. This will generate an error: ''#pragma acc parallel copyin (foo) copyout (foo)''
  * Fortran program built with ''-ffpe-trap=invalid,zero,overflow'' may trigger floating point exceptions in the cuda driver runtime.
  * Building with ''-ffast-math'' may occasionally cause linker errors. A workaround for that is to link agains libm using ''-foffload=-lm''.
  * GCC now automatically partitions independent ACC LOOPs across gang, worker and vector partitioning, when available. At times this optimization can be too aggressive. If the program fails with "libgomp: cuLaunchKernel error: too many resources requested for launch" Try adjusting ''num_workers'' on the offending loop. You may also compile the program with ''-fopenacc-dim=[num_gangs]:[num_worker]'', where num_gangs and num_workers specify the default number of gangs and workers, respectively.
  * All ''acc loop private'' clauses allocate storage for variables in local (i.e. thread-private) storage. They will utilize shared memory storage in a future release.
  * Local variables in ''acc routines'' are currently not allocated in CUDA shared memory, regardless if the routine uses ''gang'' or ''worker'' level parallelism
.
  * GCC defaults to setting ''num_workers''=32 when it detects a ''worker'' loop without an associated ''num_workers'' clause. Sometimes this can result in a runtime error. To correct this problem, rebuild your program with the ''-fopenacc-dim'' flag suggested by the compiler, or add a ''num_workers'' clause to the offending OpenACC construct.
Line 63: Line 67:
     * Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.      * The loop directive is supported, but most loop directive clauses are ignored.
     * No directives other than the loop directive are supported inside a kernels region.
     * Reduction clauses are ignored, but loops with reductions can be parallelized.
     * Code will be offloaded onto multiple gangs, but executes with just one ''worker'', and a ''vector_length'' of 1.

OpenACC

This page contains information on GCC's implementation of the OpenACC specification and related functionality. OpenACC is intended for programming accelerator devices such as GPUs, including code offloading to these devices. Specifically, currently devices using Nvidia PTX (nvptx) are supported.

OpenACC is an experimental feature of GCC 5.1 and may not meet the needs of general application development. Compared to GCC 5, the GCC 6 and 7 release series include a much improved implementation of the OpenACC 2.0a specification. The gomp-4_0-branch supports most of the OpenACC 2.5 specification.

For discussing this project, please use the standard GCC resources (mailing lists, Bugzilla, and so on). It's helpful to put a [OpenACC] tag into your email's Subject line, and set the openacc keyword in any Bugzilla issues filed.

For instructions how to build offloading-enabled GCC from source, see Offloading: How to try offloading enabled GCC.

Implementation Status

Listing first the most current work in progress, followed by the GCC release series from most current to older.

Work in Progress (gomp-4_0-branch)

Current development continues on gomp-4_0-branch. Please add a [gomp4] tag to any patches posted for inclusion in that branch.

Work is ongoing to merge gomp-4_0-branch code into trunk, for the next GCC release series.

The implementation status on gomp-4_0-branch is based on the GCC 6 release series (see below), with additional support for the OpenACC 2.5 specification and the following exceptions:

  • GCC does not support the init, shutdown and set directives, and the acc_memcpy_device runtime API routine.

  • private and firstprivate variables cannot contain subarray arguments.

  • gfortran only supports individual derive type members in acc update directives.

  • g++ does not support struct or class members inside data clauses.
  • GCC does not support the device_type and cache clauses. While the C, C++ and Fortran compilers may build programs with those clauses, no special code will be generated for them.

  • GCC now has preliminary support for OpenACC profiling. For more details see the libgomp documentation.
  • Incomplete support for the device_type clause.

  • The nohost clause is supported, but support for the bind clause is incomplete: works only in C, and only for non-inlined functions, http://mid.mail-archive.com/87twns3ebs.fsf@hertz.schwinge.homeip.net.

gomp-4_0-branch contains the following enhancements:

  • The runtime has a new gang allocation strategy to more fully utilize the accelerator's capabilities.
  • The runtime will now inform the user when an Nvidia accelerator has insufficient registers to execute a kernels or parallel construct.

  • The compiler will now assign shared memory storage for local variables declared inside parallel and kernels regions that are not associated with acc loops, or acc loops with an explicit gang clause. Note that variables which are used in acc loops that have been automatically assigned gang partitioning by the compiler will utilize thread-private storage, not share memory.

  • A new -fopt-info-note-omp flag has been added to report how the compiler automatically assigns gang, worker and vector level parallelism to independent acc loops.

Known Issues

The following are known issues in gomp-4_0-branch.

  • Certain profiling functions are not implemented. See the libgomp manual for for the implementation status of the Profiling Interface.
  • Support for dynamic arrays inside offloaded regions has been added with the following limitations:
    • The pointer-to-arrays case is not supported yet, e.g. int (*a)[100].
    • Host fallback doesn't work yet, i.e. ACC_DEVICE_TYPE=host will segfault.

  • Nested parallelism is not supported. I.e. parallel constructs cannot nest inside other parallel constructs.
  • Variables may only be used in one data clause per construct. E.g. This will generate an error: #pragma acc parallel copyin (foo) copyout (foo)

  • Fortran program built with -ffpe-trap=invalid,zero,overflow may trigger floating point exceptions in the cuda driver runtime.

  • Building with -ffast-math may occasionally cause linker errors. A workaround for that is to link agains libm using -foffload=-lm.

  • GCC now automatically partitions independent ACC LOOPs across gang, worker and vector partitioning, when available. At times this optimization can be too aggressive. If the program fails with "libgomp: cuLaunchKernel error: too many resources requested for launch" Try adjusting num_workers on the offending loop. You may also compile the program with -fopenacc-dim=[num_gangs]:[num_worker], where num_gangs and num_workers specify the default number of gangs and workers, respectively.

  • All acc loop private clauses allocate storage for variables in local (i.e. thread-private) storage. They will utilize shared memory storage in a future release.

  • Local variables in acc routines are currently not allocated in CUDA shared memory, regardless if the routine uses gang or worker level parallelism.

  • GCC defaults to setting num_workers=32 when it detects a worker loop without an associated num_workers clause. Sometimes this can result in a runtime error. To correct this problem, rebuild your program with the -fopenacc-dim flag suggested by the compiler, or add a num_workers clause to the offending OpenACC construct.

  • OpenACC Kernels
    • The loop directive is supported, but most loop directive clauses are ignored.
    • No directives other than the loop directive are supported inside a kernels region.
    • Reduction clauses are ignored, but loops with reductions can be parallelized.
    • Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.

    • The loop directive is supported, but most loop directive clauses are not.
    • No other directives are currently supported inside kernels constructs. Reductions are supported inside kernels constructs. (Note: using the reduction clause in a kernels region is not supported yet.)
    • A single, bounded loop in a kernels region can be parallelized onto gangs by using: -fopenacc -O2 -ftree-parallelize-loops=<number of gangs>.

    • Nested loops are supported. We can only parallelize the outer loop in the loop nest though. Note that the inner loops are executed sequentially on each gang.
    • Only one loop nest per kernels region is handled. It's possible to have two or more subsequent loop nest, as well as sequential code in between loops nests in a kernels region, but parallelizing such a kernels region is not yet supported.
    • No true vectorization. A dependent but vectorizable loop can be vectorized (mapped on the vector dimension), but that is currently not supported.

GCC 7 Release Series

Compared to GCC 6, the GCC 7 release series contains the following OpenACC changes:

  • Assorted bug fixing.
  • Standard conformance: pointer variables inside offloaded regions in C and C++, http://mid.mail-archive.com/573244BE.5010708@codesourcery.com.

  • Standard conformance: reduction variables imply data mapping clauses on outer compute constructs, https://gcc.gnu.org/PR70895.

  • Bug fixes related to memory mappings and reductions when used with compute constructs with an async clause.
  • Improvements in assigning gang, worker, and vector parallelism to un-annotated loop constructs.
  • The default number of gangs launched is now determined at run time, and no longer hard-coded to 32.
  • The information provided by tile clauses is now used for loop scheduling.
  • The host_data construct and use_device clause are now supported in Fortran, too.
  • Improved syntax checking for the routine construct in C and C++.
  • The cache directive doesn't reject anymore valid syntax in C and C++.
  • The vector and vector_length clauses get parsed correctly in Fortran.

GCC 6 Release Series (GCC 6.1 released on 2016-04-27)

Compared to GCC 5, the GCC 6 release series includes a much improved implementation of the OpenACC 2.0a specification.

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for an arbitrary number of gangs, up to 32 workers, and 32 vectors.

Initial support for parallelized execution of OpenACC kernels constructs:

  • Parallelization of a kernels region is switched on by '-fopenacc' combined with '-O2' or higher.
  • Code is offloaded onto multiple gangs, but executes with just one worker, and a vector length of 1.
  • Directives inside a kernels region are not supported.
  • Loops with reductions can be parallelized.
  • Only kernels regions with one loop nest are parallelized.
  • Only the outer-most loop of a loop nest can be parallelized.
  • Loop nests containing sibling loops are not parallelized.

The device_type clause is not supported. The bind and nohost clauses are not supported. The host_data directive is not supported in Fortran, <https://gcc.gnu.org/PR70598>.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

GCC 5 Release Series (GCC 5.1 released on 2015-04-22)

The GCC 5 release series includes a preliminary implementation of the OpenACC 2.0a specification. No further OpenACC development work is planned for this release series.

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for one gang, one worker, and a number of vectors. These vectors all execute in "vector-redundant" mode. This means that inside a parallel construct, offloaded code outside of any loop construct is executed by all vectors, not just a single vector. The reduction clause is not supported with the parallel construct.

The kernels construct is supported only in a simplistic way: the code is offloaded, but executes with just one gang, one worker, one vector. No directives are supported inside kernels constructs. Reductions are not supported inside kernels constructs.

The atomic, cache, declare, host_data, and routine directives are not supported. The default(none), device_type, firstprivate, and private clauses are not supported. A parallel construct's implicit data attributes for scalar data types will be treated as present_or_copy instead of firstprivate. Only the collapse clause is supported for loop constructs, and there is incomplete support for the reduction clause.

Combined directives (kernels loop, parallel loop) are not supported; use kernels alone, or parallel followed by loop, instead.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

Issue Tracking

/!\ Incomplete.

Open OpenACC bugs

Known issues with offloading.

Notes on exploiting parallelism

The "kernels" directive allows the optimization facilities of the underlying compiler to detect parallel loops. Because gcc supports a large number of varying targets, its optimization passes are not as well tuned for GPU parallelism as are compilers that were developed specifically for the GPU. As a result, users may find the performance of kernel regions to be slower than expectations in comparison to other compilers. A later section provides more information. To obtain maximum performance, users are encouraged to convert kernel regions to parallel regions.

Again because gcc supports a large number of varying targets, its optimization passes are not as highly tuned to the GPU as are compilers developed specifically with that GPU as the target. Because of that, users may find with "parallel" regions that gcc does not do as good a job of scheduling "parallel" loops (loops that are declared as parallel, but that do not contain directives such as "gang", "worker", or "vector" that indicate the specific form of parallelism to use) code generation is not yet as highly tuned for the GPU as are those compilers. Adding in "gang", "worker", and "vector" directives in those cases to guide the compiler can result in significant performance gain.

GCC's strategy for utilizing parallelism focuses on selecting vector loops first, followed by worker and gang loops. This differs from most other compilers, which focus on finding gang loops first, then worker and vector. Each strategy will perform better on some codes and worse on others. When finding less effective parallelism than expected, guiding the strategy with explicit "gang", "worker", and "vector" directives often improves the performance.

Quick Reference Guide

This section presents an overview on how to use OpenACC with GCC. All of the examples assume that the compiler binaries are prefixed x86_64-none-linux-gnu-.

For general GCC/libgomp and OpenACC documentation, please refer to:

Typical Compiler Usage

In most cases, compilation of an OpenACC program with optimization turned on is:

 $ x86_64-none-linux-gnu-gcc -fopenacc -O2 -c foo.c
 $ x86_64-none-linux-gnu-gcc -fopenacc foo.o -o foo

If the program requires math library functions inside offloaded sections, link with:

 $ x86_64-none-linux-gnu-gcc -fopenacc -foffload=-lm foo.o -o foo

GCC optimization options also provide -O3, and -Ofast, which is -O3 with additional relaxations on standards compliance. For more details, please refer to the GCC optimization options documentation.

Running OpenACC Programs Built With the Toolchain

Programs built with the OpenACC toolchain will require the included libgomp runtime to run. The included libgomp is placed in <toolchain-path>/<host>/lib64, here <host> being either x86_64-none-linux-gnu or powerpc64le-none-linux-gnu. To allow the built program to find these libraries, users can either:

  1. Use RPATH to put the libgomp path into the ELF executable during link, e.g. Add -Wl,-rpath,<toolchain-path>/x86_64-none-linux-gnu/lib64 to the above link command.

  2. Set the LD_LIBRARY_PATH variable to include the libgomp path when running the built executable.

-fopenacc

Enables OpenACC in the compiler, activating directives in the code, and provides the _OPENACC preprocessor symbol. Also see description in GCC manual.

-fopenacc-dim=<gang:worker:vector>

This option specifies the default OpenACC compute dimensions for parallel constructs (when there are no explicit user defined clauses on the directive). The gang, worker and vector argument must be an positive integer or a minus sign. The minus sign allows the runtime to determine a suitable default based on the accelerator's capabilities. The default for NVPTX accelerator targets is -:32:32. Also see description in GCC manual.

-foffload-force

Specify offloading targets and options for them. The most common form of usage for this option will probably be to link in math library on accelerator target, i.e. -foffload=-lm. For general documentation of this option, please refer to the GCC Offloading Wiki.

-fopt-info-note-omp

This flag instructs the compiler to print out any parallelism that it detected.

Runtime Environment Variables for Built OpenACC Programs

These variables, read and interpreted by the libgomp OpenACC runtime included with the toolchain, can be set when executing built programs:

ACC_DEVICE_TYPE

Two possible values can be set: nvidia and host. The former is the default of offloading to NVPTX accelerator devices, and the latter means single-threaded host-fallback execution, in a shared-memory mode.

ACC_DEVICE_NUM

Value can be set to non-negative integer, designating a specific accelerator device to use (if more than one installed). Will trigger runtime error if specified number is out of range.

GOMP_OPENACC_DIM

This variable has the same functionality as the -fopenacc-dim= compiler option, except is handled by the libgomp runtime instead of the compiler.

GOMP_DEBUG

A setting of GOMP_DEBUG=1 can be added to the environment to enable some debugging output during execution. Currently it logs data management and GPU kernel launches, and if a NVPTX device type is active, also includes a dump of the offloaded PTX code.

GFortran Interoperability with CUDA libraries

CUDA libraries may be called from Fortran through the use of function interfaces. For example, to call cublassaxpy from Fortran, use the following interface block.

interface
   subroutine cublassaxpy(N, alpha, x, incx, y, incy) bind(c, name="cublasSaxpy")
     use iso_c_binding
     integer(kind=c_int), value :: N
     real*4, value :: alpha
     type(*), dimension(*) :: x
     integer(kind=c_int), value :: incx
     type(*), dimension(*) :: y
     integer(kind=c_int), value :: incy
   end subroutine cublassaxpy
end interface

The arguments passed to cublassaxpy must by declared as host_data. A more complete example can be found here:

https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00976.html

Automatic Loop Partitioning

GCC uses the following strategy to assign unused partitioning to independent acc loops. If there is only one acc loop, e.g.

#pragma acc parallel loop
for (...)

GCC assigns it gang and vector level parallelism. I.e.

#pragma acc parallel loop gang vector
for (...)

If there are two loops, L1 and L2, when L2 nested inside L1, e.g.

#pragma acc parallel loop
for (i = ...)
{
  #pragma acc loop
  for (j = ...)

then the inner loop gets vector is assigned partitioning, and the outer loop gets assigned gang and worker partitioning. I.e.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop vector
  for (j = ...)

Lastly, if there are three or more independent loops, the inner most loop will be assigned vector partitioning, the penultimate outermost loop worker partitioning, and the outermost loop gang partitioning. E.g.

#pragma acc parallel loop gang worker
for (i = ...)
{
  #pragma acc loop
  for (j = ...)
    #pragma acc loop
    for (k = ...)
      #pragma acc loop
      for (l = ...)

becomes

#pragma acc parallel loop gang
for (i = ...)
{
  #pragma acc loop worker
  for (j = ...)
    #pragma acc loop seq
    for (k = ...)
      #pragma acc loop vector
      for (l = ...)

The flag, -fopt-info-note-omp, may be used to show how GCC automatically partitions acc loops.

None: OpenACC (last edited 2017-07-27 18:47:45 by cesar)