Differences between revisions 20 and 21
Revision 20 as of 2016-01-20 16:39:18
Size: 7314
Editor: TomDeVries
Comment:
Revision 21 as of 2016-04-22 15:46:02
Size: 6838
Editor: tschwinge
Comment: Update
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
Support for OpenACC 2.0a in GCC will be available in upcoming releases. Compared to GCC 5, the GCC 6 release series includes a much improved implementation of the OpenACC 2.0a specification.
Line 31: Line 31:
   works only in C, and only for non-inlined functions,
   <http://news.gmane.org/find-root.php?message_id=%3C87twns3ebs.fsf%40hertz.schwinge.homeip.net%3E>.
  * Initial support for OpenACC kernels:
      * Parallelization of a kernels region is switched on by '-fopenacc -ftree-parallelize-loops=[number of gangs]', combined with '-O2' or higher.
      * Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.
  works only in C, and only for non-inlined functions,
  <http://news.gmane.org/find-root.php?message_id=%3C87twns3ebs.fsf%40hertz.schwinge.homeip.net%3E>.
  * OpenACC kernels:
Line 39: Line 37:
      * Only kernels regions with one loop nest are parallelized.
      * Only the outer-most loop of a loop nest can be parallelized.
      * Loop nests containing sibling loops are not parallelized.
Line 45: Line 41:
Compared to GCC 5, the GCC 6 release series will include a much improved implementation of the OpenACC 2.0a specification. Compared to GCC 5, the GCC 6 release series includes a much improved implementation of the OpenACC 2.0a specification.
Line 47: Line 43:
In addition to single-threaded host-fallback execution, offloading is supported for nvptx on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems.
Line 50: Line 46:
Initial support for OpenACC kernels:
   * Parallelization of a kernels region is switched on by '-fopenacc -ftree-parallelize-loops=[number of gangs]', combined with '-O2' or higher.
   * Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.
Initial support for parallelized execution of OpenACC kernels constructs:
   * Parallelization of a kernels region is switched on by '-fopenacc' combined with '-O2' or higher.
   * Code is offloaded onto multiple gangs, but executes with just one worker, and a vector length of 1.
Line 59: Line 55:
The host_data directive is not yet supported in Fortran,
<http://news.gmane.org/find-root.php?message_id=%3C87r3j4lcrd.fsf%40kepler.schwinge.homeip.net%3E>.
The device_type clause is not yet supported.
The bind and nohost clauses are not yet supported.
The device_type clause is not supported.
The bind and nohost clauses are not supported.
The host_data directive is not supported in Fortran,
<https://gcc.gnu.org/PR70598>.
Line 64: Line 60:
Nested parallelism (cf. CUDA dynamic parallelism) is not yet supported. Nested parallelism (cf. CUDA dynamic parallelism) is not supported.
Line 66: Line 62:
Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not yet supported. Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.
Line 68: Line 64:
If the acc_on_device routine has a compile-time constant argument, it evaluates at compile time to a constant only for C and C++ but not for Fortran. If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.
Line 76: Line 72:
In addition to single-threaded host-fallback execution, offloading is supported for nvptx on x86_64 GNU/Linux host systems. In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 GNU/Linux host systems.
Line 78: Line 74:
These vectors will all execute in "vector-redundant" mode.
This means that inside a parallel construct, offloaded code outside of any loop construct will be executed by all vectors, not just a single vector.
These vectors all execute in "vector-redundant" mode.
This means that inside a parallel construct, offloaded code outside of any loop construct is executed by all vectors, not just a single vector.
Line 82: Line 78:
The kernels construct is supported only in a simplistic way: the code will be offloaded, but execute with just one gang, one worker, one vector. The kernels construct is supported only in a simplistic way: the code is offloaded, but executes with just one gang, one worker, one vector.
Line 97: Line 93:
If the acc_on_device routine has a compile-time constant argument, it evaluates at compile time to a constant only for C but not for C++ and Fortran. If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

OpenACC

This page contains information on GCC's implementation of the OpenACC specification and related functionality. OpenACC is intended for programming accelerator devices such as GPUs, including code offloading to these devices.

OpenACC is an experimental feature of GCC 5.1 and may not meet the needs of general application development. Compared to GCC 5, the GCC 6 release series includes a much improved implementation of the OpenACC 2.0a specification.

For discussing this project, please use the standard GCC resources (mailing lists, Bugzilla, and so on). It's helpful to put a [OpenACC] tag into your email's Subject line, and set the openacc keyword in any Bugzilla issues filed.

Implementation Status

Listing first the most current work in progress, followed by the GCC release series from most current to older.

Work in Progress (gomp-4_0-branch)

Current development continues on gomp-4_0-branch. Please add a [gomp4] tag to any patches posted for inclusion in that branch.

Work is ongoing to merge gomp-4_0-branch code into trunk, for the next GCC release series.

The implementation status on gomp-4_0-branch is basically the same as with the GCC 6 release series (see below), with the following changes:

  • Assorted bug fixing.
  • Incomplete support for the device_type clause.
  • The nohost clause is supported, but support for the bind clause is incomplete: works only in C, and only for non-inlined functions,

    <http://news.gmane.org/find-root.php?message_id=%3C87twns3ebs.fsf%40hertz.schwinge.homeip.net%3E>.

  • OpenACC kernels:
    • The loop directive is supported, but most loop directive clauses are ignored.
    • No directives other than the loop directive are supported inside a kernels region.
    • Reduction clauses are ignored, but loops with reductions can be parallelized.

GCC 6 Release Series (not yet released)

Compared to GCC 5, the GCC 6 release series includes a much improved implementation of the OpenACC 2.0a specification.

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for an arbitrary number of gangs, up to 32 workers, and 32 vectors.

Initial support for parallelized execution of OpenACC kernels constructs:

  • Parallelization of a kernels region is switched on by '-fopenacc' combined with '-O2' or higher.
  • Code is offloaded onto multiple gangs, but executes with just one worker, and a vector length of 1.
  • Directives inside a kernels region are not supported.
  • Loops with reductions can be parallelized.
  • Only kernels regions with one loop nest are parallelized.
  • Only the outer-most loop of a loop nest can be parallelized.
  • Loop nests containing sibling loops are not parallelized.

The device_type clause is not supported. The bind and nohost clauses are not supported. The host_data directive is not supported in Fortran, <https://gcc.gnu.org/PR70598>.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

GCC 5 Release Series (GCC 5.1 released on 2015-04-22)

The GCC 5 release series includes a preliminary implementation of the OpenACC 2.0a specification. No further OpenACC development work is planned for this release series.

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for one gang, one worker, and a number of vectors. These vectors all execute in "vector-redundant" mode. This means that inside a parallel construct, offloaded code outside of any loop construct is executed by all vectors, not just a single vector. The reduction clause is not supported with the parallel construct.

The kernels construct is supported only in a simplistic way: the code is offloaded, but executes with just one gang, one worker, one vector. No directives are supported inside kernels constructs. Reductions are not supported inside kernels constructs.

The atomic, cache, declare, host_data, and routine directives are not supported. The default(none), device_type, firstprivate, and private clauses are not supported. A parallel construct's implicit data attributes for scalar data types will be treated as present_or_copy instead of firstprivate. Only the collapse clause is supported for loop constructs, and there is incomplete support for the reduction clause.

Combined directives (kernels loop, parallel loop) are not supported; use kernels alone, or parallel followed by loop, instead.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

Issue Tracking

/!\ Incomplete.

Open OpenACC bugs

Known issues with offloading.

Documentation

/!\ Incomplete.

ACC_DEVICE_TYPE

For ACC_DEVICE_TYPE, there are two options: nvidia, host. The last one, host, means single-threaded host-fallback execution, in a shared-memory mode.

GOMP_DEBUG

GOMP_DEBUG=1 can be set in the environment to enable some debugging output during execution. This is planned to be improved, to be better consumed by users. Currently it logs data management and kernel launches, and if a nvptx device type is active, also includes a dump of the offloaded PTX code.

None: OpenACC (last edited 2017-07-27 18:47:45 by cesar)