Differences between revisions 14 and 15
Revision 14 as of 2015-08-12 12:57:31
Size: 6090
Editor: TomDeVries
Comment:
Revision 15 as of 2015-11-16 04:22:11
Size: 6102
Editor: tschwinge
Comment: Move "Work in Progress" before "GCC 5"
Deletions are marked like this. Additions are marked like this.
Line 16: Line 16:

== Work in Progress ==

Current development continues on [[http://news.gmane.org/find-root.php?message_id=%3C87a9elqolz.fsf%40schwinge.name%3E|gomp-4_0-branch]]. Please add a `[gomp4]` tag to any patches posted for inclusion in that branch.

The implementation status is the same as with GCC 5 (see below), with the following changes:

=== OpenACC Kernels ===

Initial support for OpenACC kernels, but still in early stages of development:

  * Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.
    (Each loop nest in a kernels region can be mapped onto a different parallelism dimension (worker, gang, vector) with a certain (optimal) parallelization factor.
    At the moment, we just support mapping a non-nested loop onto gang parallelism, with the parallelization factor indicated by a command-line switch.
    A heuristic to do an appropriate mapping should be done using loop analysis.)
  * The loop directive is supported, but most loop directive clauses are not. The independent clause is supported though.
  * No other directives are currently supported inside kernels constructs.
  * Reductions are supported inside kernels constructs.
    (Note: using the reduction clause in a kernels region is not supported yet.)
  * A single, bounded loop in a kernels region can be parallelized onto gangs by using: `-fopenacc -O2 -ftree-parallelize-loops=[number of gangs]`.
      * Nested loops are supported.
        (We can only parallelize one loop in the loop nest though.)
      * Only one loop per kernels region is handled.
        (It's possible to have two or more subsequent loops, as well as sequential code inbetween loop in a kernels region, but that is not yet supported.)
      * No true vectorization.
        (A dependent but vectorizable loop can be vectorized (mapped on the vector dimension), but that is currently not supported.
        At the moment, we just use pass_parallelize_loops, which classifies loops as either dependent or independent.)
Line 40: Line 68:


== Work in Progress ==

Current development continues on [[http://news.gmane.org/find-root.php?message_id=%3C87a9elqolz.fsf%40schwinge.name%3E|gomp-4_0-branch]]. Please add a `[gomp4]` tag to any patches posted for inclusion in that branch.

The implementation status is the same as with GCC 5, with the following changes:

=== OpenACC Kernels ===

Initial support for OpenACC kernels, but still in early stages of development:

  * Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.
    (Each loop nest in a kernels region can be mapped onto a different parallelism dimension (worker, gang, vector) with a certain (optimal) parallelization factor.
    At the moment, we just support mapping a non-nested loop onto gang parallelism, with the parallelization factor indicated by a command-line switch.
    A heuristic to do an appropriate mapping should be done using loop analysis.)
  * The loop directive is supported, but most loop directive clauses are not. The independent clause is supported though.
  * No other directives are currently supported inside kernels constructs.
  * Reductions are supported inside kernels constructs.
    (Note: using the reduction clause in a kernels region is not supported yet.)
  * A single, bounded loop in a kernels region can be parallelized onto gangs by using: `-fopenacc -O2 -ftree-parallelize-loops=[number of gangs]`.
      * Nested loops are supported.
        (We can only parallelize one loop in the loop nest though.)
      * Only one loop per kernels region is handled.
        (It's possible to have two or more subsequent loops, as well as sequential code inbetween loop in a kernels region, but that is not yet supported.)
      * No true vectorization.
        (A dependent but vectorizable loop can be vectorized (mapped on the vector dimension), but that is currently not supported.
        At the moment, we just use pass_parallelize_loops, which classifies loops as either dependent or independent.)

OpenACC

This page contains information on GCC's implementation of the OpenACC specification and related functionality. OpenACC is intended for programming accelerator devices such as GPUs, including code offloading to these devices.

OpenACC is an experimental feature of GCC 5.1 and may not meet the needs of general application development. Support for OpenACC 2.0a in GCC will be available in upcoming releases.

For discussing this project, please use the standard GCC resources (mailing lists, Bugzilla, and so on). It's helpful to put a [OpenACC] tag into your email's Subject line, and set the openacc keyword in any Bugzilla issues filed.

Implementation Status

Work in Progress

Current development continues on gomp-4_0-branch. Please add a [gomp4] tag to any patches posted for inclusion in that branch.

The implementation status is the same as with GCC 5 (see below), with the following changes:

OpenACC Kernels

Initial support for OpenACC kernels, but still in early stages of development:

  • Code will be offloaded onto multiple gangs, but executes with just one worker, and a vector_length of 1.
    • (Each loop nest in a kernels region can be mapped onto a different parallelism dimension (worker, gang, vector) with a certain (optimal) parallelization factor. At the moment, we just support mapping a non-nested loop onto gang parallelism, with the parallelization factor indicated by a command-line switch. A heuristic to do an appropriate mapping should be done using loop analysis.)
  • The loop directive is supported, but most loop directive clauses are not. The independent clause is supported though.
  • No other directives are currently supported inside kernels constructs.
  • Reductions are supported inside kernels constructs.
    • (Note: using the reduction clause in a kernels region is not supported yet.)
  • A single, bounded loop in a kernels region can be parallelized onto gangs by using: -fopenacc -O2 -ftree-parallelize-loops=[number of gangs].

    • Nested loops are supported.
      • (We can only parallelize one loop in the loop nest though.)
    • Only one loop per kernels region is handled.
      • (It's possible to have two or more subsequent loops, as well as sequential code inbetween loop in a kernels region, but that is not yet supported.)
    • No true vectorization.
      • (A dependent but vectorizable loop can be vectorized (mapped on the vector dimension), but that is currently not supported. At the moment, we just use pass_parallelize_loops, which classifies loops as either dependent or independent.)

GCC 5

GCC 5 includes a preliminary implementation of the OpenACC 2.0a specification.

The execution model currently only allows for one gang, one worker, and a number of vectors. These vectors will all execute in "vector-redundant" mode. This means that inside a parallel construct, offloaded code outside of any loop construct will be executed by all vectors, not just a single vector. The reduction clause is not yet supported with the parallel construct.

The kernels construct so far is supported only in a simplistic way: the code will be offloaded, but execute with just one gang, one worker, one vector. No directives are currently supported inside kernels constructs. Reductions are not yet supported inside kernels constructs.

The atomic, cache, declare, host_data, and routine directives are not yet supported. The default(none), device_type, firstprivate, and private clauses are not yet supported. A parallel construct's implicit data attributes for scalar data types will be treated as present_or_copy instead of firstprivate. Only the collapse clause is currently supported for loop constructs, and there is incomplete support for the reduction clause.

Combined directives (kernels loop, parallel loop) are not yet supported; use kernels alone, or parallel followed by loop, instead.

Nested parallelism (cf. CUDA dynamic parallelism) is not yet supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not yet supported.

Issue Tracking

Open OpenACC bugs

Known issues with offloading.

acc_on_device

OpenACC 2.0a, 3.2.14 acc_on_device:

  • If the acc_on_device routine has a compile-time constant argument, it evaluates at compile time to a constant.

As discussed, this currently works for C but not for C++, and Fortran.

Documentation

ACC_DEVICE_TYPE

For ACC_DEVICE_TYPE, there are three options: nvidia, host_nonshm, host. The last one, host, means single-threaded host-fallback execution, in a shared-memory mode. In contrast, host_nonshm means execution on the host, still single-threaded, but with an emulated non-shared memory. The idea is that even if no accelerator is currently available, you can still use that one to test your data directives.

GOMP_DEBUG

GOMP_DEBUG=1 can be set in the environment to enable some debugging output during execution. This is planned to be improved, to be better consumed by users. Currently it logs data management and kernel launches, and if a nvptx device type is active, also includes a dump of the offloaded PTX code.

None: OpenACC (last edited 2018-05-23 09:04:07 by tschwinge)