Differences between revisions 12 and 13
Revision 12 as of 2020-07-01 15:57:32
Size: 8591
Editor: tschwinge
Comment: GCC 10 etc. have support for C/C++ struct and class types with OpenACC, OpenACC 'attach'/'detach' actions
Revision 13 as of 2020-07-12 17:49:13
Size: 8194
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
<<Anchor(status-trunk)>>
Line 7: Line 6:
== trunk/GCC-11: GCC's general master development branch ==

Release branches are annually branched off of it.
See [[https://gcc.gnu.org/develop.html]] for details.

Most development is done directly on the GCC `master` development branch (also called `trunk`).
However, some development is first done on a branch. In case of OpenACC, the
== GCC 10 Release Series ==
Line 24: Line 17:
-
Line 38: Line 31:
This branch is available as `devel/omp/gcc-10` branch in the GCC git repository. This branch is available as `devel/omp/gcc-10` branch in the GCC git repository. Use this branch if you want to make use of the latest OpenACC functionality and performance optimizations that are not yet part of the official GCC releases. Find it at `git://gcc.gnu.org/git/gcc.git`,
Line 40: Line 33:
The branch is based on gcc-10-branch.
Find it at `git://gcc.gnu.org/git/gcc.git`,

Please send email with a short-hand `[og10]` tag in the subject line, and use `ChangeLog.openacc` files.

Use this branch if you want to make use of the latest OpenACC functionality and performance optimizations that are not yet part of the official GCC releases.
To contribute to the OpenACC effort, please send email to gcc-patches with a short-hand `[og10]` tag in the subject line, and use `ChangeLog.openacc` files.
Line 113: Line 101:
(Reviewed up to trunk r270579: gcc-9-branch branch point.)

OpenACC Implementation Status

GCC 10 Release Series

The GCC 10 release series contains the following OpenACC changes:

  • OpenACC 2.6 support
  • Fortran derived types and C/C++ struct and class types are now supported for OpenACC. Members that have pointer type will trigger OpenACC 'attach'/'detach' actions as detailed in OpenACC 2.6 and later. Generally, such variables needs to be mapped on the device before any of its members in order for 'attach'/'detach' to work properly.
  • Re-implementation of OpenACC asynchronous queues
  • Support for AMD Radeon (GCN) GPUs (Fiji, VEGA)

For Nvidia PTX (nvptx) offloading:

  • nvptx support for OpenACC asynchronous queues

Main changes which are only on the OG10 branch:

  • Improvements to the parallelization of kernels

  • Multi-dimensional dynamic array support for OpenACC data clauses (non-contiguous arrays)
  • Some of the reduction enhancements

Known issues: See OG10 Known Issues.

OG10 branch: current OpenACC development branch

This branch is available as devel/omp/gcc-10 branch in the GCC git repository. Use this branch if you want to make use of the latest OpenACC functionality and performance optimizations that are not yet part of the official GCC releases. Find it at git://gcc.gnu.org/git/gcc.git,

To contribute to the OpenACC effort, please send email to gcc-patches with a short-hand [og10] tag in the subject line, and use ChangeLog.openacc files.

The implementation status on devel/omp/gcc-10 (alias OG10) branch is based on the GCC 10 release series, with additional support for the OpenACC 2.6 specification and the following exceptions:

  • GCC does not support the init, shutdown and set directives. However it does support their corresponding OpenACC runtime API routines.

  • GCC does not support the acc_memcpy_device runtime API routine.

  • private and firstprivate variables cannot contain subarray arguments.

  • GCC does not support the device_type clause
  • GCC parses cache directives, but no special code is generated for them
  • GCC has preliminary support for the OpenACC Profiling Interface. For more details see the libgomp documentation.
  • The nohost clause is supported, but support for the bind clause is incomplete: works only in C, and only for non-inlined functions, http://mid.mail-archive.com/87twns3ebs.fsf@hertz.schwinge.homeip.net.

Compared to the GCC 10 release series (see below), /devel/omp/gcc-10 contains the following enhancements:

  • The compiler assigns shared memory storage for local variables declared inside parallel and kernels regions that are not associated with acc loops, or acc loops with an explicit gang clause. Note that variables which are used in acc loops that have been automatically assigned gang partitioning by the compiler continue to utilize thread-private storage, not shared memory.

  • Remove spurious error on combining the OpenACC auto clause with gang/worker/vector clauses on loop constructs.

  • The option -fopenacc-kernels can be used to configure the behavior of kernels constructs handling. The default behavior, -fopenacc-kernels=split, causes OpenACC kernels constructs to be split into a sequence of compute constructs, each then handled individually. To disable the default behavior, use -fopenacc-kernels=parloops, to cause the whole OpenACC kernels constructs to be handled by the parloops pass.

    • Inside OpenACC kernels constructs, loops annotated with OpenACC loop directives with independent clauses will be parallelized as if inside a parallel construct.

Known Issues

The following are known issues in devel/omp/og10-branch:

  • Certain profiling functionality is not implemented. See the libgomp manual for for the implementation status of the OpenACC Profiling Interface.
  • Nested parallelism is not supported. I.e. parallel constructs cannot nest inside other parallel constructs.
  • Variables may only be used in one data clause per construct. E.g. This generates an error: #pragma acc parallel copyin (foo) copyout (foo)

  • Fortran program built with -ffpe-trap=invalid,zero,overflow may trigger floating point exceptions in the cuda driver runtime.

  • Building with -ffast-math may occasionally cause linker errors. A workaround for that is to link agains libm using -foffload=-lm.

  • All acc loop private clauses allocate storage for variables in local (i.e. thread-private) storage. They will utilize shared memory storage in a future release.

  • Local variables in acc routines are currently not allocated in CUDA shared memory, regardless if the routine uses gang or worker level parallelism.

  • In OpenACC kernels constructs, loops not annotated with OpenACC loop directives, or annotated with OpenACC loop directives but with explicit or implicit auto clause:

    • The loop directive is supported, but most loop directive clauses are ignored.

    • No directives other than the loop directive are supported inside a kernels region.

    • reduction clauses are ignored, but loops with reductions might be parallelized.

    • If the loop can be parallelized, it will use multiple gangs, but just one worker, and a vector_length of 1.

    • Nested loops are supported, but in this scenario we can only parallelize the outer loop in this loop nest. This means that inner loops are executed sequentially on each gang.
    • No true vectorization. A dependent but vectorizable loop could be vectorized (mapped on the vector dimension), but that is currently not supported.

gcc-9-branch: GCC 9 Release Series (GCC 9.1 released on 2019-05-03)

Compared to GCC 8, the GCC 9 release series contains the following OpenACC changes:

  • General bug fixing, performance improvements, compiler diagnostics improvements.
  • Support C++ reference data types.
  • Support C++ this pointer usage in OpenACC directives.

  • The independent and seq clauses are no longer refused as conflicting in Fortran.

  • The routine directive now correctly handles clauses specifying the level of parallelism in Fortran.

  • The private clause no longer is rejected for Fortran predetermined private loop iteration variables.

  • OpenACC async bug fixes: support multiple OpenACC wait clauses, handle (as a no-op) unseen async-arguments in acc_async_test, acc_wait, acc_wait_async, handle (as a no-op) cases of the same async-argument being used in wait and async clauses, fix handling of OpenACC wait directive without wait argument but with async clause, support acc_async_sync, acc_async_noval in acc_get_cuda_stream/acc_set_cuda_stream, handle wait clauses without async-argument.

  • Add user-friendly diagnostics for OpenACC loop parallelism assigned, enabled with -fopt-info-optimized-omp.

  • Adjust to OpenACC 2.5 data clause semantics, including support OpenACC finalize, if_present clauses.

  • OpenACC 2.5 acc_*_async versions of the relevant OpenACC runtime library routines.

For Nvidia PTX (nvptx) offloading:

  • General bug fixing, performance improvements.
  • Compatibility with new CUDA versions and new PTX versions/GPU hardware, workarounds against PTX JIT bugs.
  • Determine default num_workers at runtime, and other improvements in runtime launch geometry computation.

  • Support vector_length bigger than 32.

  • Improve support for multiple GPUs.
  • Ignore C++ exceptions in OpenACC regions.
  • New flag -misa=sm_35 (use as: -foffload=nvptx-none=-misa=sm_35, or similar) to enable additional PTX atomic instructions added in target architecture sm_35.

For everything older, see the OpenACC/Branch Archive.

None: OpenACC/Implementation Status (last edited 2020-07-12 17:49:13 by CatherineMoore)