Differences between revisions 3 and 4
Revision 3 as of 2020-06-26 20:14:43
Size: 11816
Revision 4 as of 2020-07-01 13:09:19
Size: 11863
Editor: tschwinge
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from Older Branch Archive

OpenACC Branch Archive

Known issues: See OG9 Known Issues.

OG9 Branch

OG9 branch: current OpenACC development branch This branch is available as devel/omp/gcc-9 branch in the GCC git repository. (In the former the GCC git mirror, now gcc-old.git, it was in the openacc-gcc-9-branch.)

This Git-only branch is used for collaborative development. The branch is based on gcc-9-branch. Find it at git://gcc.gnu.org/git/gcc.git, https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/openacc-gcc-9-branch, or https://github.com/gcc-mirror/gcc/tree/openacc-gcc-9-branch. Please send email with a short-hand [og9] tag in the subject line, and use ChangeLog.openacc files.

Use this branch if you want to make use of the latest OpenACC functionality and performance optimizations that are not yet part of the official GCC releases.

The implementation status on devel/omp/gcc-9 (alias OG9) branch is based on the GCC 9 release series (see below), with additional support for the OpenACC 2.6 specification and the following exceptions:

GCC does not support the init, shutdown and set directives. However it does support their corresponding OpenACC runtime API routines.

GCC does not support the acc_memcpy_device runtime API routine.

private and firstprivate variables cannot contain subarray arguments.

gfortran only supports individual derived type members in acc update directives.

g++ does not support struct or class members inside data clauses. GCC does not support the device_type clause and cache directive. While the C, C++ and Fortran compilers may build programs with those clauses, no special code is generated for them.

GCC has preliminary support for the OpenACC Profiling Interface. For more details see the libgomp documentation. The nohost clause is supported, but support for the bind clause is incomplete: works only in C, and only for non-inlined functions, http://mid.mail-archive.com/87twns3ebs.fsf@hertz.schwinge.homeip.net.

Compared to the GCC 9 release series (see below), openacc-gcc-9-branch contains the following enhancements:

The compiler assigns shared memory storage for local variables declared inside parallel and kernels regions that are not associated with acc loops, or acc loops with an explicit gang clause. Note that variables which are used in acc loops that have been automatically assigned gang partitioning by the compiler continue to utilize thread-private storage, not shared memory.

Remove spurious error on combining the OpenACC auto clause with gang/worker/vector clauses on loop constructs.

A new -fopt-info-optimized-omp flag has been added to report how the compiler automatically assigns gang, worker and vector level parallelism to independent acc loops.

The option -fopenacc-kernels can be used to configure the behavior of kernels constructs handling. The default behavior, -fopenacc-kernels=split, causes OpenACC kernels constructs to be split into a sequence of compute constructs, each then handled individually. To disable the default behavior, use -fopenacc-kernels=parloops, to cause the whole OpenACC kernels constructs to be handled by the parloops pass.

Inside OpenACC kernels constructs, loops annotated with OpenACC loop directives with independent clauses will be parallelized as if inside a parallel construct.

Known Issues The following are known issues in openacc-gcc-9-branch:

Certain profiling functionality is not implemented. See the libgomp manual for for the implementation status of the OpenACC Profiling Interface. Support for dynamic arrays inside offloaded regions has been added with the following limitations: The pointer-to-arrays case is not supported yet, e.g. int (*a)[100]. Host fallback doesn't work yet, i.e. ACC_DEVICE_TYPE=host will segfault.

Nested parallelism is not supported. I.e. parallel constructs cannot nest inside other parallel constructs. Variables may only be used in one data clause per construct. E.g. This generates an error: #pragma acc parallel copyin (foo) copyout (foo)

Fortran program built with -ffpe-trap=invalid,zero,overflow may trigger floating point exceptions in the cuda driver runtime.

Building with -ffast-math may occasionally cause linker errors. A workaround for that is to link agains libm using -foffload=-lm.

GCC automatically partitions independent ACC LOOPs across gang, worker and vector partitioning, when available. At times this optimization can be too aggressive. If the program fails with "libgomp: cuLaunchKernel error: too many resources requested for launch" Try adjusting num_workers on the offending loop. You may also compile the program with -fopenacc-dim=[num_gangs]:[num_worker], where num_gangs and num_workers specify the default number of gangs and workers, respectively.

All acc loop private clauses allocate storage for variables in local (i.e. thread-private) storage. They will utilize shared memory storage in a future release.

Local variables in acc routines are currently not allocated in CUDA shared memory, regardless if the routine uses gang or worker level parallelism.

GCC defaults to setting num_workers=32 when it detects a worker loop without an associated num_workers clause. Sometimes this can result in a runtime error. To correct this problem, rebuild your program with the -fopenacc-dim flag suggested by the compiler, or add a num_workers clause to the offending OpenACC construct.

In OpenACC kernels constructs, loops not annotated with OpenACC loop directives, or annotated with OpenACC loop directives but with explicit or implicit auto clause:

The loop directive is supported, but most loop directive clauses are ignored.

No directives other than the loop directive are supported inside a kernels region.

reduction clauses are ignored, but loops with reductions might be parallelized.

If the loop can be parallelized, it will use multiple gangs, but just one worker, and a vector_length of 1.

Nested loops are supported, but in this scenario we can only parallelize the outer loop in this loop nest. This means that inner loops are executed sequentially on each gang. No true vectorization. A dependent but vectorizable loop could be vectorized (mapped on the vector dimension), but that is currently not supported. (Reviewed up to openacc-gcc-9-branch commit 891935361130d470567210e4a2dfefde2f634030).

gcc-9-branch: GCC 9 Release Series (GCC 9.1 released on 2019-05-03)

Compared to GCC 8, the GCC 9 release series contains the following OpenACC changes:

General bug fixing, performance improvements, compiler diagnostics improvements. Support C++ reference data types. Support C++ this pointer usage in OpenACC directives.

The independent and seq clauses are no longer refused as conflicting in Fortran.

The routine directive now correctly handles clauses specifying the level of parallelism in Fortran.

The private clause no longer is rejected for Fortran predetermined private loop iteration variables.

OpenACC async bug fixes: support multiple OpenACC wait clauses, handle (as a no-op) unseen async-arguments in acc_async_test, acc_wait, acc_wait_async, handle (as a no-op) cases of the same async-argument being used in wait and async clauses, fix handling of OpenACC wait directive without wait argument but with async clause, support acc_async_sync, acc_async_noval in acc_get_cuda_stream/acc_set_cuda_stream, handle wait clauses without async-argument.

Add user-friendly diagnostics for OpenACC loop parallelism assigned, enabled with -fopt-info-optimized-omp.

Adjust to OpenACC 2.5 data clause semantics, including support OpenACC finalize, if_present clauses.

OpenACC 2.5 acc_*_async versions of the relevant OpenACC runtime library routines.

For Nvidia PTX (nvptx) offloading:

General bug fixing, performance improvements. Compatibility with new CUDA versions and new PTX versions/GPU hardware, workarounds against PTX JIT bugs. Determine default num_workers at runtime, and other improvements in runtime launch geometry computation.

Support vector_length bigger than 32.

Improve support for multiple GPUs. Ignore C++ exceptions in OpenACC regions. New flag -misa=sm_35 (use as: -foffload=nvptx-none=-misa=sm_35, or similar) to enable additional PTX atomic instructions added in target architecture sm_35.

(Reviewed up to trunk r270579: gcc-9-branch branch point.)

gcc-8-branch: GCC 8 Release Series (GCC 8.1 released on 2018-05-02)

Compared to GCC 7, the GCC 8 release series contains the following OpenACC changes:

  • Assorted bug fixing.
  • Standard conformance: C/C++ acc_pcopyin, acc_pcreate, Fortran openacc_lib.h: acc_pcopyin, acc_pcreate.

  • OpenACC 1.0 compatibility: acc_async_wait, acc_async_wait_all.

  • Fortran: Don't restrict wait directive arguments to constant integers.

  • OpenACC 2.5 default (present) clause.

  • Nvidia PTX (nvptx) back end: general bug fixing, support for CUDA 9, workarounds against PTX JIT bugs, initial SLP vectorization support using PTX ISA vector modes/instructions.

gcc-7-branch: GCC 7 Release Series (GCC 7.1 released on 2017-05-02)

In addition to single-threaded host-fallback execution, offloading is supported for nvptx (Nvidia GPUs) on x86_64 and PowerPC 64-bit little-endian GNU/Linux host systems. For nvptx offloading, with the OpenACC parallel construct, the execution model allows for an arbitrary number of gangs, up to 32 workers, and 32 vectors.

Initial support for parallelized execution of OpenACC kernels constructs:

  • Parallelization of a kernels region is switched on by '-fopenacc' combined with '-O2' or higher.
  • Code is offloaded onto multiple gangs, but executes with just one worker, and a vector length of 1.
  • Directives inside a kernels region are not supported.
  • Loops with reductions can be parallelized.
  • Only kernels regions with one loop nest are parallelized.
  • Only the outer-most loop of a loop nest can be parallelized.
  • Loop nests containing sibling loops are not parallelized.

Compared to GCC 6, the GCC 7 release series contains the following OpenACC changes:

  • Assorted bug fixing.
  • Standard conformance: pointer variables inside offloaded regions in C and C++, http://mid.mail-archive.com/573244BE.5010708@codesourcery.com.

  • Standard conformance: reduction variables imply data mapping clauses on outer compute constructs, https://gcc.gnu.org/PR70895.

  • Bug fixes related to memory mappings and reductions when used with compute constructs with an async clause.
  • Improvements in assigning gang, worker, and vector parallelism to un-annotated loop constructs.
  • The default number of gangs launched is now determined at run time, and no longer hard-coded to 32.
  • The information provided by tile clauses is now used for loop scheduling.
  • The host_data construct and use_device clause are now supported in Fortran, too.
  • Improved syntax checking for the routine construct in C and C++.
  • The cache directive doesn't reject anymore valid syntax in C and C++.
  • The vector and vector_length clauses get parsed correctly in Fortran.

The device_type clause is not supported. The bind and nohost clauses are not supported.

Nested parallelism (cf. CUDA dynamic parallelism) is not supported.

Usage of OpenACC constructs inside multithreaded contexts (such as created by OpenMP, or pthread programming) is not supported.

If a call to the acc_on_device function has a compile-time constant argument, the function call evaluates to a compile-time constant value only for C and C++ but not for Fortran.

None: OpenACC/Branch Archive (last edited 2020-07-01 13:09:19 by tschwinge)