[PATCH,WIP] Use functional parameters for data mappings in OpenACC child functions

Mon Dec 18 22:58:00 GMT 2017

Jakub,

I'd like your thoughts on the following problem.

One of the offloading bottlenecks with GPU acceleration in OpenACC is
the nontrivial offloaded function invocation overhead. At present, GCC
generates code to pass a struct containing one field for each of the
data mappings used in the OMP child function. I'm guessing a struct is
used because pthread_create only accepts a single for new threads. What
I'd like to do is to create the child function with one argument per
data mapping. This has a number of advantages:

  1. No device memory needs to be managed for the child function data
     mapping struct.

  2. On PTX targets, the .param address space is cached. Using
     individual parameters for function arguments will allow the nvptx
     back end to generate a more relaxed "execution model" because the
     thread initialization code will be accessing cache memory instead
     of global memory.

  3. It was my hope that this would set a path to eliminate the
     GOMP_MAP_FIRSTPRIVATE_INT optimization, by replacing those mappings
     with the actual value directly.

1) is huge for programs, such as cloverleaf, which launch a lot of small
parallel regions a lot of times.

For the execution model in 2), OpenACC begins each parallel region in a
gang-redundant, worker-single and vector-single state. To transition
from a single-threaded (or single vector lane) state to a multi-threaded
partitioned state, GCC needs to emit code to propagate live variables,
both on the stack and registers to the spawned threads. A lot of loops,
including DGEMV from BLAS, can be executed in a fully-redundant state.
Executing code redundantly has the advantage of not requiring any state
transition code. The problem here is that because a) the struct is in
global memory, and b) not all of the GPU threads are executing the same
instruction at the same time. Consequently, initializing each thread in
a fully redundant manner actually hurts performance. When I rewrote the
same test case passing the data mappings via individual parameters, that
optimization improved performance compared to GCC trunk's baseline.

Lastly, 3) is more of a simplification than anything else. I'm not too
concerned about this because those variables only get initialized once.
So long as they don't require a separate COPYIN data mapping, the
performance hit should be negligible.

In this first attempt at using parameters I taught lower_omp_target how
to create child functions for OpenACC parallel regions with individual
parameters for the data mappings instead of using a large struct. This
works for the most part, but I realized too late that pthread_create
only passes one argument to each thread it creates. It should be noted
that I left the kernels implementation as-is, using the global struct
argument because kernels in GCC is largely ineffective and it usually
falls back to executing code on the host CPU. Eventually, we want to
redo kernels, but not until we get the parallel code running efficiently.

For fallback host targets, libgomp is using libffi to pass arguments to
the offloaded functions. This works OK at the moment because the host
code is always single-threaded. Ideally, that would change in the
future, but I'm not aware of any immediate plans to do so.

Question: is this approach acceptable for Stage 1 in May, or should I
make the offloaded function parameter expansion target-specific? I can
think a couple of ways to make this target-specific:

  a. Create two child functions during lowering, one with individual
     parameters for the data mappings, and another which takes in a
     single struct. The latter then calls the former immediately on
     on entry.

  b. Teach oaccdevlow to expand the incoming struct into individual
     parameters.

I'm concerned that b) is going to be a large pass. The SRA pass is
somewhat large at 5k. While this should be simpler, I'm not sure by how
much (probably a lot because it won't need to preform as much analysis).

While this patch is functional, it's not complete. I still need to tweak
a couple of things in the runtime. But I don't want to spend too much
time on it if we decide to go with a different approach.

Any thoughts are welcome.

By the way, next we'll be working on increasing vector_length on nvptx
targets. In conjunction with that, we'll simplifying the OpenACC
execution model in the nvptx BE, along with adding a new reduction
finalizer.

Cesar
-------------- next part --------------
A non-text attachment was scrubbed...
Name: og7-ptx-param.diff
Type: text/x-patch
Size: 53155 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20171218/b96e86f8/attachment.bin>