This is the mail archive of the
mailing list for the GCC project.
Re: oacc kernels directive -- reductions
- From: Richard Biener <rguenther at suse dot de>
- To: Tom de Vries <Tom_deVries at mentor dot com>
- Cc: Jakub Jelinek <jakub at redhat dot com>, gcc at gcc dot gnu dot org, Thomas Schwinge <Thomas_Schwinge at mentor dot com>, Bernd Schmidt <bernds at codesourcery dot com>
- Date: Thu, 16 Oct 2014 12:56:48 +0200 (CEST)
- Subject: Re: oacc kernels directive -- reductions
- Authentication-results: sourceware.org; auth=none
- References: <543CEEB7 dot 9080806 at mentor dot com>
On Tue, 14 Oct 2014, Tom de Vries wrote:
> in this email I'm trying to explain in detail what problem I'm running into
> with reductions in oacc kernels region, and how I think it could be solved.
> Any advice is welcome.
> OVERALL PROBLEM
> The overall problem I'm trying to solve is to implement the oacc kernels
> directive in gcc, reusing pass_parallelize_loops.
> OACC KERNELS
> The oacc kernels region is a region with a series of loop nests, which are
> intended to run on the accelerator. The compiler needs to offload each loop
> nest to the accelerator, in the way most optimal for the accelerator.
> The pass analyzes loops. If the loop iterations are independent, and it looks
> beneficial to parallelize the loop, the loop is transformed.
> A copy of the loop is made, that deals with:
> - small loop iterations for which the overhead of starting several threads
> be too big, or
> - fixup loop iterations that are left in case the number of iterations is not
> divisible by the parallelization factor.
> The original loop is transformed:
> - References of local variables are replaced with dereferences of a new
> variable, which are initialized at loop entry with the addresses of the
> original variables (eliminate_local_variables)
> - copy loop-non-local variables to a structure, and replace references with
> loads from a pointer to another (similar) structure
> - The loop is replaced with an GIMPLE_OMP_FOR (with and empty body) and
> - The loop region is enveloped with GIMPLE_OMP_PARALLEL and GIMPLE_OMP_RETURN
> - the loop region is omp-expanded using omp_expand_local
> I've created an initial implementation in vries/oacc-kernels, on top of the
> In the gomp-4_0-branch, the kernels directive is translated as a copy of the
> oacc parallels directive. So, the following stages are done:
> - pass_lower_omp/scan_omp:
> - scan directive body for variables.
> - build up omp_context datastructures.
> - declare struct with fields corresponding to scanned variables.
> - declare function with pointer to struct
> - pass_lower_omp/lower_omp:
> - declare struct
> - assign values to struct fields
> - declare pointer to struct
> - rewrite body in terms of struct fields using pointer to struct.
> - omp_expand:
> - build up omp_region data-structures
> - split off region in separate function
> - replace region with call to oacc runtime function while passing function
> pointer to split off function
> The current mechanism of offloading (compiling a function for a different
> architecture) is using the lto-streaming. The parloops pass is located after
> the lto-streaming point which is too late. OTOH, the parloops pass needs alias
> info, which is only available after pass_build_ealias. So a copy of the
> parloops pass specialized for oacc kernels has been added after
> pass_build_ealias (plus a couple of passes to compensate for moving the pass
> up in the pass list).
> The new pass does not use the lowering (first 2 steps of loop transform) of
> parloops. The lowering is already done by pass_omp_lower.
> The omp-expansion of the oacc-kernels region (done in gomp-4_0-branch) is
> skipped, to allow first the alias analysis to work on the scope of the intact
> function, and the new pass to do the omp-expansion.
> So, the new pass:
> - analyses the loop for dependences
> - if independent, transforms the loop:
> - The loop is replaced with an GIMPLE_OMP_FOR (kind_oacc_loop, with an empty
> body) and GIMPLE_OMP_CONTINUE
> - The GIMPLE_OACC_KERNELS is replaced with GIMPLE_OACC_PARALLEL
> - the loop region is omp-expanded using omp_expand_local
> The gotchas of the implementation are:
> - no support for reductions, nested loops, more than one loop nest in
> kernels region
> - the fixup/low-it-count loop copy is still generated _inside_ the split off
> PROBLEM WITH REDUCTIONS
> In the vries/oacc-kernels implementation, the lowering of oacc kernels (in
> pass_lower_omp) is done before any loop analysis. For reductions, that's not
> possible anymore, since that would mean that detection of reductions comes
> after handling of reductions.
> The problem we're running into here, is that:
> - on one hand, the oacc lowering is done on high gimple (scopes still intact
> because GIMPLE_BINDs are still present, no bbs and cfgs, eh not expanded, no
> - otoh, loop analysis is done on low ssa gimple (bbs, cfgs, ssa, no scopes, eh
> The parloops pass is confronted with a similar problem.
> AFAIU, ideal pass reuse for parloops would go something like this: on ssa, you
> do loop analysis. You then insert omp pragmas that indicate what
> transformations you want. Then you go back from ssa gimple to high gimple
> representation, and you run omp-lower and omp-expand to do the actual
> Things have been solved like this in parloops: the lowering of omp-lower is
> not reused in parloops, but instead a different (but similar) lowering has
> been added. What is reused, is the omp-expand. We don't go back to pre-ssa,
> but the omp-expand code has been adapted to handle ssa code. And the parloops
> pass removes the loop cfg part and substitutes it for a GIMPLE_OMP_FOR, as
> would be the case for omp for directives present in a source with omp
> An advantage that parloops has, is that its scope is limited to a single loop.
> OTOH, the implementation for oacc kernels has to cooperate with other oacc
> constructs, f.i. if an array is already present on the device due to an
> earlier construct.
> Furthermore, the lowering styles are different.
> In the omp-lower pass, oacc lowering is done like this: we load from a struct
> field a pointer (D.2158), from which we load the thread-local pointer
> instantiation (c.3D.2134):
> D.2158 = .omp_data_iD.2148->cD.2150;
> c.3D.2134 = *D.2158;
> D.2137 = c.3D.2134 + D.2136;
> *D.2137 = D.2144;
> In the parloops pass, we simply load the thread-local pointer instantiation
> from a struct field:
> c.4_53 = .paral_data_load.9_56->c;
> _7 = c.4_53 + _5;
> *_7 = _14;
> POSSIBLE SOLUTIONS
> I can think of these possible solutions:
> 1. rewrite reduction analysis from parloops to work on high gimple.
> 2. rewrite omp-lowering to work on ssa gimple
> 3. try to reuse parloops lowering, and patch it up such that it coorporates
> with code generated for other oacc constructs.
> At the moment, I'm looking at the last option.
> By lowering the kernels directive and its clauses at omp-low, but the
> associated body at parloops, I get the following code (after parloops
> lowering) for a vector add loop:
> <bb 8>:
> .omp_data_arr.10.c = &c;
> .omp_data_arr.10.b = &b;
> .omp_data_arr.10.a = &a;
> #pragma acc kernels map(from:c [len: 8]) map(to:b [len: 8]) map(to:a [len:
> 8]) [child fn: main._omp_fn.0 (.omp_data_arr.10, .omp_data_sizes.11,
> <bb 9>:
> .omp_data_i_38 = &.omp_data_arr.10;
> c.5_40 = c;
> a.6_44 = a;
> b.7_49 = b;
> I think that if I manage to replace the last 3 with this I could have the
> vector add loop working:
> c_p = .omp_data_i_38->c;
> a_p = .omp_data_i_38->a;
> b_p = .omp_data_i_38->b;
> c.5_40 = *c_p;
> a.6_44 = *a_p;
> b.7_49 = *b_p;
> From there on, I'll try to get reductions working in a similar fashion.
Not really understanding how 3) can replace 1) or 2) ... but what
I understand is that OACC lowering happens at pass_lower_omp
(no CFG or loops or SSA). If we want to keep that the "proper"
choice of dealing with this high-level OACC "kernel" directive
is to pass it down somehow and deal with it later when loops + SSA
are available. After all you _are_ re-using parloops for the
So - can't OACC kernel lowering at lower_omp time simply annotate loops?
Like in a way #pragma ivdep is handled? Maybe this should even happen
from inside the parser (who knows where loops are)? Then at CFG / loop
build time this information is transfered to loop meta-data (same
as with IVDEP) and a parloop pass somewhere in early opts can do
the right thing(TM) on the marked loops?
> - Tom
Richard Biener <email@example.com>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer