This is the mail archive of the
mailing list for the GCC project.
Re: [gomp4] openacc kernels directive support
- From: Richard Biener <rguenther at suse dot de>
- To: Tom de Vries <Tom_deVries at mentor dot com>
- Cc: Jakub Jelinek <jakub at redhat dot com>, gcc at gcc dot gnu dot org, Thomas Schwinge <Thomas_Schwinge at mentor dot com>, Bernd Schmidt <bernds at codesourcery dot com>
- Date: Mon, 22 Sep 2014 10:28:44 +0200 (CEST)
- Subject: Re: [gomp4] openacc kernels directive support
- Authentication-results: sourceware.org; auth=none
- References: <53E24570 dot 1010200 at mentor dot com> <53F1EEB7 dot 1090509 at mentor dot com> <540ED665 dot 3010003 at mentor dot com> <alpine dot LSU dot 2 dot 11 dot 1409091243220 dot 20733 at zhemvz dot fhfr dot qr> <54185877 dot 6020902 at mentor dot com>
On Tue, 16 Sep 2014, Tom de Vries wrote:
> On 09-09-14 12:56, Richard Biener wrote:
> > On Tue, 9 Sep 2014, Tom de Vries wrote:
> > > On 18-08-14 14:16, Tom de Vries wrote:
> > > > On 06-08-14 17:10, Tom de Vries wrote:
> > > > > We could insert a pass-group here that only deals with functions that
> > > > > have
> > > > > the
> > > > > kernels directive, and do the auto-par thing in a pass_oacc_kernels
> > > > > (which
> > > > > should share the majority of the infrastructure with the parloops
> > > > > pass):
> > > > > ...
> > > > > NEXT_PASS (pass_build_ealias);
> > > > > INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
> > > > > NEXT_PASS (pass_ch);
> > > > > NEXT_PASS (pass_ccp);
> > > > > NEXT_PASS (pass_lim_aux);
> > > > > NEXT_PASS (pass_oacc_par);
> > > > > POP_INSERT_PASSES ()
> > > > > ...
> > > > >
> > > > > Any comments, ideas or suggestions ?
> > > >
> > > > I've experimented with implementing this on top of gomp-4_0-branch, and
> > > > I
> > > > ran
> > > > into PR46032.
> > > >
> > > > PR46032 is about vectorization failure on a function split off by omp
> > > > parallelization. The vectorization fails due to aliasing constraints in
> > > > the
> > > > split off function, which are not present in the original code.
> > Heh. At least the omp-low.c parts from comment #1 should be pushed
> > to trunk...
> Hi Richard,
> Right, but the intra_create_variable_infos part does not apply cleanly, and I
> don't know yet how to resolve that.
> > > > In the gomp-4_0-branch, the code marked by the openacc kernels directive
> > > > is
> > > > split off during omp_expand. The generated code has the same additional
> > > > aliasing
> > > > constraints, and in pass_oacc_par the parallelization fails.
> > > >
> > > > The PR46032 contains a tentative patch by Richard Biener, which applies
> > > > cleanly
> > > > on top of 4.6 (I haven't yet reached a level of understanding of
> > > > tree-ssa-structalias.c to be able to resolve the conflict in
> > > > intra_create_variable_infos when applying on 4.7). The tentative patch
> > > > involves
> > > > running ipa-pta, which is also a pass run after the point where we write
> > > > out
> > > > the
> > > > lto stream. I'm not sure whether it makes sense to run the pta-ipa pass
> > > > as
> > > > part
> > > > of the pass_oacc_kernels pass list.
> > No, that's not even possible I think.
> OK, thanks for confirming that.
> > > > I see three ways of continuing from here:
> > > > - take the tentative patch and make it work, including running pta-ipa
> > > > during
> > > > passes_oacc_kernels
> > > > - same, but try somehow to manage without running pta-ipa.
> > > > - try to postpone splitting of the function until the end of
> > > > pass_oacc_par.
> > I don't understand the last option? What is the actual issue you run
> > into? You split oacc kernels off and _then_ run "autopar" on the
> > split-off function (and get additional kernels)?
> Let me try to reiterate the problem in more detail.
> We're trying to implement the auto-parallelization part of the oacc kernels
> directive using the existing parloops pass. The source starting point is the
> gomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the oacc
> kernels directive, analogous to the oacc parallel directive.
> So the current gomp-4_0-branch does the following steps for oacc
> parallel/kernels directives:
> 1. pass_lower_omp/scan_omp:
> - create record type with rewrite vars (.omp_data_t).
> - declare function with arg with type pointer to .omp_data_t.
> 2. pass_lower_omp/lower_omp:
> - rewrite region in terms of rewrite vars
> - add omp_return at end
> 3. pass_expand_omp:
> - split off the region into a separate function
> - replace region with call to GOACC_parallel/GOACC_kernels, with function
> pointer as argument
> I wrote an example with a single oacc kernels region containing a simple
> vector addition loop, and tried to make auto-parallelization work.
Ah, so the "target" OACC directive tells it to vectorize only, not to
parallelize? And we split off the kernel only because we have to
ship it to the accelerator.
> The first problem I ran into was that the parloops pass failed to analyze the
> dependencies in an vector addition example, due to the fact that the region
> was already split off into a separate function, similar to PR46032.
> I looked briefly into the patches set in PR46032, but I realized that even if
> I fix it, the next problem I run into will be that the parloops pass is run
> after the lto stream read/write point. So any changes the parloops pass makes
> at that point are in the accelerator compile flow, in other words we're
> talking about launching an accelerator kernel from the accelerator. While that
> is possible with recent cuda accelerators, I guess in general we should not
> expect that to be possible.
HSA also supports that btw.
> [ I also thought of a fancy scheme where we don't split off a new function,
> but manipulate the body of the already split off function, and emit a c file
> from the accelerator compiler containing the parameters that the host compiler
> should use to launch the accelerator kernel... but I guess that would be a
> last resort. ]
> So in order to solve the lto stream read/write point problem, I moved the
> parloops pass (well, a copy called pass_oacc_par or similar) up in the pass
> list, to before lto stream read/write point. That precludes solving the alias
> problem with the PR46032 patch set, since we need ipa for that.
Generally I would expect that autopar would do analysis _before_ any
offloading (like in its regular place it is applied before vectorization).
> I solved (well, rather prevented) the alias problem by disabling
> pass_omp_expand for GIMPLE_OACC_KERNELS, in other words disabling the
> function-split-off in pass_omp_expand and letting pass_oacc_par take care of
> that (This is what I meant with: 'postpone splitting of the function until the
> end of pass_oacc_par').
> Doing so required me to write a patch to handle omp-lowered code
> conservatively in cpp and forwprop, otherwise the 'rewrite region in terms of
> rewrite vars' would be undone by the time we arrive at pass_oacc_par.
Ah. Well, yes. I would say you might be able to make autopar not
do the split-off but leave it to a further omp-expand pass (it
uses the OMP machinery anyway). Both OACC and autopar can share
the actual function split-off.
I would be happily accepting splitting the current autopar pass
that way, that is, do
and make the analysis code handle lowered OMP form.
Btw, did you see my recent patch proposals on persistent dependence
information? (the "Restrict, take 42" ones?) It would be nice
if the OMP lowering code would annotate memory references with
(non-)dependence information it has so that more easily survives
the function split-off.
> > > > Some advice on how to continue from here would be *highly* appreciated.
> > > > My
> > > > hunch
> > > > atm is to investigate the last option.
> > > >
> > >
> > > Jakub,
> > > Richard,
> > >
> > > I've investigated the last option, and published the current state in
> > > git-only
> > > branch vries/oacc-kernels (
> > > https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels
> > > ).
> > >
> > > The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is
> > > that:
> > > - a simple loop marked with the oacc kernels directive is analyzed for
> > > parallelization,
> > > - the loop is then rewritten using oacc parallel and oacc loop directives
> > > - these oacc directives are expanded using omp_expand_local
> > > - this results in the loop being split off into a separate function, while
> > > the loop is replaced with a GOACC_parallel call
> > > - all this is done before writing out the lto stream
> > > - no support yet for reductions, nested loops, more than one loop nest in
> > > kernels region
> > >
> > > At toplevel, the added pass list looks like this:
> > > ...
> > > NEXT_PASS (pass_build_ealias);
> > > /* Pass group that runs when there are oacc kernels in the
> > > function. */
> > Not sure why pass_oacc_kernels runs before all the other local
> > cleanups? I would have put it after pass_cd_dce at least.
> My focus was on running pass_oacc_kernels ASAP, in order not to have to adapt
> more passes to leave the omp-lowered code alone. I'll give your suggestion a
> > > NEXT_PASS (pass_oacc_kernels);
> > > PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > > NEXT_PASS (pass_ch_oacc_kernels);
> > > NEXT_PASS (pass_tree_loop_init);
> > > NEXT_PASS (pass_lim);
> > > NEXT_PASS (pass_ccp);
> > > NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > > NEXT_PASS (pass_tree_loop_done);
> > > POP_INSERT_PASSES ()
> > > ...
> > >
> > > The main question I'm currently facing is the following: when to do
> > > lowering
> > > (in other words, rewriting of variable access in terms of .omp_data) of
> > > the
> > > kernels region. There are basically 2 passes that contain code to do this:
> > > - pass_lower_omp (on pre-ssa code)
> > > - pass_parallelize_loops (on ssa code)
> > Both use the same utilities.
> I think you mean that both passes use the same utilities to do omp-expand (in
> other words, pass_parallelize_loops uses omp_expand_local).
> But AFAIU, the omp-lowering in pass_parallelize_loops (in particular, the
> rewrite of the region in terms of rewrite vars) shares no code with the omp
> > > Atm I'm using pass_loswer_omp, and I've added a patch that handles
> > > omp-lowered
> > > code conservatively in ccp and forwprop in order for the lowering to
> > > remain
> > > until arriving at pass_parallelize_loops_oacc_kernels.
> > You mean omp-_un_-lowered code?
> No, I mean pass_omp_lower lowers the code into omp-lowered code, and the patch
> in question prevents cpp and forwprop from undoing the lowering before
> arriving at the point where we split off the function.
> > > But it might turn out to be easier/necessary to handle this in
> > > pass_parallelize_loops_oacc_kernels instead.
> > I'd do it similar to how autopar does it
> OK, I'll try then to do the lowering for the kernels region in
> pass_parallelize_loops_oacc_kernels, not in pass_omp_lower.
> FWIW, I'm looking now into reductions, and started thinking in the same
> > (not that autopar is a great
> > example for a GCC pass these days...).
> For my understanding, could you briefly elaborate on that (or give a reference
> to an earlier discussion)?
> - Tom
> > Richard.
> > > Any advice on this issue, and on the current implementation is welcome.
> > >
> > > Thanks,
> > > - Tom
Richard Biener <firstname.lastname@example.org>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer