This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp4] openacc kernels directive support
- From: Richard Biener <rguenther at suse dot de>
- To: Tom de Vries <Tom_deVries at mentor dot com>
- Cc: Jakub Jelinek <jakub at redhat dot com>, gcc at gcc dot gnu dot org, Thomas Schwinge <Thomas_Schwinge at mentor dot com>, Bernd Schmidt <bernds at codesourcery dot com>
- Date: Tue, 9 Sep 2014 12:56:33 +0200 (CEST)
- Subject: Re: [gomp4] openacc kernels directive support
- Authentication-results: sourceware.org; auth=none
- References: <53E24570 dot 1010200 at mentor dot com> <53F1EEB7 dot 1090509 at mentor dot com> <540ED665 dot 3010003 at mentor dot com>
On Tue, 9 Sep 2014, Tom de Vries wrote:
> On 18-08-14 14:16, Tom de Vries wrote:
> > On 06-08-14 17:10, Tom de Vries wrote:
> > > We could insert a pass-group here that only deals with functions that have
> > > the
> > > kernels directive, and do the auto-par thing in a pass_oacc_kernels (which
> > > should share the majority of the infrastructure with the parloops pass):
> > > ...
> > > NEXT_PASS (pass_build_ealias);
> > > INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
> > > NEXT_PASS (pass_ch);
> > > NEXT_PASS (pass_ccp);
> > > NEXT_PASS (pass_lim_aux);
> > > NEXT_PASS (pass_oacc_par);
> > > POP_INSERT_PASSES ()
> > > ...
> > >
> > > Any comments, ideas or suggestions ?
> >
> > I've experimented with implementing this on top of gomp-4_0-branch, and I
> > ran
> > into PR46032.
> >
> > PR46032 is about vectorization failure on a function split off by omp
> > parallelization. The vectorization fails due to aliasing constraints in the
> > split off function, which are not present in the original code.
Heh. At least the omp-low.c parts from comment #1 should be pushed
to trunk...
> > In the gomp-4_0-branch, the code marked by the openacc kernels directive is
> > split off during omp_expand. The generated code has the same additional
> > aliasing
> > constraints, and in pass_oacc_par the parallelization fails.
> >
> > The PR46032 contains a tentative patch by Richard Biener, which applies
> > cleanly
> > on top of 4.6 (I haven't yet reached a level of understanding of
> > tree-ssa-structalias.c to be able to resolve the conflict in
> > intra_create_variable_infos when applying on 4.7). The tentative patch
> > involves
> > running ipa-pta, which is also a pass run after the point where we write out
> > the
> > lto stream. I'm not sure whether it makes sense to run the pta-ipa pass as
> > part
> > of the pass_oacc_kernels pass list.
No, that's not even possible I think.
> > I see three ways of continuing from here:
> > - take the tentative patch and make it work, including running pta-ipa
> > during
> > passes_oacc_kernels
> > - same, but try somehow to manage without running pta-ipa.
> > - try to postpone splitting of the function until the end of pass_oacc_par.
I don't understand the last option? What is the actual issue you run
into? You split oacc kernels off and _then_ run "autopar" on the
split-off function (and get additional kernels)?
> > Some advice on how to continue from here would be *highly* appreciated. My
> > hunch
> > atm is to investigate the last option.
> >
>
> Jakub,
> Richard,
>
> I've investigated the last option, and published the current state in git-only
> branch vries/oacc-kernels (
> https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels
> ).
>
> The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is that:
> - a simple loop marked with the oacc kernels directive is analyzed for
> parallelization,
> - the loop is then rewritten using oacc parallel and oacc loop directives
> - these oacc directives are expanded using omp_expand_local
> - this results in the loop being split off into a separate function, while
> the loop is replaced with a GOACC_parallel call
> - all this is done before writing out the lto stream
> - no support yet for reductions, nested loops, more than one loop nest in
> kernels region
>
> At toplevel, the added pass list looks like this:
> ...
> NEXT_PASS (pass_build_ealias);
> /* Pass group that runs when there are oacc kernels in the
> function. */
Not sure why pass_oacc_kernels runs before all the other local
cleanups? I would have put it after pass_cd_dce at least.
> NEXT_PASS (pass_oacc_kernels);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> NEXT_PASS (pass_ch_oacc_kernels);
> NEXT_PASS (pass_tree_loop_init);
> NEXT_PASS (pass_lim);
> NEXT_PASS (pass_ccp);
> NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> NEXT_PASS (pass_tree_loop_done);
> POP_INSERT_PASSES ()
> ...
>
> The main question I'm currently facing is the following: when to do lowering
> (in other words, rewriting of variable access in terms of .omp_data) of the
> kernels region. There are basically 2 passes that contain code to do this:
> - pass_lower_omp (on pre-ssa code)
> - pass_parallelize_loops (on ssa code)
Both use the same utilities.
> Atm I'm using pass_lower_omp, and I've added a patch that handles omp-lowered
> code conservatively in ccp and forwprop in order for the lowering to remain
> until arriving at pass_parallelize_loops_oacc_kernels.
You mean omp-_un_-lowered code?
> But it might turn out to be easier/necessary to handle this in
> pass_parallelize_loops_oacc_kernels instead.
I'd do it similar to how autopar does it (not that autopar is a great
example for a GCC pass these days...).
Richard.
> Any advice on this issue, and on the current implementation is welcome.
>
> Thanks,
> - Tom