This is the mail archive of the
mailing list for the GCC project.
Re: [gomp4] openacc kernels directive support
- From: Richard Biener <rguenther at suse dot de>
- To: Tom de Vries <Tom_deVries at mentor dot com>
- Cc: Jakub Jelinek <jakub at redhat dot com>,gcc at gcc dot gnu dot org,Thomas Schwinge <Thomas_Schwinge at mentor dot com>,Bernd Schmidt <bernds at codesourcery dot com>
- Date: Tue, 16 Sep 2014 21:01:14 +0200
- Subject: Re: [gomp4] openacc kernels directive support
- Authentication-results: sourceware.org; auth=none
- References: <53E24570 dot 1010200 at mentor dot com> <53F1EEB7 dot 1090509 at mentor dot com> <540ED665 dot 3010003 at mentor dot com> <alpine dot LSU dot 2 dot 11 dot 1409091243220 dot 20733 at zhemvz dot fhfr dot qr> <54185877 dot 6020902 at mentor dot com>
On September 16, 2014 5:34:15 PM CEST, Tom de Vries <Tom_deVries@mentor.com> wrote:
>On 09-09-14 12:56, Richard Biener wrote:
>> On Tue, 9 Sep 2014, Tom de Vries wrote:
>>> On 18-08-14 14:16, Tom de Vries wrote:
>>>> On 06-08-14 17:10, Tom de Vries wrote:
>>>>> We could insert a pass-group here that only deals with functions
>>>>> kernels directive, and do the auto-par thing in a
>>>>> should share the majority of the infrastructure with the parloops
>>>>> NEXT_PASS (pass_build_ealias);
>>>>> INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
>>>>> NEXT_PASS (pass_ch);
>>>>> NEXT_PASS (pass_ccp);
>>>>> NEXT_PASS (pass_lim_aux);
>>>>> NEXT_PASS (pass_oacc_par);
>>>>> POP_INSERT_PASSES ()
>>>>> Any comments, ideas or suggestions ?
>>>> I've experimented with implementing this on top of gomp-4_0-branch,
>>>> into PR46032.
>>>> PR46032 is about vectorization failure on a function split off by
>>>> parallelization. The vectorization fails due to aliasing
>constraints in the
>>>> split off function, which are not present in the original code.
>> Heh. At least the omp-low.c parts from comment #1 should be pushed
>> to trunk...
>Right, but the intra_create_variable_infos part does not apply cleanly,
>don't know yet how to resolve that.
That part isno longer necessary.
I'll followup with the rest of the mail after I return from vacation.
>>>> In the gomp-4_0-branch, the code marked by the openacc kernels
>>>> split off during omp_expand. The generated code has the same
>>>> constraints, and in pass_oacc_par the parallelization fails.
>>>> The PR46032 contains a tentative patch by Richard Biener, which
>>>> on top of 4.6 (I haven't yet reached a level of understanding of
>>>> tree-ssa-structalias.c to be able to resolve the conflict in
>>>> intra_create_variable_infos when applying on 4.7). The tentative
>>>> running ipa-pta, which is also a pass run after the point where we
>>>> lto stream. I'm not sure whether it makes sense to run the pta-ipa
>>>> of the pass_oacc_kernels pass list.
>> No, that's not even possible I think.
>OK, thanks for confirming that.
>>>> I see three ways of continuing from here:
>>>> - take the tentative patch and make it work, including running
>>>> - same, but try somehow to manage without running pta-ipa.
>>>> - try to postpone splitting of the function until the end of
>> I don't understand the last option? What is the actual issue you run
>> into? You split oacc kernels off and _then_ run "autopar" on the
>> split-off function (and get additional kernels)?
>Let me try to reiterate the problem in more detail.
>We're trying to implement the auto-parallelization part of the oacc
>directive using the existing parloops pass. The source starting point
>gomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the
>kernels directive, analogous to the oacc parallel directive.
>So the current gomp-4_0-branch does the following steps for oacc
> - create record type with rewrite vars (.omp_data_t).
> - declare function with arg with type pointer to .omp_data_t.
> - rewrite region in terms of rewrite vars
> - add omp_return at end
> - split off the region into a separate function
>- replace region with call to GOACC_parallel/GOACC_kernels, with
> pointer as argument
>I wrote an example with a single oacc kernels region containing a
>addition loop, and tried to make auto-parallelization work.
>The first problem I ran into was that the parloops pass failed to
>dependencies in an vector addition example, due to the fact that the
>already split off into a separate function, similar to PR46032.
>I looked briefly into the patches set in PR46032, but I realized that
>even if I
>fix it, the next problem I run into will be that the parloops pass is
>the lto stream read/write point. So any changes the parloops pass makes
>point are in the accelerator compile flow, in other words we're talking
>launching an accelerator kernel from the accelerator. While that is
>with recent cuda accelerators, I guess in general we should not expect
>[ I also thought of a fancy scheme where we don't split off a new
>manipulate the body of the already split off function, and emit a c
>the accelerator compiler containing the parameters that the host
>use to launch the accelerator kernel... but I guess that would be a
>last resort. ]
>So in order to solve the lto stream read/write point problem, I moved
>parloops pass (well, a copy called pass_oacc_par or similar) up in the
>list, to before lto stream read/write point. That precludes solving the
>problem with the PR46032 patch set, since we need ipa for that.
>I solved (well, rather prevented) the alias problem by disabling
>for GIMPLE_OACC_KERNELS, in other words disabling the
>pass_omp_expand and letting pass_oacc_par take care of that (This is
>meant with: 'postpone splitting of the function until the end of
>Doing so required me to write a patch to handle omp-lowered code
>in cpp and forwprop, otherwise the 'rewrite region in terms of rewrite
>would be undone by the time we arrive at pass_oacc_par.
>>>> Some advice on how to continue from here would be *highly*
>>>> atm is to investigate the last option.
>>> I've investigated the last option, and published the current state
>>> branch vries/oacc-kernels (
>>> The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35
>>> - a simple loop marked with the oacc kernels directive is analyzed
>>> - the loop is then rewritten using oacc parallel and oacc loop
>>> - these oacc directives are expanded using omp_expand_local
>>> - this results in the loop being split off into a separate function,
>>> the loop is replaced with a GOACC_parallel call
>>> - all this is done before writing out the lto stream
>>> - no support yet for reductions, nested loops, more than one loop
>>> kernels region
>>> At toplevel, the added pass list looks like this:
>>> NEXT_PASS (pass_build_ealias);
>>> /* Pass group that runs when there are oacc kernels in
>>> function. */
>> Not sure why pass_oacc_kernels runs before all the other local
>> cleanups? I would have put it after pass_cd_dce at least.
>My focus was on running pass_oacc_kernels ASAP, in order not to have to
>more passes to leave the omp-lowered code alone. I'll give your
>suggestion a try.
>>> NEXT_PASS (pass_oacc_kernels);
>>> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>>> NEXT_PASS (pass_ch_oacc_kernels);
>>> NEXT_PASS (pass_tree_loop_init);
>>> NEXT_PASS (pass_lim);
>>> NEXT_PASS (pass_ccp);
>>> NEXT_PASS (pass_parallelize_loops_oacc_kernels);
>>> NEXT_PASS (pass_tree_loop_done);
>>> POP_INSERT_PASSES ()
>>> The main question I'm currently facing is the following: when to do
>>> (in other words, rewriting of variable access in terms of .omp_data)
>>> kernels region. There are basically 2 passes that contain code to do
>>> - pass_lower_omp (on pre-ssa code)
>>> - pass_parallelize_loops (on ssa code)
>> Both use the same utilities.
>I think you mean that both passes use the same utilities to do
>other words, pass_parallelize_loops uses omp_expand_local).
>But AFAIU, the omp-lowering in pass_parallelize_loops (in particular,
>rewrite of the region in terms of rewrite vars) shares no code with the
>>> Atm I'm using pass_loswer_omp, and I've added a patch that handles
>>> code conservatively in ccp and forwprop in order for the lowering to
>>> until arriving at pass_parallelize_loops_oacc_kernels.
>> You mean omp-_un_-lowered code?
>No, I mean pass_omp_lower lowers the code into omp-lowered code, and
>in question prevents cpp and forwprop from undoing the lowering before
>at the point where we split off the function.
>>> But it might turn out to be easier/necessary to handle this in
>>> pass_parallelize_loops_oacc_kernels instead.
>> I'd do it similar to how autopar does it
>OK, I'll try then to do the lowering for the kernels region in
>pass_parallelize_loops_oacc_kernels, not in pass_omp_lower.
>FWIW, I'm looking now into reductions, and started thinking in the same
>> (not that autopar is a great
>> example for a GCC pass these days...).
>For my understanding, could you briefly elaborate on that (or give a
>to an earlier discussion)?
>>> Any advice on this issue, and on the current implementation is
>>> - Tom