This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [gomp4] openacc kernels directive support

On 09-09-14 12:56, Richard Biener wrote:
On Tue, 9 Sep 2014, Tom de Vries wrote:

On 18-08-14 14:16, Tom de Vries wrote:
On 06-08-14 17:10, Tom de Vries wrote:
We could insert a pass-group here that only deals with functions that have
kernels directive, and do the auto-par thing in a pass_oacc_kernels (which
should share the majority of the infrastructure with the parloops pass):
            NEXT_PASS (pass_build_ealias);
            INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
               NEXT_PASS (pass_ch);
               NEXT_PASS (pass_ccp);
               NEXT_PASS (pass_lim_aux);
               NEXT_PASS (pass_oacc_par);
            POP_INSERT_PASSES ()

Any comments, ideas or suggestions ?

I've experimented with implementing this on top of gomp-4_0-branch, and I
into PR46032.

PR46032 is about vectorization failure on a function split off by omp
parallelization. The vectorization fails due to aliasing constraints in the
split off function, which are not present in the original code.

Heh.  At least the omp-low.c parts from comment #1 should be pushed
to trunk...

Hi Richard,

Right, but the intra_create_variable_infos part does not apply cleanly, and I don't know yet how to resolve that.

In the gomp-4_0-branch, the code marked by the openacc kernels directive is
split off during omp_expand. The generated code has the same additional
constraints, and in pass_oacc_par the parallelization fails.

The PR46032 contains a tentative patch by Richard Biener, which applies
on top of 4.6 (I haven't yet reached a level of understanding of
tree-ssa-structalias.c to be able to resolve the conflict in
intra_create_variable_infos when applying on 4.7). The tentative patch
running ipa-pta, which is also a pass run after the point where we write out
lto stream. I'm not sure whether it makes sense to run the pta-ipa pass as
of the pass_oacc_kernels pass list.

No, that's not even possible I think.

OK, thanks for confirming that.

I see three ways of continuing from here:
- take the tentative patch and make it work, including running pta-ipa
- same, but try somehow to manage without running pta-ipa.
- try to postpone splitting of the function until the end of pass_oacc_par.

I don't understand the last option?  What is the actual issue you run
into?  You split oacc kernels off and _then_ run "autopar" on the
split-off function (and get additional kernels)?

Let me try to reiterate the problem in more detail.

We're trying to implement the auto-parallelization part of the oacc kernels directive using the existing parloops pass. The source starting point is the gomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the oacc kernels directive, analogous to the oacc parallel directive.

So the current gomp-4_0-branch does the following steps for oacc parallel/kernels directives:
1. pass_lower_omp/scan_omp:
   - create record type with rewrite vars (.omp_data_t).
   - declare function with arg with type pointer to .omp_data_t.
2. pass_lower_omp/lower_omp:
   - rewrite region in terms of rewrite vars
   - add omp_return at end
3. pass_expand_omp:
   - split off the region into a separate function
   - replace region with call to GOACC_parallel/GOACC_kernels, with function
     pointer as argument

I wrote an example with a single oacc kernels region containing a simple vector addition loop, and tried to make auto-parallelization work.

The first problem I ran into was that the parloops pass failed to analyze the dependencies in an vector addition example, due to the fact that the region was already split off into a separate function, similar to PR46032.

I looked briefly into the patches set in PR46032, but I realized that even if I fix it, the next problem I run into will be that the parloops pass is run after the lto stream read/write point. So any changes the parloops pass makes at that point are in the accelerator compile flow, in other words we're talking about launching an accelerator kernel from the accelerator. While that is possible with recent cuda accelerators, I guess in general we should not expect that to be possible. [ I also thought of a fancy scheme where we don't split off a new function, but manipulate the body of the already split off function, and emit a c file from the accelerator compiler containing the parameters that the host compiler should use to launch the accelerator kernel... but I guess that would be a last resort. ]

So in order to solve the lto stream read/write point problem, I moved the parloops pass (well, a copy called pass_oacc_par or similar) up in the pass list, to before lto stream read/write point. That precludes solving the alias problem with the PR46032 patch set, since we need ipa for that.

I solved (well, rather prevented) the alias problem by disabling pass_omp_expand for GIMPLE_OACC_KERNELS, in other words disabling the function-split-off in pass_omp_expand and letting pass_oacc_par take care of that (This is what I meant with: 'postpone splitting of the function until the end of pass_oacc_par'). Doing so required me to write a patch to handle omp-lowered code conservatively in cpp and forwprop, otherwise the 'rewrite region in terms of rewrite vars' would be undone by the time we arrive at pass_oacc_par.

Some advice on how to continue from here would be *highly* appreciated. My
atm is to investigate the last option.


I've investigated the last option, and published the current state in git-only
branch vries/oacc-kernels (;a=shortlog;h=refs/heads/vries/oacc-kernels

The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is that:
- a simple loop marked with the oacc kernels directive is analyzed for
- the loop is then rewritten using oacc parallel and oacc loop directives
- these oacc directives are expanded using omp_expand_local
- this results in the loop being split off into a separate function, while
    the loop is replaced with a GOACC_parallel call
- all this is done before writing out the lto stream
- no support yet for reductions, nested loops, more than one loop nest in
   kernels region

At toplevel, the added pass list looks like this:
           NEXT_PASS (pass_build_ealias);
           /* Pass group that runs when there are oacc kernels in the
              function.  */

Not sure why pass_oacc_kernels runs before all the other local
cleanups?  I would have put it after pass_cd_dce at least.

My focus was on running pass_oacc_kernels ASAP, in order not to have to adapt more passes to leave the omp-lowered code alone. I'll give your suggestion a try.

           NEXT_PASS (pass_oacc_kernels);
           PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
               NEXT_PASS (pass_ch_oacc_kernels);
               NEXT_PASS (pass_tree_loop_init);
               NEXT_PASS (pass_lim);
               NEXT_PASS (pass_ccp);
               NEXT_PASS (pass_parallelize_loops_oacc_kernels);
               NEXT_PASS (pass_tree_loop_done);
           POP_INSERT_PASSES ()

The main question I'm currently facing is the following: when to do lowering
(in other words, rewriting of variable access in terms of .omp_data) of the
kernels region. There are basically 2 passes that contain code to do this:
- pass_lower_omp (on pre-ssa code)
- pass_parallelize_loops (on ssa code)

Both use the same utilities.

I think you mean that both passes use the same utilities to do omp-expand (in other words, pass_parallelize_loops uses omp_expand_local). But AFAIU, the omp-lowering in pass_parallelize_loops (in particular, the rewrite of the region in terms of rewrite vars) shares no code with the omp pass.

Atm I'm using pass_loswer_omp, and I've added a patch that handles omp-lowered
code conservatively in ccp and forwprop in order for the lowering to remain
until arriving at pass_parallelize_loops_oacc_kernels.

You mean omp-_un_-lowered code?

No, I mean pass_omp_lower lowers the code into omp-lowered code, and the patch in question prevents cpp and forwprop from undoing the lowering before arriving at the point where we split off the function.

But it might turn out to be easier/necessary to handle this in
pass_parallelize_loops_oacc_kernels instead.

I'd do it similar to how autopar does it

OK, I'll try then to do the lowering for the kernels region in pass_parallelize_loops_oacc_kernels, not in pass_omp_lower.

FWIW, I'm looking now into reductions, and started thinking in the same direction.

(not that autopar is a great
example for a GCC pass these days...).

For my understanding, could you briefly elaborate on that (or give a reference to an earlier discussion)?

- Tom


Any advice on this issue, and on the current implementation is welcome.

- Tom

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]