This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [gomp4] openacc kernels directive support


On 18-08-14 14:16, Tom de Vries wrote:
On 06-08-14 17:10, Tom de Vries wrote:
We could insert a pass-group here that only deals with functions that have the
kernels directive, and do the auto-par thing in a pass_oacc_kernels (which
should share the majority of the infrastructure with the parloops pass):
...
           NEXT_PASS (pass_build_ealias);
           INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
              NEXT_PASS (pass_ch);
              NEXT_PASS (pass_ccp);
              NEXT_PASS (pass_lim_aux);
              NEXT_PASS (pass_oacc_par);
           POP_INSERT_PASSES ()
...

Any comments, ideas or suggestions ?

I've experimented with implementing this on top of gomp-4_0-branch, and I ran
into PR46032.

PR46032 is about vectorization failure on a function split off by omp
parallelization. The vectorization fails due to aliasing constraints in the
split off function, which are not present in the original code.

In the gomp-4_0-branch, the code marked by the openacc kernels directive is
split off during omp_expand. The generated code has the same additional aliasing
constraints, and in pass_oacc_par the parallelization fails.

The PR46032 contains a tentative patch by Richard Biener, which applies cleanly
on top of 4.6 (I haven't yet reached a level of understanding of
tree-ssa-structalias.c to be able to resolve the conflict in
intra_create_variable_infos when applying on 4.7). The tentative patch involves
running ipa-pta, which is also a pass run after the point where we write out the
lto stream. I'm not sure whether it makes sense to run the pta-ipa pass as part
of the pass_oacc_kernels pass list.

I see three ways of continuing from here:
- take the tentative patch and make it work, including running pta-ipa during
   passes_oacc_kernels
- same, but try somehow to manage without running pta-ipa.
- try to postpone splitting of the function until the end of pass_oacc_par.

Some advice on how to continue from here would be *highly* appreciated. My hunch
atm is to investigate the last option.


Jakub,
Richard,

I've investigated the last option, and published the current state in git-only branch vries/oacc-kernels ( https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels ).

The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is that:
- a simple loop marked with the oacc kernels directive is analyzed for
   parallelization,
- the loop is then rewritten using oacc parallel and oacc loop directives
- these oacc directives are expanded using omp_expand_local
- this results in the loop being split off into a separate function, while
   the loop is replaced with a GOACC_parallel call
- all this is done before writing out the lto stream
- no support yet for reductions, nested loops, more than one loop nest in
  kernels region

At toplevel, the added pass list looks like this:
...
          NEXT_PASS (pass_build_ealias);
          /* Pass group that runs when there are oacc kernels in the
             function.  */
          NEXT_PASS (pass_oacc_kernels);
          PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
              NEXT_PASS (pass_ch_oacc_kernels);
              NEXT_PASS (pass_tree_loop_init);
              NEXT_PASS (pass_lim);
              NEXT_PASS (pass_ccp);
              NEXT_PASS (pass_parallelize_loops_oacc_kernels);
              NEXT_PASS (pass_tree_loop_done);
          POP_INSERT_PASSES ()
 ...

The main question I'm currently facing is the following: when to do lowering (in other words, rewriting of variable access in terms of .omp_data) of the kernels region. There are basically 2 passes that contain code to do this:
- pass_lower_omp (on pre-ssa code)
- pass_parallelize_loops (on ssa code)

Atm I'm using pass_lower_omp, and I've added a patch that handles omp-lowered code conservatively in ccp and forwprop in order for the lowering to remain until arriving at pass_parallelize_loops_oacc_kernels.

But it might turn out to be easier/necessary to handle this in pass_parallelize_loops_oacc_kernels instead.

Any advice on this issue, and on the current implementation is welcome.

Thanks,
- Tom


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]