This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [gomp4] openacc kernels directive support


On 22-09-14 10:28, Richard Biener wrote:
On Tue, 16 Sep 2014, Tom de Vries wrote:

On 09-09-14 12:56, Richard Biener wrote:
On Tue, 9 Sep 2014, Tom de Vries wrote:

On 18-08-14 14:16, Tom de Vries wrote:
On 06-08-14 17:10, Tom de Vries wrote:
We could insert a pass-group here that only deals with functions that
have
the
kernels directive, and do the auto-par thing in a pass_oacc_kernels
(which
should share the majority of the infrastructure with the parloops
pass):
...
             NEXT_PASS (pass_build_ealias);
             INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
                NEXT_PASS (pass_ch);
                NEXT_PASS (pass_ccp);
                NEXT_PASS (pass_lim_aux);
                NEXT_PASS (pass_oacc_par);
             POP_INSERT_PASSES ()
...

Any comments, ideas or suggestions ?

I've experimented with implementing this on top of gomp-4_0-branch, and
I
ran
into PR46032.

PR46032 is about vectorization failure on a function split off by omp
parallelization. The vectorization fails due to aliasing constraints in
the
split off function, which are not present in the original code.

Heh.  At least the omp-low.c parts from comment #1 should be pushed
to trunk...


Hi Richard,

Right, but the intra_create_variable_infos part does not apply cleanly, and I
don't know yet how to resolve that.

In the gomp-4_0-branch, the code marked by the openacc kernels directive
is
split off during omp_expand. The generated code has the same additional
aliasing
constraints, and in pass_oacc_par the parallelization fails.

The PR46032 contains a tentative patch by Richard Biener, which applies
cleanly
on top of 4.6 (I haven't yet reached a level of understanding of
tree-ssa-structalias.c to be able to resolve the conflict in
intra_create_variable_infos when applying on 4.7). The tentative patch
involves
running ipa-pta, which is also a pass run after the point where we write
out
the
lto stream. I'm not sure whether it makes sense to run the pta-ipa pass
as
part
of the pass_oacc_kernels pass list.

No, that's not even possible I think.


OK, thanks for confirming that.

I see three ways of continuing from here:
- take the tentative patch and make it work, including running pta-ipa
during
     passes_oacc_kernels
- same, but try somehow to manage without running pta-ipa.
- try to postpone splitting of the function until the end of
pass_oacc_par.

I don't understand the last option?  What is the actual issue you run
into?  You split oacc kernels off and _then_ run "autopar" on the
split-off function (and get additional kernels)?


Let me try to reiterate the problem in more detail.

We're trying to implement the auto-parallelization part of the oacc kernels
directive using the existing parloops pass. The source starting point is the
gomp-4_0-branch.  The gomp-4_0-branch has a dummy implementation of the oacc
kernels directive, analogous to the oacc parallel directive.

So the current gomp-4_0-branch does the following steps for oacc
parallel/kernels directives:
1. pass_lower_omp/scan_omp:
    - create record type with rewrite vars (.omp_data_t).
    - declare function with arg with type pointer to .omp_data_t.
2. pass_lower_omp/lower_omp:
    - rewrite region in terms of rewrite vars
    - add omp_return at end
3. pass_expand_omp:
    - split off the region into a separate function
    - replace region with call to GOACC_parallel/GOACC_kernels, with function
      pointer as argument

I wrote an example with a single oacc kernels region containing a simple
vector addition loop, and tried to make auto-parallelization work.

Ah, so the "target" OACC directive tells it to vectorize only, not to
parallelize?

Hi Richard,

I'm trying to make auto-parallelization work, not vectorization.

And we split off the kernel only because we have to
ship it to the accelerator.

The first problem I ran into was that the parloops pass failed to analyze the
dependencies in an vector addition example, due to the fact that the region
was already split off into a separate function, similar to PR46032.

I looked briefly into the patches set in PR46032, but I realized that even if
I fix it, the next problem I run into will be that the parloops pass is run
after the lto stream read/write point. So any changes the parloops pass makes
at that point are in the accelerator compile flow, in other words we're
talking about launching an accelerator kernel from the accelerator. While that
is possible with recent cuda accelerators, I guess in general we should not
expect that to be possible.

HSA also supports that btw.


OK, good to know.

[ I also thought of a fancy scheme where we don't split off a new function,
but manipulate the body of the already split off function, and emit a c file
from the accelerator compiler containing the parameters that the host compiler
should use to launch the accelerator kernel... but I guess that would be a
last resort. ]

So in order to solve the lto stream read/write point problem, I moved the
parloops pass (well, a copy called pass_oacc_par or similar) up in the pass
list, to before lto stream read/write point. That precludes solving the alias
problem with the PR46032 patch set, since we need ipa for that.

Generally I would expect that autopar would do analysis _before_ any
offloading (like in its regular place it is applied before vectorization).


Right. The pass_parallelize_loops works on all loops in a function. If a loop qualifies for parallelization, the loop is transformed and split off in order to call it as a thread function. If not, there's no need to split off or transform the loop, and it is left alone.

For a kernels region the following holds:
...
Summary
This construct defines a region of the program that is to be compiled into a sequence of kernels for execution on the accelerator device.

Description
The compiler will split the code in the kernels region into a sequence of accelerator kernels. Typically, each loop nest will be a distinct kernel. When the program encounters a kernels construct, it will launch the sequence of kernels in order on the device. The number and configuration of gangs of workers and vector length may be different for each kernel.
...

AFAIU the definition, all the loops in a kernels region will be offloaded.

So, since you know all loops will be (part of) kernels in principle you _could_ first split off the loop, and then do the analysis (though it would be a bit more complicated, since modifying the loop in the split-off function will mean modify the calling parameters in another function).

I solved (well, rather prevented) the alias problem by disabling
pass_omp_expand for GIMPLE_OACC_KERNELS, in other words disabling the
function-split-off in pass_omp_expand and letting pass_oacc_par take care of
that (This is what I meant with: 'postpone splitting of the function until the
end of pass_oacc_par').
Doing so required me to write a patch to handle omp-lowered code
conservatively in cpp and forwprop, otherwise the 'rewrite region in terms of
rewrite vars' would be undone by the time we arrive at pass_oacc_par.

Ah.  Well, yes.  I would say you might be able to make autopar not
do the split-off but leave it to a further omp-expand pass (it
uses the OMP machinery anyway).  Both OACC and autopar can share
the actual function split-off.

I would be happily accepting splitting the current autopar pass
that way, that is, do

NEXT_PASS (pass_parallelize_loops)
PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
NEXT_PASS (pass_expand_omp)
POP_INSERT_PASSES ()

and make the analysis code handle lowered OMP form.


To explore that, I created a tentative patch on top of the gomp-4_0-branch, which allows a non-bootstrap build and a gcc dg.exp run, so at least a couple of parloops test-cases. I can put this through bootstrap and reg-test if you confirm this patch is what you want.

I'm not sure though OACC and autopar can share the actual function split-off. autopar is run rather late, later than the lto-stream point, while we need the split-off done before that for oacc. I'm also not sure what the point would be to have lowered OMP form in all those passes in between, I'd think you want to omp-expand it asap.

Btw, did you see my recent patch proposals on persistent dependence
information?  (the "Restrict, take 42" ones?)
 >

Didn't see it before, but read the discussion part now.

It would be nice
if the OMP lowering code would annotate memory references with
(non-)dependence information it has so that more easily survives
the function split-off.


Right, that will probably be helpful at some point.

For the moment, I'm dealing with the following problem.

I'm trying to implement reductions in an oacc kernels region, starting from the vries/oacc-kernels branch.

The omp-lowering has code to deal with reductions for loops which are marked as reductions with oacc directives.

OTOH, the parloops pass has code to detect reductions.

I can't just use the detection code from parloops in/before omp-lowering because omp-lowering works on pre-ssa, while parloops works on ssa. Also, we need alias analysis to do the analysis, and we don't have that available at that point.

Parloops also has lowering code, but AFAIU the lowering done in parloops is different than the one done in omp-lowering. AFAIU there's one level of indirection extra in oacc-lowered code, so that code doesn't play well with the other oacc code.

So I don't see a quick fix here. I'm pondering atm the following:
- rewrite the reduction detection code from parloops to work on non-ssa,
  ignoring the alias analysis aspect, and using that in/before omp-lowering
  to mark a loop as reduction, which will trigger the omp-lowering.
- in parloops, do the alias analysis part on the loop. If that checks out ok,
  split off the loop as parallel reduction loop. If not, split off the loop
  (due to data locality issues) but run it sequential.

Any comments, or ideas how to do this better?

Thanks,
- Tom

Thanks,
Richard.

Some advice on how to continue from here would be *highly* appreciated.
My
hunch
atm is to investigate the last option.


Jakub,
Richard,

I've investigated the last option, and published the current state in
git-only
branch vries/oacc-kernels (
https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels
).

The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is
that:
- a simple loop marked with the oacc kernels directive is analyzed for
     parallelization,
- the loop is then rewritten using oacc parallel and oacc loop directives
- these oacc directives are expanded using omp_expand_local
- this results in the loop being split off into a separate function, while
     the loop is replaced with a GOACC_parallel call
- all this is done before writing out the lto stream
- no support yet for reductions, nested loops, more than one loop nest in
    kernels region

At toplevel, the added pass list looks like this:
...
            NEXT_PASS (pass_build_ealias);
            /* Pass group that runs when there are oacc kernels in the
               function.  */

Not sure why pass_oacc_kernels runs before all the other local
cleanups?  I would have put it after pass_cd_dce at least.


My focus was on running pass_oacc_kernels ASAP, in order not to have to adapt
more passes to leave the omp-lowered code alone. I'll give your suggestion a
try.

            NEXT_PASS (pass_oacc_kernels);
            PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
                NEXT_PASS (pass_ch_oacc_kernels);
                NEXT_PASS (pass_tree_loop_init);
                NEXT_PASS (pass_lim);
                NEXT_PASS (pass_ccp);
                NEXT_PASS (pass_parallelize_loops_oacc_kernels);
                NEXT_PASS (pass_tree_loop_done);
            POP_INSERT_PASSES ()
   ...

The main question I'm currently facing is the following: when to do
lowering
(in other words, rewriting of variable access in terms of .omp_data) of
the
kernels region. There are basically 2 passes that contain code to do this:
- pass_lower_omp (on pre-ssa code)
- pass_parallelize_loops (on ssa code)

Both use the same utilities.


I think you mean that both passes use the same utilities to do omp-expand (in
other words, pass_parallelize_loops uses omp_expand_local).
But AFAIU, the omp-lowering in pass_parallelize_loops (in particular, the
rewrite of the region in terms of rewrite vars) shares no code with the omp
pass.

Atm I'm using pass_loswer_omp, and I've added a patch that handles
omp-lowered
code conservatively in ccp and forwprop in order for the lowering to
remain
until arriving at pass_parallelize_loops_oacc_kernels.

You mean omp-_un_-lowered code?


No, I mean pass_omp_lower lowers the code into omp-lowered code, and the patch
in question prevents cpp and forwprop from undoing the lowering before
arriving at the point where we split off the function.

But it might turn out to be easier/necessary to handle this in
pass_parallelize_loops_oacc_kernels instead.

I'd do it similar to how autopar does it

OK, I'll try then to do the lowering for the kernels region in
pass_parallelize_loops_oacc_kernels, not in pass_omp_lower.

FWIW, I'm looking now into reductions, and started thinking in the same
direction.

(not that autopar is a great
example for a GCC pass these days...).


For my understanding, could you briefly elaborate on that (or give a reference
to an earlier discussion)?

Thanks,
- Tom

Richard.

Any advice on this issue, and on the current implementation is welcome.

Thanks,
- Tom





2014-09-24  Tom de Vries  <tom@codesourcery.com>

	* function.h (struct function): Add omp_expand_needed field.
	* omp-low.c (pass_data pass_data_expand_omp_ssa): New pass_data.
	(class pass_expand_omp_ssa): New pass.
	(make_pass_expand_omp_ssa): New function.
	* passes.def (pass_parallelize_loops): Use PUSH_INSERT_PASSES_WITHIN
	instead of NEXT_PASS.
	(pass_expand_omp_ssa): Add after pass_parallelize_loops.
	* tree-parloops.c (gen_parallel_loop): Remove call to omp_expand_local.
	(pass_parallelize_loops::execute): Don't do cleanups TODO_cleanup_cfg
	and TODO_rebuild_alias yet.  Add TODO_update_ssa.  Set
	cfun->omp_expand_needed.
	* tree-pass.h (make_pass_expand_omp_ssa): Declare.
---
 gcc/function.h      |  3 +++
 gcc/omp-low.c       | 42 ++++++++++++++++++++++++++++++++++++++++++
 gcc/passes.def      |  3 +++
 gcc/tree-parloops.c | 13 +++++--------
 gcc/tree-pass.h     |  1 +
 5 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/gcc/function.h b/gcc/function.h
index a8294b2..82f9230 100644
--- a/gcc/function.h
+++ b/gcc/function.h
@@ -670,6 +670,9 @@ struct GTY(()) function {
 
   /* Set when the tail call has been identified.  */
   unsigned int tail_call_marked : 1;
+
+  /* Set when an omp_expand is needed.  */
+  unsigned int omp_expand_needed : 1;
 };
 
 /* Add the decl D to the local_decls list of FUN.  */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index ce97a0e..eeeb867 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -9356,6 +9356,48 @@ make_pass_expand_omp (gcc::context *ctxt)
 {
   return new pass_expand_omp (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_expand_omp_ssa =
+{
+  GIMPLE_PASS, /* type */
+  "ompexpssa", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg | PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_cleanup_cfg | TODO_rebuild_alias, /* todo_flags_finish */
+};
+
+class pass_expand_omp_ssa : public gimple_opt_pass
+{
+public:
+  pass_expand_omp_ssa (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_expand_omp_ssa, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return (bool)cfun->omp_expand_needed; }
+
+  virtual unsigned int execute (function *)
+  {
+    unsigned res = execute_expand_omp ();
+    cfun->omp_expand_needed = 0;
+    return res;
+  }
+
+}; // class pass_expand_omp_ssa
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_expand_omp_ssa (gcc::context *ctxt)
+{
+  return new pass_expand_omp_ssa (ctxt);
+}
 
 /* Helper function to preform, potentially COMPLEX_TYPE, operation and
    convert it to gimple.  */
diff --git a/gcc/passes.def b/gcc/passes.def
index f13df6c..1ef9b26 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -222,6 +222,9 @@ along with GCC; see the file COPYING3.  If not see
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_parallelize_loops);
+	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
+	  NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_if_conversion);
 	  /* pass_vectorize must immediately follow pass_if_conversion.
 	     Please do not add any other passes in between.  */
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 112c295..085bef2 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -1895,13 +1895,6 @@ gen_parallel_loop (struct loop *loop,
      removed statements.  */
   FOR_EACH_LOOP (loop, 0)
     free_numbers_of_iterations_estimates_loop (loop);
-
-  /* Expand the parallel constructs.  We do it directly here instead of running
-     a separate expand_omp pass, since it is more efficient, and less likely to
-     cause troubles with further analyses not being able to deal with the
-     OMP trees.  */
-
-  omp_expand_local (parallel_head);
 }
 
 /* Returns true when LOOP contains vector phi nodes.  */
@@ -2273,7 +2266,11 @@ pass_parallelize_loops::execute (function *fun)
     return 0;
 
   if (parallelize_loops ())
-    return TODO_cleanup_cfg | TODO_rebuild_alias;
+    {
+      fun->omp_expand_needed = 1;
+      return TODO_update_ssa;
+    }
+
   return 0;
 }
 
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 52f2e13..1919c79 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -394,6 +394,7 @@ extern gimple_opt_pass *make_pass_lower_vector_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lower_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_strlen (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_fold_builtins (gcc::context *ctxt);
-- 
1.9.1



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]