This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[gomp4] OpenACC vector and worker reductions


This patch adds support for OpenACC vector and worker reductions in a
target-independent fashion. It adds quite a bit of machinery to
accomplish that goal. For starters, three internal functions,
GOACC_REDUCTION_INIT, GOACC_REDUCTION and GOACC_REDUCTION_WRITEBACK,
have been introduced. It's probably easiest to explain all of the
changes with an example. Given an acc loop reduction as follows

  red = ...

  #pragma acc loop reduction (+:red) vector
  for (...)
    red++;

the OpenMP way to lower this reduction would be to introduce a new
private variable for 'red', which I'll call red.private. That private
reduction variable gets initialized with some value depending on the
reduction operation. All of the references to the original reduction
variable inside the loop get replaced with the private copy. Immediately
after the loop exits, the original reduction variable is atomically
updated with the private copy.

The code ends up looking something as follows:

  red = ...
  red.private = 0;   // initialize red.internal
  #pragma omp for (...)
    red.internal++;
  #pragma omp continue
    red += red.private // this is an atomic operation
  #pragma omp end

Conceptually, this loop may be decomposed into three sections. The first
section is the reduction initializer, the second is the loop, and the
third is the reduction finalizer.

This get a little more complicated in OpenACC. For starters, there are
three levels of parallelism that may be associated with a single acc
loop. When transferring from one level of parallelism to another, some
targets (e.g. nvptx) may require variable state propagation and
predication due to the constraints of static thread scheduling. Nathan
solved that problem, at least from a high-level, by surrounding acc
loops with GOACC_FORK and GOACC_JOIN function markers.

Furthermore, certain targets have hardware limitations preventing
general atomic operations from being utilized. Specifically, spinlocks
may not be used by threads inside the same warp for nvptx targets. In
gcc 6.0, warps corresponds to vectors, which currently contain 32
threads. That said, spinlocks are usable on nvptx targets if only one
thread within a warp is using it. This patch solves this problem by
breaking up the reduction finalizer into two steps -- a parallel
reduction (a call to GOACC_REDUCTION) and a write-back to the original
variable. In OpenACC, the original loop gets lowered into the following
form:

  red = ...
  red.private = GOACC_REDUCTION_INIT (0)
  GOACC_FORK ()
  #pragma omp for (...)
    red.internal++;
  #pragma omp continue
    red.private = GOACC_REDUCTION (gwv_mask, op, red.private)
    GOACC_WRITEBACK ()
    red += red.private // this is an atomic operation
  #pragma omp end
  GOACC_JOIN ()

First of all, the call to GOACC_REDUCTION_INIT is necessary to ensure
that red.private has a value to propagate to all of the threads
associated with that loop. Without it, in situations where there are
more threads than loop iterations, the threads that didn't enter the
body of the loop would not contain a proper initial value, so the
reduction finalizer would be generating bogus results.

Both GOACC_REDUCTION and GOACC_WRITEBACK get evaluated inside the target
compiler by a new fold_oacc_reductions pass. That pass uses
targetm.goacc.fold_reduction to fold GOACC_REDUCTION in a
target-specific way. That pass also removes the GOACC_WRITEBACK marker
and moves the nearest GOACC_JOIN call at it's place if necessary
(worker-only loops are special). This is guaranteed to work because
OpenACC loops are single-entry, single-exit and there is only one
GOACC_WRITEBACK marker per acc loop (there is one GOACC_REDUCTION per
reduction though). Moving the GOACC_JOIN up allows the reduction
write-back to operate in a corresponds 'single' mode. E.g. since this
example executes the body in vector-partitioned mode, the original
reduction variable must be updated in vector-single mode.

There's one more quirk that I encountered while working on this patch.
All dummy args to fortran subroutine are passed by reference. That
causes problems for loop state propagation, because only the pointer
gets propagated, and not the value being pointed to. To get around this,
I taught the gimplifier to introduce a new local copy of the reduction
variable. Now the reduction clause has five operands associated with it,
with the fifth one being new private reduction variable.

In addition to the above machinery, this patch also implements the
fold_reduction hook on nvptx targets to use a tree-reduction for vector
loops. All other reductions on nvptx targets use atomics.

I hopefully ironed out all of the bugs in this patch, but I am rerunning
the entire regression testsuite again. Any comments are welcome. Is this
reduction scheme too nvptx-specific?

I'll post the test cases in a follow up patch because the patch would be
too big for the mailing list otherwise.

Thanks,
Cesar
2015-07-17  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_generate_vector_shuffle): New function.
	(nvptx_goacc_fold_reduction): New function.
	(TARGET_GOACC_FOLD_REDUCTION): Define hook.
	* doc/tm.texi: Document TARGET_GOACC_FOLD_REDUCTION.
	* doc/tm.texi.in: Add hook for TARGET_GOACC_FOLD_REDUCTION.
	* gimplify.c (struct privatize_reduction): Declare.
	(localize_reductions_r): New function.
	(localize_reductions): New function.
	(gimplify_omp_for): Call localize_reductions for acc loops.
	* internal-fn.c (expand_GOACC_REDUCTION): New function.
	(expand_GOACC_REDUCTION_INIT): New function.
	(expand_GOACC_REDUCTION_WRITEBACK): New function.
	* internal-fn.def (GOACC_REDUCTION): New internal function.
	(GOACC_REDUCTION_INIT): New internal function.
	(GOACC_REDUCTION_WRITEBACK): New internal function.
	* omp-low.c (lower_rec_input_clauses): Use GOACC_REDUCTION_INIT for
	OpenACC reductions.
	(lower_oacc_reductions): New function.
	(lower_reduction_clauses): Use lower_oacc_reductions for OpenACC
	reductions.
	(find_goacc_join): New function.
	(find_enclosing_join): New function.
	(execute_fold_oacc_reductions): New function.
	(class pass_fold_oacc_reductions): New pass.
	(make_pass_fold_oacc_reductions): New function.
	(default_goacc_fold_reduction): New function.
	* optabs.def (oacc_thread_broadcast_optab): Remove.
	* passes.def (pass_fold_oacc_reductions): Use it.
	* target.def (fold_reduction): New target hook.
	* targhooks.h (default_goacc_fold_reduction): Declare.
	* tree-core.h (enum omp_clause_code): Document argument 4 of
	OMP_CLAUSE_REDUCTION.
	* tree-pass.h (make_pass_oacc_fold_reductions): Declare.
	* tree.c (omp_clause_num_ops): Increase the number of reduction clause
	operands by one.
	* tree.h (OMP_CLAUSE_REDUCTION_PRIVATE_DECL): New macro.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b471890..cdfdf00 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -59,6 +59,15 @@
 #include "dominance.h"
 #include "cfg.h"
 #include "omp-low.h"
+#include "fold-const.h"
+#include "stringpool.h"
+#include "internal-fn.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "gimple-ssa.h"
+#include "tree-phinodes.h"
+#include "ssa-iterators.h"
+#include "tree-ssanames.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -3190,6 +3199,124 @@ nvptx_expand_builtin (tree exp, rtx target ATTRIBUTE_UNUSED,
 
   gcc_unreachable ();
 }
+
+static void
+nvptx_generate_vector_shuffle (tree dest_var, tree var, int shfl,
+			       gimple_stmt_iterator *where)
+{
+  tree vartype = TREE_TYPE (var);
+  tree call_arg_type = unsigned_type_node;
+  tree_code ccode = SCALAR_FLOAT_TYPE_P (vartype)
+    ? VIEW_CONVERT_EXPR : NOP_EXPR;
+  enum nvptx_builtins fn = NVPTX_BUILTIN_SHUFFLE_DOWN;
+
+  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
+    {
+      fn = NVPTX_BUILTIN_SHUFFLE_DOWNLL;
+      call_arg_type = long_long_unsigned_type_node;
+    }
+
+  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
+  tree casted_var = var;
+
+  if (need_conversion)
+    {
+      casted_var = make_ssa_name (call_arg_type);
+      tree t1 = fold_build1 (ccode, call_arg_type, var);
+      gassign *conv1 = gimple_build_assign (casted_var, t1);
+      update_stmt (conv1);
+      gsi_insert_before (where, conv1, GSI_SAME_STMT);
+    }
+
+  tree fndecl = nvptx_builtin_decl (fn, true);
+  tree shift =  build_int_cst (integer_type_node, shfl);
+  gimple call = gimple_build_call (fndecl, 2, casted_var, shift);
+
+  gsi_insert_before (where, call, GSI_SAME_STMT);
+  tree casted_dest = dest_var;
+
+  if (need_conversion)
+    {
+      casted_dest = make_ssa_name (call_arg_type);
+      tree t2 = fold_build1 (ccode, vartype, casted_dest);
+      gassign *conv2 = gimple_build_assign (dest_var, t2);
+      gsi_insert_before (where, conv2, GSI_SAME_STMT);
+      update_stmt (conv2);
+    }
+
+  update_stmt (call);
+  gimple_call_set_lhs (call, casted_dest);
+}
+
+/* Fold an OpenACC vector reduction shuffle down instructions.  */
+
+static void
+nvptx_goacc_fold_reduction (gimple_stmt_iterator *gsi)
+{
+  /* Generate a sequence of instructions to preform a tree reduction using
+     shfl.down as an intermediate step.  */
+
+  gimple call = gsi_stmt (*gsi);
+  tree arg0 = gimple_call_arg (call, 0); // loop mask
+  tree arg1 = gimple_call_arg (call, 1); // reduction op
+  tree arg2 = gimple_call_arg (call, 2); // reduction decl
+  tree type = TREE_TYPE (arg2);
+  unsigned level = TREE_INT_CST_LOW (arg0);
+  enum tree_code code = (enum tree_code) TREE_INT_CST_LOW (arg1);
+  tree lhs = gimple_call_lhs (call);
+
+  /* Nothing to do here is this isn't a vector loop.  */
+  if ((level & OACC_LOOP_MASK (OACC_vector)) == 0)
+    {
+      gassign *g = gimple_build_assign (lhs, arg2);
+      gsi_replace (gsi, g, true);
+      return;
+    }
+
+  tree new_var = arg2;
+  tree t, t2;
+  gassign *g;
+
+  if (code == TRUTH_ANDIF_EXPR)
+    code = BIT_AND_EXPR;
+  else if (code == TRUTH_ORIF_EXPR)
+    code = BIT_IOR_EXPR;
+
+  if (!is_gimple_val (arg0))
+    {
+      new_var = make_ssa_name (type);
+      gassign *g = gimple_build_assign (new_var, arg2);
+      gsi_insert_before (gsi, g, GSI_SAME_STMT);
+    }
+
+  for (int shfl = 16; shfl > 0; shfl = shfl >> 1)
+    {
+      t = make_ssa_name (type);
+      nvptx_generate_vector_shuffle (t, new_var, shfl, gsi);
+      t2 = make_ssa_name (create_tmp_var (type));
+
+      g = gimple_build_assign (t2, fold_build2 (code, type, new_var, t));
+      update_stmt (g);
+      gsi_insert_before (gsi, g, GSI_SAME_STMT);
+
+      new_var = t2;
+    }
+
+  /* Restore the type of the comparison operand.  */
+  if (code == EQ_EXPR || code == NE_EXPR)
+    {
+      type = TREE_TYPE (lhs);
+      t = make_ssa_name (type);
+      t2 = fold_build1 (NOP_EXPR, type, new_var);
+      gassign *g = gimple_build_assign (t, t2);
+      gsi_insert_before (gsi, g, GSI_SAME_STMT);
+      new_var = t;
+    }
+
+  g = gimple_build_assign (lhs, new_var);
+  gsi_replace (gsi, g, false);
+  update_stmt (g);
+}
 
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
@@ -3285,6 +3412,9 @@ nvptx_expand_builtin (tree exp, rtx target ATTRIBUTE_UNUSED,
 #undef  TARGET_BUILTIN_DECL
 #define TARGET_BUILTIN_DECL nvptx_builtin_decl
 
+#undef TARGET_GOACC_FOLD_REDUCTION
+#define TARGET_GOACC_FOLD_REDUCTION nvptx_goacc_fold_reduction
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 3dc51c0..cc42998 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5740,6 +5740,14 @@ usable.  In that case, the smaller the number is, the more desirable it is
 to use it.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_GOACC_FOLD_REDUCTION (gimple_stmt_iterator *@var{gsi})
+This hook is used to expand OpenACC reductions represented by calls to
+the internal function @var{GOACC_REDUCTION} into a sequence of gimple
+instructions.  @var{gsi} points to the gimple statement holding the
+ function call.  By default, targets are assumed to be single-threaded
+although that is not a requirement.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 93fb41c..0936516 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4245,6 +4245,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_SIMD_CLONE_USABLE
 
+@hook TARGET_GOACC_FOLD_REDUCTION
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index f5ec04a..833e469 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -181,6 +181,11 @@ struct gimplify_omp_ctx
   bool distribute;
 };
 
+struct privatize_reduction
+{
+  tree ref_var, local_var;
+};
+
 static struct gimplify_ctx *gimplify_ctxp;
 static struct gimplify_omp_ctx *gimplify_omp_ctxp;
 
@@ -7292,6 +7297,97 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *)
   return NULL_TREE;
 }
 
+/* Helper function for localize_reductions.  Replace all uses of REF_VAR with
+   LOCAL_VAR.  */
+
+static tree
+localize_reductions_r (tree *tp, int *walk_subtrees, void *data)
+{
+  enum tree_code tc = TREE_CODE (*tp);
+  struct privatize_reduction *pr = (struct privatize_reduction *) data;
+
+  if (TYPE_P (*tp))
+    *walk_subtrees = 0;
+
+  switch (tc)
+    {
+    case INDIRECT_REF:
+    case MEM_REF:
+      if (TREE_OPERAND (*tp, 0) == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    case VAR_DECL:
+    case PARM_DECL:
+    case RESULT_DECL:
+      if (*tp == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    default:
+      break;
+    }
+
+  return NULL_TREE;
+}
+
+/* OpenACC worker and vector loop state propagation requires reductions
+   to be inside local variables.  This function replaces all reference-type
+   reductions variables associated with the loop with a local copy.  */
+
+static void
+localize_reductions (tree *expr_p)
+{
+  tree clauses = OMP_FOR_CLAUSES (*expr_p);
+  tree c, var, type, new_var;
+  struct privatize_reduction pr;
+  int gwv_cur = 0;
+  int mask_wv = OACC_LOOP_MASK (OACC_worker) | OACC_LOOP_MASK (OACC_vector);
+
+  /* Non-vector and worker reduction do not need to be localized.  */
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      enum omp_clause_code cc = OMP_CLAUSE_CODE (c);
+
+      if (cc == OMP_CLAUSE_GANG)
+	gwv_cur |= OACC_LOOP_MASK (OACC_gang);
+      else if (cc == OMP_CLAUSE_WORKER)
+	gwv_cur |= OACC_LOOP_MASK (OACC_worker);
+      else if (cc == OMP_CLAUSE_VECTOR)
+	gwv_cur |= OACC_LOOP_MASK (OACC_vector);
+    }
+
+  if (!(gwv_cur & mask_wv))
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
+      {
+	var = OMP_CLAUSE_DECL (c);
+
+	if (!lang_hooks.decls.omp_privatize_by_reference (var))
+	  {
+	    OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = NULL;
+	    continue;
+	  }
+
+	type = TREE_TYPE (TREE_TYPE (var));
+	new_var = create_tmp_var (type);
+
+	pr.ref_var = var;
+	pr.local_var = new_var;
+
+	walk_tree (expr_p, localize_reductions_r, &pr, NULL);
+
+	OMP_CLAUSE_DECL (c) = var;
+	OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = new_var;
+      }
+}
+
 /* Gimplify the gross structure of an OMP_FOR statement.  */
 
 static enum gimplify_status
@@ -7330,6 +7426,9 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
       gcc_unreachable ();
     }
 
+  if (ork == ORK_OACC)
+    localize_reductions (expr_p);
+
   /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
      clause for the IV.  */
   if (simd && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 0a2c9a1..bc6f23e 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -46,6 +46,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "tree-ssanames.h"
 #include "diagnostic-core.h"
+#include "recog.h"
 
 /* The names of each internal function, indexed by function number.  */
 const char *const internal_fn_name_array[] = {
@@ -1984,6 +1985,42 @@ expand_GOACC_JOIN (gcall *stmt ATTRIBUTE_UNUSED)
 #endif
 }
 
+/* GOACC_REDUCTION is supposed to be expanded at pass_fold_reductions.
+   So this dummy function should never be called.  */
+
+static void
+expand_GOACC_REDUCTION (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This is an optimization barrier.  Propagate call arg0 to the LHS.  */
+
+static void
+expand_GOACC_REDUCTION_INIT (gcall *stmt)
+{
+  tree lhs, arg0;
+  rtx target, val;
+
+  lhs = gimple_call_lhs (stmt);
+  arg0 = gimple_call_arg (stmt, 0);
+  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  val = expand_expr (arg0, target, VOIDmode, EXPAND_NORMAL);
+  emit_move_insn (target, val);
+}
+
+/* GOACC_REDUCTION_WRITEBACK is used as a marker to denote the transition of
+   the execution engine entering into a single-threaded mode from a thread-
+   partitioned mode.  The code following this marker is responsible for
+   updating the original reduction variable.  This function is expanded during
+   fold_oacc_reductions.  */
+
+static void
+expand_GOACC_REDUCTION_WRITEBACK (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c3374d6..ddd63c9 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -66,3 +66,6 @@ DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
 DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".")
 DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_REDUCTION, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_INIT, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_WRITEBACK, ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 37b853f..e58394c 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -82,6 +82,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "gomp-constants.h"
 #include "gimple-pretty-print.h"
 #include "set"
+#include "tree-ssa-propagate.h"
+#include "omp-low.h"
 
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
@@ -4394,7 +4396,39 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
 			handle_simd_reference (clause_loc, new_vard, ilist);
 		      else if (is_oacc_parallel (ctx) && is_reference (var))
 			new_var = build_simple_mem_ref (new_var);
-		      gimplify_assign (new_var, x, ilist);
+
+		      /* OpenACC loops may require loop state propagation.
+			 Using an function call for the reduction initializer
+			 ensures that the initial value for the private
+			 reduction variable is propagated to all of the
+			 threads inside a loop.  */
+		      if (is_gimple_omp_oacc (ctx->stmt)
+			  && (ctx->gwv_this &
+			      (OACC_LOOP_MASK (OACC_worker)
+			       | OACC_LOOP_MASK (OACC_vector))))
+			{
+			  tree t = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+			  if (t == NULL)
+			    t = new_var;
+			  gcall *call = gimple_build_call_internal
+			    (IFN_GOACC_REDUCTION_INIT, 1, x);
+			  if (TREE_CODE (t) != INDIRECT_REF)
+			    {
+			      gimple_call_set_lhs (call, t);
+			      gimple_seq_add_stmt (ilist, call);
+			    }
+			  else
+			    {
+			      tree x = create_tmp_var (TREE_TYPE (t));
+			      gimplify_assign (x, t, ilist);
+			      gimple_call_set_lhs (call, x);
+			      gimple_seq_add_stmt (ilist, call);
+			      gimplify_assign (t, x, ilist);
+			    }
+			}
+		      else
+			  gimplify_assign (new_var, x, ilist);
+
 		      if (is_simd)
 			{
 			  tree ref = build_outer_var_ref (var, ctx);
@@ -4746,6 +4780,158 @@ expand_oacc_get_thread_num (gimple_seq *seq, int gwv_bits)
   return res;
 }
 
+static void
+lower_oacc_reductions (tree clauses, gimple_seq *stmt_seqp, omp_context *ctx)
+{
+  int loop_flags = 0;
+  tree tlevel, c, x, atom;
+  gcall *call;
+  gimple stmt;
+  int reductions = 0;
+  bool use_atomics = false;
+  bool atomic_compatible = true;
+  hash_map<tree, tree> ired_map;
+
+  /* GWV_THIS contains the current level of parallelism the loop nest.
+     Extract the level of parallelism only associated with the current
+     loop.  */
+  if (is_gimple_omp_oacc (ctx->stmt))
+    {
+      omp_context *outer = ctx->outer;
+
+      if (outer && gimple_code (outer->stmt) != GIMPLE_OMP_FOR)
+	outer = NULL;
+
+      loop_flags = outer ? ctx->gwv_this & (~outer->gwv_this)
+	: ctx->gwv_this;
+    }
+
+  /* OpenACC reduction finalizers operate in two stages.  The first
+     stage combines all of the partial reductions values together in
+     a 'partitioned' execution mode.  The second stage updates the
+     original or intermediate reduction variable in a 'single' execution
+     mode.
+
+     The internal function GOACC_REDUCTION handles the first stage, and
+     GOACC_REDUCTION_WRITEBACK acts as a marker for the second stage.
+     Later on, fold_oacc_reductions will move all of the code following
+     GOACC_REDUCTION_WRITEBACK immediately after the nearest GOACC_JOIN.  */
+
+  /* Phase 1: vectorize the reductions.  */
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      tree ired;  // intermediate reduction variable
+      tree var;   // reduction clause decl
+      tree tcode;
+
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_REDUCTION)
+	continue;
+
+      tcode
+	= build_int_cst (integer_type_node, OMP_CLAUSE_REDUCTION_CODE (c));
+
+      var = OMP_CLAUSE_DECL (c);
+      ired = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+      if (ired == NULL_TREE)
+	ired = lookup_decl (var, ctx);
+
+      if (is_reference (ired))
+	ired = build_simple_mem_ref (ired);
+
+      if (!is_gimple_reg (ired))
+	{
+	  tree t = create_tmp_var (TREE_TYPE (ired));
+	  gimplify_assign (t, ired, stmt_seqp);
+	  ired = t;
+	}
+
+      ired_map.put (var, ired);
+
+      tlevel = build_int_cst (unsigned_type_node, loop_flags);
+
+      call = gimple_build_call_internal (IFN_GOACC_REDUCTION, 3, tlevel,
+					 tcode, ired);
+      gimple_call_set_lhs (call, ired);
+      gimple_seq_add_stmt (stmt_seqp, call);
+
+      if (!is_atomic_compatible_reduction (var, ctx))
+	atomic_compatible = false;
+
+      reductions++;
+    }
+
+  if (reductions == 0)
+    return;
+
+  /* Phase 2: Update the original reduction variable.  */
+
+  /* Insert the marks for the reduction writeback here.  */
+  call = gimple_build_call_internal (IFN_GOACC_REDUCTION_WRITEBACK, 1,
+				     tlevel);
+  gimple_seq_add_stmt (stmt_seqp, call);
+
+  use_atomics = atomic_compatible
+    && (loop_flags & ~OACC_LOOP_MASK (OACC_vector)) != 0;
+
+  /* Use a spin-lock if multiple reductions are involved.  */
+  if (!atomic_compatible || (reductions > 1 && use_atomics))
+    {
+      atom = builtin_decl_explicit (BUILT_IN_GOMP_ATOMIC_START);
+      stmt = gimple_build_call (atom, 0);
+      gimple_seq_add_stmt (stmt_seqp, stmt);
+    }
+
+  /* Lower individual reduction writebacks.  */
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      tree ired;             // intermediate reduction variable
+      tree ored, ored_addr;  // original reduction variable
+      tree var;              // reduction clause decl
+      enum tree_code tcode;
+
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_REDUCTION)
+	continue;
+
+      tcode = OMP_CLAUSE_REDUCTION_CODE (c);
+      if (tcode == MINUS_EXPR)
+        tcode = PLUS_EXPR;
+
+      var = OMP_CLAUSE_DECL (c);
+      ired = *ired_map.get (var);
+      ored = build_outer_var_ref (var, ctx);
+
+      if (use_atomics && reductions == 1)
+	{
+	  ored_addr = build_fold_addr_expr (ored);
+	  ored_addr = save_expr (ored_addr);
+
+	  ored = build1 (INDIRECT_REF, TREE_TYPE (TREE_TYPE (ored_addr)),
+			 ored_addr);
+	  x = fold_build2 (tcode, TREE_TYPE (ored), ored, ired);
+	  x = build2 (OMP_ATOMIC, void_type_node, ored_addr, x);
+	  gimplify_and_add (x, stmt_seqp);
+	}
+      else
+	{
+	  tree t = ored;
+	  if (is_reference (ored) && !is_reference (ired))
+	    {
+	      t = create_tmp_var (TREE_TYPE (ired));
+	      gimplify_assign (t, ired, stmt_seqp);
+	    }
+	  x = build2 (tcode, TREE_TYPE (ired), t, ired);
+	  gimplify_assign (ored, x, stmt_seqp);
+	}
+    }
+
+  if (!atomic_compatible || (reductions > 1 && use_atomics))
+    {
+      atom = builtin_decl_explicit (BUILT_IN_GOMP_ATOMIC_END);
+      stmt = gimple_build_call (atom, 0);
+      gimple_seq_add_stmt (stmt_seqp, stmt);
+    }
+}
+
 /* Generate code to implement the REDUCTION clauses.  OpenACC reductions
    are usually executed in parallel, but they fallback to sequential code for
    known single-threaded regions.  */
@@ -4758,6 +4944,13 @@ lower_reduction_clauses (tree clauses, gimple_seq *stmt_seqp, omp_context *ctx)
   tree x, c;
   int count = 0;
 
+  /* OpenACC loop reductions are handled elsewhere.  */
+  if (!is_oacc_parallel (ctx) && is_gimple_omp_oacc (ctx->stmt))
+    {
+      lower_oacc_reductions (clauses, stmt_seqp, ctx);
+      return;
+    }
+
   /* SIMD reductions are handled in lower_rec_input_clauses.  */
   if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
       && gimple_omp_for_kind (ctx->stmt) & GF_OMP_FOR_SIMD)
@@ -14394,4 +14587,188 @@ make_pass_late_lower_omp (gcc::context *ctxt)
   return new pass_late_lower_omp (ctxt);
 }
 
+static void
+find_goacc_join (gimple_stmt_iterator *gsi)
+{
+  gimple stmt;
+
+  while (!gsi_end_p (*gsi))
+    {
+      stmt = gsi_stmt (*gsi);
+
+      if (is_gimple_call (stmt) && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_fn (stmt) == IFN_GOACC_JOIN)
+	return;
+
+      gsi_next (gsi);
+    }
+}
+
+static gimple_stmt_iterator
+find_enclosing_join (basic_block bb)
+{
+  basic_block son;
+  gimple_stmt_iterator gsi = gsi_start_bb (bb);
+
+  find_goacc_join (&gsi);
+  if (!gsi_end_p (gsi))
+    return gsi;
+
+  for (son = first_dom_son (CDI_DOMINATORS, bb);
+       son;
+       son = next_dom_son (CDI_DOMINATORS, son))
+    {
+      gsi = find_enclosing_join (son);
+      if (!gsi_end_p (gsi))
+	return gsi;
+    }
+
+  return gsi;
+}
+
+/* Main entry point for folding function calls for oacc reductions.  See
+   lower_oacc_reductions for a description on how the internal functions
+   are used.  */
+
+static unsigned int
+execute_fold_oacc_reductions ()
+{
+  basic_block bb;
+  gimple_stmt_iterator gsi, gsi_wb, gsi_join;
+  gimple stmt;
+  tree arg;
+  int reductions = 0;
+  int loop_mask = 0;
+
+  if (!lookup_attribute ("oacc function",
+			 DECL_ATTRIBUTES (current_function_decl)))
+    return 0;
+
+  free_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      /* Pass 1: Fold GOACC_REDUCTION.  These calls are to be evaluated
+         by targetm.goacc.fold_reduction.  */
+      gsi = gsi_start_bb (bb);
+      reductions = 0;
+
+      while (!gsi_end_p (gsi))
+	{
+	  bool removed = false;
+	  stmt = gsi_stmt (gsi);
+
+	  if (is_gimple_call (stmt) && gimple_call_internal_p (stmt))
+	    {
+	      if (gimple_call_internal_fn (stmt) == IFN_GOACC_REDUCTION)
+		{
+		  targetm.goacc.fold_reduction (&gsi);
+		  stmt = gsi_stmt (gsi);
+		  reductions++;
+		  removed = true;
+		}
+	      else if (gimple_call_internal_fn (stmt)
+		       == IFN_GOACC_REDUCTION_WRITEBACK)
+		break;
+	    }
+
+	  if (!removed)
+	    gsi_next (&gsi);
+	}
+
+      /* Pass 2: Update the placement of the GOACC_JOINs using the
+         GOACC_REDUCTION_WRITEBACK markers for vector reductions.  */
+
+      if (reductions == 0)
+	continue;
+
+      arg = gimple_call_arg (stmt, 0);
+      loop_mask = TREE_INT_CST_LOW (arg);
+
+      /* Only vector reduction writebacks need to placed after the call
+	 to GOACC_JOIN.  */
+      if ((loop_mask & OACC_LOOP_MASK (OACC_vector)) == 0)
+	{
+	  gsi_remove (&gsi, true);
+	  continue;
+	}
+
+      gsi_wb = gsi;
+      gsi_join = find_enclosing_join (bb);
+
+      gcc_assert (!gsi_end_p (gsi_join));
+      stmt = gsi_stmt (gsi_join);
+
+      /* Replace the call go GOACC_REDUCTION_WRITEBACK with a call to
+	 GOACC_JOIN marker.  */
+      tree arg0 = gimple_call_arg (stmt, 0);
+      gcall *call = gimple_build_call_internal (IFN_GOACC_JOIN, 1, arg0);
+      gsi_replace (&gsi_wb, call, true);
+
+      /* Remove the original call to GOACC_JOIN.  */
+      gsi_remove(&gsi_join, true);
+    }
+
+  cleanup_tree_cfg ();
+  mark_virtual_operands_for_renaming (cfun);
+  update_ssa (TODO_update_ssa);
+  verify_ssa (true, true);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_oacc_reductions =
+{
+  GIMPLE_PASS, /* type */
+  "fold_oacc_reductions", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_fold_oacc_reductions : public gimple_opt_pass
+{
+public:
+  pass_fold_oacc_reductions (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_reductions, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual unsigned int execute (function *)
+    {
+      bool gate = (flag_openacc != 0 && !seen_error ());
+
+      if (!gate)
+	return 0;
+
+      return execute_fold_oacc_reductions ();
+    }
+
+}; // class pass_fold_oacc_reductions
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_fold_oacc_reductions (gcc::context *ctxt)
+{
+  return new pass_fold_oacc_reductions (ctxt);
+}
+
+void
+default_goacc_fold_reduction (gimple_stmt_iterator *gsi)
+{
+  gimple call = gsi_stmt (*gsi);
+  tree lhs = gimple_call_lhs (call);
+  tree rhs = gimple_call_arg (call, 2);
+  gassign *g = gimple_build_assign (lhs, rhs);
+
+  gsi_replace (gsi, g, true);
+}
+
 #include "gt-omp-low.h"
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6018971..888b21c 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -332,5 +332,3 @@ OPTAB_D (atomic_xor_optab, "atomic_xor$I$a")
 
 OPTAB_D (get_thread_pointer_optab, "get_thread_pointer$I$a")
 OPTAB_D (set_thread_pointer_optab, "set_thread_pointer$I$a")
-
-OPTAB_D (oacc_thread_broadcast_optab, "oacc_thread_broadcast$I$a")
diff --git a/gcc/passes.def b/gcc/passes.def
index 43e67df..abb598f 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -165,6 +165,7 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASSES_AFTER (all_passes)
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
+  NEXT_PASS (pass_fold_oacc_reductions);
   NEXT_PASS (pass_all_optimizations);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations)
       NEXT_PASS (pass_remove_cgraph_callee_edges);
diff --git a/gcc/target.def b/gcc/target.def
index 4edc209..ecdeb74 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1639,6 +1639,22 @@ int, (struct cgraph_node *), NULL)
 
 HOOK_VECTOR_END (simd_clone)
 
+/* Functions relating to openacc.  */
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_GOACC_"
+HOOK_VECTOR (TARGET_GOACC, goacc)
+
+DEFHOOK
+(fold_reduction,
+"This hook is used to expand OpenACC reductions represented by calls to\n\
+the internal function @var{GOACC_REDUCTION} into a sequence of gimple\n\
+instructions.  @var{gsi} points to the gimple statement holding the\n\ function call.  By default, targets are assumed to be single-threaded\n\
+although that is not a requirement.",
+void, (gimple_stmt_iterator *gsi),
+default_goacc_fold_reduction)
+
+HOOK_VECTOR_END (goacc)
+
 /* Functions relating to vectorization.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_VECTORIZE_"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 5ae991d..8e3112d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -107,6 +107,8 @@ extern unsigned default_add_stmt_cost (void *, int, enum vect_cost_for_stmt,
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
+extern void default_goacc_fold_reduction (gimple_stmt_iterator *);
+
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
 
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index 3be9093..cd2a618 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -232,7 +232,9 @@ enum omp_clause_code {
      Operand 2: OMP_CLAUSE_REDUCTION_MERGE: Stmt-list to merge private var
                 into the shared one.
      Operand 3: OMP_CLAUSE_REDUCTION_PLACEHOLDER: A dummy VAR_DECL
-                placeholder used in OMP_CLAUSE_REDUCTION_{INIT,MERGE}.  */
+                placeholder used in OMP_CLAUSE_REDUCTION_{INIT,MERGE}.
+     Operand 4: OMP_CLAUSE_REDUCTION_PRIVATE_DECL: A private VAR_DECL of
+                the original DECL associated with the reduction clause.  */
   OMP_CLAUSE_REDUCTION,
 
   /* OpenMP clause: copyin (variable_list).  */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 13f20ea..ce31593 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -410,6 +410,7 @@ extern gimple_opt_pass *make_pass_late_lower_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_fold_oacc_reductions (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_strlen (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_fold_builtins (gcc::context *ctxt);
diff --git a/gcc/tree.c b/gcc/tree.c
index ff533a3..395f236 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -281,7 +281,7 @@ unsigned const char omp_clause_num_ops[] =
   1, /* OMP_CLAUSE_SHARED  */
   1, /* OMP_CLAUSE_FIRSTPRIVATE  */
   2, /* OMP_CLAUSE_LASTPRIVATE  */
-  4, /* OMP_CLAUSE_REDUCTION  */
+  5, /* OMP_CLAUSE_REDUCTION  */
   1, /* OMP_CLAUSE_COPYIN  */
   1, /* OMP_CLAUSE_COPYPRIVATE  */
   3, /* OMP_CLAUSE_LINEAR  */
diff --git a/gcc/tree.h b/gcc/tree.h
index 189c298..4479790 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1428,6 +1428,8 @@ extern void protected_set_expr_location (tree, location_t);
   (OMP_CLAUSE_CHECK (NODE))->omp_clause.gimple_reduction_merge
 #define OMP_CLAUSE_REDUCTION_PLACEHOLDER(NODE) \
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_REDUCTION), 3)
+#define OMP_CLAUSE_REDUCTION_PRIVATE_DECL(NODE) \
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_REDUCTION), 4)
 
 /* True if a REDUCTION clause may reference the original list item (omp_orig)
    in its OMP_CLAUSE_REDUCTION_{,GIMPLE_}INIT.  */

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]