This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels


On 09/11/15 20:52, Tom de Vries wrote:
On 09/11/15 16:35, Tom de Vries wrote:
Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

      1    Insert new exit block only when needed in
         transform_to_exit_first_loop_alt
      2    Make create_parallel_loop return void
      3    Ignore reduction clause on kernels directive
      4    Implement -foffload-alias
      5    Add in_oacc_kernels_region in struct loop
      6    Add pass_oacc_kernels
      7    Add pass_dominator_oacc_kernels
      8    Add pass_ch_oacc_kernels
      9    Add pass_parallelize_loops_oacc_kernels
     10    Add pass_oacc_kernels pass group in passes.def
     11    Update testcases after adding kernels pass group
     12    Handle acc loop directive
     13    Add c-c++-common/goacc/kernels-*.c
     14    Add gfortran.dg/goacc/kernels-*.f95
     15    Add libgomp.oacc-c-c++-common/kernels-*.c
     16    Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.

This patch adds pass_parallelize_loops_oacc_kernels.

There's a number of things we do differently in parloops for oacc kernels:
- in normal parloops, we generate code to choose between a parallel
   version of the loop, and a sequential (low iteration count) version.
   Since the code in oacc kernels region is supposed to run on the
   accelerator anyway, we skip this check, and don't add a low iteration
   count loop.
- in normal parloops, we generate an #pragma omp parallel /
   GIMPLE_OMP_RETURN pair to delimit the region which will we split off
   into a thread function. Since the oacc kernels region is already
   split off, we don't add this pair.
- we indicate the parallelization factor by setting the oacc function
   attributes
- we generate an #pragma oacc loop instead of an #pragma omp for, and
   we add the gang clause
- in normal parloops, we rewrite the variable accesses in the loop in
   terms into accesses relative to a thread function parameter. For the
   oacc kernels region, that rewrite has already been done at omp-lower,
   so we skip this.
- we need to ensure that the entire kernels region can be run in
   parallel. The loop independence check is already present, so for oacc
   kernels we add a check between blocks outside the loop and the entire
   region.
- we guard stores in the blocks outside the loop with gang_pos == 0.
   There's no need for each gang to write to a single location, we can
   do this in just one gang. (Typically this is the write of the final
   value of the iteration variable if that one is copied back to the
   host).


Reposting with loop optimizer init added in pass_parallelize_loops_oacc_kernels::execute.

Thanks,
- Tom
Add pass_parallelize_loops_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.c (expand_omp_atomic_fetch_op):  Release defs of update stmt.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
        (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle function parameter oacc_kernels_p.
	Calculate dominance info.  Skip loops that are not in a kernels region
	in oacc_kernels_p mode.  Skip inner loops of parallelized loops.
	(pass_parallelize_loops::execute): Call parallelize_loops with false
	argument.
	(pass_data_parallelize_loops_oacc_kernels): New pass_data.
	(class pass_parallelize_loops_oacc_kernels): New pass.
	(pass_parallelize_loops_oacc_kernels::execute)
	(make_pass_parallelize_loops_oacc_kernels): New function.
	* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.

---
 gcc/omp-low.c       |   8 +-
 gcc/omp-low.h       |   1 +
 gcc/tree-parloops.c | 693 +++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-pass.h     |   2 +
 4 files changed, 640 insertions(+), 64 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fba7bbd..9eae09a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11944,10 +11944,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
   gsi_remove (&gsi, true);
   gsi = gsi_last_bb (store_bb);
+  stmt = gsi_stmt (gsi);
   gsi_remove (&gsi, true);
 
   if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_no_phi);
+    {
+      release_defs (stmt);
+      update_ssa (TODO_update_ssa_no_phi);
+    }
 
   return true;
 }
@@ -12321,7 +12325,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
    function attribute.  Push any that are non-constant onto the ARGS
    list, along with an appropriate GOMP_LAUNCH_DIM tag.  */
 
-static void
+void
 set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
 {
   /* Must match GOMP_DIM ordering.  */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 194b3d1..1790f40 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
 extern void replace_oacc_fn_attrib (tree, tree);
 extern tree build_oacc_routine_dims (tree);
 extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
 extern int get_oacc_ifn_dim_arg (const gimple *);
 extern int get_oacc_fn_dim_size (tree, int);
 
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..96b8415 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "params.h"
 #include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
 
 /* This pass tries to distribute iterations of loops into several threads.
    The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
 				   of the reduction variable when existing the loop. */
   tree initial_value;		/* The initial value of the reduction var before entering the loop.  */
   tree field;			/*  the name of the field in the parloop data structure intended for reduction.  */
+  tree reduc_addr;		/* The address of the reduction variable for
+				   openacc reductions.  */
   tree init;			/* reduction initialization value.  */
   gphi *new_phi;		/* (helper field) Newly created phi node whose result
 				   will be passed to the atomic operation.  Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
   tree tmp_load, name;
   gimple *load;
 
-  load_struct = build_simple_mem_ref (clsn_data->load);
-  t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+  if (reduc->reduc_addr == NULL_TREE)
+    {
+      load_struct = build_simple_mem_ref (clsn_data->load);
+      t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+      addr = build_addr (t);
+    }
+  else
+    {
+      /* Set the address for the atomic store.  */
+      addr = reduc->reduc_addr;
 
-  addr = build_addr (t);
+      /* Remove the non-atomic store '*addr = sum'.  */
+      tree res = PHI_RESULT (reduc->keep_res);
+      use_operand_p use_p;
+      gimple *stmt;
+      bool single_use_p = single_imm_use (res, &use_p, &stmt);
+      gcc_assert (single_use_p);
+      replace_uses_by (gimple_vdef (stmt),
+		       gimple_vuse (stmt));
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
 
   /* Create phi node.  */
   bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
 
 static void
 create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
-		      tree new_data, unsigned n_threads, location_t loc)
+		      tree new_data, unsigned n_threads, location_t loc,
+		      bool oacc_kernels_p)
 {
   gimple_stmt_iterator gsi;
   basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
   gomp_continue *omp_cont_stmt;
   tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
   edge exit, nexit, guard, end, e;
+  tree for_clauses = NULL_TREE;
 
   /* Prepare the GIMPLE_OMP_PARALLEL statement.  */
   bb = loop_preheader_edge (loop)->src;
-  paral_bb = single_pred (bb);
-  gsi = gsi_last_bb (paral_bb);
+  if (!oacc_kernels_p)
+    {
+      paral_bb = single_pred (bb);
+      gsi = gsi_last_bb (paral_bb);
+    }
 
-  t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
-  OMP_CLAUSE_NUM_THREADS_EXPR (t)
-    = build_int_cst (integer_type_node, n_threads);
-  omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
-  gimple_set_location (omp_par_stmt, loc);
+  if (!oacc_kernels_p)
+    {
+      t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+      OMP_CLAUSE_NUM_THREADS_EXPR (t)
+	= build_int_cst (integer_type_node, n_threads);
+      omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+      gimple_set_location (omp_par_stmt, loc);
 
-  gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+    }
+  else
+    {
+      tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+      OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+	= build_int_cst (integer_type_node, n_threads);
+      set_oacc_fn_attrib (cfun->decl, clause, NULL);
+    }
 
   /* Initialize NEW_DATA.  */
   if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
       gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
     }
 
-  /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
-  bb = split_loop_exit_edge (single_dom_exit (loop));
-  gsi = gsi_last_bb (bb);
-  omp_return_stmt1 = gimple_build_omp_return (false);
-  gimple_set_location (omp_return_stmt1, loc);
-  gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+  /* Skip insertion of OMP_RETURN for oacc_kernels_p.  We've already generated
+     one when lowering the oacc kernels directive in
+     pass_lower_omp/lower_omp (). */
+  if (!oacc_kernels_p)
+    {
+      /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL.  */
+      bb = split_loop_exit_edge (single_dom_exit (loop));
+      gsi = gsi_last_bb (bb);
+      omp_return_stmt1 = gimple_build_omp_return (false);
+      gimple_set_location (omp_return_stmt1, loc);
+      gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+    }
 
   /* Extract data for GIMPLE_OMP_FOR.  */
   gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
     OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
       = build_int_cst (integer_type_node, chunk_size);
 
-  for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+  if (1)
+    {
+      /* In combination with the NUM_GANGS on the parallel.  */
+      for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+    }
+
+  for_stmt = gimple_build_omp_for (NULL,
+				   (oacc_kernels_p
+				    ? GF_OMP_FOR_KIND_OACC_LOOP
+				    : GF_OMP_FOR_KIND_FOR),
+				   for_clauses, 1, NULL);
   gimple_set_location (for_stmt, loc);
   gimple_omp_for_set_index (for_stmt, 0, initvar);
   gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
 static void
 gen_parallel_loop (struct loop *loop,
 		   reduction_info_table_type *reduction_list,
-		   unsigned n_threads, struct tree_niter_desc *niter)
+		   unsigned n_threads, struct tree_niter_desc *niter,
+		   bool oacc_kernels_p)
 {
   tree many_iterations_cond, type, nit;
   tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  if (loop->inner)
-    m_p_thread=2;
-  else
-    m_p_thread=MIN_PER_THREAD;
-
-   many_iterations_cond =
-     fold_build2 (GE_EXPR, boolean_type_node,
-                nit, build_int_cst (type, m_p_thread * n_threads));
-
-  many_iterations_cond
-    = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
-		   invert_truthvalue (unshare_expr (niter->may_be_zero)),
-		   many_iterations_cond);
-  many_iterations_cond
-    = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
-  if (stmts)
-    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-  if (!is_gimple_condexpr (many_iterations_cond))
+  if (!oacc_kernels_p)
     {
+      if (loop->inner)
+	m_p_thread=2;
+      else
+	m_p_thread=MIN_PER_THREAD;
+
+      many_iterations_cond =
+	fold_build2 (GE_EXPR, boolean_type_node,
+		     nit, build_int_cst (type, m_p_thread * n_threads));
+
+      many_iterations_cond
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       invert_truthvalue (unshare_expr (niter->may_be_zero)),
+		       many_iterations_cond);
       many_iterations_cond
-	= force_gimple_operand (many_iterations_cond, &stmts,
-				true, NULL_TREE);
+	= force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
       if (stmts)
 	gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
-    }
+      if (!is_gimple_condexpr (many_iterations_cond))
+	{
+	  many_iterations_cond
+	    = force_gimple_operand (many_iterations_cond, &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	}
 
-  initialize_original_copy_tables ();
+      initialize_original_copy_tables ();
 
-  /* We assume that the loop usually iterates a lot.  */
-  prob = 4 * REG_BR_PROB_BASE / 5;
-  loop_version (loop, many_iterations_cond, NULL,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
-  update_ssa (TODO_update_ssa);
-  free_original_copy_tables ();
+      /* We assume that the loop usually iterates a lot.  */
+      prob = 4 * REG_BR_PROB_BASE / 5;
+      loop_version (loop, many_iterations_cond, NULL,
+		    prob, prob, REG_BR_PROB_BASE - prob, true);
+      update_ssa (TODO_update_ssa);
+      free_original_copy_tables ();
+    }
 
   /* Base all the induction variables in LOOP on a single control one.  */
   canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
     }
   else
     {
+      if (oacc_kernels_p)
+	n_threads = 1;
+
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
 	 header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
   entry = loop_preheader_edge (loop);
   exit = single_dom_exit (loop);
 
-  eliminate_local_variables (entry, exit);
-  /* In the old loop, move all variables non-local to the loop to a structure
-     and back, and create separate decls for the variables used in loop.  */
-  separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
-			    &new_arg_struct, &clsn_data);
+  /* This rewrites the body in terms of new variables.  This has already
+     been done for oacc_kernels_p in pass_lower_omp/lower_omp ().  */
+  if (!oacc_kernels_p)
+    {
+      eliminate_local_variables (entry, exit);
+      /* In the old loop, move all variables non-local to the loop to a
+	 structure and back, and create separate decls for the variables used in
+	 loop.  */
+      separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+				&new_arg_struct, &clsn_data);
+    }
+  else
+    {
+      arg_struct = NULL_TREE;
+      new_arg_struct = NULL_TREE;
+      clsn_data.load = NULL_TREE;
+      clsn_data.load_bb = exit->dest;
+      clsn_data.store = NULL_TREE;
+      clsn_data.store_bb = NULL;
+    }
 
   /* Create the parallel constructs.  */
   loc = UNKNOWN_LOCATION;
   cond_stmt = last_stmt (loop->header);
   if (cond_stmt)
     loc = gimple_location (cond_stmt);
-  create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
-			new_arg_struct, n_threads, loc);
+  create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+			n_threads, loc, oacc_kernels_p);
   if (reduction_list->elements () > 0)
     create_call_for_reduction (loop, reduction_list, &clsn_data);
 
@@ -2527,12 +2606,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
   return true;
 }
 
+static tree
+get_omp_data_i_param (void)
+{
+  tree decl = DECL_ARGUMENTS (cfun->decl);
+  gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+  return ssa_default_def (cfun, decl);
+}
+
 /* Try to initialize REDUCTION_LIST for code generation part.
    REDUCTION_LIST describes the reductions.  */
 
 static bool
 try_create_reduction_list (loop_p loop,
-			   reduction_info_table_type *reduction_list)
+			   reduction_info_table_type *reduction_list,
+			   bool oacc_kernels_p)
 {
   edge exit = single_dom_exit (loop);
   gphi_iterator gsi;
@@ -2588,6 +2676,7 @@ try_create_reduction_list (loop_p loop,
 			 "  FAILED: it is not a part of reduction.\n");
 	      return false;
 	    }
+	  red->keep_res = phi;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
 	      fprintf (dump_file, "reduction phi is  ");
@@ -2622,15 +2711,402 @@ try_create_reduction_list (loop_p loop,
     }
 
 
+  if (oacc_kernels_p)
+    {
+      edge e = loop_preheader_edge (loop);
+
+      for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree def = PHI_RESULT (phi);
+	  affine_iv iv;
+
+	  if (!virtual_operand_p (def)
+	      && !simple_iv (loop, loop, def, &iv, true))
+	    {
+	      struct reduction_info *red;
+	      red = reduction_phi (reduction_list, phi);
+
+	      /* Look for pattern:
+
+		 <bb preheader>
+		   .omp_data_i = &.omp_data_arr;
+		   addr = .omp_data_i->sum;
+		   sum_a = *addr;
+
+		 <bb header>:
+		   sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+		 and assign addr to reduc->reduc_addr.  */
+
+	      tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      gimple *stmt = SSA_NAME_DEF_STMT (arg);
+	      if (!gimple_assign_single_p (stmt))
+		return false;
+	      tree memref = gimple_assign_rhs1 (stmt);
+	      if (TREE_CODE (memref) != MEM_REF)
+		return false;
+	      tree addr = TREE_OPERAND (memref, 0);
+
+	      gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+	      if (!gimple_assign_single_p (stmt2))
+		return false;
+	      tree compref = gimple_assign_rhs1 (stmt2);
+	      if (TREE_CODE (compref) != COMPONENT_REF)
+		return false;
+	      tree addr2 = TREE_OPERAND (compref, 0);
+	      if (TREE_CODE (addr2) != MEM_REF)
+		return false;
+	      addr2 = TREE_OPERAND (addr2, 0);
+	      if (TREE_CODE (addr2) != SSA_NAME
+		  || addr2 != get_omp_data_i_param ())
+		return false;
+	      red->reduc_addr = addr;
+	    }
+	}
+    }
+
   return true;
 }
 
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+			   bool ref_is_store, vec<basic_block> region_bbs,
+			   unsigned int i, gimple *skip_stmt)
+{
+  basic_block bb = region_bbs[i];
+  gsi_next (&gsi);
+
+  while (true)
+    {
+      for (; !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (stmt == skip_stmt)
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "skipping reduction store: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      continue;
+	    }
+
+	  if (!gimple_vdef (stmt)
+	      && !gimple_vuse (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+
+	  if (ref_is_store)
+	    {
+	      if (ref_maybe_used_by_stmt_p (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	  else
+	    {
+	      if (stmt_may_clobber_ref_p_1 (stmt, ref))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Stmt ");
+		      print_gimple_stmt (dump_file, stmt, 0, 0);
+		    }
+		  return true;
+		}
+	    }
+	}
+      i++;
+      if (i == region_bbs.length ())
+	break;
+      bb = region_bbs[i];
+      gsi = gsi_start_bb (bb);
+    }
+
+  return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+		      tree omp_data_i,
+		      reduction_info_table_type *reduction_list,
+		      bitmap reduction_stores)
+{
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  gimple *skip_stmt = NULL;
+
+	  if (is_gimple_debug (stmt)
+	      || gimple_code (stmt) == GIMPLE_COND)
+	    continue;
+
+	  ao_ref ref;
+	  bool ref_is_store = false;
+	  if (gimple_assign_load_p (stmt))
+	    {
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      tree base = get_base_address (rhs);
+	      if (TREE_CODE (base) == MEM_REF
+		  && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+		continue;
+
+	      tree lhs = gimple_assign_lhs (stmt);
+	      if (TREE_CODE (lhs) == SSA_NAME
+		  && has_single_use (lhs))
+		{
+		  use_operand_p use_p;
+		  gimple *use_stmt;
+		  single_imm_use (lhs, &use_p, &use_stmt);
+		  if (gimple_code (use_stmt) == GIMPLE_PHI)
+		    {
+		      struct reduction_info *red;
+		      red = reduction_phi (reduction_list, use_stmt);
+		      tree val = PHI_RESULT (red->keep_res);
+		      if (has_single_use (val))
+			{
+			  single_imm_use (val, &use_p, &use_stmt);
+			  if (gimple_store_p (use_stmt))
+			    {
+			      unsigned int id
+				= SSA_NAME_VERSION (gimple_vdef (use_stmt));
+			      bitmap_set_bit (reduction_stores, id);
+			      skip_stmt = use_stmt;
+			      if (dump_file)
+				{
+				  fprintf (dump_file, "found reduction load: ");
+				  print_gimple_stmt (dump_file, stmt, 0, 0);
+				}
+			    }
+			}
+		    }
+		}
+
+	      ao_ref_init (&ref, rhs);
+	    }
+	  else if (gimple_store_p (stmt))
+	    {
+	      ao_ref_init (&ref, gimple_assign_lhs (stmt));
+	      ref_is_store = true;
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+	    continue;
+	  else if (!gimple_has_side_effects (stmt)
+		   && !gimple_could_trap_p (stmt)
+		   && !stmt_could_throw_p (stmt)
+		   && !gimple_vdef (stmt)
+		   && !gimple_vuse (stmt))
+	    continue;
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+	    continue;
+	  else if (gimple_code (stmt) == GIMPLE_RETURN)
+	    continue;
+	  else
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+
+	  if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+					 i, skip_stmt))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "conflicts with entry/exit stmt: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+	      return false;
+	    }
+	}
+    }
+
+  return true;
+}
+
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+   gang_pos == 0, except when the stores are REDUCTION_STORES.  Return true
+   if any changes were made.  */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+			     bitmap reduction_stores)
+{
+  tree gang_pos = NULL_TREE;
+  bool changed = false;
+
+  unsigned i;
+  basic_block bb;
+  FOR_EACH_VEC_ELT (region_bbs, i, bb)
+    {
+      if (bitmap_bit_p (in_loop_bbs, bb->index))
+	continue;
+
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!gimple_store_p (stmt))
+	    {
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (bitmap_bit_p (reduction_stores,
+			    SSA_NAME_VERSION (gimple_vdef (stmt))))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file,
+			   "skipped reduction store for single-gang"
+			   " neutering: ");
+		  print_gimple_stmt (dump_file, stmt, 0, 0);
+		}
+
+	      /* Update gsi to point to next stmt.  */
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  changed = true;
+
+	  if (gang_pos == NULL_TREE)
+	    {
+	      tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+	      gcall *gang_single
+		= gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+	      gang_pos = make_ssa_name (integer_type_node);
+	      gimple_call_set_lhs (gang_single, gang_pos);
+	      gimple_stmt_iterator start
+		= gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+	      tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+	      gimple_set_vuse (gang_single, vuse);
+	      gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+	    }
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "found store that needs single-gang neutering: ");
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+
+	  {
+	    /* Split block before store.  */
+	    gimple_stmt_iterator gsi2 = gsi;
+	    gsi_prev (&gsi2);
+	    edge e;
+	    if (gsi_end_p (gsi2))
+	      {
+		e = split_block_after_labels (bb);
+		gsi2 = gsi_last_bb (bb);
+	      }
+	    else
+	      e = split_block (bb, gsi_stmt (gsi2));
+	    basic_block bb2 = e->dest;
+
+	    /* Split block after store.  */
+	    gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+	    edge e2 = split_block (bb2, gsi_stmt (gsi3));
+	    basic_block bb3 = e2->dest;
+
+	    gimple *cond
+	      = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+				   NULL_TREE, NULL_TREE);
+	    gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+	    edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+	    e->flags = EDGE_TRUE_VALUE;
+
+	    tree vdef = gimple_vdef (stmt);
+	    tree vuse = gimple_vuse (stmt);
+
+	    tree phi_res = copy_ssa_name (vdef);
+	    gphi *new_phi = create_phi_node (phi_res, bb3);
+	    replace_uses_by (vdef, phi_res);
+	    add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+	    add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+	    /* Update gsi to point to next stmt.  */
+	    bb = bb3;
+	    gsi = gsi_start_bb (bb);
+	  }
+	}
+    }
+
+  return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+		    reduction_info_table_type *reduction_list)
+{
+  basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+  tree omp_data_i = get_omp_data_i_param ();
+  gcc_assert (omp_data_i != NULL_TREE);
+  vec<basic_block> region_bbs
+    = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+  bitmap_clear (in_loop_bbs);
+  for (unsigned int i = 0; i < loop->num_nodes; i++)
+    bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+  bitmap reduction_stores = BITMAP_ALLOC (NULL);
+  bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+				   reduction_list, reduction_stores);
+
+  if (res)
+    {
+      bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+						  reduction_stores);
+      if (changed)
+	{
+	  free_dominance_info (CDI_DOMINATORS);
+	  calculate_dominance_info (CDI_DOMINATORS);
+	}
+    }
+
+  free (loop_bbs);
+
+  BITMAP_FREE (in_loop_bbs);
+  BITMAP_FREE (reduction_stores);
+
+  return res;
+}
+
 /* Detect parallel loops and generate parallel code using libgomp
    primitives.  Returns true if some loop was parallelized, false
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2642,19 +3118,29 @@ parallelize_loops (void)
   source_location loop_loc;
 
   /* Do not parallelize loops in the functions created by parallelization.  */
-  if (parallelized_function_p (cfun->decl))
+  if (!oacc_kernels_p
+      && parallelized_function_p (cfun->decl))
     return false;
+
+  /* Do not parallelize loops in offloaded functions.  */
+  if (!oacc_kernels_p
+      && get_oacc_fn_attrib (cfun->decl) != NULL)
+     return false;
+
   if (cfun->has_nonlocal_label)
     return false;
 
   gcc_obstack_init (&parloop_obstack);
   reduction_info_table_type reduction_list (10);
 
+  calculate_dominance_info (CDI_DOMINATORS);
+
   FOR_EACH_LOOP (loop, 0)
     {
       if (loop == skip_loop)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
+	  if (!loop->in_oacc_kernels_region
+	      && dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "Skipping loop %d as inner loop of parallelized loop\n",
 		     loop->num);
@@ -2666,6 +3152,22 @@ parallelize_loops (void)
 	skip_loop = NULL;
 
       reduction_list.empty ();
+
+      if (oacc_kernels_p)
+	{
+	  if (!loop->in_oacc_kernels_region)
+	    continue;
+
+	  /* Don't try to parallelize inner loops in an oacc kernels region.  */
+	  if (loop->inner)
+	    skip_loop = loop->inner;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Trying loop %d with header bb %d in oacc kernels"
+		     " region\n", loop->num, loop->header->index);
+	}
+
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
         fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2707,6 +3209,7 @@ parallelize_loops (void)
       /* FIXME: Bypass this check as graphite doesn't update the
 	 count and frequency correctly now.  */
       if (!flag_loop_parallelize_all
+	  && !oacc_kernels_p
 	  && ((estimated != -1
 	       && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
 	      /* Do not bother with loops in cold areas.  */
@@ -2716,14 +3219,23 @@ parallelize_loops (void)
       if (!try_get_loop_niter (loop, &niter_desc))
 	continue;
 
-      if (!try_create_reduction_list (loop, &reduction_list))
+      if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
 	continue;
 
       if (!flag_loop_parallelize_all
 	  && !loop_parallel_p (loop, &parloop_obstack))
 	continue;
 
+      if (oacc_kernels_p
+	&& !oacc_entry_exit_ok (loop, &reduction_list))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "entry/exit not ok: FAILED\n");
+	  continue;
+	}
+
       changed = true;
+      /* Skip inner loop(s) of parallelized loop.  */
       skip_loop = loop->inner;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2736,8 +3248,9 @@ parallelize_loops (void)
 	  fprintf (dump_file, "\nloop at %s:%d: ",
 		   LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
       }
+
       gen_parallel_loop (loop, &reduction_list,
-			 n_threads, &niter_desc);
+			 n_threads, &niter_desc, oacc_kernels_p);
     }
 
   obstack_free (&parloop_obstack, NULL);
@@ -2787,7 +3300,7 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
 
@@ -2806,3 +3319,59 @@ make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "parloops_oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+  loop_optimizer_init (LOOPS_NORMAL
+		       | LOOPS_HAVE_RECORDED_EXITS);
+  rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+
+      return TODO_update_ssa;
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 395a93a..f5803d0 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -384,6 +384,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+  make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]