This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
- From: Tom de Vries <Tom_deVries at mentor dot com>
- To: "gcc-patches at gnu dot org" <gcc-patches at gnu dot org>
- Cc: Jakub Jelinek <jakub at redhat dot com>, Richard Biener <rguenther at suse dot de>
- Date: Tue, 24 Nov 2015 13:24:47 +0100
- Subject: Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels
- Authentication-results: sourceware.org; auth=none
- References: <5640BD31 dot 2060602 at mentor dot com> <5640F98B dot 5050601 at mentor dot com> <5649C508 dot 80803 at mentor dot com>
On 16/11/15 12:59, Tom de Vries wrote:
On 09/11/15 20:52, Tom de Vries wrote:
On 09/11/15 16:35, Tom de Vries wrote:
Hi,
this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.
The patch series contains these patches:
1 Insert new exit block only when needed in
transform_to_exit_first_loop_alt
2 Make create_parallel_loop return void
3 Ignore reduction clause on kernels directive
4 Implement -foffload-alias
5 Add in_oacc_kernels_region in struct loop
6 Add pass_oacc_kernels
7 Add pass_dominator_oacc_kernels
8 Add pass_ch_oacc_kernels
9 Add pass_parallelize_loops_oacc_kernels
10 Add pass_oacc_kernels pass group in passes.def
11 Update testcases after adding kernels pass group
12 Handle acc loop directive
13 Add c-c++-common/goacc/kernels-*.c
14 Add gfortran.dg/goacc/kernels-*.f95
15 Add libgomp.oacc-c-c++-common/kernels-*.c
16 Add libgomp.oacc-fortran/kernels-*.f95
The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.
Bootstrapped and reg-tested on x86_64.
Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
I'll post the individual patches in reply to this message.
This patch adds pass_parallelize_loops_oacc_kernels.
There's a number of things we do differently in parloops for oacc
kernels:
- in normal parloops, we generate code to choose between a parallel
version of the loop, and a sequential (low iteration count) version.
Since the code in oacc kernels region is supposed to run on the
accelerator anyway, we skip this check, and don't add a low iteration
count loop.
- in normal parloops, we generate an #pragma omp parallel /
GIMPLE_OMP_RETURN pair to delimit the region which will we split off
into a thread function. Since the oacc kernels region is already
split off, we don't add this pair.
- we indicate the parallelization factor by setting the oacc function
attributes
- we generate an #pragma oacc loop instead of an #pragma omp for, and
we add the gang clause
- in normal parloops, we rewrite the variable accesses in the loop in
terms into accesses relative to a thread function parameter. For the
oacc kernels region, that rewrite has already been done at omp-lower,
so we skip this.
- we need to ensure that the entire kernels region can be run in
parallel. The loop independence check is already present, so for oacc
kernels we add a check between blocks outside the loop and the entire
region.
- we guard stores in the blocks outside the loop with gang_pos == 0.
There's no need for each gang to write to a single location, we can
do this in just one gang. (Typically this is the write of the final
value of the iteration variable if that one is copied back to the
host).
Reposting with loop optimizer init added in
pass_parallelize_loops_oacc_kernels::execute.
Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
added in pass_parallelize_loops_oacc_kernels::execute.
Thanks,
- Tom
Add pass_parallelize_loops_oacc_kernels
2015-11-09 Tom de Vries <tom@codesourcery.com>
* omp-low.c (set_oacc_fn_attrib): Make extern.
* omp-low.c (expand_omp_atomic_fetch_op): Release defs of update stmt.
* omp-low.h (set_oacc_fn_attrib): Declare.
* tree-parloops.c (struct reduction_info): Add reduc_addr field.
(create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
Add and handle function parameter oacc_kernels_p.
(get_omp_data_i_param): New function.
(ref_conflicts_with_region, oacc_entry_exit_ok_1)
(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
(parallelize_loops): Add and handle function parameter oacc_kernels_p.
Calculate dominance info. Skip loops that are not in a kernels region
in oacc_kernels_p mode. Skip inner loops of parallelized loops.
(pass_parallelize_loops::execute): Call parallelize_loops with false
argument.
(pass_data_parallelize_loops_oacc_kernels): New pass_data.
(class pass_parallelize_loops_oacc_kernels): New pass.
(pass_parallelize_loops_oacc_kernels::execute)
(make_pass_parallelize_loops_oacc_kernels): New function.
* tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare.
---
gcc/omp-low.c | 8 +-
gcc/omp-low.h | 1 +
gcc/tree-parloops.c | 700 +++++++++++++++++++++++++++++++++++++++++++++++-----
gcc/tree-pass.h | 2 +
4 files changed, 647 insertions(+), 64 deletions(-)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 0d4c6e5..efe5d3a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -11925,10 +11925,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb,
gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE);
gsi_remove (&gsi, true);
gsi = gsi_last_bb (store_bb);
+ stmt = gsi_stmt (gsi);
gsi_remove (&gsi, true);
if (gimple_in_ssa_p (cfun))
- update_ssa (TODO_update_ssa_no_phi);
+ {
+ release_defs (stmt);
+ update_ssa (TODO_update_ssa_no_phi);
+ }
return true;
}
@@ -12302,7 +12306,7 @@ replace_oacc_fn_attrib (tree fn, tree dims)
function attribute. Push any that are non-constant onto the ARGS
list, along with an appropriate GOMP_LAUNCH_DIM tag. */
-static void
+void
set_oacc_fn_attrib (tree fn, tree clauses, vec<tree> *args)
{
/* Must match GOMP_DIM ordering. */
diff --git a/gcc/omp-low.h b/gcc/omp-low.h
index 194b3d1..1790f40 100644
--- a/gcc/omp-low.h
+++ b/gcc/omp-low.h
@@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree);
extern void replace_oacc_fn_attrib (tree, tree);
extern tree build_oacc_routine_dims (tree);
extern tree get_oacc_fn_attrib (tree);
+extern void set_oacc_fn_attrib (tree, tree, vec<tree> *);
extern int get_oacc_ifn_dim_arg (const gimple *);
extern int get_oacc_fn_dim_size (tree, int);
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 9b564ca..0403d3b 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -53,6 +53,10 @@ along with GCC; see the file COPYING3. If not see
#include "tree-ssa.h"
#include "params.h"
#include "params-enum.h"
+#include "tree-ssa-alias.h"
+#include "tree-eh.h"
+#include "gomp-constants.h"
+#include "tree-dfa.h"
/* This pass tries to distribute iterations of loops into several threads.
The implementation is straightforward -- for each loop we test whether its
@@ -192,6 +196,8 @@ struct reduction_info
of the reduction variable when existing the loop. */
tree initial_value; /* The initial value of the reduction var before entering the loop. */
tree field; /* the name of the field in the parloop data structure intended for reduction. */
+ tree reduc_addr; /* The address of the reduction variable for
+ openacc reductions. */
tree init; /* reduction initialization value. */
gphi *new_phi; /* (helper field) Newly created phi node whose result
will be passed to the atomic operation. Represents
@@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data)
tree tmp_load, name;
gimple *load;
- load_struct = build_simple_mem_ref (clsn_data->load);
- t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+ if (reduc->reduc_addr == NULL_TREE)
+ {
+ load_struct = build_simple_mem_ref (clsn_data->load);
+ t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE);
+
+ addr = build_addr (t);
+ }
+ else
+ {
+ /* Set the address for the atomic store. */
+ addr = reduc->reduc_addr;
- addr = build_addr (t);
+ /* Remove the non-atomic store '*addr = sum'. */
+ tree res = PHI_RESULT (reduc->keep_res);
+ use_operand_p use_p;
+ gimple *stmt;
+ bool single_use_p = single_imm_use (res, &use_p, &stmt);
+ gcc_assert (single_use_p);
+ replace_uses_by (gimple_vdef (stmt),
+ gimple_vuse (stmt));
+ gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+ gsi_remove (&gsi, true);
+ }
/* Create phi node. */
bb = clsn_data->load_bb;
@@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop,
static void
create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
- tree new_data, unsigned n_threads, location_t loc)
+ tree new_data, unsigned n_threads, location_t loc,
+ bool oacc_kernels_p)
{
gimple_stmt_iterator gsi;
basic_block bb, paral_bb, for_bb, ex_bb, continue_bb;
@@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
gomp_continue *omp_cont_stmt;
tree cvar, cvar_init, initvar, cvar_next, cvar_base, type;
edge exit, nexit, guard, end, e;
+ tree for_clauses = NULL_TREE;
/* Prepare the GIMPLE_OMP_PARALLEL statement. */
bb = loop_preheader_edge (loop)->src;
- paral_bb = single_pred (bb);
- gsi = gsi_last_bb (paral_bb);
+ if (!oacc_kernels_p)
+ {
+ paral_bb = single_pred (bb);
+ gsi = gsi_last_bb (paral_bb);
+ }
- t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
- OMP_CLAUSE_NUM_THREADS_EXPR (t)
- = build_int_cst (integer_type_node, n_threads);
- omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
- gimple_set_location (omp_par_stmt, loc);
+ if (!oacc_kernels_p)
+ {
+ t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS);
+ OMP_CLAUSE_NUM_THREADS_EXPR (t)
+ = build_int_cst (integer_type_node, n_threads);
+ omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data);
+ gimple_set_location (omp_par_stmt, loc);
- gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+ gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT);
+ }
+ else
+ {
+ tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS);
+ OMP_CLAUSE_NUM_GANGS_EXPR (clause)
+ = build_int_cst (integer_type_node, n_threads);
+ set_oacc_fn_attrib (cfun->decl, clause, NULL);
+ }
/* Initialize NEW_DATA. */
if (data)
@@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
}
- /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL. */
- bb = split_loop_exit_edge (single_dom_exit (loop));
- gsi = gsi_last_bb (bb);
- omp_return_stmt1 = gimple_build_omp_return (false);
- gimple_set_location (omp_return_stmt1, loc);
- gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+ /* Skip insertion of OMP_RETURN for oacc_kernels_p. We've already generated
+ one when lowering the oacc kernels directive in
+ pass_lower_omp/lower_omp (). */
+ if (!oacc_kernels_p)
+ {
+ /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL. */
+ bb = split_loop_exit_edge (single_dom_exit (loop));
+ gsi = gsi_last_bb (bb);
+ omp_return_stmt1 = gimple_build_omp_return (false);
+ gimple_set_location (omp_return_stmt1, loc);
+ gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT);
+ }
/* Extract data for GIMPLE_OMP_FOR. */
gcc_assert (loop->header == single_dom_exit (loop)->src);
@@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t)
= build_int_cst (integer_type_node, chunk_size);
- for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL);
+ if (1)
+ {
+ /* In combination with the NUM_GANGS on the parallel. */
+ for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG);
+ }
+
+ for_stmt = gimple_build_omp_for (NULL,
+ (oacc_kernels_p
+ ? GF_OMP_FOR_KIND_OACC_LOOP
+ : GF_OMP_FOR_KIND_FOR),
+ for_clauses, 1, NULL);
gimple_set_location (for_stmt, loc);
gimple_omp_for_set_index (for_stmt, 0, initvar);
gimple_omp_for_set_initial (for_stmt, 0, cvar_init);
@@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data,
static void
gen_parallel_loop (struct loop *loop,
reduction_info_table_type *reduction_list,
- unsigned n_threads, struct tree_niter_desc *niter)
+ unsigned n_threads, struct tree_niter_desc *niter,
+ bool oacc_kernels_p)
{
tree many_iterations_cond, type, nit;
tree arg_struct, new_arg_struct;
@@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop,
if (stmts)
gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
- if (loop->inner)
- m_p_thread=2;
- else
- m_p_thread=MIN_PER_THREAD;
-
- many_iterations_cond =
- fold_build2 (GE_EXPR, boolean_type_node,
- nit, build_int_cst (type, m_p_thread * n_threads));
-
- many_iterations_cond
- = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
- invert_truthvalue (unshare_expr (niter->may_be_zero)),
- many_iterations_cond);
- many_iterations_cond
- = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
- if (stmts)
- gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
- if (!is_gimple_condexpr (many_iterations_cond))
+ if (!oacc_kernels_p)
{
+ if (loop->inner)
+ m_p_thread=2;
+ else
+ m_p_thread=MIN_PER_THREAD;
+
+ many_iterations_cond =
+ fold_build2 (GE_EXPR, boolean_type_node,
+ nit, build_int_cst (type, m_p_thread * n_threads));
+
+ many_iterations_cond
+ = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+ invert_truthvalue (unshare_expr (niter->may_be_zero)),
+ many_iterations_cond);
many_iterations_cond
- = force_gimple_operand (many_iterations_cond, &stmts,
- true, NULL_TREE);
+ = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE);
if (stmts)
gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
- }
+ if (!is_gimple_condexpr (many_iterations_cond))
+ {
+ many_iterations_cond
+ = force_gimple_operand (many_iterations_cond, &stmts,
+ true, NULL_TREE);
+ if (stmts)
+ gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+ stmts);
+ }
- initialize_original_copy_tables ();
+ initialize_original_copy_tables ();
- /* We assume that the loop usually iterates a lot. */
- prob = 4 * REG_BR_PROB_BASE / 5;
- loop_version (loop, many_iterations_cond, NULL,
- prob, prob, REG_BR_PROB_BASE - prob, true);
- update_ssa (TODO_update_ssa);
- free_original_copy_tables ();
+ /* We assume that the loop usually iterates a lot. */
+ prob = 4 * REG_BR_PROB_BASE / 5;
+ loop_version (loop, many_iterations_cond, NULL,
+ prob, prob, REG_BR_PROB_BASE - prob, true);
+ update_ssa (TODO_update_ssa);
+ free_original_copy_tables ();
+ }
/* Base all the induction variables in LOOP on a single control one. */
canonicalize_loop_ivs (loop, &nit, true);
@@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,
}
else
{
+ if (oacc_kernels_p)
+ n_threads = 1;
+
/* Fall back on the method that handles more cases, but duplicates the
loop body: move the exit condition of LOOP to the beginning of its
header, and duplicate the part of the last iteration that gets disabled
@@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop,
entry = loop_preheader_edge (loop);
exit = single_dom_exit (loop);
- eliminate_local_variables (entry, exit);
- /* In the old loop, move all variables non-local to the loop to a structure
- and back, and create separate decls for the variables used in loop. */
- separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
- &new_arg_struct, &clsn_data);
+ /* This rewrites the body in terms of new variables. This has already
+ been done for oacc_kernels_p in pass_lower_omp/lower_omp (). */
+ if (!oacc_kernels_p)
+ {
+ eliminate_local_variables (entry, exit);
+ /* In the old loop, move all variables non-local to the loop to a
+ structure and back, and create separate decls for the variables used in
+ loop. */
+ separate_decls_in_region (entry, exit, reduction_list, &arg_struct,
+ &new_arg_struct, &clsn_data);
+ }
+ else
+ {
+ arg_struct = NULL_TREE;
+ new_arg_struct = NULL_TREE;
+ clsn_data.load = NULL_TREE;
+ clsn_data.load_bb = exit->dest;
+ clsn_data.store = NULL_TREE;
+ clsn_data.store_bb = NULL;
+ }
/* Create the parallel constructs. */
loc = UNKNOWN_LOCATION;
cond_stmt = last_stmt (loop->header);
if (cond_stmt)
loc = gimple_location (cond_stmt);
- create_parallel_loop (loop, create_loop_fn (loc), arg_struct,
- new_arg_struct, n_threads, loc);
+ create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct,
+ n_threads, loc, oacc_kernels_p);
if (reduction_list->elements () > 0)
create_call_for_reduction (loop, reduction_list, &clsn_data);
@@ -2531,12 +2610,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter)
return true;
}
+static tree
+get_omp_data_i_param (void)
+{
+ tree decl = DECL_ARGUMENTS (cfun->decl);
+ gcc_assert (DECL_CHAIN (decl) == NULL_TREE);
+ return ssa_default_def (cfun, decl);
+}
+
/* Try to initialize REDUCTION_LIST for code generation part.
REDUCTION_LIST describes the reductions. */
static bool
try_create_reduction_list (loop_p loop,
- reduction_info_table_type *reduction_list)
+ reduction_info_table_type *reduction_list,
+ bool oacc_kernels_p)
{
edge exit = single_dom_exit (loop);
gphi_iterator gsi;
@@ -2595,6 +2683,7 @@ try_create_reduction_list (loop_p loop,
" FAILED: it is not a part of reduction.\n");
return false;
}
+ red->keep_res = phi;
if (dump_file && (dump_flags & TDF_DETAILS))
{
fprintf (dump_file, "reduction phi is ");
@@ -2629,15 +2718,402 @@ try_create_reduction_list (loop_p loop,
}
+ if (oacc_kernels_p)
+ {
+ edge e = loop_preheader_edge (loop);
+
+ for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+ gsi_next (&gsi))
+ {
+ gphi *phi = gsi.phi ();
+ tree def = PHI_RESULT (phi);
+ affine_iv iv;
+
+ if (!virtual_operand_p (def)
+ && !simple_iv (loop, loop, def, &iv, true))
+ {
+ struct reduction_info *red;
+ red = reduction_phi (reduction_list, phi);
+
+ /* Look for pattern:
+
+ <bb preheader>
+ .omp_data_i = &.omp_data_arr;
+ addr = .omp_data_i->sum;
+ sum_a = *addr;
+
+ <bb header>:
+ sum_b = PHI <sum_a (preheader), sum_c (latch)>
+
+ and assign addr to reduc->reduc_addr. */
+
+ tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+ gimple *stmt = SSA_NAME_DEF_STMT (arg);
+ if (!gimple_assign_single_p (stmt))
+ return false;
+ tree memref = gimple_assign_rhs1 (stmt);
+ if (TREE_CODE (memref) != MEM_REF)
+ return false;
+ tree addr = TREE_OPERAND (memref, 0);
+
+ gimple *stmt2 = SSA_NAME_DEF_STMT (addr);
+ if (!gimple_assign_single_p (stmt2))
+ return false;
+ tree compref = gimple_assign_rhs1 (stmt2);
+ if (TREE_CODE (compref) != COMPONENT_REF)
+ return false;
+ tree addr2 = TREE_OPERAND (compref, 0);
+ if (TREE_CODE (addr2) != MEM_REF)
+ return false;
+ addr2 = TREE_OPERAND (addr2, 0);
+ if (TREE_CODE (addr2) != SSA_NAME
+ || addr2 != get_omp_data_i_param ())
+ return false;
+ red->reduc_addr = addr;
+ }
+ }
+ }
+
+ return true;
+}
+
+static bool
+ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref,
+ bool ref_is_store, vec<basic_block> region_bbs,
+ unsigned int i, gimple *skip_stmt)
+{
+ basic_block bb = region_bbs[i];
+ gsi_next (&gsi);
+
+ while (true)
+ {
+ for (; !gsi_end_p (gsi);
+ gsi_next (&gsi))
+ {
+ gimple *stmt = gsi_stmt (gsi);
+ if (stmt == skip_stmt)
+ {
+ if (dump_file)
+ {
+ fprintf (dump_file, "skipping reduction store: ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+ continue;
+ }
+
+ if (!gimple_vdef (stmt)
+ && !gimple_vuse (stmt))
+ continue;
+
+ if (gimple_code (stmt) == GIMPLE_RETURN)
+ continue;
+
+ if (ref_is_store)
+ {
+ if (ref_maybe_used_by_stmt_p (stmt, ref))
+ {
+ if (dump_file)
+ {
+ fprintf (dump_file, "Stmt ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+ return true;
+ }
+ }
+ else
+ {
+ if (stmt_may_clobber_ref_p_1 (stmt, ref))
+ {
+ if (dump_file)
+ {
+ fprintf (dump_file, "Stmt ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+ return true;
+ }
+ }
+ }
+ i++;
+ if (i == region_bbs.length ())
+ break;
+ bb = region_bbs[i];
+ gsi = gsi_start_bb (bb);
+ }
+
+ return false;
+}
+
+static bool
+oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+ tree omp_data_i,
+ reduction_info_table_type *reduction_list,
+ bitmap reduction_stores)
+{
+ unsigned i;
+ basic_block bb;
+ FOR_EACH_VEC_ELT (region_bbs, i, bb)
+ {
+ if (bitmap_bit_p (in_loop_bbs, bb->index))
+ continue;
+
+ gimple_stmt_iterator gsi;
+ for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+ gsi_next (&gsi))
+ {
+ gimple *stmt = gsi_stmt (gsi);
+ gimple *skip_stmt = NULL;
+
+ if (is_gimple_debug (stmt)
+ || gimple_code (stmt) == GIMPLE_COND)
+ continue;
+
+ ao_ref ref;
+ bool ref_is_store = false;
+ if (gimple_assign_load_p (stmt))
+ {
+ tree rhs = gimple_assign_rhs1 (stmt);
+ tree base = get_base_address (rhs);
+ if (TREE_CODE (base) == MEM_REF
+ && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0))
+ continue;
+
+ tree lhs = gimple_assign_lhs (stmt);
+ if (TREE_CODE (lhs) == SSA_NAME
+ && has_single_use (lhs))
+ {
+ use_operand_p use_p;
+ gimple *use_stmt;
+ single_imm_use (lhs, &use_p, &use_stmt);
+ if (gimple_code (use_stmt) == GIMPLE_PHI)
+ {
+ struct reduction_info *red;
+ red = reduction_phi (reduction_list, use_stmt);
+ tree val = PHI_RESULT (red->keep_res);
+ if (has_single_use (val))
+ {
+ single_imm_use (val, &use_p, &use_stmt);
+ if (gimple_store_p (use_stmt))
+ {
+ unsigned int id
+ = SSA_NAME_VERSION (gimple_vdef (use_stmt));
+ bitmap_set_bit (reduction_stores, id);
+ skip_stmt = use_stmt;
+ if (dump_file)
+ {
+ fprintf (dump_file, "found reduction load: ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+ }
+ }
+ }
+ }
+
+ ao_ref_init (&ref, rhs);
+ }
+ else if (gimple_store_p (stmt))
+ {
+ ao_ref_init (&ref, gimple_assign_lhs (stmt));
+ ref_is_store = true;
+ }
+ else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
+ continue;
+ else if (!gimple_has_side_effects (stmt)
+ && !gimple_could_trap_p (stmt)
+ && !stmt_could_throw_p (stmt)
+ && !gimple_vdef (stmt)
+ && !gimple_vuse (stmt))
+ continue;
+ else if (is_gimple_call (stmt)
+ && gimple_call_internal_p (stmt)
+ && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS)
+ continue;
+ else if (gimple_code (stmt) == GIMPLE_RETURN)
+ continue;
+ else
+ {
+ if (dump_file)
+ {
+ fprintf (dump_file, "Unhandled stmt in entry/exit: ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+ return false;
+ }
+
+ if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs,
+ i, skip_stmt))
+ {
+ if (dump_file)
+ {
+ fprintf (dump_file, "conflicts with entry/exit stmt: ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+ return false;
+ }
+ }
+ }
+
return true;
}
+/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with
+ gang_pos == 0, except when the stores are REDUCTION_STORES. Return true
+ if any changes were made. */
+
+static bool
+oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec<basic_block> region_bbs,
+ bitmap reduction_stores)
+{
+ tree gang_pos = NULL_TREE;
+ bool changed = false;
+
+ unsigned i;
+ basic_block bb;
+ FOR_EACH_VEC_ELT (region_bbs, i, bb)
+ {
+ if (bitmap_bit_p (in_loop_bbs, bb->index))
+ continue;
+
+ gimple_stmt_iterator gsi;
+ for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+ {
+ gimple *stmt = gsi_stmt (gsi);
+
+ if (!gimple_store_p (stmt))
+ {
+ /* Update gsi to point to next stmt. */
+ gsi_next (&gsi);
+ continue;
+ }
+
+ if (bitmap_bit_p (reduction_stores,
+ SSA_NAME_VERSION (gimple_vdef (stmt))))
+ {
+ if (dump_file)
+ {
+ fprintf (dump_file,
+ "skipped reduction store for single-gang"
+ " neutering: ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+
+ /* Update gsi to point to next stmt. */
+ gsi_next (&gsi);
+ continue;
+ }
+
+ changed = true;
+
+ if (gang_pos == NULL_TREE)
+ {
+ tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+ gcall *gang_single
+ = gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg);
+ gang_pos = make_ssa_name (integer_type_node);
+ gimple_call_set_lhs (gang_single, gang_pos);
+ gimple_stmt_iterator start
+ = gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+ tree vuse = ssa_default_def (cfun, gimple_vop (cfun));
+ gimple_set_vuse (gang_single, vuse);
+ gsi_insert_before (&start, gang_single, GSI_SAME_STMT);
+ }
+
+ if (dump_file)
+ {
+ fprintf (dump_file,
+ "found store that needs single-gang neutering: ");
+ print_gimple_stmt (dump_file, stmt, 0, 0);
+ }
+
+ {
+ /* Split block before store. */
+ gimple_stmt_iterator gsi2 = gsi;
+ gsi_prev (&gsi2);
+ edge e;
+ if (gsi_end_p (gsi2))
+ {
+ e = split_block_after_labels (bb);
+ gsi2 = gsi_last_bb (bb);
+ }
+ else
+ e = split_block (bb, gsi_stmt (gsi2));
+ basic_block bb2 = e->dest;
+
+ /* Split block after store. */
+ gimple_stmt_iterator gsi3 = gsi_start_bb (bb2);
+ edge e2 = split_block (bb2, gsi_stmt (gsi3));
+ basic_block bb3 = e2->dest;
+
+ gimple *cond
+ = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node,
+ NULL_TREE, NULL_TREE);
+ gsi_insert_after (&gsi2, cond, GSI_NEW_STMT);
+
+ edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE);
+ e->flags = EDGE_TRUE_VALUE;
+
+ tree vdef = gimple_vdef (stmt);
+ tree vuse = gimple_vuse (stmt);
+
+ tree phi_res = copy_ssa_name (vdef);
+ gphi *new_phi = create_phi_node (phi_res, bb3);
+ replace_uses_by (vdef, phi_res);
+ add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION);
+ add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION);
+
+ /* Update gsi to point to next stmt. */
+ bb = bb3;
+ gsi = gsi_start_bb (bb);
+ }
+ }
+ }
+
+ return changed;
+}
+
+static bool
+oacc_entry_exit_ok (struct loop *loop,
+ reduction_info_table_type *reduction_list)
+{
+ basic_block *loop_bbs = get_loop_body_in_dom_order (loop);
+ tree omp_data_i = get_omp_data_i_param ();
+ gcc_assert (omp_data_i != NULL_TREE);
+ vec<basic_block> region_bbs
+ = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+ bitmap in_loop_bbs = BITMAP_ALLOC (NULL);
+ bitmap_clear (in_loop_bbs);
+ for (unsigned int i = 0; i < loop->num_nodes; i++)
+ bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index);
+
+ bitmap reduction_stores = BITMAP_ALLOC (NULL);
+ bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i,
+ reduction_list, reduction_stores);
+
+ if (res)
+ {
+ bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs,
+ reduction_stores);
+ if (changed)
+ {
+ free_dominance_info (CDI_DOMINATORS);
+ calculate_dominance_info (CDI_DOMINATORS);
+ }
+ }
+
+ free (loop_bbs);
+
+ BITMAP_FREE (in_loop_bbs);
+ BITMAP_FREE (reduction_stores);
+
+ return res;
+}
+
/* Detect parallel loops and generate parallel code using libgomp
primitives. Returns true if some loop was parallelized, false
otherwise. */
static bool
-parallelize_loops (void)
+parallelize_loops (bool oacc_kernels_p)
{
unsigned n_threads = flag_tree_parallelize_loops;
bool changed = false;
@@ -2649,19 +3125,29 @@ parallelize_loops (void)
source_location loop_loc;
/* Do not parallelize loops in the functions created by parallelization. */
- if (parallelized_function_p (cfun->decl))
+ if (!oacc_kernels_p
+ && parallelized_function_p (cfun->decl))
return false;
+
+ /* Do not parallelize loops in offloaded functions. */
+ if (!oacc_kernels_p
+ && get_oacc_fn_attrib (cfun->decl) != NULL)
+ return false;
+
if (cfun->has_nonlocal_label)
return false;
gcc_obstack_init (&parloop_obstack);
reduction_info_table_type reduction_list (10);
+ calculate_dominance_info (CDI_DOMINATORS);
+
FOR_EACH_LOOP (loop, 0)
{
if (loop == skip_loop)
{
- if (dump_file && (dump_flags & TDF_DETAILS))
+ if (!loop->in_oacc_kernels_region
+ && dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file,
"Skipping loop %d as inner loop of parallelized loop\n",
loop->num);
@@ -2673,6 +3159,22 @@ parallelize_loops (void)
skip_loop = NULL;
reduction_list.empty ();
+
+ if (oacc_kernels_p)
+ {
+ if (!loop->in_oacc_kernels_region)
+ continue;
+
+ /* Don't try to parallelize inner loops in an oacc kernels region. */
+ if (loop->inner)
+ skip_loop = loop->inner;
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Trying loop %d with header bb %d in oacc kernels"
+ " region\n", loop->num, loop->header->index);
+ }
+
if (dump_file && (dump_flags & TDF_DETAILS))
{
fprintf (dump_file, "Trying loop %d as candidate\n",loop->num);
@@ -2714,6 +3216,7 @@ parallelize_loops (void)
/* FIXME: Bypass this check as graphite doesn't update the
count and frequency correctly now. */
if (!flag_loop_parallelize_all
+ && !oacc_kernels_p
&& ((estimated != -1
&& estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD)
/* Do not bother with loops in cold areas. */
@@ -2723,14 +3226,23 @@ parallelize_loops (void)
if (!try_get_loop_niter (loop, &niter_desc))
continue;
- if (!try_create_reduction_list (loop, &reduction_list))
+ if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p))
continue;
if (!flag_loop_parallelize_all
&& !loop_parallel_p (loop, &parloop_obstack))
continue;
+ if (oacc_kernels_p
+ && !oacc_entry_exit_ok (loop, &reduction_list))
+ {
+ if (dump_file)
+ fprintf (dump_file, "entry/exit not ok: FAILED\n");
+ continue;
+ }
+
changed = true;
+ /* Skip inner loop(s) of parallelized loop. */
skip_loop = loop->inner;
if (dump_file && (dump_flags & TDF_DETAILS))
{
@@ -2743,8 +3255,9 @@ parallelize_loops (void)
fprintf (dump_file, "\nloop at %s:%d: ",
LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc));
}
+
gen_parallel_loop (loop, &reduction_list,
- n_threads, &niter_desc);
+ n_threads, &niter_desc, oacc_kernels_p);
}
obstack_free (&parloop_obstack, NULL);
@@ -2794,7 +3307,7 @@ pass_parallelize_loops::execute (function *fun)
if (number_of_loops (fun) <= 1)
return 0;
- if (parallelize_loops ())
+ if (parallelize_loops (false))
{
fun->curr_properties &= ~(PROP_gimple_eomp);
@@ -2813,3 +3326,66 @@ make_pass_parallelize_loops (gcc::context *ctxt)
{
return new pass_parallelize_loops (ctxt);
}
+
+namespace {
+
+const pass_data pass_data_parallelize_loops_oacc_kernels =
+{
+ GIMPLE_PASS, /* type */
+ "parloops_oacc_kernels", /* name */
+ OPTGROUP_LOOP, /* optinfo_flags */
+ TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+ ( PROP_cfg | PROP_ssa ), /* properties_required */
+ 0, /* properties_provided */
+ 0, /* properties_destroyed */
+ 0, /* todo_flags_start */
+ 0, /* todo_flags_finish */
+};
+
+class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass
+{
+public:
+ pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+ : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt)
+ {}
+
+ /* opt_pass methods: */
+ virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; }
+ virtual unsigned int execute (function *);
+
+}; // class pass_parallelize_loops_oacc_kernels
+
+unsigned
+pass_parallelize_loops_oacc_kernels::execute (function *fun)
+{
+ unsigned int todo = 0;
+
+ loop_optimizer_init (LOOPS_NORMAL
+ | LOOPS_HAVE_RECORDED_EXITS);
+
+ if (number_of_loops (fun) <= 1)
+ return 0;
+
+ rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+ scev_initialize ();
+
+ if (parallelize_loops (true))
+ {
+ fun->curr_properties &= ~(PROP_gimple_eomp);
+ todo |= TODO_update_ssa;
+ }
+
+ scev_finalize ();
+ loop_optimizer_finalize ();
+
+ return todo;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt)
+{
+ return new pass_parallelize_loops_oacc_kernels (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 9704918..004db77 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -385,6 +385,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *
+ make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);