This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[gomp] Initial parallel gimple support

From: Diego Novillo <dnovillo at redhat dot com>
To: gcc-patches at gcc dot gnu dot org
Cc: Richard Henderson <rth at redhat dot com>
Date: Thu, 12 Jan 2006 17:12:12 -0500
Subject: [gomp] Initial parallel gimple support
This patch introduces initial support for an intermediate representation to 
express concurrency.  It mostly re-uses the OMP_* codes we have for 
OpenMP.  I have given no thought to whether we would need other codes to 
express other types of parallelism, but given that the runtime library is 
heavily influenced by OpenMP, adding other codes would not be useful at 
the moment (and I suspect that we will not be too eager to support 
anything other than OpenMP for now).

The idea is essentially the same idea underlying the GIMPLE to low-GIMPLE 
transition.  When we parse OpenMP directives, we create OMP_* trees which 
are then taken to high-GIMPLE form and, up until now, immediately expanded 
into the corresponding runtime library calls into libgomp.

This patch further separates the OpenMP front-end from the back-end by 
taking the OpenMP IL into low-GIMPLE and having it survive until we create 
the CFG.

Currently, we cannot take an OpenMP program lower than the CFG because 
neither the SSA form nor the optimizers understand concurrency.  It is 
significantly easier to make the CFG aware of parallelism by using 
abnormal edges when transitioning to blocks inside a parallel region.

By separating lowering from expansion we get the usual benefits of an 
optimizable IL.  In this case the benefits are more noticeable because in 
OpenMP, expansion forces us to outline the parallel regions into separate 
functions.  This means that local variables that need to be sent in/out of 
the parallel regions must be mapped into a special structure whose address 
is passed to the outlined function.

All this mapping is done in two phases: scan_omp and lower_omp.  During 
scan_omp we create all the mappings and decide how local variables will be 
sent in and out of the parallel regions.  During lower_omp, we do the 
actual re-mapping and convert the bodies of the OpenMP directives into 
low-GIMPLE form.

Before this patch, lowering an OpenMP directive meant expanding it into the 
corresponding library calls.  With this patch, we merely lower the 
structure of the directives and create the variable mappings so that a CFG 
can be created over the program.  So, the program:

{
  int a[10], i;

  #pragma omp parallel
  #pragma omp for schedule (dynamic)
  for (i = 0; i < 10; i++)
    a[i] = i;
}

Becomes (.t18.omplower):

{
  struct .omp_data_s.1 .omp_data_o.3;
  int[10] * D.1573;
  int i.2;

  {
    .omp_data_o.3.a = &a;
    #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)]
      {
        .omp_data_i = &.omp_data_o.3;
        {
          #pragma omp for schedule(dynamic) private(i)
          for (i = 0; i <= 9; i = i + 1)
            {
              i.0 = i;
              D.1573 = .omp_data_i->a;
              i.2 = i.0;
              (*D.1573)[i.2] = i;
              OMP_RETURN
            }
          __builtin_GOMP_barrier ();
        }
        OMP_RETURN
      }
  }
}

Notice that at this point we have exposed the data mappings via .omp_data_i 
and .omp_data_o.3.  We have also determined where we are going to outline 
the parallel region (foo.omp_fn.0).  The data sending is expressed with a 
simple assignment (.omp_data_i = &.omp_data_o.3).  After this, the program 
is converted into low GIMPLE and then the CFG is created:

{
  int i;
  int a[10];
  struct .omp_data_s.1 .omp_data_o.3;
  int[10] * D.1573;
  int i.2;

  # BLOCK 2
  # PRED: ENTRY (fallthru)
  .omp_data_o.3.a = &a;
  #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)]
  # SUCC: 3 (ab)

  # BLOCK 3
  # PRED: 2 (ab)
  .omp_data_i = &.omp_data_o.3;
  #pragma omp for schedule(dynamic) private(i)
  for (i = 0; i <= 9; i = i + 1)
  # SUCC: 4 (ab)

  # BLOCK 4
  # PRED: 3 (ab)
  i.0 = i;
  D.1573 = .omp_data_i->a;
  i.2 = i.0;
  (*D.1573)[i.2] = i;
  OMP_RETURN
  # SUCC: 5 (fallthru)

  # BLOCK 5
  # PRED: 4 (fallthru)
  __builtin_GOMP_barrier ();
  OMP_RETURN
  # SUCC: 6 (fallthru)

  # BLOCK 6
  # PRED: 5 (fallthru)
  return;
  # SUCC: EXIT

}

All the CFG edges involving parallel or workshare regions are abnormal 
edges.  In the future we may want to change that, but for now this is 
sufficient to keep structural optimizations from messing things up.  At 
this point, we can do some optimizations.  The first obvious we can do is 
remove that barrier in block 5 (we do it during the expansion of the 
OMP_PARALLEL region).

We also make sure that the parallel region to be outlined is an SESE region 
delimited by the OMP_* directive and OMP_RETURN.  Since the data sharing 
has already been expressed and exposed with the .omp_data_s mappings, all 
we need to do now is outline the SESE region in the CFG between blocks 3 
and 5.  This is done by pass_expand_omp.  It first discovers the parallel 
and workshare regions to expand and does all the outlining.

This is the part that gave me the most grief.  I had originally thought 
that this was a 2 day patch (almost 6 weeks ago).  We did not really know 
how to create functions out of a subgraph, so I spent quite a bit of time 
tweaking the new helper move_sese_region_to_fn() and the associated 
callgraph changes.

So, now in .t26.ompexp, we get the fully expanded program with all the 
calls into the library.

OMP region tree

#pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)]
    #pragma omp for schedule(dynamic) private(i)
    OMP_RETURN
OMP_RETURN

Merging blocks 2 and 10
Merging blocks 2 and 6
foo ()
{
  int i;
  int a[10];
  struct .omp_data_s.1 .omp_data_o.3;

  # BLOCK 2
  # PRED: ENTRY (fallthru)
  .omp_data_o.3.a = &a;
  __builtin_GOMP_parallel_loop_dynamic_start (foo.omp_fn.0, &.omp_data_o.3, 
0, 0, 10, 1, 1);
  foo.omp_fn.0 (&.omp_data_o.3);
  __builtin_GOMP_parallel_end ();
  return;
  # SUCC: EXIT
}

foo.omp_fn.0 (.omp_data_i)
{
  _Bool D.1593;
  int i.2;
  int[10] * D.1574;
  int D.1592;
  long int .iend0.5;
  long int .istart0.4;
  _Bool D.1594;
  int i.0;
  int a[10] [value-expr: *.omp_data_i->a];
  int i;

  # BLOCK 2
  # PRED: ENTRY (fallthru)
  # SUCC: 3 (fallthru)

  # BLOCK 3
  # PRED: 6 [100.0%]  (false) 2 (fallthru)
<L2>:;
  D.1594 = __builtin_GOMP_loop_dynamic_next (&.istart0.4, &.iend0.5);
  if (D.1594) goto <L0>; else goto <L3>;
  # SUCC: 4 (false) 5 (true)

  # BLOCK 4
  # PRED: 3 (false)
<L3>:;
  return;
  # SUCC: EXIT

  # BLOCK 5
  # PRED: 3 (true)
<L0>:;
  i = .istart0.4;
  D.1592 = .iend0.5;
  # SUCC: 6 (fallthru)

  # BLOCK 6
  # PRED: 5 (fallthru) 6 (true)
<L1>:;
  i.0 = i;
  D.1574 = .omp_data_i->a;
  i.2 = i.0;
  (*D.1574)[i.2] = i;
  i = i + 1;
  D.1593 = i < D.1592;
  if (D.1593) goto <L1>; else goto <L2>;
  # SUCC: 3 [100.0%]  (false) 6 (true)

}

Here, we have removed the .omp_data_i = &.omp_data_o assignment, created 
the outlined function and we have also done another minor optimization: we 
are calling the combined parallel loop startup function instead of the two 
separate parallel start and loop start routines.  This saves us from doing 
another barrier at runtime.

The code to detect when to do this optimization is now easy to implement.  
Essentially, if the basic block for OMP_PARALLEL immediately precedes the 
basic block for OMP_FOR, the directives can be merged:

  # BLOCK 2
  # PRED: ENTRY (fallthru)
  .omp_data_o.3.a = &a;
  #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)]
  # SUCC: 3 (ab)

  # BLOCK 3
  # PRED: 2 (ab)
  .omp_data_i = &.omp_data_o.3;
  #pragma omp for schedule(dynamic) private(i)
  for (i = 0; i <= 9; i = i + 1)
  # SUCC: 4 (ab)

There is a limitation, however.  If the header expressions of inner OMP_FOR 
use variables instead of constants, these variables will have been mapped 
and lowered by omp_lower.  For instance:

{
  int a[10], i, z;

  z = 5;
  #pragma omp parallel
  #pragma omp for schedule (dynamic)
  for (i = 0; i < z * 2; i++)
    a[i] = i;
}

becomes

  # BLOCK 2
  # PRED: ENTRY (fallthru)
  z = 5;
  .omp_data_o.3.a = &a;
  .omp_data_o.3.z = z;
  #pragma omp parallel shared(z) shared(a) [child fn: foo.omp_fn.0
(.omp_data_o.3)]
  # SUCC: 3 (ab)

  # BLOCK 3
  # PRED: 2 (ab)
  .omp_data_i = &.omp_data_o.3;
  D.1578 = .omp_data_i->z;
  D.1566 = D.1578 * 2;
  D.1581 = D.1566;
  #pragma omp for schedule(dynamic) private(i)
  for (i = 0; i < D.1581; i = i + 1)
  # SUCC: 4 (ab)

When we emit the combined parallel+for library call we need to send 'z' 
among other arguments.  But the problem now is that we not only have 'z' 
mapped into the .omp_data_i structure, we also need to emit the call 
*before* that computation is now being done.  So, we'd need to hoist the 
code in block 3 into block 2.  That means crossing the assignment to 
'omp_data_i' in block 3.

Doable, but we need dataflow information, and atm we don't build it.  We 
could hack something for this specific case, but I'd rather have a clean 
IL first.  This micro-optimization can wait.

Bootstrapped and tested x86, x86-64, ia64 and ppc64.

This is a fairly big patch.  Richard, could you go over it before I apply 
it?  I'm fairly confident about it, but I did re-arrange omp-low.c 
inside-out again.  Thanks.


        * tree-pretty-print.c (dump_generic_node): Handle
        OMP_PARALLEL_FN, OMP_PARALLEL_DATA_ARG and OMP_RETURN_EXPR.
        * cgraph.c (cgraph_expand_queue): Rename from
        cgraph_analyze_queue.
        Update all users.
        * cgraphunit.c (cgraph_assemble_pending_functions): Process
        cgraph_expand_queue.
        (cgraph_expand_all_functions): Likewise.
        (cgraph_finalize_pending_functions): Remove.  Update callers.

        * tree.h (OMP_DIRECTIVE_P): Define.
        (OMP_PARALLEL_FN): Define.
        (OMP_PARALLEL_DATA_ARG): Define.
        (OMP_SECTIONS_SECTIONS): Define.
        * tree-pass.h (pass_expand_omp): Declare.
        * omp-low.c (struct omp_region): Declare.
        (struct omp_context): Remove fields 'parallel_type',
        'parallel_start_ix' and 'parallel_start_additional_args'.
        Update all users.
        (struct omp_for_data): Rename from struct expand_omp_for_data.
        (omp_regions): New static variable.
        (root_omp_region): New static variable.
        (find_omp_clause): Make static.
        (is_in_combined_parallel_ctx): Remove.
        (is_combined_parallel): New.
        (extract_omp_for_data): Move earlier in the file.
        (workshare_safe_to_combine_p): New.
        (get_ws_args_for): New.
        (determine_parallel_type): Move earlier in the file.
        (omp_copy_decl_2): Do not set DECL_CONTEXT of new local to the
        child function.
        (omp_copy_decl): Likewise.
        (create_omp_child_function): Likewise.
        (lookup_omp_region): New.
        (dump_omp_region): New.
        (debug_omp_region): New.
        (debug_all_omp_regions): New.
        (new_omp_region): New.
        (scan_omp_parallel): If parallel_nesting_level > 1, the
        directive is nested within another parallel directive.
        Set OMP_PARALLEL_FN.
        (scan_omp_for): Do not try to handle combined parallel+for
        cases.
        Remove FIXME comment.
        (scan_omp_nested): Remove.
        (scan_omp_1): Do not call scan_omp_nested when
        parallel_nesting_level is > 1.
        Do not change the DECL_CONTEXT of local variables found.
        (lookup_decl_in_outer_ctx): New.
        (lower_rec_input_clauses): Rename from expand_rec_input_clauses.
        (lower_lastprivate_clauses): Rename from 
expand_lastprivate_clauses.
        (lower_reduction_clauses): Rename from expand_reduction_clauses.
        (lower_copyprivate_clauses): Rename from 
expand_copyprivate_clauses.
        If CTX is nested, lookup VAR in the outer context when
        building copy assignment.
        (lower_send_clauses): Rename from expand_send_clauses.
        If CTX is nested, lookup VAR in the outer context when
        building copy assignments.
        (lower_send_shared_vars): Rename from expand_send_shared_vars.
        If CTX is nested, lookup VAR in the outer context when
        building copy assignments.
        (expand_parallel_call): Rename from build_parallel_call.
        Handle combined parallel+workshare cases.
        Re-implement to emit code into the CFG.
        (list2chain): New.
        (expand_omp_parallel): Re-implement to emit code into the CFG.
        Call move_sese_region_to_fn to outline the sub-graph
        containing the parallel region.
        (expand_omp_for_1): Remove.
        (expand_omp_for_generic): Re-implement to emit code into the
        CFG.
        (expand_omp_for_static_nochunk): Likewise.
        (expand_omp_for_static_chunk): Likewise.
        (expand_omp_for): Likewise.
        (expand_omp_sections): Likewise.
        (remove_exit_barriers): New.
        (expand_omp_synch): New.
        (expand_omp): New.
        (build_omp_regions_1): New.
        (build_omp_regions): New.
        (execute_expand_omp): New.
        (gate_expand_omp): New.
        (pass_expand_omp): Define.
        (lower_omp_sections): Rename from expand_omp_sections.
        Set OMP_SECTIONS_SECTIONS.
        (lower_omp_single_simple): Rename from expand_omp_single_simple.
        (lower_omp_single_copy): Rename from expand_omp_single_copy.
        (lower_omp_single): Rename from expand_omp_simple.
        (lower_omp_master): Rename from expand_omp_master.
        (lower_omp_ordered): Rename from expand_omp_ordered.
        (lower_omp_critical): Rename from expand_omp_critical.
        (lower_omp_for_lastprivate): Rename from 
expand_omp_for_lastprivate.
        (lower_omp_for): Re-implement.
        (lower_omp_parallel): Re-implement.
        (lower_regimplify): Rename from expand_regimplify.
        (lower_omp_1): Rename from expand_omp_1.
        If there are syntax errors in the program, replace every
        OpenMP directive with NOP.
        Call lower_omp_* instead of expand_omp_*.
        (lower_omp): Rename from expand_omp.

        * tree-gimple.c (is_gimple_stmt): Handle OMP_RETURN_EXPR.
        * tree-gimple.h (enum omp_parallel_type): Remove.
        (gimple_boolify): Declare extern.
        (find_omp_clause, determine_parallel_type): Remove.

        * gimple-low.c (lower_omp_directive): New.
        (lower_stmt): Call it.
        (record_vars_into): Move from ...
        (record_vars): ... here.
        Call record_vars_into with current_function_decl.

        * gimplify.c (struct gimplify_ctx): Remove fields
        combined_pre_p and combined_ctxp.  Update users.
        (get_formal_tmp_var): Add documentation.
        (gimple_boolify): Make extern.
        (gimplify_expr_in_ctx): Remove.  Update callers.
        (gimplify_omp_parallel): Do not assume that OMP_PARALLEL_BODY
        will always be a BIND_EXPR.
        (gimplify_expr): Handle OMP_RETURN_EXPR.
        * tree.def (BLOCK): Remove documentation about BLOCK_TYPE_TAGS.
        (OMP_PARALLEL): Add 3 operands.
        (OMP_SECTIONS): Add 1 operand.
        (OMP_RETURN_EXPR): Define.

        * tree-inline.c (estimate_num_insns_1): Handle OpenMP directives.
        (copy_tree_r): Restore TREE_CHAIN in OMP_CLAUSE_*.
        * tree-iterator.c (alloc_stmt_list): Assert that we are not
        creating a circular free list.
        (free_stmt_list): Assert that we are not freeing stmt_list_cache.

        * tree-flow.h (move_sese_region_to_fn): Declare.
        (record_vars_into): Declare.
        * tree-cfg.c (make_omp_sections_edges): New.
        (make_exit_edges): Handle OMP_PARALLEL, OMP_FOR, OMP_SINGLE,
        OMP_MASTER, OMP_ORDERED, OMP_CRITICAL, OMP_RETURN_EXPR,
        OMP_SECTIONS and OMP_SECTION.
        (is_ctrl_altering_stmt): Return true for OMP_DIRECTIVE_P.
        (set_bb_for_stmt): Undo change to check currently_expanding_to_rtl.
        (verify_stmt): Do not handle OMP_DIRECTIVE_P.
        (gather_blocks_in_sese_region): New.
        (struct move_stmt_d): Declare.
        (move_stmt_r): New.
        (move_block_to_fn): New.
        (move_sese_region_to_fn): New.

        * passes.c (init_optimization_passes): Schedule
        pass_expand_omp after pass_init_datastructures.

        * tree-ssa-operands.c (get_expr_operands): Handle
        OMP_PARALLEL, OMP_SECTIONS, OMP_FOR, OMP_RETURN_EXPR,
        OMP_SINGLE, OMP_MASTER, OMP_ORDERED, OMP_CRITICAL.


fortran/

        * trans.h (build4_v): Define.
        * trans-openmp.c: Call build4_v to create OMP_PARALLEL nodes.
        Call build3_v to create OMP_SECTIONS nodes.


testsuite/

        * testsuite/gcc.dg/gomp/for-13.c: Use -fdump-tree-ompexp.
        * testsuite/gcc.dg/gomp/critical-1.c: Likewise.
        * testsuite/gcc.dg/gomp/critical-3.c: Likewise.
        * testsuite/gcc.dg/gomp/empty.c: Likewise.
        * testsuite/gcc.dg/gomp/ordered-1.c: Likewise.
        * testsuite/gcc.dg/gomp/for-4.c: Likewise.
        * testsuite/gcc.dg/gomp/for-6.c: Likewise.
        * testsuite/gcc.dg/gomp/master-3.c: Likewise.
        * testsuite/gcc.dg/gomp/for-8.c: Likewise.
        * testsuite/gcc.dg/gomp/for-10.c: Likewise.
        * testsuite/gcc.dg/gomp/for-18.c: Likewise.
        * testsuite/gcc.dg/gomp/for-5.c: Likewise.
        * testsuite/gcc.dg/gomp/for-7.c: Likewise.
        * testsuite/gcc.dg/gomp/for-9.c: Likewise.
        * testsuite/g++.dg/gomp/for-4.C: Likewise.
        * testsuite/g++.dg/gomp/for-6.C: Likewise.
        * testsuite/g++.dg/gomp/for-8.C: Likewise.
        * testsuite/g++.dg/gomp/for-10.C: Likewise.
        * testsuite/g++.dg/gomp/for-5.C: Likewise.
        * testsuite/g++.dg/gomp/for-7.C: Likewise.
        * testsuite/g++.dg/gomp/for-9.C: Likewise.
Attachment: 20060112-pargimple.diff.gz
Description: GNU Zip compressed data
Follow-Ups:
- Re: [gomp] Initial parallel gimple support
  - From: Sebastian Pop
- Re: [gomp] Initial parallel gimple support
  - From: Diego Novillo
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]