This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
This patch introduces initial support for an intermediate representation to express concurrency. It mostly re-uses the OMP_* codes we have for OpenMP. I have given no thought to whether we would need other codes to express other types of parallelism, but given that the runtime library is heavily influenced by OpenMP, adding other codes would not be useful at the moment (and I suspect that we will not be too eager to support anything other than OpenMP for now). The idea is essentially the same idea underlying the GIMPLE to low-GIMPLE transition. When we parse OpenMP directives, we create OMP_* trees which are then taken to high-GIMPLE form and, up until now, immediately expanded into the corresponding runtime library calls into libgomp. This patch further separates the OpenMP front-end from the back-end by taking the OpenMP IL into low-GIMPLE and having it survive until we create the CFG. Currently, we cannot take an OpenMP program lower than the CFG because neither the SSA form nor the optimizers understand concurrency. It is significantly easier to make the CFG aware of parallelism by using abnormal edges when transitioning to blocks inside a parallel region. By separating lowering from expansion we get the usual benefits of an optimizable IL. In this case the benefits are more noticeable because in OpenMP, expansion forces us to outline the parallel regions into separate functions. This means that local variables that need to be sent in/out of the parallel regions must be mapped into a special structure whose address is passed to the outlined function. All this mapping is done in two phases: scan_omp and lower_omp. During scan_omp we create all the mappings and decide how local variables will be sent in and out of the parallel regions. During lower_omp, we do the actual re-mapping and convert the bodies of the OpenMP directives into low-GIMPLE form. Before this patch, lowering an OpenMP directive meant expanding it into the corresponding library calls. With this patch, we merely lower the structure of the directives and create the variable mappings so that a CFG can be created over the program. So, the program: { int a[10], i; #pragma omp parallel #pragma omp for schedule (dynamic) for (i = 0; i < 10; i++) a[i] = i; } Becomes (.t18.omplower): { struct .omp_data_s.1 .omp_data_o.3; int[10] * D.1573; int i.2; { .omp_data_o.3.a = &a; #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)] { .omp_data_i = &.omp_data_o.3; { #pragma omp for schedule(dynamic) private(i) for (i = 0; i <= 9; i = i + 1) { i.0 = i; D.1573 = .omp_data_i->a; i.2 = i.0; (*D.1573)[i.2] = i; OMP_RETURN } __builtin_GOMP_barrier (); } OMP_RETURN } } } Notice that at this point we have exposed the data mappings via .omp_data_i and .omp_data_o.3. We have also determined where we are going to outline the parallel region (foo.omp_fn.0). The data sending is expressed with a simple assignment (.omp_data_i = &.omp_data_o.3). After this, the program is converted into low GIMPLE and then the CFG is created: { int i; int a[10]; struct .omp_data_s.1 .omp_data_o.3; int[10] * D.1573; int i.2; # BLOCK 2 # PRED: ENTRY (fallthru) .omp_data_o.3.a = &a; #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)] # SUCC: 3 (ab) # BLOCK 3 # PRED: 2 (ab) .omp_data_i = &.omp_data_o.3; #pragma omp for schedule(dynamic) private(i) for (i = 0; i <= 9; i = i + 1) # SUCC: 4 (ab) # BLOCK 4 # PRED: 3 (ab) i.0 = i; D.1573 = .omp_data_i->a; i.2 = i.0; (*D.1573)[i.2] = i; OMP_RETURN # SUCC: 5 (fallthru) # BLOCK 5 # PRED: 4 (fallthru) __builtin_GOMP_barrier (); OMP_RETURN # SUCC: 6 (fallthru) # BLOCK 6 # PRED: 5 (fallthru) return; # SUCC: EXIT } All the CFG edges involving parallel or workshare regions are abnormal edges. In the future we may want to change that, but for now this is sufficient to keep structural optimizations from messing things up. At this point, we can do some optimizations. The first obvious we can do is remove that barrier in block 5 (we do it during the expansion of the OMP_PARALLEL region). We also make sure that the parallel region to be outlined is an SESE region delimited by the OMP_* directive and OMP_RETURN. Since the data sharing has already been expressed and exposed with the .omp_data_s mappings, all we need to do now is outline the SESE region in the CFG between blocks 3 and 5. This is done by pass_expand_omp. It first discovers the parallel and workshare regions to expand and does all the outlining. This is the part that gave me the most grief. I had originally thought that this was a 2 day patch (almost 6 weeks ago). We did not really know how to create functions out of a subgraph, so I spent quite a bit of time tweaking the new helper move_sese_region_to_fn() and the associated callgraph changes. So, now in .t26.ompexp, we get the fully expanded program with all the calls into the library. OMP region tree #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)] #pragma omp for schedule(dynamic) private(i) OMP_RETURN OMP_RETURN Merging blocks 2 and 10 Merging blocks 2 and 6 foo () { int i; int a[10]; struct .omp_data_s.1 .omp_data_o.3; # BLOCK 2 # PRED: ENTRY (fallthru) .omp_data_o.3.a = &a; __builtin_GOMP_parallel_loop_dynamic_start (foo.omp_fn.0, &.omp_data_o.3, 0, 0, 10, 1, 1); foo.omp_fn.0 (&.omp_data_o.3); __builtin_GOMP_parallel_end (); return; # SUCC: EXIT } foo.omp_fn.0 (.omp_data_i) { _Bool D.1593; int i.2; int[10] * D.1574; int D.1592; long int .iend0.5; long int .istart0.4; _Bool D.1594; int i.0; int a[10] [value-expr: *.omp_data_i->a]; int i; # BLOCK 2 # PRED: ENTRY (fallthru) # SUCC: 3 (fallthru) # BLOCK 3 # PRED: 6 [100.0%] (false) 2 (fallthru) <L2>:; D.1594 = __builtin_GOMP_loop_dynamic_next (&.istart0.4, &.iend0.5); if (D.1594) goto <L0>; else goto <L3>; # SUCC: 4 (false) 5 (true) # BLOCK 4 # PRED: 3 (false) <L3>:; return; # SUCC: EXIT # BLOCK 5 # PRED: 3 (true) <L0>:; i = .istart0.4; D.1592 = .iend0.5; # SUCC: 6 (fallthru) # BLOCK 6 # PRED: 5 (fallthru) 6 (true) <L1>:; i.0 = i; D.1574 = .omp_data_i->a; i.2 = i.0; (*D.1574)[i.2] = i; i = i + 1; D.1593 = i < D.1592; if (D.1593) goto <L1>; else goto <L2>; # SUCC: 3 [100.0%] (false) 6 (true) } Here, we have removed the .omp_data_i = &.omp_data_o assignment, created the outlined function and we have also done another minor optimization: we are calling the combined parallel loop startup function instead of the two separate parallel start and loop start routines. This saves us from doing another barrier at runtime. The code to detect when to do this optimization is now easy to implement. Essentially, if the basic block for OMP_PARALLEL immediately precedes the basic block for OMP_FOR, the directives can be merged: # BLOCK 2 # PRED: ENTRY (fallthru) .omp_data_o.3.a = &a; #pragma omp parallel shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)] # SUCC: 3 (ab) # BLOCK 3 # PRED: 2 (ab) .omp_data_i = &.omp_data_o.3; #pragma omp for schedule(dynamic) private(i) for (i = 0; i <= 9; i = i + 1) # SUCC: 4 (ab) There is a limitation, however. If the header expressions of inner OMP_FOR use variables instead of constants, these variables will have been mapped and lowered by omp_lower. For instance: { int a[10], i, z; z = 5; #pragma omp parallel #pragma omp for schedule (dynamic) for (i = 0; i < z * 2; i++) a[i] = i; } becomes # BLOCK 2 # PRED: ENTRY (fallthru) z = 5; .omp_data_o.3.a = &a; .omp_data_o.3.z = z; #pragma omp parallel shared(z) shared(a) [child fn: foo.omp_fn.0 (.omp_data_o.3)] # SUCC: 3 (ab) # BLOCK 3 # PRED: 2 (ab) .omp_data_i = &.omp_data_o.3; D.1578 = .omp_data_i->z; D.1566 = D.1578 * 2; D.1581 = D.1566; #pragma omp for schedule(dynamic) private(i) for (i = 0; i < D.1581; i = i + 1) # SUCC: 4 (ab) When we emit the combined parallel+for library call we need to send 'z' among other arguments. But the problem now is that we not only have 'z' mapped into the .omp_data_i structure, we also need to emit the call *before* that computation is now being done. So, we'd need to hoist the code in block 3 into block 2. That means crossing the assignment to 'omp_data_i' in block 3. Doable, but we need dataflow information, and atm we don't build it. We could hack something for this specific case, but I'd rather have a clean IL first. This micro-optimization can wait. Bootstrapped and tested x86, x86-64, ia64 and ppc64. This is a fairly big patch. Richard, could you go over it before I apply it? I'm fairly confident about it, but I did re-arrange omp-low.c inside-out again. Thanks. * tree-pretty-print.c (dump_generic_node): Handle OMP_PARALLEL_FN, OMP_PARALLEL_DATA_ARG and OMP_RETURN_EXPR. * cgraph.c (cgraph_expand_queue): Rename from cgraph_analyze_queue. Update all users. * cgraphunit.c (cgraph_assemble_pending_functions): Process cgraph_expand_queue. (cgraph_expand_all_functions): Likewise. (cgraph_finalize_pending_functions): Remove. Update callers. * tree.h (OMP_DIRECTIVE_P): Define. (OMP_PARALLEL_FN): Define. (OMP_PARALLEL_DATA_ARG): Define. (OMP_SECTIONS_SECTIONS): Define. * tree-pass.h (pass_expand_omp): Declare. * omp-low.c (struct omp_region): Declare. (struct omp_context): Remove fields 'parallel_type', 'parallel_start_ix' and 'parallel_start_additional_args'. Update all users. (struct omp_for_data): Rename from struct expand_omp_for_data. (omp_regions): New static variable. (root_omp_region): New static variable. (find_omp_clause): Make static. (is_in_combined_parallel_ctx): Remove. (is_combined_parallel): New. (extract_omp_for_data): Move earlier in the file. (workshare_safe_to_combine_p): New. (get_ws_args_for): New. (determine_parallel_type): Move earlier in the file. (omp_copy_decl_2): Do not set DECL_CONTEXT of new local to the child function. (omp_copy_decl): Likewise. (create_omp_child_function): Likewise. (lookup_omp_region): New. (dump_omp_region): New. (debug_omp_region): New. (debug_all_omp_regions): New. (new_omp_region): New. (scan_omp_parallel): If parallel_nesting_level > 1, the directive is nested within another parallel directive. Set OMP_PARALLEL_FN. (scan_omp_for): Do not try to handle combined parallel+for cases. Remove FIXME comment. (scan_omp_nested): Remove. (scan_omp_1): Do not call scan_omp_nested when parallel_nesting_level is > 1. Do not change the DECL_CONTEXT of local variables found. (lookup_decl_in_outer_ctx): New. (lower_rec_input_clauses): Rename from expand_rec_input_clauses. (lower_lastprivate_clauses): Rename from expand_lastprivate_clauses. (lower_reduction_clauses): Rename from expand_reduction_clauses. (lower_copyprivate_clauses): Rename from expand_copyprivate_clauses. If CTX is nested, lookup VAR in the outer context when building copy assignment. (lower_send_clauses): Rename from expand_send_clauses. If CTX is nested, lookup VAR in the outer context when building copy assignments. (lower_send_shared_vars): Rename from expand_send_shared_vars. If CTX is nested, lookup VAR in the outer context when building copy assignments. (expand_parallel_call): Rename from build_parallel_call. Handle combined parallel+workshare cases. Re-implement to emit code into the CFG. (list2chain): New. (expand_omp_parallel): Re-implement to emit code into the CFG. Call move_sese_region_to_fn to outline the sub-graph containing the parallel region. (expand_omp_for_1): Remove. (expand_omp_for_generic): Re-implement to emit code into the CFG. (expand_omp_for_static_nochunk): Likewise. (expand_omp_for_static_chunk): Likewise. (expand_omp_for): Likewise. (expand_omp_sections): Likewise. (remove_exit_barriers): New. (expand_omp_synch): New. (expand_omp): New. (build_omp_regions_1): New. (build_omp_regions): New. (execute_expand_omp): New. (gate_expand_omp): New. (pass_expand_omp): Define. (lower_omp_sections): Rename from expand_omp_sections. Set OMP_SECTIONS_SECTIONS. (lower_omp_single_simple): Rename from expand_omp_single_simple. (lower_omp_single_copy): Rename from expand_omp_single_copy. (lower_omp_single): Rename from expand_omp_simple. (lower_omp_master): Rename from expand_omp_master. (lower_omp_ordered): Rename from expand_omp_ordered. (lower_omp_critical): Rename from expand_omp_critical. (lower_omp_for_lastprivate): Rename from expand_omp_for_lastprivate. (lower_omp_for): Re-implement. (lower_omp_parallel): Re-implement. (lower_regimplify): Rename from expand_regimplify. (lower_omp_1): Rename from expand_omp_1. If there are syntax errors in the program, replace every OpenMP directive with NOP. Call lower_omp_* instead of expand_omp_*. (lower_omp): Rename from expand_omp. * tree-gimple.c (is_gimple_stmt): Handle OMP_RETURN_EXPR. * tree-gimple.h (enum omp_parallel_type): Remove. (gimple_boolify): Declare extern. (find_omp_clause, determine_parallel_type): Remove. * gimple-low.c (lower_omp_directive): New. (lower_stmt): Call it. (record_vars_into): Move from ... (record_vars): ... here. Call record_vars_into with current_function_decl. * gimplify.c (struct gimplify_ctx): Remove fields combined_pre_p and combined_ctxp. Update users. (get_formal_tmp_var): Add documentation. (gimple_boolify): Make extern. (gimplify_expr_in_ctx): Remove. Update callers. (gimplify_omp_parallel): Do not assume that OMP_PARALLEL_BODY will always be a BIND_EXPR. (gimplify_expr): Handle OMP_RETURN_EXPR. * tree.def (BLOCK): Remove documentation about BLOCK_TYPE_TAGS. (OMP_PARALLEL): Add 3 operands. (OMP_SECTIONS): Add 1 operand. (OMP_RETURN_EXPR): Define. * tree-inline.c (estimate_num_insns_1): Handle OpenMP directives. (copy_tree_r): Restore TREE_CHAIN in OMP_CLAUSE_*. * tree-iterator.c (alloc_stmt_list): Assert that we are not creating a circular free list. (free_stmt_list): Assert that we are not freeing stmt_list_cache. * tree-flow.h (move_sese_region_to_fn): Declare. (record_vars_into): Declare. * tree-cfg.c (make_omp_sections_edges): New. (make_exit_edges): Handle OMP_PARALLEL, OMP_FOR, OMP_SINGLE, OMP_MASTER, OMP_ORDERED, OMP_CRITICAL, OMP_RETURN_EXPR, OMP_SECTIONS and OMP_SECTION. (is_ctrl_altering_stmt): Return true for OMP_DIRECTIVE_P. (set_bb_for_stmt): Undo change to check currently_expanding_to_rtl. (verify_stmt): Do not handle OMP_DIRECTIVE_P. (gather_blocks_in_sese_region): New. (struct move_stmt_d): Declare. (move_stmt_r): New. (move_block_to_fn): New. (move_sese_region_to_fn): New. * passes.c (init_optimization_passes): Schedule pass_expand_omp after pass_init_datastructures. * tree-ssa-operands.c (get_expr_operands): Handle OMP_PARALLEL, OMP_SECTIONS, OMP_FOR, OMP_RETURN_EXPR, OMP_SINGLE, OMP_MASTER, OMP_ORDERED, OMP_CRITICAL. fortran/ * trans.h (build4_v): Define. * trans-openmp.c: Call build4_v to create OMP_PARALLEL nodes. Call build3_v to create OMP_SECTIONS nodes. testsuite/ * testsuite/gcc.dg/gomp/for-13.c: Use -fdump-tree-ompexp. * testsuite/gcc.dg/gomp/critical-1.c: Likewise. * testsuite/gcc.dg/gomp/critical-3.c: Likewise. * testsuite/gcc.dg/gomp/empty.c: Likewise. * testsuite/gcc.dg/gomp/ordered-1.c: Likewise. * testsuite/gcc.dg/gomp/for-4.c: Likewise. * testsuite/gcc.dg/gomp/for-6.c: Likewise. * testsuite/gcc.dg/gomp/master-3.c: Likewise. * testsuite/gcc.dg/gomp/for-8.c: Likewise. * testsuite/gcc.dg/gomp/for-10.c: Likewise. * testsuite/gcc.dg/gomp/for-18.c: Likewise. * testsuite/gcc.dg/gomp/for-5.c: Likewise. * testsuite/gcc.dg/gomp/for-7.c: Likewise. * testsuite/gcc.dg/gomp/for-9.c: Likewise. * testsuite/g++.dg/gomp/for-4.C: Likewise. * testsuite/g++.dg/gomp/for-6.C: Likewise. * testsuite/g++.dg/gomp/for-8.C: Likewise. * testsuite/g++.dg/gomp/for-10.C: Likewise. * testsuite/g++.dg/gomp/for-5.C: Likewise. * testsuite/g++.dg/gomp/for-7.C: Likewise. * testsuite/g++.dg/gomp/for-9.C: Likewise.
Attachment:
20060112-pargimple.diff.gz
Description: GNU Zip compressed data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |