[gomp4] Preserve NVPTX "reconvergence" points

Richard Biener richard.guenther@gmail.com
Thu May 28 15:02:00 GMT 2015


On Thu, May 28, 2015 at 4:06 PM, Julian Brown <julian@codesourcery.com> wrote:
> For NVPTX, it is vitally important that the divergence of threads
> within a warp can be controlled: in particular we must be able to
> generate code that we know "reconverges" at a particular point.
> Unfortunately GCC's middle-end optimisers can cause this property to
> be violated, which causes problems for the OpenACC execution model
> we're planning to use for NVPTX.
>
> As a brief example: code running in vector-single mode runs on a
> single thread of a warp, and must broadcast condition results to other
> threads of the warp so that they can "follow along" and be ready for
> vector-partitioned execution when necessary.
>
> #pragma acc parallel
> {
>   #pragma acc loop gang
>   for (i = 0; i < N; i++)
>   {
>     /* This is vector-single mode.  */
>     n = ...;
>     switch (n)
>     {
>     case 1:
>       #pragma acc loop vector
>       for (...)
>       {
>         /* This is vector-partitioned mode.  */
>       }
>       ...
>     }
>   }
> }
>
> Here, the calculation "n = ..." takes place on a single thread (of
> each partitioned gang of the outer loop), but the switch statement
> (terminating the BB) must be executed by all threads in the warp. The
> vector-single statements will be translated using a branch around for
> the "idle" threads:
>
> if (threadIdx.x == 0)
> {
>   n_0 = ...;
> }
> n_x = broadcast (n_0)
> switch (n_x)
> ...
>
> Where "broadcast" is an operation that transfers values from some
> other thread of a warp (i.e., the zeroth) to the current thread
> (implemented as a "shfl" instruction for NVPTX).
>
> I observed a similar example to this cloning the broadcast and switch
> instructions (in the .dom1 dump), along the lines of:
>
> if (threadIdx.x == 0)
> {
>   n_0 = ...;
>   n_x = broadcast (n_0)
>   switch (n_x)
>   ...
> }
> else
> {
>   n_x = broadcast (n_0)
>   switch (n_x)
>   ...
> }
>
> This doesn't work because the "broadcast" operation has to be run with
> non-diverged warps for correct operation, and here there is divergence
> due to the "if (threadIdx.x == 0)" condition.
>
> So, the way I have tried to handle this is by attempting to inhibit
> optimisation along edges which have a reconvergence point as their
> destination. The essential idea is to make such edges "abnormal",
> although the existing EDGE_ABNORMAL flag is not used because that has
> implicit meaning built into it already, and the new edge type may need
> to be handled differently in some areas. One example is that at
> present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL
> set on their EDGE_TRUE or EDGE_FALSE outgoing edges.
>
> The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE),
> for the GIMPLE CFG only. In principle there's nothing to stop the flag
> being propagated to the RTL CFG also, in which case it'd probably be
> set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g.
> EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which
> inspect the RTL CFG can continue to only check the ABNORMAL flag. But
> so far (in rather limited testing!), that has not been observed to be
> necessary. (We can control RTL CFG manipulation indirectly by using the
> CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast"
> instruction.)
>
> For the GIMPLE CFG (i.e. in passes operating on GIMPLE form),
> EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e.,
> inhibiting certain optimisations), and so has been added to relevant
> conditionals largely mechanically. Places where it is treated specially
> are:
>
> * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on
>   outgoing edges of a block concluding with a GIMPLE_COND statement.
>   But, we allow EDGE_TO_RECONVERGENCE there.
>
> * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND
>   edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass
>   tries to split the edge later).
>
> There are probably other optimisations that will be tripped up by the
> new flag along the same lines as the VRP tweak above, which we will no
> doubt discover in due course.
>
> Together with the patch,
>
>   https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html
>
> This shows no regressions for the libgomp tests.
>
> OK for gomp4 branch?

Hmm, I don't think adding a new edge flag is good nor necessary.  It seems to
me that instead the broadcast operation should have abnormal control flow
and thus basic-blocks should be split either before or after it (so either
incoming or outgoing edge(s) should be abnormal).  I suppose splitting
before the broadcast would be best (thus handle it similar to setjmp ()).

Richard.

> Thanks,
>
> Julian
>
>     ChangeLog
>
>     gcc/
>     * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag.
>     (bb_hash_abnorm_or_reconv_pred): New function.
>     (hash_abnormal_or_eh_outgoing_edge_p): Consider
>     EDGE_TO_RECONVERGENCE also.
>     * cfg-flags.def (TO_RECONVERGENCE): Add flag.
>     * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges
>     leading to a reconvergence point.
>     * cfgbuild.c (purge_dead_tablejump_edges): Consider
>     EDGE_TO_RECONVERGENCE.
>     * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise.
>     * cfgexpand.c (expand_gimple_tailcall, construct_exit_block)
>     (pass_expand::execute): Likewise.
>     * cfghooks.c (can_copy_bbs_p): Likewise.
>     * cfgloop.c (bb_loop_header_p): Likewise.
>     * cfgloopmanip.c (scale_loop_profile): Likewise.
>     * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise.
>     * graph.c (draw_cfg_node_succ_edges): Likewise.
>     * graphite-scope-detection.c (canonicalize_loop_closed_ssa):
>     Likewise.
>     * predict.c (tree_bb_level_predictions): Likewise.
>     * profile.c (instrument_edges, branch_prop, find_spanning_tree):
>     Likewise.
>     * tree-cfg.c (replace_uses_by, gimple_split_edge)
>     (gimple_redirect_edge_and_branch, split_critical_edges): Likewise.
>     * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block)
>     (pass_merge_phi::execute): Likewise.
>     * tree-chkp.c (chkp_fix_cfg): Likewise.
>     * tree-if-conv.c (if_convertible_bb_p): Likewise.
>     * tree-inline.c (update_ssa_across_abnormal_edges): Likewise.
>     * tree-into-ssa.c (rewrite_update_phi_arguments)
>     (rewrite_update_dom_walker::before_dom_children)
>     (create_new_def_for): Likewise.
>     * tree-outof-ssa.c (eliminate_phi): Likewise.
>     * tree-phinodes.c (add_phi_arg): Likewise.
>     * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map)
>     (coalesce_partitions): Likewise.
>     * tree-ssa-dom.c (cprop_into_successor_phis)
>     (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs):
>     Likewise.
>     * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise.
>     * tree-ssa-loop-prefetch.c (emit_mfence_after_loop)
>     (may_use_storent_in_loop_p): Likewise.
>     * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise.
>     * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block):
>     Likewise.
>     * tree-ssa-propagate.c (simulate_block, replace_phi_args_in):
>     Likewise.
>     * tree-ssa-sink.c (sink_code_in_bb): Likewise.
>     * tree-ssa-threadedge.c (thread_across_edge): Likewise.
>     * tree-ssa-threadupdate.c (thread_single_edge): Likewise.
>     * tree-ssa-uninit.c (compute_control_dep_chain): Likewise.
>     * tree-ssa.c (verify_phi_args): Likewise.
>     * tree-vect-loop.c (vect_analyze_loop_form): Likewise.
>     * value-prof.c (gimple_ic): Likewise.
>     * tree-vrp.c (infer_value_range, process_assert_insertions_for):
>     Likewise.
>     (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges.



More information about the Gcc-patches mailing list