[gomp4] Preserve NVPTX "reconvergence" points

Julian Brown julian@codesourcery.com
Thu May 28 14:20:00 GMT 2015


For NVPTX, it is vitally important that the divergence of threads
within a warp can be controlled: in particular we must be able to
generate code that we know "reconverges" at a particular point.
Unfortunately GCC's middle-end optimisers can cause this property to
be violated, which causes problems for the OpenACC execution model
we're planning to use for NVPTX.

As a brief example: code running in vector-single mode runs on a
single thread of a warp, and must broadcast condition results to other
threads of the warp so that they can "follow along" and be ready for
vector-partitioned execution when necessary.

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i < N; i++)
  {
    /* This is vector-single mode.  */
    n = ...;
    switch (n)
    {
    case 1:
      #pragma acc loop vector
      for (...)
      {
        /* This is vector-partitioned mode.  */
      }
      ...
    }
  }
}

Here, the calculation "n = ..." takes place on a single thread (of
each partitioned gang of the outer loop), but the switch statement
(terminating the BB) must be executed by all threads in the warp. The
vector-single statements will be translated using a branch around for
the "idle" threads:

if (threadIdx.x == 0)
{
  n_0 = ...;
}
n_x = broadcast (n_0)
switch (n_x)
...

Where "broadcast" is an operation that transfers values from some
other thread of a warp (i.e., the zeroth) to the current thread
(implemented as a "shfl" instruction for NVPTX).

I observed a similar example to this cloning the broadcast and switch
instructions (in the .dom1 dump), along the lines of:

if (threadIdx.x == 0)
{
  n_0 = ...;
  n_x = broadcast (n_0)
  switch (n_x)
  ...
}
else
{
  n_x = broadcast (n_0)
  switch (n_x)
  ...
}

This doesn't work because the "broadcast" operation has to be run with
non-diverged warps for correct operation, and here there is divergence
due to the "if (threadIdx.x == 0)" condition.

So, the way I have tried to handle this is by attempting to inhibit
optimisation along edges which have a reconvergence point as their
destination. The essential idea is to make such edges "abnormal",
although the existing EDGE_ABNORMAL flag is not used because that has
implicit meaning built into it already, and the new edge type may need
to be handled differently in some areas. One example is that at
present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL
set on their EDGE_TRUE or EDGE_FALSE outgoing edges.

The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE),
for the GIMPLE CFG only. In principle there's nothing to stop the flag
being propagated to the RTL CFG also, in which case it'd probably be
set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g.
EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which
inspect the RTL CFG can continue to only check the ABNORMAL flag. But
so far (in rather limited testing!), that has not been observed to be
necessary. (We can control RTL CFG manipulation indirectly by using the
CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast"
instruction.)

For the GIMPLE CFG (i.e. in passes operating on GIMPLE form),
EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e.,
inhibiting certain optimisations), and so has been added to relevant
conditionals largely mechanically. Places where it is treated specially
are:

* tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on
  outgoing edges of a block concluding with a GIMPLE_COND statement.
  But, we allow EDGE_TO_RECONVERGENCE there.

* tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND
  edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass
  tries to split the edge later).

There are probably other optimisations that will be tripped up by the
new flag along the same lines as the VRP tweak above, which we will no
doubt discover in due course.

Together with the patch,

  https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html

This shows no regressions for the libgomp tests.

OK for gomp4 branch?

Thanks,

Julian

    ChangeLog

    gcc/
    * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag.
    (bb_hash_abnorm_or_reconv_pred): New function.
    (hash_abnormal_or_eh_outgoing_edge_p): Consider
    EDGE_TO_RECONVERGENCE also.
    * cfg-flags.def (TO_RECONVERGENCE): Add flag.
    * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges
    leading to a reconvergence point.
    * cfgbuild.c (purge_dead_tablejump_edges): Consider
    EDGE_TO_RECONVERGENCE.
    * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise.
    * cfgexpand.c (expand_gimple_tailcall, construct_exit_block)
    (pass_expand::execute): Likewise.
    * cfghooks.c (can_copy_bbs_p): Likewise.
    * cfgloop.c (bb_loop_header_p): Likewise.
    * cfgloopmanip.c (scale_loop_profile): Likewise.
    * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise.
    * graph.c (draw_cfg_node_succ_edges): Likewise.
    * graphite-scope-detection.c (canonicalize_loop_closed_ssa):
    Likewise.
    * predict.c (tree_bb_level_predictions): Likewise.
    * profile.c (instrument_edges, branch_prop, find_spanning_tree):
    Likewise.
    * tree-cfg.c (replace_uses_by, gimple_split_edge)
    (gimple_redirect_edge_and_branch, split_critical_edges): Likewise.
    * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block)
    (pass_merge_phi::execute): Likewise.
    * tree-chkp.c (chkp_fix_cfg): Likewise.
    * tree-if-conv.c (if_convertible_bb_p): Likewise.
    * tree-inline.c (update_ssa_across_abnormal_edges): Likewise.
    * tree-into-ssa.c (rewrite_update_phi_arguments)
    (rewrite_update_dom_walker::before_dom_children)
    (create_new_def_for): Likewise.
    * tree-outof-ssa.c (eliminate_phi): Likewise.
    * tree-phinodes.c (add_phi_arg): Likewise.
    * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map)
    (coalesce_partitions): Likewise.
    * tree-ssa-dom.c (cprop_into_successor_phis)
    (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs):
    Likewise.
    * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise.
    * tree-ssa-loop-prefetch.c (emit_mfence_after_loop)
    (may_use_storent_in_loop_p): Likewise.
    * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise.
    * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block):
    Likewise.
    * tree-ssa-propagate.c (simulate_block, replace_phi_args_in):
    Likewise.
    * tree-ssa-sink.c (sink_code_in_bb): Likewise.
    * tree-ssa-threadedge.c (thread_across_edge): Likewise.
    * tree-ssa-threadupdate.c (thread_single_edge): Likewise.
    * tree-ssa-uninit.c (compute_control_dep_chain): Likewise.
    * tree-ssa.c (verify_phi_args): Likewise.
    * tree-vect-loop.c (vect_analyze_loop_form): Likewise.
    * value-prof.c (gimple_ic): Likewise.
    * tree-vrp.c (infer_value_range, process_assert_insertions_for):
    Likewise.
    (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: to-reconvergence-4.diff
Type: text/x-patch
Size: 31591 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20150528/0bb8201e/attachment.bin>


More information about the Gcc-patches mailing list