[RFC] Don't move cold code out of loop by checking bb count

Richard Biener richard.guenther@gmail.com
Fri Oct 15 08:11:52 GMT 2021


On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
> Hi,
>
> On 2021/9/28 20:09, Richard Biener wrote:
> > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >> Update the patch to v3, not sure whether you prefer the paste style
> >> and continue to link the previous thread as Segher dislikes this...
> >>
> >>
> >> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>
> >>
> >> Changes:
> >> 1. Handle max_loop in determine_max_movement instead of
> >> outermost_invariant_loop.
> >> 2. Remove unnecessary changes.
> >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> >> infinite loop when implementing v1 and the iteration is missed to be
> >> updated actually.
> >>
> >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>
> >> There was a patch trying to avoid move cold block out of loop:
> >>
> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>
> >> Richard suggested to "never hoist anything from a bb with lower execution
> >> frequency to a bb with higher one in LIM invariantness_dom_walker
> >> before_dom_children".
> >>
> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >> expected target loop, if profile count of the loop bb is colder
> >> than target loop preheader, it won't be hoisted out of loop.
> >> Likely for store motion, if all locations of the REF in loop is cold,
> >> don't do store motion of it.
> >>
> >> SPEC2017 performance evaluation shows 1% performance improvement for
> >> intrate GEOMEAN and no obvious regression for others.  Especially,
> >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >> on P8LE.
> >>
> >> gcc/ChangeLog:
> >>
> >>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>         before motion.
> >>         (find_invariants_body): Add argument.
> >>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>         (determine_max_movement): Use find_coldest_out_loop.
> >>         (move_computations_worker): Adjust and fix iteration udpate.
> >>         (execute_sm_exit): Check pointer validness.
> >>         (class ref_in_loop_hot_body): New functor.
> >>         (ref_in_loop_hot_body::operator): New.
> >>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >> ---
> >>  gcc/loop-invariant.c                       | 10 ++--
> >>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>
> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >> index fca0c2b24be..5c3be7bf0eb 100644
> >> --- a/gcc/loop-invariant.c
> >> +++ b/gcc/loop-invariant.c
> >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
> >>     call.  */
> >>
> >>  static void
> >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
> >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >> +                   bool always_executed)
> >>  {
> >>    rtx_insn *insn;
> >> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >> +
> >> +  if (preheader->count > bb->count)
> >> +    return;
> >>
> >>    FOR_BB_INSNS (bb, insn)
> >>      {
> >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
> >>    unsigned i;
> >>
> >>    for (i = 0; i < loop->num_nodes; i++)
> >> -    find_invariants_bb (body[i],
> >> -                       bitmap_bit_p (always_reached, i),
> >> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>                         bitmap_bit_p (always_executed, i));
> >>  }
> >>
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 4b187c2cdaf..655fab03442 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>    return ret;
> >>  }
> >>
> >> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
> >> +
> >> +static class loop *
> >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >> +                      basic_block curr_bb)
> >> +{
> >> +  class loop *cold_loop, *min_loop;
> >> +  cold_loop = min_loop = outmost_loop;
> >> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >> +
> >> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> >
> > Honza - can you comment on whether we should compare BB counts this way?
> >
> > I would suspect that for, say,
> >
> >   for (...)
> >      if (a)
> >        X;
> >      else
> >        Y;
> >
> > that the counts for X and Y will be less than that of the preheader of the loop
> > only when the loop is estimated to run once.  That is, should we really compare
> > the to the preheader count or maybe better to the _header_ count which
> > would keep the number of iterations out of the equation?
>
> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> loop_preheader_edge (loop)->dest, it will cause many failures in
> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
> reasonable to compare the bb count with preheader count as both gimple lim
> and RTL loop-invariant move instructions to *preheader* instead of *header*
> after analysis?

Hmm, yeah - guess I was confused here.

> >
> > If we look at maybe_hot_count_p that's a quite sophisticated thing to
> > compare a count to the "IPA hot", here we're comparing two counts
> > within a function where it actually matters whether we use a<b or
> > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> > function).
> >
> > Xionghu, you error on the side of not hoisting for unordered counts here
> >
> >> +    return NULL;
> >> +
> >> +  while (min_loop != loop)
> >> +    {
> >> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> >
> > but in the other direction here and on the side of not hoisting
> > in ref_in_loop_hot_body.
> >
> > The three-state relational operator overloads are probably not the
> > very best idea...
> > (see profile-count.h for them)
> >
> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
> will return true to continue perform store motion, both preserve the previous
> behavior.

Thanks.  But I don't think the abstraction as written is useful:

+/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
+{
+  if (count1 < count2)
+    return true;
+  else
+    return false;
+}

given the following seems to pass the preheader count in place of the BB count.

+      if (bb_colder_than_loop_preheader (
+           loop_preheader_edge (min_loop)->src->count, min_count))
+       cold_loop = min_loop;

find_coldest_out_loop is also a bit weird, I think we want to find
the outermost loop between outmost_loop and loop that has a
lower count than the curr_bb count but

+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (bb_colder_than_loop_preheader (
+           loop_preheader_edge (min_loop)->src->count, min_count))
+       cold_loop = min_loop;

compares the outermost loops count (min_count) against the preheader
count?  So we're searching for a cold loop with respect to its enclosing loop
here?

Why is this function not simply

+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+                      basic_block curr_bb)
+{
     while (bb_colder_than_loop_preheader (curr_bb->count,
               loop_preheader_edge (outermost_loop)->src->count))
        {
            if (outermost_loop == loop)
              return NULL;
            outermost_loop = superloop_at_depth (loop, loop_depth
(outermost_loop) + 1);
        }
     return outermost_loop;
}

?

Likewise I wonder why ref_in_loop_hot_body::operator () needs to call
find_coldest_out_loop and why it not simply does

+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
    if (bb_colder_than_loop_preheader (curr_bb->count,
loop_preheader_edge (l)->src->count))
      return false;
   return true;
   }

?
>
> >> +       cold_loop = min_loop;
> >> +    }
> >> +  return cold_loop;
> >> +}
> >> +
> >>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
> >>     loop to that we could move the expression using DEF if it did not have
> >>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> >> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
> >>      level = ALWAYS_EXECUTED_IN (bb);
> >>    else
> >>      level = superloop_at_depth (loop, 1);
> >> -  lim_data->max_loop = level;
> >> +  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
> >> +  if (!lim_data->max_loop)
> >> +    return false;
> >>
> >>    if (gphi *phi = dyn_cast <gphi *> (stmt))
> >>      {
> >> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb)
> >>    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
> >>      {
> >>        edge e;
> >> -
> >>        gimple *stmt = gsi_stmt (bsi);
> >>
> >>        lim_data = get_lim_data (stmt);
> >> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb)
> >>        /* We do not really want to move conditionals out of the loop; we just
> >>          placed it here to force its operands to be moved if necessary.  */
> >>        if (gimple_code (stmt) == GIMPLE_COND)
> >> -       continue;
> >> +       {
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >>
> >>        if (dump_file && (dump_flags & TDF_DETAILS))
> >>         {
> >> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
> >>         }
> >>        else
> >>         {
> >> -         sm_aux *aux = *aux_map.get (ref);
> >> +         sm_aux **paux = aux_map.get (ref);
> >> +         sm_aux *aux;
> >> +         if (paux)
> >> +           aux = *paux;
> >> +         else
> >> +           continue;
> >
> > do you really need this?  I doubt so.
>
> Removed.
>
> >
> >>           if (!aux->store_flag || kind == sm_ord)
> >>             {
> >>               gassign *store;
> >> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
> >>    return indep_p;
> >>  }
> >>
> >> +class ref_in_loop_hot_body
> >> +{
> >> +public:
> >> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> >> +  bool operator () (mem_ref_loc *loc);
> >> +  class loop *l;
> >> +};
> >> +
> >> +bool
> >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> >> +{
> >> +  basic_block curr_bb = gimple_bb (loc->stmt);
> >> +  edge e = loop_preheader_edge (l);
> >> +  if (e->src->count > curr_bb->count)
> >> +    return false;
> >> +  else
> >> +    return true;
> >> +}
> >> +
> >>
> >>  /* Returns true if we can perform store motion of REF from LOOP.  */
> >>
> >> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
> >>    if (!ref_indep_loop_p (loop, ref, sm_war))
> >>      return false;
> >>
> >
> > Add a comment here what this is about.
>
> Done.
>
> >
> > Otherwise the GIMPLE invariant motion parts look sensible, but I'd
> > really like to have
> > the issue on the profile_count API sorted out.
> >
> > Can you split out the RTL invariant motion part to a separate patch please?
>
> Done.  Attached the two patches, thanks.
>
>
> BR,
> Xionghu


More information about the Gcc-patches mailing list