[RFC] Don't move cold code out of loop by checking bb count

Richard Biener richard.guenther@gmail.com
Thu Nov 4 13:00:33 GMT 2021


On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/10/29 19:48, Richard Biener wrote:
> > I'm talking about the can_sm_ref_p call, in that context 'loop' will
> > be the outermost loop of
> > interest, and we are calling this for all stores in a loop.  We're doing
> >
> > +bool
> > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> > +{
> > +  basic_block curr_bb = gimple_bb (loc->stmt);
> > +  class loop *inner_loop = curr_bb->loop_father;
> > +  return find_coldest_out_loop (l, inner_loop, curr_bb);
> >
> > for each location the ref is accessed and the intent was to see
> > whether there's at least one
> > that we would like to move to 'loop'.  Indeed since we only know the
> > common outer loop
> > but not the inner we are hosting from there's not a single "coldest"
> > loop to cache and so
> > any caching we might want to perform could be applied to the other case as well.
> >
> > I suppose the most natural thing to cache is for each loop the outer loop where
> > its outer loop preheader would be hotter than the outer loops preheader so that
> >
> > +  while (outmost_loop != loop)
> > +    {
> > +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> > (outmost_loop)->src,
> > +                                        loop_preheader_edge (cold_loop)->src))
> > +       cold_loop = outmost_loop;
> > +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> > +    }
> >
> > could be instead written as
> >
> >   coldest_loop = coldest_outermost_loop[loop->num];
> >   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >     return outermost_loop;
> >   return coldest_loop;
> >
> > ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> > It should be possible to compute such cache in a DFS walk of the loop tree
> > (the loop iterator by default visits in such order).
>
>
>
> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>
>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
>
> then function find_coldest_out_loop will return a loop NOT accord with our
> expectation, that should return second_coldest_loop instead of outermost_loop?

Hmm, interesting - yes.  I guess the common case will be that the pre-computed
outermost loop will be the loop at depth 1 since outer loops tend to
be colder than
inner loops?  That would then defeat the whole exercise.

To optimize the common case but not avoiding iteration in the cases we care
about we could instead cache the next outermost loop that is _not_ colder
than loop.  So for your [ ... ] example above we'd have
hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
candidate would then be 'second_coldest_loop' and we'd then iterate
to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
cold candidate we can compare against?  For the common case we'd
have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
simply pick 'outermost_loop'.

One comment on the patch itself below.

>
>
> Changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
>
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
>
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
>
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
>
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> gcc/ChangeLog:
>
>         * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
>         function.
>         (get_coldest_out_loop): New function.
>         (determine_max_movement): Use get_coldest_out_loop.
>         (move_computations_worker): Adjust and fix iteration udpate.
>         (class ref_in_loop_hot_body): New functor.
>         (ref_in_loop_hot_body::operator): New.
>         (can_sm_ref_p): Use for_all_locs_in_loop.
>         (fill_coldest_out_loop): New.
>         (loop_invariant_motion_in_fun): Call fill_coldest_out_loop.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 111 ++++++++++++++++++++-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 ++++++
>  7 files changed, 251 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..d3390385fd9 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,9 @@ public:
>  enum dep_kind { lim_raw, sm_war, sm_waw };
>  enum dep_state { dep_unknown, dep_independent, dep_dependent };
>
> +/* coldest outermost loop for given loop.  */
> +class loop **coldest_outermost_loop;
> +
>  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
>
>  static void
> @@ -417,6 +420,43 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
>
> +/* Compare the profile count inequality of bb and preheader, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
> +{
> +  gcc_assert (bb && preheader);
> +  return bb->count < preheader->count;
> +}
> +
> +/* Check coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
> +   It does two steps check:
> +   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> +   return NULL which means it should not be moved out at all;
> +   2)  CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> +   OUTMOST_LOOP.  */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                     basic_block curr_bb)
> +{
> +  gcc_assert (outmost_loop == loop || flow_loop_nested_p (outmost_loop, loop));
> +  class loop *cold_loop;
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb,
> +                                       loop_preheader_edge (loop)->src))
> +    return NULL;
> +
> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> +  if (loop_depth (coldest_loop) < loop_depth (outmost_loop))
> +    return outmost_loop;
> +  return coldest_loop;
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -685,7 +725,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
>      level = superloop_at_depth (loop, 1);
> -  lim_data->max_loop = level;
> +  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
> +  if (!lim_data->max_loop)
> +    return false;
>
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
>      {
> @@ -1221,7 +1263,10 @@ move_computations_worker (basic_block bb)
>        /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>        if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> @@ -2887,6 +2932,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
>    return indep_p;
>  }
>
> +class ref_in_loop_hot_body
> +{
> +public:
> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> +  bool operator () (mem_ref_loc *loc);
> +  class loop *l;
> +};
> +
> +/* Check the coldest loop between loop L and innermost loop.  If there is one
> +   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
> +   no cold loop means no store motion.  get_coldest_out_loop also handles cases
> +   when l is inner_loop.  */
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  return get_coldest_out_loop (l, inner_loop, curr_bb);
> +}
> +
>
>  /* Returns true if we can perform store motion of REF from LOOP.  */
>
> @@ -2941,6 +3006,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>    if (!ref_indep_loop_p (loop, ref, sm_war))
>      return false;
>
> +  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
> +    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
> +    out of it's loop_father.  */
> +  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> +    return false;
> +
>    return true;
>  }
>
> @@ -3153,6 +3224,34 @@ fill_always_executed_in (void)
>      fill_always_executed_in_1 (loop, contains_call);
>  }
>
> +/* Find the coldest loop preheader from loop tree root to LOOP.  Set LOOP to
> +   cold_loop, then iteratively search loops from {L[outmost_loop],
> +   L[outmost_loop+1], ... L[loop]}, if L[i] is colder than L[cold_loop], reset
> +   cold_loop to L[i] until get the loop that has smallest profile_count.
> +   Then recursively set each inner loop.  */
> +
> +void
> +fill_coldest_out_loop (class loop *loop)
> +{
> +  class loop *outmost_loop = current_loops->tree_root->inner;

that should be superloop_at_depth (loop, 1), otherwise it's wrong
when the function has more than one loop at the outermost level.
I was also hoping to avoid this loop by passing down the current
coldest loop as the single one to compare against.

But with the above discussion things will look different anyway I guess.

> +  class loop *cold_loop = loop;
> +  while (outmost_loop != loop)
> +    {
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (outmost_loop)->src,
> +           loop_preheader_edge (cold_loop)->src))
> +       cold_loop = outmost_loop;
> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> +    }
> +
> +  coldest_outermost_loop[loop->num] = cold_loop;
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "loop %d's coldest outermost loop is %d\n",
> +                loop->num, cold_loop->num);
> +
> +  for (loop = loop->inner; loop; loop = loop->next)
> +    fill_coldest_out_loop (loop);
> +}
>
>  /* Compute the global information needed by the loop invariant motion pass.  */
>
> @@ -3237,6 +3336,8 @@ tree_ssa_lim_finalize (void)
>      free_affine_expand_cache (&memory_accesses.ttae_cache);
>
>    free (bb_loop_postorder);
> +
> +  free (coldest_outermost_loop);
>  }
>
>  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> @@ -3256,6 +3357,12 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
>    fill_always_executed_in ();
>
> +  /* Pre-compute coldest outermost loop of each loop.  */
> +  class loop *loop;
> +  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> +  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> +    fill_coldest_out_loop (loop);
> +
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>  }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> new file mode 100644
> index 00000000000..7326a230b3f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       bar (k / 5, "one", "two");
> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..51c1913d003
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++) // Loop 2
> +         for (k = 0; k < n; k++) // Loop 3
> +           {
> +             bar (s / 5, "one", "two");
> +             a[t] = s;
> +           }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..bc60a040a70
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `count' is not hoisted out of loop when bb is cold.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  struct obj *next;
> +
> +} *q;
> +
> +void
> +func (int m)
> +{
> +  struct obj *p;
> +  for (int i = 0; i < m; i++)
> +    if (__builtin_expect (x, 0))
> +      count++;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> new file mode 100644
> index 00000000000..ffe6f8f699d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
> +   when it is in cold loop.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  int data1;
> +  struct obj *next;
> +};
> +
> +void
> +func (int m, int n, int k, struct obj *a)
> +{
> +  struct obj *q = a;
> +  for (int j = 0; j < m; j++)
> +    if (__builtin_expect (m, 0))
> +      for (int i = 0; i < m; i++)
> +       {
> +         if (__builtin_expect (x, 0))
> +           {
> +             count++;
> +             q->data += 3; /* Not hoisted out to inner loop. */
> +           }
> +         count += n;
> +         q->data1 += k; /* Not hoisted out to outer loop. */
> +       }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> new file mode 100644
> index 00000000000..16ba4ceb8ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +volatile int y;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++) // Loop 2
> +         if (__builtin_expect (y, 0))
> +           for (k = 0; k < n; k++) // Loop 3
> +             {
> +               bar (s / 5, "one", "two");
> +               a[t] = s;
> +             }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> +
> --
> 2.27.0.90.geebb51ba8c
>
>
>


More information about the Gcc-patches mailing list