[PATCH v2] Fix incomplete computation in fill_always_executed_in_1

Thu Aug 19 12:11:47 GMT 2021

On Tue, 17 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/17 15:12, Richard Biener wrote:
> > On Tue, 17 Aug 2021, Xionghu Luo wrote:
> > 
> >> Hi,
> >>
> >> On 2021/8/16 19:46, Richard Biener wrote:
> >>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
> >>>
> >>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> >>>> nested loops.  inn_loop is updated to inner loop, so it need be restored
> >>>> when exiting from innermost loop. With this patch, the store instruction
> >>>> in outer loop could also be moved out of outer loop by store motion.
> >>>> Any comments?  Thanks.
> >>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>>   * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
> >>>>   inn_loop when exiting from innermost loop.
> >>>>
> >>>> gcc/testsuite/ChangeLog:
> >>>>
> >>>> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>>> ---
> >>>>    gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
> >>>>    gcc/tree-ssa-loop-im.c                     |  6 +++++-
> >>>>    2 files changed, 29 insertions(+), 1 deletion(-)
> >>>>    create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>>
> >>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>> new file mode 100644
> >>>> index 00000000000..097a5ee4a4b
> >>>> --- /dev/null
> >>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>> @@ -0,0 +1,24 @@
> >>>> +/* PR/101293 */
> >>>> +/* { dg-do compile } */
> >>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> >>>> +
> >>>> +struct X { int i; int j; int k;};
> >>>> +
> >>>> +void foo(struct X *x, int n, int l)
> >>>> +{
> >>>> +  for (int j = 0; j < l; j++)
> >>>> +    {
> >>>> +      for (int i = 0; i < n; ++i)
> >>>> +	{
> >>>> +	  int *p = &x->j;
> >>>> +	  int tem = *p;
> >>>> +	  x->j += tem * i;
> >>>> +	}
> >>>> +      int *r = &x->k;
> >>>> +      int tem2 = *r;
> >>>> +      x->k += tem2 * j;
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } }
> >>>> */
> >>>> +
> >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >>>> index b24bc64f2a7..5ca4738b20e 100644
> >>>> --- a/gcc/tree-ssa-loop-im.c
> >>>> +++ b/gcc/tree-ssa-loop-im.c
> >>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> >>>> @@ contains_call)
> >>>>       if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>         last = bb;
> >>>>    +	  if (inn_loop != loop
> >>>> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
> >>>> +	    inn_loop = bb->loop_father;
> >>>> +
> >>>
> >>> The comment says
> >>>
> >>>                 /* In a loop that is always entered we may proceed anyway.
> >>>                    But record that we entered it and stop once we leave it.
> >>> */
> >>>                 inn_loop = bb->loop_father;
> >>>
> >>> and your change would defeat that early return, no?
> >>
> >> The issue is the search method exits too early when iterating the outer
> >> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
> >> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
> >> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
> >> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
> >> they won't be processed by store motion again.
> >>
> >>
> >>      5<----
> >>      |\   |
> >>      8 \  9
> >>      |  \ |
> >> --->3--->4
> >> |    |
> >> 10---|
> >>
> >>
> >> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
> >> patch, it will continue search when meet bb 3 until bb 4, then last is updated
> >> to bb 4, it will break until exit edge is found at bb 4 by
> >> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code
> >> will
> >> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
> >>
> >>
> >>       while (1)
> >> 	{
> >> 	  SET_ALWAYS_EXECUTED_IN (last, loop);
> >> 	  if (last == loop->header)
> >> 	    break;
> >> 	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> >> 	}
> >>
> >> After further discussion with Kewen, we found that the inn_loop variable is
> >> totally useless and could be removed.
> >>
> >>
> >>>
> >>>>       if (bitmap_bit_p (contains_call, bb->index))
> >>>>         break;
> >>>>    
> >>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> >>>> @@ contains_call)
> >>>>    
> >>>>       if (bb->loop_father->header == bb)
> >>>>    	    {
> >>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch,
> >>>> bb))
> >>>>      break;
> >>>
> >>> That's now a always false condition - a loops latch is always dominated
> >>> by its header.  The condition as written tries to verify whether the
> >>> loop is always entered - mind we visit all blocks, not only those
> >>> always executed.
> >>
> >> Thanks for the catch!  I am afraid the piece of code should be removed since
> >> it stops
> >> search of potential ALWAYS EXECUTED bb after inner loop...
> > 
> > But the code says:
> > 
> >              /* In a loop that is always entered we may proceed anyway.
> >                 But record that we entered it and stop once we leave it.
> >               */
> > 
> > and you do not remove this comment still it doesn't hold anymore
> > after your patch.  I don't say the current code is optimal - I just
> > say it does exactly what is documented.
> 
> Removed.
> 
> > 
> > 
> >>>
> >>> In fact for your testcase the x->j ref is _not_ always executed
> >>> since the inner loop is conditional on n > 0.
> >>
> >> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in
> >> store-motion.
> > 
> > Sorry, that wasn't clear.  Yes, I agree the code fails to see always
> > executed blocks after inner loops.  But since the code simply walks
> > all blocks in a loop instead of greedily following edges it has
> > to do that since the inner loop might exit the outer loop as well,
> > sth your change wouldn't honor.  Consider
> 
> This is already handled now, if there is an edge exiting out of outer
> loop directly, it will also early break, I've verified it with your case.
> 
> 
> 	  FOR_EACH_EDGE (e, ei, bb->succs)
> 	    {
> 	      /* If there is an exit from this BB.  */
> 	      if (!flow_bb_inside_loop_p (loop, e->dest))
> 		break;
> > 
> >   while (--n)
> >    {
> >      do
> >        {
> >          if (do_exit)
> >            goto out;
> >        }
> >      while (1);
> >      p->x += 1;
> >    }
> > out:;
> > 
> > you'll see a CFG where the inner loop exits the outer loop as well.
> > 
> > So I'm saying to improve the code you'll likely have to do more.
> > And add a testcase like the following
> > 
> > void __attribute__((noipa))
> > foo (int n, int m, int f, int *p, int *q)
> > {
> >   while (--n)
> >    {
> >      do
> >        {
> >          *q += 1;
> >          if (f)
> >            goto out;
> >        }
> >      while (--m);
> >      *p += 1;
> >    }
> > out:;
> > }
> > 
> > int main()
> > {
> >    int i = 0;
> >    foo (10, 10, 1, (void *)0, &i);
> >    if (i != 1)
> >      __builtin_abort ();
> >    return 0;
> > }
> > 
> > Richard.
> > 
> 
> Updated the patch:
> 
> 
> [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
> 
> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> nested loops.  Current design will exit if an inner loop doesn't
> dominate outer loop's latch or exit after exiting from inner loop, which
> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> inner loops is skipped.
> This patch removes the redundant inner loop check, but keep the check of
> exiting to out of outer loop from inner loop, then the store instruction
> in outer loop could also be moved out of outer loop by store motion.
> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
> 	inn_loop check.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++
>  gcc/tree-ssa-loop-im.c                     | 14 ----------
>  3 files changed, 54 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..097a5ee4a4b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,24 @@
> +/* PR/101293 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +struct X { int i; int j; int k;};
> +
> +void foo(struct X *x, int n, int l)
> +{
> +  for (int j = 0; j < l; j++)
> +    {
> +      for (int i = 0; i < n; ++i)
> +	{
> +	  int *p = &x->j;
> +	  int tem = *p;
> +	  x->j += tem * i;
> +	}
> +      int *r = &x->k;
> +      int tem2 = *r;
> +      x->k += tem2 * j;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..6e180de528e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,30 @@
> +/* PR/101293 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
> +{
> +  while (--n)
> +    {
> +      do
> +	{
> +	  *q += 1;
> +	  if (f)
> +	    goto out;
> +	}
> +      while (--m);
> +      *p += 1;
> +    }
> +out:;
> +}
> +
> +int main()
> +{
> +    int i = 0;
> +    foo (10, 10, 1, (void *) 0, &i);
> +    if (i != 1)
> +      __builtin_abort ();
> +    return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index b24bc64f2a7..c44f4390ce2 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>    basic_block bb = NULL, *bbs, last = NULL;
>    unsigned i;
>    edge e;
> -  class loop *inn_loop = loop;
>  
>    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>      {
> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>  	     to disprove this if possible).  */
>  	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>  	    break;
> -
> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> -	    break;
> -
> -	  if (bb->loop_father->header == bb)
> -	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -		break;
> -
> -	      /* In a loop that is always entered we may proceed anyway.
> -		 But record that we entered it and stop once we leave it.  */
> -	      inn_loop = bb->loop_father;
> -	    }
>  	}
>  
>        while (1)

I'm not sure this will work correct (I'm not sure how the existing
code makes it so either...).  That said, I can't poke any hole
into the change.  What I see is that definitely

          if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
            last = bb;

          if (bitmap_bit_p (contains_call, bb->index))
            break;

doesn't work reliably since the DOM ordering will process blocks
A B and C in random order for

  for (;;)
   {
      if (cond)
        {
          A: foo ();
        }
      else B:;
      C:;
   }

and thus we can end up setting 'last' to C _before_ processing
'A' and thus arriving at the call foo () ...

get_loop_body_in_dom_order does some "special sauce" but not
to address the above problem - but it might be that a subtle
issue like the above is the reason for the inner loop handling.
The inner loop block order does _not_ adhere to this "special sauce",
that is - the "Additionally, if a basic block s dominates
the latch, then only blocks dominated by s are be after it."
guarantee holds for the outer loop latch, not for the inner.

Digging into the history of fill_always_executed_in_1 doesn't
reveal anything - the inner loop handling has been present
since introduction by Zdenek - but usually Zdenek has a reason
for doing things as he does ;)

Note it might be simply a measure against quadratic complexity,
esp. since with your patch we also dive into not always executed
subloops as you remove the

              if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
                break;

check.  I suggest to evaluate behavior of the patch on a testcase
like

void foo (int n, int **k)
{
  for (int i = 0; i < n; ++i)
    if (k[0][i])
      for (int j = 0; j < n; ++j)
        if (k[1][j])
          for (int l = 0; l < n; ++l)
            if (k[2][l])
              ...
}

I suspect you'll see quadratic behavior with your patch.  You
should be at least able to preserve a check like

          /* Do not process not always executed subloops to avoid
             quadratic behavior.  */
          if (bb->loop_father->header == bb
              && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
            break;

which is of course not optimistic for cases like

  for (..)
   {
     if (cond)
       for (..)
     x = 1; // this is always executed if the inner loop is finite
   }

but we need to have an eye on the complexity of this function.  I would
have suggested to do greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
is empty proceed to inner loops as the current code does.  For
bookkeeping simply keep a to-visit-incoming-edges counter per BB.
Pseudo-code:

  bitmap_set_bit (worklist, loop-header-bb);
  while (!bitmap_empty_p (worklist))
    {
      bb = pop (worklist);
      SET_ALWAYS_EXECUTED_IN (bb, loop);
      if (bitmap_bit_p (contains_call, bb->index))
        continue;
      FOR_EACH_EDGE (e, ei, bb->succs)
        {
          if (!flow_bb_inside_loop_p (loop, e->dest))
            continue;
          if (incoming_count[e->dest->index]-- == 0)
            push (worklist, e->dest);
        }
    }

  iterate over inner loops (incoming_count can be retained,
  we just force the inner loop header onto the worklist).

that would avoid the quadraticness with respect to
get_loop_body_in_dom_order and also fix the dominator issue
mentioned above.

Richard.