Bug 23855

Summary: loop header should also be pulled out of the inner loop too
Product: gcc Reporter: Richard Biener <rguenth>
Component: tree-optimizationAssignee: Richard Biener <rguenth>
Status: RESOLVED FIXED    
Severity: enhancement CC: dberlin, gcc-bugs, paulo, rakdver, schwab, xinliangli
Priority: P2 Keywords: missed-optimization, patch
Version: 4.1.0   
Target Milestone: 7.0   
URL: http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01424.html
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2009-04-22 22:50:24
Bug Depends on:    
Bug Blocks: 23970, 60042    
Attachments: Updated version of the patch, also handling invariant memory references
updated patch
updated patch

Description Richard Biener 2005-09-13 09:46:45 UTC
void bar(void);
void foo(int ie, int je)
{
  int i, j;
  for (j=0; j<je; ++j)
    for (i=0; i<ie; ++i)
      bar();
}

should _not_ be transformed to

foo (ie, je)
{ 
  int j;
  int i;
  
<bb 0>:
  if (je > 0) goto <L23>; else goto <L5>;
  
<L23>:;
  j = 0;
  goto <bb 3> (<L2>);

<L22>:;
  i = 0;

<L1>:;
  bar ();
  i = i + 1;
  if (ie != i) goto <L1>; else goto <L3>;

<L3>:;
  j = j + 1;
  if (je != j) goto <L2>; else goto <L5>;

<L2>:;
  if (ie > 0) goto <L22>; else goto <L3>;

<L5>:;
  return;

}

i.e. containing an loop-invariant check if (ie > 0).

Both DOM and copy-header do this transformation.  Disabling both
we get

;; Function foo (foo)
  
foo (ie, je)
{
  int j;
  int i;

<bb 0>:
  j = 0;
  goto <bb 4> (<L4>);

<L1>:;
  bar ();
  i = i + 1;

<L2>:;
  if (i < ie) goto <L1>; else goto <L3>;

<L3>:;
  j = j + 1;

<L4>:;
  if (j < je) goto <L8>; else goto <L5>;

<L8>:;
  i = 0;
  goto <bb 2> (<L2>);

<L5>:;
  return;

}

which is a _lot_ faster for small ie.  Optimally we would hoist the
loop invariant check out of the j loop.
Comment 1 Richard Biener 2005-09-13 13:12:16 UTC
Unswitching should clean this up, but unfortunately(?) only looks at innermost
loops.  For a reason it seems, just removing this checks results in an ICE.
Testcase for unswitching:

void bar(void);
void foo(int ie, int je)
{
  int j;
  int i;

  if (je <= 0)
    goto L5;

  j = 0;
L2:
  if (ie <= 0)
    goto L3;

  i = 0;
L1:
  bar ();
  i = i + 1;
  if (ie != i) goto L1;

L3:
  j = j + 1;
  if (je != j) goto L2;

L5:
  return;
}
Comment 2 Zdenek Dvorak 2005-09-13 13:29:51 UTC
It is not clear to me why you find the code without header copying better -- 
number of checks of each condition is exactly the same in both cases, and with 
right ordering of basic blocks, there should be one less jump executed. 
Depending on branch prediction on the target, either the original code or the 
code with copied headers may be faster, but it is hard to determine which one. 
 
Unswitching of non-innermost loops should not be hard to implement (most of 
code is in place, it only was quite some time that it was tested for anything 
but innermost loops, so there probably is some bitrot; and a few pieces on tree 
level may be missing). 
Comment 3 Richard Biener 2005-09-13 13:47:03 UTC
Well, it is the case that I have some numerical application that has
such loops and the case of small ie (1 or 2) does happen during boundary
updates, so instead of

  if (ie <= 0)
    return;
  for (j=0; j<je; ++j)
    i=0;
    do {
      foo();
      i=i+1;
    } while (i<ie);

and executing the if (ie <= 0) exactly one time, we execute it
je times, which is as often as half times the number of total iterations
of the inner loop body.  You are probably right that the execution number
without loop header copying is exactly the same, but unswitching this
condition certainly helps for small ie (where loop versioning with constant
ie and completely unrolled inner loop would help even more, of course).
Comment 4 Andrew Pinski 2005-09-13 14:01:50 UTC
On PPC, we get optimal and almost unswitched loop:
L4:
        ble- cr4,L7
        li r31,0
L6:
        addi r31,r31,1
        bl _bar
        cmpw cr7,r30,r31
        bne+ cr7,L6
L7:
        addi r29,r29,1
        cmpw cr7,r28,r29
        bne+ cr7,L4


cr4 is a callee save register.
Comment 5 Richard Biener 2005-09-13 14:09:24 UTC
Heh - you unswitched the comparison but not the jump!
Comment 6 Andrew Pinski 2006-01-15 20:08:39 UTC
What really should happen is that we should change the loop to be something like which helps a different bug too (I cannot find it right now but it has to do with pulling stuff out of the loops):
void bar(void);
void foo(int ie, int je)
{
  int i, j;
  if (0<je && 0 <ie)
  {
  j = 0;
  do {
  do {
   i  =0;
      bar(); 
    i++;
  } while (i < ie);
  j++;
  } while (j < ie);
}
Comment 7 Andrew Pinski 2006-01-15 20:12:40 UTC
Oh, that was PR 23970.

And I had meant:
void bar(void);
void foo(int ie, int je)
{
  int i, j;
  if (0<je && 0 <ie)
  {
  j = 0;
  do {
   i  =0;
  do {
      bar();
    i++;
  } while (i < ie);
  j++;
  } while (j < ie);
  }
}
Comment 8 Zdenek Dvorak 2006-01-15 20:15:37 UTC
Unswitching the condition (i < ie) (plus empty loop removal) would produce this code.  Nevertheless, this is a fairly common case, so perhaps it might make sense to special-case it in loop header copying (for performance reasons).
Comment 9 Zdenek Dvorak 2006-01-20 23:56:59 UTC
Patch:

http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01424.html
Comment 10 patchapp@dberlin.org 2006-02-28 18:15:51 UTC
Subject: Bug number PR23855

A patch for this bug has been added to the patch tracker.
The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01424.html
Comment 11 Richard Biener 2006-04-07 11:34:56 UTC
I updated the patch for current mainline, but it still has issues for some common uses of loops:

void foo(int *ie, int *je, double *x)
{
  int i, j;
  for (j=0; j<*je; ++j)
    for (i=0; i<*ie; ++i)
      x[i+j] = 0.0;
}

After loop header copying we have

  if (*je > 0)
    for (j=0; j<*je; ++j)
      if (*ie > 0)
        for (i=0; i<*ie; ++i)
          x[i+j ] = 0.0;

note how in this form we see the condition *ie > 0 not invariant wrt the
outer loop (because it does a memory load), but if we would run load-PRE
between loop header copying and guard hoisting we would get

  pretmp.1 = *je;
  if (pretmp.1 > 0)
    pretmp.2 = *ie;
    for (j=0; j<pretmp.1; ++j)
      if (pretmp.2 > 0)
        for (i=0; i<pretmp.2; ++i)
          x[i+j] = 0.0;

which would enable us to hoist the guard out of the outer loop.
Comment 12 Richard Biener 2006-04-07 12:01:16 UTC
So what I now have is (ahh, a PROP_loops would be so nice...)

  pass_ch
  pass_split_crit_edges
  pass_pre
  [cfg_cleanup to re-merge critical edges]
  pass_hoist_guards

which works nice for this scenario, still without an extra load-PRE pass we have
memory loads that were only PREd out of the inner loop (due to the not hoisted
guard) inside the outer loop, even if they're now redundant.

Which makes me wonder, if PRE could do loop-header-FRE here?  Danny?  Has such
been done before?  Basically you have

do {
  if (*mem > 0)
    {
      i = 0;
      do {
      } while (*mem > i);
    }
} while (...);

where the (*mem > 0) is (full or partially?) redundant.
Comment 13 Zdenek Dvorak 2006-04-07 12:57:38 UTC
Subject: Re:  loop header should also be pulled out of the inner loop too

> I updated the patch for current mainline, but it still has issues for some
> common uses of loops:
> 
> void foo(int *ie, int *je, double *x)
> {
>   int i, j;
>   for (j=0; j<*je; ++j)
>     for (i=0; i<*ie; ++i)
>       x[i+j] = 0.0;
> }
> 
> After loop header copying we have
> 
>   if (*je > 0)
>     for (j=0; j<*je; ++j)
>       if (*ie > 0)
>         for (i=0; i<*ie; ++i)
>           x[i+j ] = 0.0;
> 
> note how in this form we see the condition *ie > 0 not invariant wrt the
> outer loop (because it does a memory load), but if we would run load-PRE
> between loop header copying and guard hoisting we would get
> 
>   pretmp.1 = *je;
>   if (pretmp.1 > 0)
>     pretmp.2 = *ie;
>     for (j=0; j<pretmp.1; ++j)
>       if (pretmp.2 > 0)
>         for (i=0; i<pretmp.2; ++i)
>           x[i+j] = 0.0;
> 
> which would enable us to hoist the guard out of the outer loop.

this is somewhat problematic; it would work for 2 nested loops, but not
for 3 or more (one would need to iterate guard hoisting & invariant
motion or PRE for that).  It would be possible to integrate guard hoisting
into the loop invariant motion pass, I may give it a try.
Comment 14 Zdenek Dvorak 2006-04-07 13:20:19 UTC
(In reply to comment #11)
> I updated the patch for current mainline, but it still has issues for some
> common uses of loops:
> 
> void foo(int *ie, int *je, double *x)
> {
>   int i, j;
>   for (j=0; j<*je; ++j)
>     for (i=0; i<*ie; ++i)
>       x[i+j] = 0.0;
> }
> 
> After loop header copying we have
> 
>   if (*je > 0)
>     for (j=0; j<*je; ++j)
>       if (*ie > 0)
>         for (i=0; i<*ie; ++i)
>           x[i+j ] = 0.0;
> 
> note how in this form we see the condition *ie > 0 not invariant wrt the
> outer loop (because it does a memory load), but if we would run load-PRE
> between loop header copying and guard hoisting we would get

actually, thinking about it again, it should suffice to teach invariant_without_guard_p about invariant memory loads, and this should just work.
Comment 15 Zdenek Dvorak 2006-04-09 23:51:37 UTC
(In reply to comment #14)
> (In reply to comment #11)
> > I updated the patch for current mainline, but it still has issues for some
> > common uses of loops:
> > 
> > void foo(int *ie, int *je, double *x)
> > {
> >   int i, j;
> >   for (j=0; j<*je; ++j)
> >     for (i=0; i<*ie; ++i)
> >       x[i+j] = 0.0;
> > }
> > 
> > After loop header copying we have
> > 
> >   if (*je > 0)
> >     for (j=0; j<*je; ++j)
> >       if (*ie > 0)
> >         for (i=0; i<*ie; ++i)
> >           x[i+j ] = 0.0;
> > 
> > note how in this form we see the condition *ie > 0 not invariant wrt the
> > outer loop (because it does a memory load), but if we would run load-PRE
> > between loop header copying and guard hoisting we would get
> 
> actually, thinking about it again, it should suffice to teach
> invariant_without_guard_p about invariant memory loads, and this should just
> work.

It basically does, the only other problem is that we are not able to determine that the outer loop is not infinite.
Comment 16 Zdenek Dvorak 2006-04-10 00:03:57 UTC
Created attachment 11231 [details]
Updated version of the patch, also handling invariant memory references
Comment 17 rguenther@suse.de 2006-04-10 08:10:59 UTC
Subject: Re:  loop header should also be pulled
 out of the inner loop too

On Mon, 9 Apr 2006, rakdver at gcc dot gnu dot org wrote:

> (In reply to comment #14)
> > (In reply to comment #11)
> > > I updated the patch for current mainline, but it still has issues for some
> > > common uses of loops:
> > > 
> > > void foo(int *ie, int *je, double *x)
> > > {
> > >   int i, j;
> > >   for (j=0; j<*je; ++j)
> > >     for (i=0; i<*ie; ++i)
> > >       x[i+j] = 0.0;
> > > }
> > > 
> > > After loop header copying we have
> > > 
> > >   if (*je > 0)
> > >     for (j=0; j<*je; ++j)
> > >       if (*ie > 0)
> > >         for (i=0; i<*ie; ++i)
> > >           x[i+j ] = 0.0;
> > > 
> > > note how in this form we see the condition *ie > 0 not invariant wrt the
> > > outer loop (because it does a memory load), but if we would run load-PRE
> > > between loop header copying and guard hoisting we would get
> > 
> > actually, thinking about it again, it should suffice to teach
> > invariant_without_guard_p about invariant memory loads, and this should just
> > work.
> 
> It basically does, the only other problem is that we are not able to determine
> that the outer loop is not infinite.

It that because we don't recognize the condition of the loop header?  
Otherwise we know that *je > 0 and the loop runs from 0 to *je (in fact,
I have loop determine the number of iterations of the loop to be *je).  Of
course this is with load-PRE moving the load before the loop-header and 
thus SCEV being presented with

   D.xxx = *je;
   if (D.xxx > 0)
     for (j=0; j<D.xxx; ++j)
      ...

which it just handles fine.  It looks like a chicken-and-egg problem ;)

Richard.
Comment 18 Zdenek Dvorak 2006-04-10 10:24:41 UTC
Subject: Re:  loop header should also be pulled out of the inner loop too

> > > actually, thinking about it again, it should suffice to teach
> > > invariant_without_guard_p about invariant memory loads, and this should just
> > > work.
> > 
> > It basically does, the only other problem is that we are not able to determine
> > that the outer loop is not infinite.
> 
> It that because we don't recognize the condition of the loop header?  

the problem is that we use # of iterations analysis, and it tries to
fully instantiate the scalar evolutions, which is impossible for *je.
Comment 19 Richard Biener 2006-04-10 11:56:01 UTC
So it is indeed chicken and egg ;)  load-PRE does not PRE the loads if the loop is not in do-while form, and we won't hoist the loop header copies until the loads are PREd.  As to comment #13, if we are able to "clean" the two innermost loops
of a nest that is already a good improvement.  Though from all the experiments
it looks like it is advisable to write do-while loops with hoisted loop header copies manually rather than for loops.
Comment 20 Richard Biener 2006-10-24 13:56:01 UTC
So, where do we want to go?
Comment 21 Zdenek Dvorak 2006-10-24 13:58:57 UTC
(In reply to comment #20)
> So, where do we want to go?
> 

Unless the basic patch is approved (which did not happen so far, despite of several pings), I do not know.  I will try to resend the patch, perhaps that will help.
Comment 22 Daniel Berlin 2006-10-24 14:23:34 UTC
(In reply to comment #19)
> So it is indeed chicken and egg ;)  load-PRE does not PRE the loads if the loop
> is not in do-while form, and we won't hoist the loop header copies until the
> loads are PREd.  As to comment #13, if we are able to "clean" the two innermost
> loops
> of a nest that is already a good improvement.  Though from all the experiments
> it looks like it is advisable to write do-while loops with hoisted loop header
> copies manually rather than for loops.
> 

So the PRE fixes i've got should help here, i think.
Maybe not.
:)

Comment 23 Andrew Pinski 2011-10-19 19:10:38 UTC
The other way to handle this is to allow unswitch loops to handle non inner loops which from what I can tell is a two line patch.
Comment 24 rguenther@suse.de 2011-10-20 07:43:44 UTC
On Wed, 19 Oct 2011, pinskia at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23855
> 
> --- Comment #23 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-19 19:10:38 UTC ---
> The other way to handle this is to allow unswitch loops to handle non inner
> loops which from what I can tell is a two line patch.

Though unswitching is too late for loop invariant motion to work here.
Comment 25 Richard Biener 2012-12-07 10:11:58 UTC
*** Bug 55187 has been marked as a duplicate of this bug. ***
Comment 26 Richard Biener 2014-02-05 14:13:02 UTC
Created attachment 32049 [details]
updated patch

Updated the patch to trunk.
Comment 27 Richard Biener 2014-02-05 14:14:35 UTC
Some more TLC to be applied before it's ready for prime time (and obviously testing).  Eventually integrating with LIM sounds better?
Comment 28 Richard Biener 2014-02-05 14:26:56 UTC
Created attachment 32051 [details]
updated patch

Slight update to apply all possible hoists to PR60042.
Comment 29 Andrew Pinski 2016-08-14 03:59:52 UTC
We unswitch on the outer loops now in GCC 6 and above.
So can we consider this as fixed?

The code for aarch64-linux-gnu on the trunk looks like:
.L3:
        mov     w19, 0
        .p2align 3
.L4:
        bl      bar
        add     w19, w19, 1
        cmp     w20, w19
        bne     .L4
        add     w21, w21, 1
        cmp     w22, w21
        bne     .L3
Comment 30 Richard Biener 2016-08-15 08:24:44 UTC
Well, unfortunately it doesn't fully work for

void bar(void);
void foo(int ie, int je, int ke)
{
  int i, j, k;
  for (k=0; k<ke; ++k)
    for (j=0; j<je; ++j)
      for (i=0; i<ie; ++i)
        bar();
}

it hoists the header check one level but it doesn't manage to hoist the
ie > 0 check all the way up out of the k loop.  I suppose iterating
find_loop_guard in some way would work.  I have a patch in testing.
Comment 31 Richard Biener 2016-08-17 08:12:04 UTC
Author: rguenth
Date: Wed Aug 17 08:11:32 2016
New Revision: 239523

URL: https://gcc.gnu.org/viewcvs?rev=239523&root=gcc&view=rev
Log:
2016-08-17  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/23855
	* tree-ssa-loop-unswitch.c: Include tree-ssa-loop-manip.h.
	(tree_unswitch_outer_loop): Iterate find_loop_guard as long as we
	find guards to hoist.  Do not update SSA form but rewrite virtuals
	into loop closed SSA.
	(find_loop_guard): Adjust to skip already hoisted guards.  Do
	not mark virtuals for renaming or update SSA form.

	* gcc.dg/loop-unswitch-2.c: Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/loop-unswitch-2.c
    trunk/gcc/tree-ssa-loop-unswitch.c
Comment 32 Richard Biener 2016-08-17 08:19:40 UTC
Fixed fully for GCC 7.
Comment 33 Andrew Pinski 2021-07-26 03:36:03 UTC
*** Bug 35344 has been marked as a duplicate of this bug. ***