19038 – [4.0 Regression] tree-ssa causing loops to have more than one BB

Bug 19038 - [4.0 Regression] tree-ssa causing loops to have more than one BB

Summary: [4.0 Regression] tree-ssa causing loops to have more than one BB

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	4.0.0
Assignee:	Daniel Berlin

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	18048 21488
	Show dependency tree / graph

Reported:	2004-12-16 15:51 UTC by David Edelsohn
Modified:	2005-05-10 13:35 UTC (History)
CC List:	6 users (show)

See Also:
Host:	--*
Target:	--*
Build:	--*
Known to work:	3.3 3.4.0
Known to fail:	4.0.0
Last reconfirmed:	2005-01-22 01:50:52

Attachments
quick hack for tree-outof-ssa (929 bytes, patch) 2004-12-23 20:06 UTC, Andrew Pinski	Details \| Diff
Fix for dom propagating loop variant copies over invariant ones (364 bytes, patch) 2005-01-21 19:30 UTC, Daniel Berlin	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Edelsohn 2004-12-16 15:51:41 UTC

Loop header copying would allow some loops to be transformed from two or more 
basic blocks into a single basic block, allowing better scheduling of the 
instructions in the loop.  This would help SPEC CPU2000 sixtrack benchmark, 
for example.

Comment 1 David Edelsohn 2004-12-16 21:39:15 UTC

The focus of the problem is the inner loop of functio thin6d at line 572.  The 
function consumes 97.5% of the cycles and the inner loop consumes 48.5%.

Comment 2 Steven Bosscher 2004-12-16 22:55:20 UTC

The interesting thing is that the front end already does the header 
"copying" when it generates code: 
 
                              { 
                                int4 D.1164; 
 
                                D.1164 = nmz; 
                                k = 3; 
                                k.12 = k; 
 
                                if (k.12 <= D.1164) 
                                  { 
                                    <D1500>:; 
                                    { 
                                      ...loop body, straight-line code... 
                                      __label_000420:; 
                                      k.12 = k; 
                                      D.1168 = k.12 == D.1164; 
                                      k.12 = k; 
                                      D.1522 = k.12 + 1; 
                                      k = D.1522; 
                                      if (D.1168) 
                                        { 
                                          goto L.51; 
                                        } 
                                      else 
                                        { 
 
                                        } 
                                    } 
                                    goto <D1500>; 
                                  } 
                                else 
                                  { 
 
                                  } 
                                L.51:; 
                              }

Comment 3 Andrew Pinski 2004-12-17 04:04:46 UTC

If anyone wants the full benchmark I can attach it or sent it to them.

Comment 4 Andrew Pinski 2004-12-17 04:05:19 UTC

(In reply to comment #3)
> If anyone wants the full benchmark I can attach it or sent it to them.
That was not for this bug, woops.

Comment 5 Andrew Pinski 2004-12-17 04:16:44 UTC

I think the problem is the BBs are not being merged if the secondary BB has a user label.  If that is so,
what we should do is merge the BB and just move the labels.  I will try to this and attach the patch so 
someone can test it as I don't really have access to SPEC.

Comment 6 Andrew Pinski 2004-12-17 04:28:41 UTC

Could someone see if the patch attached to PR 19049 fixes the bug, I am going to submit it anyways 
since it helps so much fortran code in general and we can vectorize some more loops.

Comment 7 Andrew Pinski 2004-12-20 18:25:23 UTC

After getting access to sixtrack sources, steven and I found this is a problem in out of ssa committing 
on a new BB, I almost have a patch but it has one issue in that we could be the new statement's results 
in the control flow statement as we insert before it.

Comment 8 Andrew Pinski 2004-12-20 19:04:35 UTC

The reduced testcase (well heavely modified) looks like:
int
f (int temp2, int temp3, int xlvj_, int zlvj_, int yv1j_, int yv2j_)
{
  int temp1, temp5;

  for(;;) {
    temp1 = temp2*xlvj_ - temp3*zlvj_;
    temp3 = temp2*zlvj_ + temp3*xlvj_;
    temp5 = temp2;
    temp2 = temp1;
    if (temp3)
      break;
  }
  return yv1j_ * yv2j_;
}

Before the tree-ssa we would produce one BB when creating RTL but after out of ssa produces an extra 
BB to hold "  temp2 = temp1;"

Comment 9 Andrew Pinski 2004-12-20 19:23:03 UTC

This also causes a compile time issue as the RTL optimizers and the register allocator have to do more 
work.

Comment 10 Andrew Pinski 2004-12-20 19:47:08 UTC

(In reply to comment #8)
> The reduced testcase (well heavely modified) looks like:
Here is the resulting loop asm (on x86 where shows up worse than ppc which is the same because
the RA does its work better there):
.L7:
        movl    %edx, %ecx
.L3:
        movl    %ecx, %edx
        movl    %ebx, %eax
        imull   %edi, %edx
        imull   %esi, %eax
        imull   %esi, %ecx
        subl    %eax, %edx
        movl    %ebx, %eax
        movl    %ecx, %ebx
        imull   %edi, %eax
        addl    %eax, %ebx
        je      .L7

Compared to 3.4.0:
.L2:
        movl    16(%ebp), %ebx
        movl    %eax, %ecx
        movl    %esi, %edx
        imull   16(%ebp), %eax
        imull   %edi, %ecx
        imull   %esi, %ebx
        imull   %edi, %edx
        movl    %ebx, %esi
        subl    %ecx, %esi
        addl    %edx, %eax
        je      .L2

Comment 11 Steven Bosscher 2004-12-22 19:03:12 UTC

This is not a compile time hog, it's a performance regression. 
 
*Please*, I beg you, 
 
DON'T USE KEYWORDS LIKE REGRESSION AND HOG SO G** D*** OFTEN FOR 
TESTS THAT ARE NOT REGRESSIONS OR HOGS

Comment 12 Andrew Pinski 2004-12-23 02:41:18 UTC

Note I also see it for some simple fortran code see PR 14741.

Comment 13 David Edelsohn 2004-12-23 15:19:14 UTC

Fixing the problem at the RTL evel with loop header copying improves sixtrack 
performance 12.5%.

Comment 14 Steven Bosscher 2004-12-23 16:00:53 UTC

s/fixing/papering over/ in the previous comment ;-)

This appears to be a problem with loop-closed ssa form, but I don't
quite understand it yet.  But we should fix this on trees no matter
what, we don't need another RTL pass.

But 12.5%, wow!  Making this a "normal" bug (not `enhancement') and
upgrading to P1.  We really shoud fix this for GCC 4.0.  If not for
performance then for the just completely broken idea that we turn a
simple 1 BB loop into a complex multiple-BBs loop.  I'm quite sure
that this hurts a lot more than just SMS.  (Actually, that is what
that 12.5% shows, of course.)

Comment 15 Andrew Pinski 2004-12-23 20:06:32 UTC

Created attachment 7813 [details]
quick hack for tree-outof-ssa

This is a quick hack which I don't feel like a real solution but it does
improve all the cases which I can think of but still has an issue when we use
the copy in the conditional.  Also it might be correct as if we have a switch
which turns into a loop it could cause a problem (yes that does not happen
hardly at all) but it is still not the correct solution.

Comment 16 Andrew Pinski 2004-12-24 02:18:53 UTC

(In reply to comment #15)
> Created an attachment (id=7813)
> quick hack for tree-outof-ssa
Note this patch does not even allow to bootstrap.

Comment 17 mustafa 2004-12-24 12:13:44 UTC

Doing loop header copying in RTL works around this problem, doing so
gain us improvements of 3.9% for SPECINT on a G5 machine and a 66%
for gcc benchmark.

Comment 18 Steven Bosscher 2004-12-24 12:42:14 UTC

A smaller test case:

int
f (int temp2, int xlvj_)
{
  int temp1;
 
  for(;;) {
    temp1 = temp2*xlvj_;
    temp2 = temp1;
    if (temp1)
      break;
  }
  return xlvj_;
}
                                                                               
               

After out-of-ssa this gives the following tree dump:

;; Function f (f)
                                                                               
               
Analyzing Edge Insertions.
f (temp2, xlvj_)
{
  int temp1;
  int D.1471;
                                                                               
               
<L4>:;
                                                                               
               
<L0>:;
  temp1 = temp2 * xlvj_;
  if (temp1 != 0) goto <L2>; else goto <L7>;
                                                                               
               
<L7>:;
  temp2 = temp1;
  goto <bb 1> (<L0>);
                                                                               
               
<L2>:;
  return xlvj_;
                                                                               
               
}



As far as I can tell the following test case is semantically
equivalent to the one above:

int
f (int temp2, int xlvj_)
{
  for(;;) {
    temp2 = temp2*xlvj_;
    if (temp2)
      break;
  }
  return xlvj_;
}

But the tree dumps are definitely not the same, in the smaller
test case we don't have the extra copy:

;; Function f (f)
                                                                               
               
Analyzing Edge Insertions.
f (temp2, xlvj_)
{
  int D.1470;
                                                                               
               
<L4>:;
                                                                               
               
<L0>:;
  temp2 = temp2 * xlvj_;
  if (temp2 != 0) goto <L2>; else goto <L0>;
                                                                               
               
<L2>:;
  return xlvj_;
                                                                               
               
}

So apparently something is preventing temp1 and temp2 from being
coalesced when going out of ssa.

Comment 19 Steven Bosscher 2004-12-24 12:46:38 UTC

Side-by-side just before out-of-ssa:

f (temp2D.1464, xlvj_D.1465)                                    f (temp2D.1464,
xlvj_D.1465)
{                                                               {
  intD.0 D.1470;                                              |   intD.0
temp1D.1468;
                                                              >   intD.0 D.1471;
 
  # BLOCK 0                                                       # BLOCK 0
  # PRED: ENTRY [100.0%]  (fallthru,exec)                         # PRED: ENTRY
[100.0%]  (fallthru,exec)
<L4>:;                                                          <L4>:;
  # SUCC: 1 [100.0%]  (fallthru,exec)                             # SUCC: 1
[100.0%]  (fallthru,exec)
 
  # BLOCK 1                                                       # BLOCK 1
  # PRED: 0 [100.0%]  (fallthru,exec) 1 [89.0%]  (dfs_back,fa     # PRED: 0
[100.0%]  (fallthru,exec) 1 [89.0%]  (dfs_back,fa
  # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp2D.1464_4(1)>; |   #
temp2D.1464_1 = PHI <temp2D.1464_2(0), temp1D.1468_4(1)>;
<L0>:;                                                          <L0>:;
  temp2D.1464_4 = temp2D.1464_1 * xlvj_D.1465_3;              |   temp1D.1468_4
= temp2D.1464_1 * xlvj_D.1465_3;
  if (temp2D.1464_4 != 0) goto <L2>; else goto <L0>;          |   if
(temp1D.1468_4 != 0) goto <L2>; else goto <L0>;
  # SUCC: 2 [11.0%]  (loop_exit,true,exec) 1 [89.0%]  (dfs_ba     # SUCC: 2
[11.0%]  (loop_exit,true,exec) 1 [89.0%]  (dfs_ba
 
  # BLOCK 2                                                       # BLOCK 2
  # PRED: 1 [11.0%]  (loop_exit,true,exec)                        # PRED: 1
[11.0%]  (loop_exit,true,exec)
<L2>:;                                                          <L2>:;
  return xlvj_D.1465_3;                                           return
xlvj_D.1465_3;
  # SUCC: EXIT [100.0%]                                           # SUCC: EXIT
[100.0%]
 
}                                                               }

Comment 20 Steven Bosscher 2004-12-24 12:52:32 UTC

Silly bugzilla...  Sorry about that.

A piece of unidiff, then.  (-) is the smaller test that works, (+) is the
test where we get the extra copy.  What I wanted to show is that only the
root variable is different in the PHI argument:

@@ -28,10 +29,10 @@
  
   # BLOCK 1
   # PRED: 0 [100.0%]  (fallthru,exec) 1 [89.0%]  (dfs_back,false,exec)
-  # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp2D.1464_4(1)>;
+  # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp1D.1468_4(1)>;
 <L0>:;
-  temp2D.1464_4 = temp2D.1464_1 * xlvj_D.1465_3;
-  if (temp2D.1464_4 != 0) goto <L2>; else goto <L0>;
+  temp1D.1468_4 = temp2D.1464_1 * xlvj_D.1465_3;
+  if (temp1D.1468_4 != 0) goto <L2>; else goto <L0>;
   # SUCC: 2 [11.0%]  (loop_exit,true,exec) 1 [89.0%]  (dfs_back,false,exec)
  
   # BLOCK 2
@@ -48,45 +49,45 @@


This comes from DOM copy-proping around the loop.  Before DOM1 we have
a function that looks like this (.dce1):
;; Function f (f)
                                                                               
                                                             
f (temp2, xlvj_)
{
  int temp1;
  int D.1471;
                                                                               
                                                             
  # temp2_1 = PHI <temp2_2(-1), temp2_5(0)>;
<L0>:;
  temp1_4 = temp2_1 * xlvj__3;
  temp2_5 = temp1_4;
  if (temp1_4 != 0) goto <L2>; else goto <L0>;
                                                                               
                                                             
<L2>:;
  xlvj__6 = xlvj__3;
  return xlvj__6;
                                                                               
                                                             
}
                                                                               
                                                             

After DOM1 (the .dom1 dump) we have this:
;; Function f (f)
                                                                               
                                                             
f (temp2, xlvj_)
{
  int temp1;
  int D.1471;
                                                                               
                                                             
  # temp2_1 = PHI <temp2_2(-1), temp1_4(0)>;
<L0>:;
  temp1_4 = temp2_1 * xlvj__3;
  temp2_5 = temp1_4;
  if (temp1_4 != 0) goto <L2>; else goto <L0>;
                                                                               
                                                             
<L2>:;
  xlvj__6 = xlvj__3;
  return xlvj__3;
                                                                               
                                                             
}
                                                                               
                                                             
We never recover from this, we don't coalesce, and there's your copy.

Comment 21 Steven Bosscher 2004-12-24 12:55:14 UTC

Possible work-arounds/hacks:
- Don't copy-prop into a PHI argument when the root variable of the
  PHI result is the same as the root variable of the copy that is
  being replaced.  Seems silly that this would be necessary.
- Try to insert copies before conditional jumps if the jump condition
  does not depend on anything in the copy.
- Figure out why we don't coalesce temp1 and temp2.  Can we coalesce
  them?  If so, why are we failing to do it?

Comment 22 Daniel Berlin 2004-12-26 01:09:45 UTC

Just to followup on irc discussions.
Being able to coalesce this may require moving to a different out-of-ssa
algorithm that is better at coalescing.
All of the below take different approaches to the coalescing problem, however,
they may provide some insight as to whether/how these two variables can be
coalesced.

The briggs one (which is what morgan is based on, which is what the comments
claim is used) is known not to coalesce a significant number of variables.

A comparison of the two major algorithms for out-of-ssa (briggs and sreedhar's)
can be found at
http://www.is.titech.ac.jp/~sassa/papers-written/sassa-kohama-ito-ipsi04.pdf

Sreedhar's removes roughly 80% more copies than brigg's algorithm.
Note that the authors propose their own algorithm, even though it does
significantly worse than sreedhars, so i don't consider it a serious contender here.

Sreedhar's can be found at http://tinyurl.com/5axtl
i have a copy of the paper if you have no access.

In addition, yet another different approach to out-of-ssa coalescing was
presented at cgo 2004.
http://www.cgo.org/cgo2004/papers/21_14_rastello_f_revised.pdf

I mention it because it supposedly can coalesce even more than sreedhar's
algorithm is able to.

I point these out only as references to hopefully be able coalesce these variables.
I don't believe totally rewriting out-of-ssa would be apropos for 4.0 unless the
new algorithm is significantly simpler, etc (maybe one of the above would be, i
don't know).

Comment 23 Andrew Pinski 2004-12-26 15:06:04 UTC

This will happen with any "DO LABEL VAR=START,END" loop in fortran.

Comment 24 Jeffrey A. Law 2004-12-27 21:08:12 UTC

Trying to solve this problem by avoiding copy-propagation in special cases
is definitely not the right approach.

Any time we do a copy propagation or redundancy elimination we run the risk
of possibly creating overlapping lifetimes for objects which did not previously
overlap.

Our focus needs to be on figuring out why we did not coalesce everything to
a single variable.

Comment 25 Jeffrey A. Law 2004-12-28 08:22:17 UTC

A few more notes.

I would suggest starting the testcase in comment #8.  While the simpler
testcase shows the underlying issue, it's simplified so far that the "problem"
could be fixed in various ways which totally avoid the copy, but which do not
help the larger testcases.

If we go back to the testcase in comment #8 and analyze the code closely we
will see that a copy is inevitable.  If we copy propagate into the PHI, then
the copy gets inserted into its own block.  If we do not copy propagate
into the PHI, then the copy sits just before the loop exit test.  Having the
copy before the loop exit test rather than splitting the loop backedge is
better for a number of reasons.

So the issue at hand is not an extraneous copy, but instead the *placement*
of that copy.

We queue the copy to be inserted on the backedge of the loop, which is clearly
a correct insertion point.  However, if we use life analysis we can determine
that we can place the copy before the loop exit test.  The out-of-ssa code
computes object lifetimes, so, in theory we have the information we need.

What we'd need to do is enhance the copy placement code to be a little smarter
when we want to insert a copy on a critical edge.  Given a copy dest = src 
that we want to place on a critical edge E, we'd want to check if dest is
live on any of the other edges in e->src->succs -- if dest is not live on
any of those edges, then we can place the copy at the end of e->src.  Or
something along those lines.

I have no idea how hard that would be to implement within the current
out-of-ssa framework we have in place.

It's worth nothing that copy placement on loop backedges for out-of-ssa has
been an area that has been problematical before.  I ran into a number of
regressions in the RTL SSA code with poor placements (mostly in ways that
confused the old old rtl loop optimizers).  So it's not a _huge_ surprise
that we're seeing some issues in the same areas with the tree out-of-ssa code.

Comment 26 Jeffrey A. Law 2004-12-28 09:17:55 UTC

One final set of notes before I sign off for the night.

Daniel -- thanks for the pointers to the new papers.  I've studied Sreedhar's
algorith in the past and thought it looked promising -- it's good to see 
some hard #s comparing it to Briggs.  I'm not convinced Sreedhar's is
simpler to implement, but it does look like it's worth some effort.

I'm not yet sure if Sreedhar's algorithm will result in a better placement
for the copy in this testcase or not.  I'll have to re-review the algorithm.

Comment 27 Jeffrey A. Law 2004-12-28 23:31:57 UTC

Yea, the code I've got lying around here to undo unprofitable const/copy
propagations definitely helps this code.   And better yet, we don't need
the full-blown version (which is queued for 4.1).  We can get by with the
very simplistic version.   For the testcase in comment #8 we generate the
following .optimized dump:

  # BLOCK 0
  # PRED: ENTRY [100.0%]  (fallthru,exec)
<L4>:;
  # SUCC: 1 [100.0%]  (fallthru,exec)

  # BLOCK 1
  # PRED: 0 [100.0%]  (fallthru,exec) 1 [89.0%]  (dfs_back,false,exec)
<L0>:;
  temp1 = temp2 * xlvj_ - temp3 * zlvj_;
  temp3 = temp2 * zlvj_ + temp3 * xlvj_;
  temp2 = temp1;
  if (temp3 != 0) goto <L2>; else goto <L0>;
  # SUCC: 2 [11.0%]  (loop_exit,true,exec) 1 [89.0%]  (dfs_back,false,exec)

  # BLOCK 2
  # PRED: 1 [11.0%]  (loop_exit,true,exec)
<L2>:;
  return yv1j_ * yv2j_;
  # SUCC: EXIT [100.0%]

}

Comment 28 Steven Bosscher 2004-12-29 19:33:28 UTC

http://gcc.gnu.org/ml/gcc-patches/2004-12/msg02063.html

Comment 29 Andrew Pinski 2004-12-29 20:50:10 UTC

The reference patch does not fix the problem all the time, see PR 14741 for a testcase where the patch 
fails, and it fails still for sixtrack but I should note that this would be now caused by IV-OPTs.

Comment 30 Andrew Pinski 2004-12-29 20:52:04 UTC

(In reply to comment #29)
> The reference patch does not fix the problem all the time, see PR 14741 for a testcase where the 
patch 
> fails, and it fails still for sixtrack but I should note that this would be now caused by IV-OPTs.

For that testcase:
without IV-OPTS:
L33:
        lfd f13,0(r9)
        add r9,r9,r8
        lfd f0,0(r11)
        addi r11,r11,8
        fmadd f0,f13,f0,f12
        fmr f12,f0
        bdnz L33

With:
L43:
        fmr f12,f0
L33:
        lfd f13,0(r9)
        add r9,r9,r0
        lfd f0,0(r11)
        addi r11,r11,8
        fmadd f0,f13,f0,f12
        bdnz L43
Notice how there are two BB for this loop.

Comment 31 Jeffrey A. Law 2004-12-29 21:06:00 UTC

Subject: Re:  [4.0 Regression] out-of ssa
	causing loops to have more than one BB

On Wed, 2004-12-29 at 20:50 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-29 20:50 -------
> The reference patch does not fix the problem all the time, see PR 14741 for a testcase where the patch 
> fails, and it fails still for sixtrack but I should note that this would be now caused by IV-OPTs.
I don't expect to fix it all the time.  There's a fundamental problem in that
copy propagation can lengthen lifetimes and thus require a copy at
out-of-ssa time.  The patch is meant to fix those cases which can be
trivially detected and for which we can generate better code.

Changing the out-of-ssa code to use Sreedhar's algorithm (which appears
to do a lot better at avoiding copies) isn't something I'd advocate
for 4.1.  But even with a better out-of-ssa algorithm there are still
going to be cases where out-of-ssa is going to have to generate a
copy.  No way around it.

Now if you think that PR is a trivial case that should be caught, then,
show me why and I'll take a closer look.

jeff

Comment 32 Andrew Pinski 2004-12-30 17:27:36 UTC

(In reply to comment #31)
> Subject: Re:  [4.0 Regression] out-of ssa
>         causing loops to have more than one BB

> Now if you think that PR is a trivial case that should be caught, then,
> show me why and I'll take a closer look.

The reason why it is not caught is because we don't cleanup the cfg while doing the loop optimizations, 
this has been fixed already on the tcb.

Oh, by the way I see that sixtrack has regressed on x86 now with your patch applied, I think this is 
because we still have the same problem as before as ivopts puts the new instruction in an empty BB
which becomes from not cleaning up the cfg.

Comment 33 Jeffrey A. Law 2004-12-30 18:39:43 UTC

Subject: Re:  [4.0 Regression] out-of ssa
	causing loops to have more than one BB

On Thu, 2004-12-30 at 17:27 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-30 17:27 -------
> (In reply to comment #31)
> > Subject: Re:  [4.0 Regression] out-of ssa
> >         causing loops to have more than one BB
> 
> > Now if you think that PR is a trivial case that should be caught, then,
> > show me why and I'll take a closer look.
> 
> The reason why it is not caught is because we don't cleanup the cfg while doing the
> loop optimizations, this has been fixed already on the tcb.
Can you be more precise how cleaning up the CFG during the loop
optimizer affects the code that we see during out-of-ssa.  Specifically
how does it affect PHI arguments on backedges and the proper marking
of backedges in the CFG? 

> 
> Oh, by the way I see that sixtrack has regressed on x86 now with your patch applied, I think this is 
> because we still have the same problem as before as ivopts puts the new instruction in an empty BB
> which becomes from not cleaning up the cfg.
Again, more information please on how this affects us during out-of-ssa?

I'm happy to look into these problems, but you've apparently got a lot
more state on them than I do.  I'd like to learn what you already know
to speed up that process.

jeff

Comment 34 Andrew Pinski 2004-12-30 18:49:16 UTC

(In reply to comment #33)
> Subject: Re:  [4.0 Regression] out-of ssa
>         causing loops to have more than one BB
> > > > Now if you think that PR is a trivial case that should be caught, then,
> > > show me why and I'll take a closer look.
> > 
> > The reason why it is not caught is because we don't cleanup the cfg while doing the
> > loop optimizations, this has been fixed already on the tcb.
> Can you be more precise how cleaning up the CFG during the loop
> optimizer affects the code that we see during out-of-ssa.  Specifically
> how does it affect PHI arguments on backedges and the proper marking
> of backedges in the CFG? 

It has nothing to do with out-of-ssa any more, sorry for not being clear but here are the chain
of events for the current problem.

We split the critial edges so when we try to create the IV (in iv-opts), we insert an instruction on the bb 
which is the empty and otherwise useless but if we had cleaned up the CFG before running IV-OPTS, we
would insert it right before the condition just like your patch does for out of ssa.

And now we have this extra bb that is not useless, out of ssa is going to use instead of using your new 
scheme.

Comment 35 Andrew Pinski 2004-12-30 21:50:59 UTC

(In reply to comment #34)
Actually it is because we are placing statements in the loop's latch.
It is done in create_new_iv.

Comment 36 Jeffrey A. Law 2005-01-04 04:04:36 UTC

Subject: Re:  [4.0 Regression] out-of ssa
	causing loops to have more than one BB

On Thu, 2004-12-30 at 21:51 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-30 21:50 -------
> (In reply to comment #34)
> Actually it is because we are placing statements in the loop's latch.
> It is done in create_new_iv.
So just to be clear, it's not my change that is causing a regression.
The IV code creates situations which prevent my change from having
any kind of impact because the loop backedge is already split by
the IV code (and thus the loop backedge is no longer critical and
my out-of-ssa code does nothing).

Right?

Jeff

ps.  It seems to me that the IV code could use a trick similar to what I
did to the out-of-ssa code.  The only significant complication would be
that the IV code would have to verify that the code it wants to insert
is safe on both path (loop backedge and loop exit).

Comment 37 Andrew Pinski 2005-01-04 04:10:20 UTC

(In reply to comment #36)
> Subject: Re:  [4.0 Regression] out-of ssa
>         causing loops to have more than one BB> So just to be clear, it's not my change that is causing 
a regression.
> The IV code creates situations which prevent my change from having
> any kind of impact because the loop backedge is already split by
> the IV code (and thus the loop backedge is no longer critical and
> my out-of-ssa code does nothing).
> 
> Right?

Yes.

Comment 38 Andrew Macleod 2005-01-12 13:34:23 UTC

So let me see, If I read this properly, this is no longer an out of ssa problem?

notes:
- changing where we place copies based on jeffs observations in comment 25 ought
not be difficult.
- If we wish to change the coalesce algorithm, we should be able to simply plug
a  new one in. The existing one was just a quick straightforward implementation
to get it working with the full intention of improving it later. Its also
reasonably flexible and cost driven.  tree-outof-ssa.c::coalesce_ssa_name() does
all the setup work. A list of possible coalesces is created, given costs and the
sorted. Coalesces are then performed by calling
tree-ssa-live::coalesce_tpa_members() (tpa's are generic lists which associate
trees, ie VAR trees to a partition index).  The routine
tree-ssa-live.c::pop_best_coalesce() deterimes the order to attempt coalesces. 
currently this simply retreives them in sorted ordered by "cost". cost is
calculated during the building of the coalesce list via calls to
tree-ssa-live.c::add_coalesce(), and is a cost value passed to the routine.

The current implementation simply calls add_coalesce with a value of 1 all the
time, so it ends up being simply a copy count.. ie, we try to coalesce variables
which are copied between each other 4 times before we try those copied once. The
original intention was to multiple each copy by 10^nest_level_of_copy, but that
info wasnt easily available when I implemented it. I beleive it is now, and just
haven't gotten around to doing it. It would be trivial to add that to the calls
to add_coalesce.  

We could increase the cost again if its going to insert a copy on a latch edge
quite easily. (maybe 20^nest_level instead or some such thing) Making those even
more important.

add_coalesce should probably be changed to avoid wraparound as well I suppose.

Comment 39 Andrew Pinski 2005-01-18 00:50:11 UTC

Changing the summary to reflect the current problem.

Comment 40 Jeffrey A. Law 2005-01-18 05:02:42 UTC

Subject: Re:  [4.0 Regression] out-of ssa
	causing loops to have more than one BB

On Wed, 2005-01-12 at 13:34 +0000, amacleod at redhat dot com wrote:
> ------- Additional Comments From amacleod at redhat dot com  2005-01-12 13:34 -------
> So let me see, If I read this properly, this is no longer an out of ssa problem?
> 
> notes:
> - changing where we place copies based on jeffs observations in comment 25 ought
> not be difficult.
Good to know.  Though I doubt it's really necessary.

> - If we wish to change the coalesce algorithm, we should be able to simply plug
> a  new one in.
Maybe, maybe not.

The quick and dirty synopsis of Sreedhar's approach is to rewrite PHIs
so that none of the args conflict with each other or the PHI result.
Once that's done you coalesce the PHI args & result together, which you
know is always safe.  Coupled with that idea is some improvements in
when/where copies are inserted (which have certain parallels to
the ideas in comment #25.

Contrast that to what we do, which is more like coalescing in a register
allocator.  If after coalescing we find that the PHI result and one or
more of its arguments are in different partitions, then we insert a
copy on the edge associated with the PHI argument.

So the algorithms differ first in when they insert copies, where the
copies are inserted and to some extent how they determine when copies
are needed.

[ There's another approach from the Rice team which is similar to
  Sreedhar, but from what I could tell does not perform particularly
  well. ]

If I were looking for incremental improvements to our exiting
out-of-ssa, I would look at bumping the cost of copies associated
with backedges.  Sreedhar's approach isn't really an incremental
improvement -- it's more like a pre-pass for out-of-ssa which
obliviates the need to do copy coalescing during out-of-ssa.

jeff

Comment 41 GCC Commits 2005-01-19 22:50:14 UTC

Subject: Bug 19038

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rakdver@gcc.gnu.org	2005-01-19 22:50:06

Modified files:
	gcc            : ChangeLog tree-ssa-loop-ivopts.c 

Log message:
	PR tree-optimization/19038
	* tree-ssa-loop-ivopts.c (allow_ip_end_pos_p): New function.
	(add_candidate): Add ivs with increment in latch only if
	allow_ip_end_pos_p is true.
	(determine_iv_cost): Use empty_block_p.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7191&r2=2.7192
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-ivopts.c.diff?cvsroot=gcc&r1=2.41&r2=2.42

Comment 42 Andrew Pinski 2005-01-19 23:25:13 UTC

Now it is back to out of ssa causing one extra BB.

Comment 43 Andrew Pinski 2005-01-19 23:33:59 UTC

(In reply to comment #42)
> Now it is back to out of ssa causing one extra BB.
I will try get a smaller testcase for this problem now.  Thanks Zdenek for fixing the IV-OPTs related 
problem.

Comment 44 Andrew Pinski 2005-01-20 00:31:44 UTC

(In reply to comment #43)
> (In reply to comment #42)
> > Now it is back to out of ssa causing one extra BB.
> I will try get a smaller testcase for this problem now.  Thanks Zdenek for fixing the IV-OPTs related 
> problem.

Here is the reduced fortran testcase (I tried to convert it to C but I must not be matching all the syntax 
correctly) and change the names so I would not get into trouble with SPEC:
      subroutine function11(max)
      parameter(maxindex=64)
      common array1(maxindex),array2(maxindex)
      save
      do 20 j=1,max
	array1(j)=array2(j)
   20   continue
      return
      end

Comment 45 Andrew Pinski 2005-01-20 00:46:51 UTC

Here is another reduced testcase:
      subroutine thin6d(max)
      parameter(maxindex=64)
      common array1(maxindex)
      save
      do 20 j=1,max
	array1(j)=1.0
   20   continue
      return
      end


but note disabling ivopts "fixes" the problem but it is not caused by ivopts per say this time:
Before out of SSA:
  # ivtmp.6_21 = PHI <ivtmp.6_24(1), ivtmp.6_22(3)>;
  # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(3)>;
<L7>:;
  [thin6d.f : 6] *ivtmp.6_21 = 1.0e+0;

__label_000020:;
  ivtmp.3_20 = ivtmp.3_12 + 1;
  ivtmp.6_22 = ivtmp.6_21 + 4B;
  [thin6d.f : 1] D.493_30 = D.451_4 - 1;
  [thin6d.f : 1] D.494_31 = (<unnamed type>) D.493_30;
  [thin6d.f : 1] D.495_32 = D.494_31 + 1;
  [thin6d.f : 5] if (ivtmp.3_20 == D.495_32) [thin6d.f : 5] goto <L11>; else goto <L7>;

Note this is looks very normal and like it should work correctly (except for the ((cast)(a-1))+1 but I 
know Zdenek has a patch for that).

But for some reason (have not looked actually) we add an extra BB in out of SSA:
<L7>:;
  [thin6d.f : 6] *ivtmp.6 = 1.0e+0;

__label_000020:;
  ivtmp.10 = ivtmp.3 + 1;
  ivtmp.6 = ivtmp.6 + 4B;
  [thin6d.f : 5] if (ivtmp.10 == (<unnamed type>) (D.451 - 1) + 1) [thin6d.f : 5] goto <L11>; else goto 
<L13>;

<L13>:;
  ivtmp.3 = ivtmp.10;
  goto <bb 2> (<L7>);

Looks like we are not coalescing the following PHI:
  # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(3)>;

But why not.

Comment 46 Andrew Pinski 2005-01-20 01:15:47 UTC

(In reply to comment #45)
From -details .optimized file:
> Coalesce list: (0)ivtmp.3_12 & (1)ivtmp.3_20 [map: 0, 1] : Fail due to conflict
Huh, how do they conflict, I don't see how.

Hmm, I assume that tree-ssa-live is wrong here.

Comment 47 Steven Bosscher 2005-01-21 13:18:30 UTC

No, tree-ssa-live is quite right.  ivtmp.3_12 and ivtmp.3_20 are defined 
and live at some common statements, so they conflict: 
 
  # BLOCK 1 
  # PRED: 0 [88.4%]  (true,exec) 1 [89.0%]  (dfs_back,false,exec) 
  # ivtmp.3_12 = PHI <0(0), ivtmp.3_20(1)>; 
<L8>:; 
  D.501_28 = &__BLNK__.array1[ivtmp.3_12]; 
  #   __BLNK___11 = V_MAY_DEF <__BLNK___2>; 
  *D.501_28 = 1.0e+0; 
  ivtmp.3_20 = ivtmp.3_12 + 1; 
  D.503_30 = D.457_4 - 1; 
  D.504_31 = (<unnamed type>) D.503_30; 
  D.505_32 = D.504_31 + 1; 
  if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; 
  # SUCC: 2 [11.0%]  (loop_exit,true,exec) 1 [89.0%]  (dfs_back,false,exec) 
 
  # BLOCK 2 
  # PRED: 1 [11.0%]  (loop_exit,true,exec) 
<L12>:; 
  D.497_22 = ivtmp.3_12 + 2; 
  lsm_tmp.2_23 = (int4) D.497_22; 
  #   j_7 = V_MUST_DEF <j_6>; 
  j = lsm_tmp.2_23; 
  # SUCC: 3 [100.0%]  (fallthru) 
 
Between "ivtmp.3_20 = ivtmp.3_12 + 1;" and "D.497_22 = ivtmp.3_12 + 2;" 
both versions of ivtmp.3 are live.  I have not looked at what causes this, 
but IMOH all passes should avoid letting induction variables escape a loop. 
 
Interesting note: I think "D.497_22 = ivtmp.3_12 + 2;" could be replaced 
with "D.497_22 = ivtmp.3_20 + 1;" which would fix this problem.

Comment 48 Zdenek Dvorak 2005-01-21 13:22:28 UTC

Subject: Re:  [4.0 Regression] out of ssa causing loops to have more than one BB

> No, tree-ssa-live is quite right.  ivtmp.3_12 and ivtmp.3_20 are defined 
> and live at some common statements, so they conflict: 
>  
>   # BLOCK 1 
>   # PRED: 0 [88.4%]  (true,exec) 1 [89.0%]  (dfs_back,false,exec) 
>   # ivtmp.3_12 = PHI <0(0), ivtmp.3_20(1)>; 
> <L8>:; 
>   D.501_28 = &__BLNK__.array1[ivtmp.3_12]; 
>   #   __BLNK___11 = V_MAY_DEF <__BLNK___2>; 
>   *D.501_28 = 1.0e+0; 
>   ivtmp.3_20 = ivtmp.3_12 + 1; 
>   D.503_30 = D.457_4 - 1; 
>   D.504_31 = (<unnamed type>) D.503_30; 
>   D.505_32 = D.504_31 + 1; 
>   if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; 
>   # SUCC: 2 [11.0%]  (loop_exit,true,exec) 1 [89.0%]  (dfs_back,false,exec) 
>  
>   # BLOCK 2 
>   # PRED: 1 [11.0%]  (loop_exit,true,exec) 
> <L12>:; 
>   D.497_22 = ivtmp.3_12 + 2; 
>   lsm_tmp.2_23 = (int4) D.497_22; 
>   #   j_7 = V_MUST_DEF <j_6>; 
>   j = lsm_tmp.2_23; 
>   # SUCC: 3 [100.0%]  (fallthru) 
>  
> Between "ivtmp.3_20 = ivtmp.3_12 + 1;" and "D.497_22 = ivtmp.3_12 + 2;" 
> both versions of ivtmp.3 are live.  I have not looked at what causes this, 
> but IMOH all passes should avoid letting induction variables escape a loop. 

There's not much to avoid -- the iv is simply used outside of the loop,
and we cannot do anything about it.

> Interesting note: I think "D.497_22 = ivtmp.3_12 + 2;" could be replaced 
> with "D.497_22 = ivtmp.3_20 + 1;" which would fix this problem. 

I think this is how it is immediately after ivopts (if not, it would be
a bug); but dom performs this replacement.

Comment 49 Steven Bosscher 2005-01-21 13:27:06 UTC

Correctemundo Zdenek! 
 
Before dom3: 
  # BLOCK 2 
  # PRED: 1 [100.0%]  (fallthru) 2 [89.0%]  (dfs_back,false,exec) 
  # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(2)>; 
  # __BLNK___2 = PHI <__BLNK___8(1), __BLNK___11(2)>; 
<L8>:; 
  D.498_25 = (<unnamed type>) lsm_tmp.2_14; 
  D.499_26 = ivtmp.3_12 + D.498_25; 
  D.500_27 = D.499_26 + ffffffffffffffff; 
  D.501_28 = &__BLNK__.array1[D.500_27]; 
  ruatmp.9_29 = D.501_28; 
  #   __BLNK___11 = V_MAY_DEF <__BLNK___2>; 
  *ruatmp.9_29 = 1.0e+0; 
  ivtmp.3_20 = ivtmp.3_12 + 1; 
  D.503_30 = D.457_4 - lsm_tmp.2_14; 
  D.504_31 = (<unnamed type>) D.503_30; 
  D.505_32 = D.504_31 + 1; 
  if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; 
  # SUCC: 3 [11.0%]  (loop_exit,true,exec) 2 [89.0%]  (dfs_back,false,exec) 
 
  # BLOCK 3 
  # PRED: 2 [11.0%]  (loop_exit,true,exec) 
  # ivtmp.3_24 = PHI <ivtmp.3_20(2)>; 
<L12>:; 
  D.496_21 = (<unnamed type>) lsm_tmp.2_14; 
  D.497_22 = ivtmp.3_24 + D.496_21; 
  lsm_tmp.2_23 = (int4) D.497_22; 
  lsm_tmp.2_19 = lsm_tmp.2_23; 
  #   j_7 = V_MUST_DEF <j_6>; 
  j = lsm_tmp.2_19; 
  # SUCC: 4 [100.0%]  (fallthru) 
 
After dom3: 
  # BLOCK 2 
  # PRED: 1 [100.0%]  (fallthru) 2 [89.0%]  (dfs_back,false,exec) 
  # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(2)>; 
  # __BLNK___2 = PHI <__BLNK___8(1), __BLNK___11(2)>; 
<L8>:; 
  D.498_25 = 1; 
  D.499_26 = ivtmp.3_12 + 1; 
  D.500_27 = ivtmp.3_12; 
  D.501_28 = &__BLNK__.array1[ivtmp.3_12]; 
  ruatmp.9_29 = D.501_28; 
  #   __BLNK___11 = V_MAY_DEF <__BLNK___2>; 
  *D.501_28 = 1.0e+0; 
  ivtmp.3_20 = ivtmp.3_12 + 1; 
  D.503_30 = D.457_4 - 1; 
  D.504_31 = (<unnamed type>) D.503_30; 
  D.505_32 = D.504_31 + 1; 
  if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; 
  # SUCC: 3 [11.0%]  (loop_exit,true,exec) 2 [89.0%]  (dfs_back,false,exec) 
 
  # BLOCK 3 
  # PRED: 2 [11.0%]  (loop_exit,true,exec) 
  # ivtmp.3_24 = PHI <ivtmp.3_20(2)>; 
<L12>:; 
  D.496_21 = 1; 
  D.497_22 = ivtmp.3_12 + 2; 
  lsm_tmp.2_23 = (int4) D.497_22; 
  lsm_tmp.2_19 = lsm_tmp.2_23; 
  #   j_7 = V_MUST_DEF <j_6>; 
  j = lsm_tmp.2_23; 
  # SUCC: 4 [100.0%]  (fallthru)

Comment 50 Steven Bosscher 2005-01-21 13:28:25 UTC

Not an out-of-ssa problem.  This is a DOM problem.

Comment 51 Daniel Berlin 2005-01-21 19:19:21 UTC

I just tired this on struct-aliasing, to come up iwth a dom fix, and it
load-store moved everything, so there was nothing left for ivopts to create
ivtmps of :)

I think that's good.
I'll have to go try my patch on the mainline.

Comment 52 Daniel Berlin 2005-01-21 19:30:31 UTC

Created attachment 8034 [details]
Fix for dom propagating loop variant copies over invariant ones

Try the following patch, it should do what you want (allow the ivtmps to
coalesce).
If it works, i will submit to mainline with changelog.

Comment 53 Andrew Pinski 2005-01-21 20:47:01 UTC

(In reply to comment #52)
> Created an attachment (id=8034)
> Fix for dom propagating loop variant copies over invariant ones
> 
> Try the following patch, it should do what you want (allow the ivtmps to
> coalesce).
> If it works, i will submit to mainline with changelog.

Yes this works (I am putting it here just for future reference).

For all the loops in thin6d.f we no longer have multiple BBs for the loops which don't need an extra BB.

Comment 54 GCC Commits 2005-01-22 16:48:42 UTC

Subject: Bug 19038

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	dberlin@gcc.gnu.org	2005-01-22 16:48:24

Modified files:
	gcc            : ChangeLog tree-ssa-dom.c 

Log message:
	2005-01-20  Daniel Berlin  <dberlin@dberlin.org>
	
	Fix PR tree-optimization/19038
	* tree-ssa-dom.c (cprop_operand): Don't replace loop invaeriant
	copies with loop variant ones.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7228&r2=2.7229
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-dom.c.diff?cvsroot=gcc&r1=2.86&r2=2.87

Comment 55 Daniel Berlin 2005-01-22 16:49:42 UTC

Fixed