GCC Bugzilla – Bug 19038
[4.0 Regression] tree-ssa causing loops to have more than one BB
Last modified: 2005-05-10 13:35:00 UTC
Loop header copying would allow some loops to be transformed from two or more basic blocks into a single basic block, allowing better scheduling of the instructions in the loop. This would help SPEC CPU2000 sixtrack benchmark, for example.
The focus of the problem is the inner loop of functio thin6d at line 572. The function consumes 97.5% of the cycles and the inner loop consumes 48.5%.
The interesting thing is that the front end already does the header "copying" when it generates code: { int4 D.1164; D.1164 = nmz; k = 3; k.12 = k; if (k.12 <= D.1164) { <D1500>:; { ...loop body, straight-line code... __label_000420:; k.12 = k; D.1168 = k.12 == D.1164; k.12 = k; D.1522 = k.12 + 1; k = D.1522; if (D.1168) { goto L.51; } else { } } goto <D1500>; } else { } L.51:; }
If anyone wants the full benchmark I can attach it or sent it to them.
(In reply to comment #3) > If anyone wants the full benchmark I can attach it or sent it to them. That was not for this bug, woops.
I think the problem is the BBs are not being merged if the secondary BB has a user label. If that is so, what we should do is merge the BB and just move the labels. I will try to this and attach the patch so someone can test it as I don't really have access to SPEC.
Could someone see if the patch attached to PR 19049 fixes the bug, I am going to submit it anyways since it helps so much fortran code in general and we can vectorize some more loops.
After getting access to sixtrack sources, steven and I found this is a problem in out of ssa committing on a new BB, I almost have a patch but it has one issue in that we could be the new statement's results in the control flow statement as we insert before it.
The reduced testcase (well heavely modified) looks like: int f (int temp2, int temp3, int xlvj_, int zlvj_, int yv1j_, int yv2j_) { int temp1, temp5; for(;;) { temp1 = temp2*xlvj_ - temp3*zlvj_; temp3 = temp2*zlvj_ + temp3*xlvj_; temp5 = temp2; temp2 = temp1; if (temp3) break; } return yv1j_ * yv2j_; } Before the tree-ssa we would produce one BB when creating RTL but after out of ssa produces an extra BB to hold " temp2 = temp1;"
This also causes a compile time issue as the RTL optimizers and the register allocator have to do more work.
(In reply to comment #8) > The reduced testcase (well heavely modified) looks like: Here is the resulting loop asm (on x86 where shows up worse than ppc which is the same because the RA does its work better there): .L7: movl %edx, %ecx .L3: movl %ecx, %edx movl %ebx, %eax imull %edi, %edx imull %esi, %eax imull %esi, %ecx subl %eax, %edx movl %ebx, %eax movl %ecx, %ebx imull %edi, %eax addl %eax, %ebx je .L7 Compared to 3.4.0: .L2: movl 16(%ebp), %ebx movl %eax, %ecx movl %esi, %edx imull 16(%ebp), %eax imull %edi, %ecx imull %esi, %ebx imull %edi, %edx movl %ebx, %esi subl %ecx, %esi addl %edx, %eax je .L2
This is not a compile time hog, it's a performance regression. *Please*, I beg you, DON'T USE KEYWORDS LIKE REGRESSION AND HOG SO G** D*** OFTEN FOR TESTS THAT ARE NOT REGRESSIONS OR HOGS
Note I also see it for some simple fortran code see PR 14741.
Fixing the problem at the RTL evel with loop header copying improves sixtrack performance 12.5%.
s/fixing/papering over/ in the previous comment ;-) This appears to be a problem with loop-closed ssa form, but I don't quite understand it yet. But we should fix this on trees no matter what, we don't need another RTL pass. But 12.5%, wow! Making this a "normal" bug (not `enhancement') and upgrading to P1. We really shoud fix this for GCC 4.0. If not for performance then for the just completely broken idea that we turn a simple 1 BB loop into a complex multiple-BBs loop. I'm quite sure that this hurts a lot more than just SMS. (Actually, that is what that 12.5% shows, of course.)
Created attachment 7813 [details] quick hack for tree-outof-ssa This is a quick hack which I don't feel like a real solution but it does improve all the cases which I can think of but still has an issue when we use the copy in the conditional. Also it might be correct as if we have a switch which turns into a loop it could cause a problem (yes that does not happen hardly at all) but it is still not the correct solution.
(In reply to comment #15) > Created an attachment (id=7813) > quick hack for tree-outof-ssa Note this patch does not even allow to bootstrap.
Doing loop header copying in RTL works around this problem, doing so gain us improvements of 3.9% for SPECINT on a G5 machine and a 66% for gcc benchmark.
A smaller test case: int f (int temp2, int xlvj_) { int temp1; for(;;) { temp1 = temp2*xlvj_; temp2 = temp1; if (temp1) break; } return xlvj_; } After out-of-ssa this gives the following tree dump: ;; Function f (f) Analyzing Edge Insertions. f (temp2, xlvj_) { int temp1; int D.1471; <L4>:; <L0>:; temp1 = temp2 * xlvj_; if (temp1 != 0) goto <L2>; else goto <L7>; <L7>:; temp2 = temp1; goto <bb 1> (<L0>); <L2>:; return xlvj_; } As far as I can tell the following test case is semantically equivalent to the one above: int f (int temp2, int xlvj_) { for(;;) { temp2 = temp2*xlvj_; if (temp2) break; } return xlvj_; } But the tree dumps are definitely not the same, in the smaller test case we don't have the extra copy: ;; Function f (f) Analyzing Edge Insertions. f (temp2, xlvj_) { int D.1470; <L4>:; <L0>:; temp2 = temp2 * xlvj_; if (temp2 != 0) goto <L2>; else goto <L0>; <L2>:; return xlvj_; } So apparently something is preventing temp1 and temp2 from being coalesced when going out of ssa.
Side-by-side just before out-of-ssa: f (temp2D.1464, xlvj_D.1465) f (temp2D.1464, xlvj_D.1465) { { intD.0 D.1470; | intD.0 temp1D.1468; > intD.0 D.1471; # BLOCK 0 # BLOCK 0 # PRED: ENTRY [100.0%] (fallthru,exec) # PRED: ENTRY [100.0%] (fallthru,exec) <L4>:; <L4>:; # SUCC: 1 [100.0%] (fallthru,exec) # SUCC: 1 [100.0%] (fallthru,exec) # BLOCK 1 # BLOCK 1 # PRED: 0 [100.0%] (fallthru,exec) 1 [89.0%] (dfs_back,fa # PRED: 0 [100.0%] (fallthru,exec) 1 [89.0%] (dfs_back,fa # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp2D.1464_4(1)>; | # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp1D.1468_4(1)>; <L0>:; <L0>:; temp2D.1464_4 = temp2D.1464_1 * xlvj_D.1465_3; | temp1D.1468_4 = temp2D.1464_1 * xlvj_D.1465_3; if (temp2D.1464_4 != 0) goto <L2>; else goto <L0>; | if (temp1D.1468_4 != 0) goto <L2>; else goto <L0>; # SUCC: 2 [11.0%] (loop_exit,true,exec) 1 [89.0%] (dfs_ba # SUCC: 2 [11.0%] (loop_exit,true,exec) 1 [89.0%] (dfs_ba # BLOCK 2 # BLOCK 2 # PRED: 1 [11.0%] (loop_exit,true,exec) # PRED: 1 [11.0%] (loop_exit,true,exec) <L2>:; <L2>:; return xlvj_D.1465_3; return xlvj_D.1465_3; # SUCC: EXIT [100.0%] # SUCC: EXIT [100.0%] } }
Silly bugzilla... Sorry about that. A piece of unidiff, then. (-) is the smaller test that works, (+) is the test where we get the extra copy. What I wanted to show is that only the root variable is different in the PHI argument: @@ -28,10 +29,10 @@ # BLOCK 1 # PRED: 0 [100.0%] (fallthru,exec) 1 [89.0%] (dfs_back,false,exec) - # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp2D.1464_4(1)>; + # temp2D.1464_1 = PHI <temp2D.1464_2(0), temp1D.1468_4(1)>; <L0>:; - temp2D.1464_4 = temp2D.1464_1 * xlvj_D.1465_3; - if (temp2D.1464_4 != 0) goto <L2>; else goto <L0>; + temp1D.1468_4 = temp2D.1464_1 * xlvj_D.1465_3; + if (temp1D.1468_4 != 0) goto <L2>; else goto <L0>; # SUCC: 2 [11.0%] (loop_exit,true,exec) 1 [89.0%] (dfs_back,false,exec) # BLOCK 2 @@ -48,45 +49,45 @@ This comes from DOM copy-proping around the loop. Before DOM1 we have a function that looks like this (.dce1): ;; Function f (f) f (temp2, xlvj_) { int temp1; int D.1471; # temp2_1 = PHI <temp2_2(-1), temp2_5(0)>; <L0>:; temp1_4 = temp2_1 * xlvj__3; temp2_5 = temp1_4; if (temp1_4 != 0) goto <L2>; else goto <L0>; <L2>:; xlvj__6 = xlvj__3; return xlvj__6; } After DOM1 (the .dom1 dump) we have this: ;; Function f (f) f (temp2, xlvj_) { int temp1; int D.1471; # temp2_1 = PHI <temp2_2(-1), temp1_4(0)>; <L0>:; temp1_4 = temp2_1 * xlvj__3; temp2_5 = temp1_4; if (temp1_4 != 0) goto <L2>; else goto <L0>; <L2>:; xlvj__6 = xlvj__3; return xlvj__3; } We never recover from this, we don't coalesce, and there's your copy.
Possible work-arounds/hacks: - Don't copy-prop into a PHI argument when the root variable of the PHI result is the same as the root variable of the copy that is being replaced. Seems silly that this would be necessary. - Try to insert copies before conditional jumps if the jump condition does not depend on anything in the copy. - Figure out why we don't coalesce temp1 and temp2. Can we coalesce them? If so, why are we failing to do it?
Just to followup on irc discussions. Being able to coalesce this may require moving to a different out-of-ssa algorithm that is better at coalescing. All of the below take different approaches to the coalescing problem, however, they may provide some insight as to whether/how these two variables can be coalesced. The briggs one (which is what morgan is based on, which is what the comments claim is used) is known not to coalesce a significant number of variables. A comparison of the two major algorithms for out-of-ssa (briggs and sreedhar's) can be found at http://www.is.titech.ac.jp/~sassa/papers-written/sassa-kohama-ito-ipsi04.pdf Sreedhar's removes roughly 80% more copies than brigg's algorithm. Note that the authors propose their own algorithm, even though it does significantly worse than sreedhars, so i don't consider it a serious contender here. Sreedhar's can be found at http://tinyurl.com/5axtl i have a copy of the paper if you have no access. In addition, yet another different approach to out-of-ssa coalescing was presented at cgo 2004. http://www.cgo.org/cgo2004/papers/21_14_rastello_f_revised.pdf I mention it because it supposedly can coalesce even more than sreedhar's algorithm is able to. I point these out only as references to hopefully be able coalesce these variables. I don't believe totally rewriting out-of-ssa would be apropos for 4.0 unless the new algorithm is significantly simpler, etc (maybe one of the above would be, i don't know).
This will happen with any "DO LABEL VAR=START,END" loop in fortran.
Trying to solve this problem by avoiding copy-propagation in special cases is definitely not the right approach. Any time we do a copy propagation or redundancy elimination we run the risk of possibly creating overlapping lifetimes for objects which did not previously overlap. Our focus needs to be on figuring out why we did not coalesce everything to a single variable.
A few more notes. I would suggest starting the testcase in comment #8. While the simpler testcase shows the underlying issue, it's simplified so far that the "problem" could be fixed in various ways which totally avoid the copy, but which do not help the larger testcases. If we go back to the testcase in comment #8 and analyze the code closely we will see that a copy is inevitable. If we copy propagate into the PHI, then the copy gets inserted into its own block. If we do not copy propagate into the PHI, then the copy sits just before the loop exit test. Having the copy before the loop exit test rather than splitting the loop backedge is better for a number of reasons. So the issue at hand is not an extraneous copy, but instead the *placement* of that copy. We queue the copy to be inserted on the backedge of the loop, which is clearly a correct insertion point. However, if we use life analysis we can determine that we can place the copy before the loop exit test. The out-of-ssa code computes object lifetimes, so, in theory we have the information we need. What we'd need to do is enhance the copy placement code to be a little smarter when we want to insert a copy on a critical edge. Given a copy dest = src that we want to place on a critical edge E, we'd want to check if dest is live on any of the other edges in e->src->succs -- if dest is not live on any of those edges, then we can place the copy at the end of e->src. Or something along those lines. I have no idea how hard that would be to implement within the current out-of-ssa framework we have in place. It's worth nothing that copy placement on loop backedges for out-of-ssa has been an area that has been problematical before. I ran into a number of regressions in the RTL SSA code with poor placements (mostly in ways that confused the old old rtl loop optimizers). So it's not a _huge_ surprise that we're seeing some issues in the same areas with the tree out-of-ssa code.
One final set of notes before I sign off for the night. Daniel -- thanks for the pointers to the new papers. I've studied Sreedhar's algorith in the past and thought it looked promising -- it's good to see some hard #s comparing it to Briggs. I'm not convinced Sreedhar's is simpler to implement, but it does look like it's worth some effort. I'm not yet sure if Sreedhar's algorithm will result in a better placement for the copy in this testcase or not. I'll have to re-review the algorithm.
Yea, the code I've got lying around here to undo unprofitable const/copy propagations definitely helps this code. And better yet, we don't need the full-blown version (which is queued for 4.1). We can get by with the very simplistic version. For the testcase in comment #8 we generate the following .optimized dump: # BLOCK 0 # PRED: ENTRY [100.0%] (fallthru,exec) <L4>:; # SUCC: 1 [100.0%] (fallthru,exec) # BLOCK 1 # PRED: 0 [100.0%] (fallthru,exec) 1 [89.0%] (dfs_back,false,exec) <L0>:; temp1 = temp2 * xlvj_ - temp3 * zlvj_; temp3 = temp2 * zlvj_ + temp3 * xlvj_; temp2 = temp1; if (temp3 != 0) goto <L2>; else goto <L0>; # SUCC: 2 [11.0%] (loop_exit,true,exec) 1 [89.0%] (dfs_back,false,exec) # BLOCK 2 # PRED: 1 [11.0%] (loop_exit,true,exec) <L2>:; return yv1j_ * yv2j_; # SUCC: EXIT [100.0%] }
http://gcc.gnu.org/ml/gcc-patches/2004-12/msg02063.html
The reference patch does not fix the problem all the time, see PR 14741 for a testcase where the patch fails, and it fails still for sixtrack but I should note that this would be now caused by IV-OPTs.
(In reply to comment #29) > The reference patch does not fix the problem all the time, see PR 14741 for a testcase where the patch > fails, and it fails still for sixtrack but I should note that this would be now caused by IV-OPTs. For that testcase: without IV-OPTS: L33: lfd f13,0(r9) add r9,r9,r8 lfd f0,0(r11) addi r11,r11,8 fmadd f0,f13,f0,f12 fmr f12,f0 bdnz L33 With: L43: fmr f12,f0 L33: lfd f13,0(r9) add r9,r9,r0 lfd f0,0(r11) addi r11,r11,8 fmadd f0,f13,f0,f12 bdnz L43 Notice how there are two BB for this loop.
Subject: Re: [4.0 Regression] out-of ssa causing loops to have more than one BB On Wed, 2004-12-29 at 20:50 +0000, pinskia at gcc dot gnu dot org wrote: > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-29 20:50 ------- > The reference patch does not fix the problem all the time, see PR 14741 for a testcase where the patch > fails, and it fails still for sixtrack but I should note that this would be now caused by IV-OPTs. I don't expect to fix it all the time. There's a fundamental problem in that copy propagation can lengthen lifetimes and thus require a copy at out-of-ssa time. The patch is meant to fix those cases which can be trivially detected and for which we can generate better code. Changing the out-of-ssa code to use Sreedhar's algorithm (which appears to do a lot better at avoiding copies) isn't something I'd advocate for 4.1. But even with a better out-of-ssa algorithm there are still going to be cases where out-of-ssa is going to have to generate a copy. No way around it. Now if you think that PR is a trivial case that should be caught, then, show me why and I'll take a closer look. jeff
(In reply to comment #31) > Subject: Re: [4.0 Regression] out-of ssa > causing loops to have more than one BB > Now if you think that PR is a trivial case that should be caught, then, > show me why and I'll take a closer look. The reason why it is not caught is because we don't cleanup the cfg while doing the loop optimizations, this has been fixed already on the tcb. Oh, by the way I see that sixtrack has regressed on x86 now with your patch applied, I think this is because we still have the same problem as before as ivopts puts the new instruction in an empty BB which becomes from not cleaning up the cfg.
Subject: Re: [4.0 Regression] out-of ssa causing loops to have more than one BB On Thu, 2004-12-30 at 17:27 +0000, pinskia at gcc dot gnu dot org wrote: > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-30 17:27 ------- > (In reply to comment #31) > > Subject: Re: [4.0 Regression] out-of ssa > > causing loops to have more than one BB > > > Now if you think that PR is a trivial case that should be caught, then, > > show me why and I'll take a closer look. > > The reason why it is not caught is because we don't cleanup the cfg while doing the > loop optimizations, this has been fixed already on the tcb. Can you be more precise how cleaning up the CFG during the loop optimizer affects the code that we see during out-of-ssa. Specifically how does it affect PHI arguments on backedges and the proper marking of backedges in the CFG? > > Oh, by the way I see that sixtrack has regressed on x86 now with your patch applied, I think this is > because we still have the same problem as before as ivopts puts the new instruction in an empty BB > which becomes from not cleaning up the cfg. Again, more information please on how this affects us during out-of-ssa? I'm happy to look into these problems, but you've apparently got a lot more state on them than I do. I'd like to learn what you already know to speed up that process. jeff
(In reply to comment #33) > Subject: Re: [4.0 Regression] out-of ssa > causing loops to have more than one BB > > > > Now if you think that PR is a trivial case that should be caught, then, > > > show me why and I'll take a closer look. > > > > The reason why it is not caught is because we don't cleanup the cfg while doing the > > loop optimizations, this has been fixed already on the tcb. > Can you be more precise how cleaning up the CFG during the loop > optimizer affects the code that we see during out-of-ssa. Specifically > how does it affect PHI arguments on backedges and the proper marking > of backedges in the CFG? It has nothing to do with out-of-ssa any more, sorry for not being clear but here are the chain of events for the current problem. We split the critial edges so when we try to create the IV (in iv-opts), we insert an instruction on the bb which is the empty and otherwise useless but if we had cleaned up the CFG before running IV-OPTS, we would insert it right before the condition just like your patch does for out of ssa. And now we have this extra bb that is not useless, out of ssa is going to use instead of using your new scheme.
(In reply to comment #34) Actually it is because we are placing statements in the loop's latch. It is done in create_new_iv.
Subject: Re: [4.0 Regression] out-of ssa causing loops to have more than one BB On Thu, 2004-12-30 at 21:51 +0000, pinskia at gcc dot gnu dot org wrote: > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-30 21:50 ------- > (In reply to comment #34) > Actually it is because we are placing statements in the loop's latch. > It is done in create_new_iv. So just to be clear, it's not my change that is causing a regression. The IV code creates situations which prevent my change from having any kind of impact because the loop backedge is already split by the IV code (and thus the loop backedge is no longer critical and my out-of-ssa code does nothing). Right? Jeff ps. It seems to me that the IV code could use a trick similar to what I did to the out-of-ssa code. The only significant complication would be that the IV code would have to verify that the code it wants to insert is safe on both path (loop backedge and loop exit).
(In reply to comment #36) > Subject: Re: [4.0 Regression] out-of ssa > causing loops to have more than one BB> So just to be clear, it's not my change that is causing a regression. > The IV code creates situations which prevent my change from having > any kind of impact because the loop backedge is already split by > the IV code (and thus the loop backedge is no longer critical and > my out-of-ssa code does nothing). > > Right? Yes.
So let me see, If I read this properly, this is no longer an out of ssa problem? notes: - changing where we place copies based on jeffs observations in comment 25 ought not be difficult. - If we wish to change the coalesce algorithm, we should be able to simply plug a new one in. The existing one was just a quick straightforward implementation to get it working with the full intention of improving it later. Its also reasonably flexible and cost driven. tree-outof-ssa.c::coalesce_ssa_name() does all the setup work. A list of possible coalesces is created, given costs and the sorted. Coalesces are then performed by calling tree-ssa-live::coalesce_tpa_members() (tpa's are generic lists which associate trees, ie VAR trees to a partition index). The routine tree-ssa-live.c::pop_best_coalesce() deterimes the order to attempt coalesces. currently this simply retreives them in sorted ordered by "cost". cost is calculated during the building of the coalesce list via calls to tree-ssa-live.c::add_coalesce(), and is a cost value passed to the routine. The current implementation simply calls add_coalesce with a value of 1 all the time, so it ends up being simply a copy count.. ie, we try to coalesce variables which are copied between each other 4 times before we try those copied once. The original intention was to multiple each copy by 10^nest_level_of_copy, but that info wasnt easily available when I implemented it. I beleive it is now, and just haven't gotten around to doing it. It would be trivial to add that to the calls to add_coalesce. We could increase the cost again if its going to insert a copy on a latch edge quite easily. (maybe 20^nest_level instead or some such thing) Making those even more important. add_coalesce should probably be changed to avoid wraparound as well I suppose.
Changing the summary to reflect the current problem.
Subject: Re: [4.0 Regression] out-of ssa causing loops to have more than one BB On Wed, 2005-01-12 at 13:34 +0000, amacleod at redhat dot com wrote: > ------- Additional Comments From amacleod at redhat dot com 2005-01-12 13:34 ------- > So let me see, If I read this properly, this is no longer an out of ssa problem? > > notes: > - changing where we place copies based on jeffs observations in comment 25 ought > not be difficult. Good to know. Though I doubt it's really necessary. > - If we wish to change the coalesce algorithm, we should be able to simply plug > a new one in. Maybe, maybe not. The quick and dirty synopsis of Sreedhar's approach is to rewrite PHIs so that none of the args conflict with each other or the PHI result. Once that's done you coalesce the PHI args & result together, which you know is always safe. Coupled with that idea is some improvements in when/where copies are inserted (which have certain parallels to the ideas in comment #25. Contrast that to what we do, which is more like coalescing in a register allocator. If after coalescing we find that the PHI result and one or more of its arguments are in different partitions, then we insert a copy on the edge associated with the PHI argument. So the algorithms differ first in when they insert copies, where the copies are inserted and to some extent how they determine when copies are needed. [ There's another approach from the Rice team which is similar to Sreedhar, but from what I could tell does not perform particularly well. ] If I were looking for incremental improvements to our exiting out-of-ssa, I would look at bumping the cost of copies associated with backedges. Sreedhar's approach isn't really an incremental improvement -- it's more like a pre-pass for out-of-ssa which obliviates the need to do copy coalescing during out-of-ssa. jeff
Subject: Bug 19038 CVSROOT: /cvs/gcc Module name: gcc Changes by: rakdver@gcc.gnu.org 2005-01-19 22:50:06 Modified files: gcc : ChangeLog tree-ssa-loop-ivopts.c Log message: PR tree-optimization/19038 * tree-ssa-loop-ivopts.c (allow_ip_end_pos_p): New function. (add_candidate): Add ivs with increment in latch only if allow_ip_end_pos_p is true. (determine_iv_cost): Use empty_block_p. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7191&r2=2.7192 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-loop-ivopts.c.diff?cvsroot=gcc&r1=2.41&r2=2.42
Now it is back to out of ssa causing one extra BB.
(In reply to comment #42) > Now it is back to out of ssa causing one extra BB. I will try get a smaller testcase for this problem now. Thanks Zdenek for fixing the IV-OPTs related problem.
(In reply to comment #43) > (In reply to comment #42) > > Now it is back to out of ssa causing one extra BB. > I will try get a smaller testcase for this problem now. Thanks Zdenek for fixing the IV-OPTs related > problem. Here is the reduced fortran testcase (I tried to convert it to C but I must not be matching all the syntax correctly) and change the names so I would not get into trouble with SPEC: subroutine function11(max) parameter(maxindex=64) common array1(maxindex),array2(maxindex) save do 20 j=1,max array1(j)=array2(j) 20 continue return end
Here is another reduced testcase: subroutine thin6d(max) parameter(maxindex=64) common array1(maxindex) save do 20 j=1,max array1(j)=1.0 20 continue return end but note disabling ivopts "fixes" the problem but it is not caused by ivopts per say this time: Before out of SSA: # ivtmp.6_21 = PHI <ivtmp.6_24(1), ivtmp.6_22(3)>; # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(3)>; <L7>:; [thin6d.f : 6] *ivtmp.6_21 = 1.0e+0; __label_000020:; ivtmp.3_20 = ivtmp.3_12 + 1; ivtmp.6_22 = ivtmp.6_21 + 4B; [thin6d.f : 1] D.493_30 = D.451_4 - 1; [thin6d.f : 1] D.494_31 = (<unnamed type>) D.493_30; [thin6d.f : 1] D.495_32 = D.494_31 + 1; [thin6d.f : 5] if (ivtmp.3_20 == D.495_32) [thin6d.f : 5] goto <L11>; else goto <L7>; Note this is looks very normal and like it should work correctly (except for the ((cast)(a-1))+1 but I know Zdenek has a patch for that). But for some reason (have not looked actually) we add an extra BB in out of SSA: <L7>:; [thin6d.f : 6] *ivtmp.6 = 1.0e+0; __label_000020:; ivtmp.10 = ivtmp.3 + 1; ivtmp.6 = ivtmp.6 + 4B; [thin6d.f : 5] if (ivtmp.10 == (<unnamed type>) (D.451 - 1) + 1) [thin6d.f : 5] goto <L11>; else goto <L13>; <L13>:; ivtmp.3 = ivtmp.10; goto <bb 2> (<L7>); Looks like we are not coalescing the following PHI: # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(3)>; But why not.
(In reply to comment #45) From -details .optimized file: > Coalesce list: (0)ivtmp.3_12 & (1)ivtmp.3_20 [map: 0, 1] : Fail due to conflict Huh, how do they conflict, I don't see how. Hmm, I assume that tree-ssa-live is wrong here.
No, tree-ssa-live is quite right. ivtmp.3_12 and ivtmp.3_20 are defined and live at some common statements, so they conflict: # BLOCK 1 # PRED: 0 [88.4%] (true,exec) 1 [89.0%] (dfs_back,false,exec) # ivtmp.3_12 = PHI <0(0), ivtmp.3_20(1)>; <L8>:; D.501_28 = &__BLNK__.array1[ivtmp.3_12]; # __BLNK___11 = V_MAY_DEF <__BLNK___2>; *D.501_28 = 1.0e+0; ivtmp.3_20 = ivtmp.3_12 + 1; D.503_30 = D.457_4 - 1; D.504_31 = (<unnamed type>) D.503_30; D.505_32 = D.504_31 + 1; if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; # SUCC: 2 [11.0%] (loop_exit,true,exec) 1 [89.0%] (dfs_back,false,exec) # BLOCK 2 # PRED: 1 [11.0%] (loop_exit,true,exec) <L12>:; D.497_22 = ivtmp.3_12 + 2; lsm_tmp.2_23 = (int4) D.497_22; # j_7 = V_MUST_DEF <j_6>; j = lsm_tmp.2_23; # SUCC: 3 [100.0%] (fallthru) Between "ivtmp.3_20 = ivtmp.3_12 + 1;" and "D.497_22 = ivtmp.3_12 + 2;" both versions of ivtmp.3 are live. I have not looked at what causes this, but IMOH all passes should avoid letting induction variables escape a loop. Interesting note: I think "D.497_22 = ivtmp.3_12 + 2;" could be replaced with "D.497_22 = ivtmp.3_20 + 1;" which would fix this problem.
Subject: Re: [4.0 Regression] out of ssa causing loops to have more than one BB > No, tree-ssa-live is quite right. ivtmp.3_12 and ivtmp.3_20 are defined > and live at some common statements, so they conflict: > > # BLOCK 1 > # PRED: 0 [88.4%] (true,exec) 1 [89.0%] (dfs_back,false,exec) > # ivtmp.3_12 = PHI <0(0), ivtmp.3_20(1)>; > <L8>:; > D.501_28 = &__BLNK__.array1[ivtmp.3_12]; > # __BLNK___11 = V_MAY_DEF <__BLNK___2>; > *D.501_28 = 1.0e+0; > ivtmp.3_20 = ivtmp.3_12 + 1; > D.503_30 = D.457_4 - 1; > D.504_31 = (<unnamed type>) D.503_30; > D.505_32 = D.504_31 + 1; > if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; > # SUCC: 2 [11.0%] (loop_exit,true,exec) 1 [89.0%] (dfs_back,false,exec) > > # BLOCK 2 > # PRED: 1 [11.0%] (loop_exit,true,exec) > <L12>:; > D.497_22 = ivtmp.3_12 + 2; > lsm_tmp.2_23 = (int4) D.497_22; > # j_7 = V_MUST_DEF <j_6>; > j = lsm_tmp.2_23; > # SUCC: 3 [100.0%] (fallthru) > > Between "ivtmp.3_20 = ivtmp.3_12 + 1;" and "D.497_22 = ivtmp.3_12 + 2;" > both versions of ivtmp.3 are live. I have not looked at what causes this, > but IMOH all passes should avoid letting induction variables escape a loop. There's not much to avoid -- the iv is simply used outside of the loop, and we cannot do anything about it. > Interesting note: I think "D.497_22 = ivtmp.3_12 + 2;" could be replaced > with "D.497_22 = ivtmp.3_20 + 1;" which would fix this problem. I think this is how it is immediately after ivopts (if not, it would be a bug); but dom performs this replacement.
Correctemundo Zdenek! Before dom3: # BLOCK 2 # PRED: 1 [100.0%] (fallthru) 2 [89.0%] (dfs_back,false,exec) # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(2)>; # __BLNK___2 = PHI <__BLNK___8(1), __BLNK___11(2)>; <L8>:; D.498_25 = (<unnamed type>) lsm_tmp.2_14; D.499_26 = ivtmp.3_12 + D.498_25; D.500_27 = D.499_26 + ffffffffffffffff; D.501_28 = &__BLNK__.array1[D.500_27]; ruatmp.9_29 = D.501_28; # __BLNK___11 = V_MAY_DEF <__BLNK___2>; *ruatmp.9_29 = 1.0e+0; ivtmp.3_20 = ivtmp.3_12 + 1; D.503_30 = D.457_4 - lsm_tmp.2_14; D.504_31 = (<unnamed type>) D.503_30; D.505_32 = D.504_31 + 1; if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; # SUCC: 3 [11.0%] (loop_exit,true,exec) 2 [89.0%] (dfs_back,false,exec) # BLOCK 3 # PRED: 2 [11.0%] (loop_exit,true,exec) # ivtmp.3_24 = PHI <ivtmp.3_20(2)>; <L12>:; D.496_21 = (<unnamed type>) lsm_tmp.2_14; D.497_22 = ivtmp.3_24 + D.496_21; lsm_tmp.2_23 = (int4) D.497_22; lsm_tmp.2_19 = lsm_tmp.2_23; # j_7 = V_MUST_DEF <j_6>; j = lsm_tmp.2_19; # SUCC: 4 [100.0%] (fallthru) After dom3: # BLOCK 2 # PRED: 1 [100.0%] (fallthru) 2 [89.0%] (dfs_back,false,exec) # ivtmp.3_12 = PHI <0(1), ivtmp.3_20(2)>; # __BLNK___2 = PHI <__BLNK___8(1), __BLNK___11(2)>; <L8>:; D.498_25 = 1; D.499_26 = ivtmp.3_12 + 1; D.500_27 = ivtmp.3_12; D.501_28 = &__BLNK__.array1[ivtmp.3_12]; ruatmp.9_29 = D.501_28; # __BLNK___11 = V_MAY_DEF <__BLNK___2>; *D.501_28 = 1.0e+0; ivtmp.3_20 = ivtmp.3_12 + 1; D.503_30 = D.457_4 - 1; D.504_31 = (<unnamed type>) D.503_30; D.505_32 = D.504_31 + 1; if (ivtmp.3_20 == D.505_32) goto <L12>; else goto <L8>; # SUCC: 3 [11.0%] (loop_exit,true,exec) 2 [89.0%] (dfs_back,false,exec) # BLOCK 3 # PRED: 2 [11.0%] (loop_exit,true,exec) # ivtmp.3_24 = PHI <ivtmp.3_20(2)>; <L12>:; D.496_21 = 1; D.497_22 = ivtmp.3_12 + 2; lsm_tmp.2_23 = (int4) D.497_22; lsm_tmp.2_19 = lsm_tmp.2_23; # j_7 = V_MUST_DEF <j_6>; j = lsm_tmp.2_23; # SUCC: 4 [100.0%] (fallthru)
Not an out-of-ssa problem. This is a DOM problem.
I just tired this on struct-aliasing, to come up iwth a dom fix, and it load-store moved everything, so there was nothing left for ivopts to create ivtmps of :) I think that's good. I'll have to go try my patch on the mainline.
Created attachment 8034 [details] Fix for dom propagating loop variant copies over invariant ones Try the following patch, it should do what you want (allow the ivtmps to coalesce). If it works, i will submit to mainline with changelog.
(In reply to comment #52) > Created an attachment (id=8034) > Fix for dom propagating loop variant copies over invariant ones > > Try the following patch, it should do what you want (allow the ivtmps to > coalesce). > If it works, i will submit to mainline with changelog. Yes this works (I am putting it here just for future reference). For all the loops in thin6d.f we no longer have multiple BBs for the loops which don't need an extra BB.
Subject: Bug 19038 CVSROOT: /cvs/gcc Module name: gcc Changes by: dberlin@gcc.gnu.org 2005-01-22 16:48:24 Modified files: gcc : ChangeLog tree-ssa-dom.c Log message: 2005-01-20 Daniel Berlin <dberlin@dberlin.org> Fix PR tree-optimization/19038 * tree-ssa-dom.c (cprop_operand): Don't replace loop invaeriant copies with loop variant ones. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7228&r2=2.7229 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-ssa-dom.c.diff?cvsroot=gcc&r1=2.86&r2=2.87
Fixed