The combiner jumped up from 3.3.2's 8% to 3.4/3.5's 56% in the attached function. No. of lines in block Run time 600 0.36 1200 2.33 1800 5.77 2400 10.56 There seems to be a quadratic bottleneck in combine. distribute_notes has a loop that goes over all previous instructions. distribute_notes calls reg_set_p, which calls set_of, which calls note_stores, which calls set_of_1, which calls reg_overlap_mentioned_p, which calls refers_to_regno_p... and all these show up quite at the top of the profile. # of instructions function 224,203,200 rtlanal.c:refers_to_regno_p 150,433,200 rtlanal.c:reg_overlap_mentioned_p 75,561,636 rtlanal.c:note_stores 56,985,600 rtlanal.c:reg_referenced_p 49,576,316 regrename.c:validate_value_data 40,026,779 rtl.c:rtx_equal_p 28,740,658 flow.c:mark_set_1 27,562,800 combine.c:distribute_notes 25,928,400 rtlanal.c:set_of_1 15,840,000 rtlanal.c:reg_set_p This might be a duplicate of PR13775.
Created attachment 5617 [details] testcase
Confirmed, on i686, combine takes most of the time, while on powerpc: scheduling : 22.06 (21%) usr 1.10 (34%) sys 49.98 (19%) wall rename registers : 46.87 (44%) usr 0.38 (12%) sys 122.39 (47%) wall scheduling 2 : 20.88 (20%) usr 0.85 (27%) sys 55.33 (21%) wall Takes most of the time.
It is probably because on powerpc the combination attempts fail, while on i686 they succeed (there is an increment-a-memory-location instruction). Gee, the timings for PPC are even worse. :-( As far as the combiner is concerned, it seems to be worst case quadratic in the number of succeeded combinations in a basic block. Maybe it is possible to just propagate notes just 50 or 100 instruction behind in the basic block after a succeeded combination? Or maybe it screws up the liveness information? It may also be ok to simply rerun data-flow analysis after combine (instead of doing distribute_notes) after 50 or 100 succeeded combinations. (I don't have a tree here, otherwise I'd try out these two possibilities).
Ah, is it a regression on PPC?
Actually the regression for PPC-darwin is only for darwin and not PPC linux and is filed in a different PR.
This needs to be fixed, but it would be too risky to try before 3.4.0. Postponed until 3.4.1.
I will take a look at this next week, it might be already helped by the merge of the tree-ssa but I doubt that though.
Actually look into this there is a compile time regression in 3.3.3 also from 3.2.3: tin:~/src/gnu/gcctest>time gcc -S -O1 pr13931.c 2.050u 0.040s 0:02.15 97.2% 0+0k 0+0io 942pf+0w tin:~/src/gnu/gcctest>time ~/ia32_linux_gcc3_2/bin/gcc -S -O1 pr13931.c 0.350u 0.040s 0:00.45 86.6% 0+0k 0+0io 687pf+0w tin:~/src/gnu/gcctest>time ~/ia32_linux_gcc3_3/bin/gcc -S -O1 pr13931.c 1.700u 0.030s 0:01.77 97.7% 0+0k 0+0io 708pf+0w tin:~/src/gnu/gcctest>time ~/ia32_linux_gcc3_4/bin/gcc -S -O1 pr13931.c 1.790u 0.030s 0:01.86 97.8% 0+0k 0+0io 734pf+0w
For powerpc-apple-darwin, most of the time is being taking in get_cse_reg_info. But I will note that powerpc-apple-darwin is now faster than even the 3.3 release or even Apple's 3.3 Hmm I noticed something, for 3.3.1 20030707 on i686-pc-linux-gnu, the speed of the compile is fast but on a released version of 3.3.3 it is just as slow as 3.4.0, so there was a patch between them which made this slower.
If someone could find the patch which make this slower I could do something about it. There has only been one patch to combine after 3.3 which might have caused this: 2003-10-06 Eric Botcazou <ebotcazou@libertysurf.fr> PR optimization/11637 * combine.c (adjust_for_new_dest): New function to adjust the notes and LOG_LINKS when the dest of an insn has changed. (try_combine): Use it when deleting the first insn of a two-insn parallel or splitting a two-load parallel.
Does reverting it fix the problem? If so, please assign the PR to me.
Postponed until GCC 3.4.2.
Postponed until GCC 3.4.3.
Postponed until GCC 3.4.4.
Investigating.
> There has only been one patch to combine after 3.3 which might have caused > this: > > 2003-10-06 Eric Botcazou <ebotcazou@libertysurf.fr> > > PR optimization/11637 > * combine.c (adjust_for_new_dest): New function to adjust the > notes and LOG_LINKS when the dest of an insn has changed. > (try_combine): Use it when deleting the first insn of a two-insn > parallel or splitting a two-load parallel. No, this patch is not responsible for the quadratic behaviour, which is still present as of today on AMD64: reverting it doesn't change anything on mainline. Digging a little further would have showed that the patch only adds a call to distribute_links, not distribute_notes, which contains this comment: Note that this correctly handles the link that used to point from I3 to I2. Also note that not much searching is typically done here since most links don't point very far away. */
And 5 minutes of additional work would have been sufficient to spot the obvious culprit: 2003-05-14 Eric Christopher <echristo@redhat.com> * combine.c: Fix header comments. (distribute_notes): Remove usage of elim_i1, elim_i2. Propagate to all calls and prototype. containing most notably: @@ -12550,19 +12530,14 @@ reg_bitfield_target_p (x, body) as appropriate. I3 and I2 are the insns resulting from the combination insns including FROM (I2 may be zero). - ELIM_I2 and ELIM_I1 are either zero or registers that we know will - not need REG_DEAD notes because they are being substituted for. This - saves searching in the most common cases. - Each note in the list is either ignored or placed on some insns, depending on the type of note. */ Reverting it makes the quadratic behaviour vanish on mainline. It was backported between 3.3.2 and 3.3.3 by: 2003-12-16 Zack Weinberg <zack@codesourcery.com> Backport the following patches from mainline. [...] 2003-05-14 Eric Christopher <echristo@redhat.com> * combine.c: Fix header comments. (distribute_notes): Remove usage of elim_i1, elim_i2. Propagate to all calls and prototype.
Note on the mainline on PPC-darwin we are about twice as fast 1.5 seconds compared to 3.3 (3.2 seconds).
The patch was put in to stop erroneous REG_DEAD notes from being created where they shouldn't IIRC. Now, we may be able to rerun cfg as Paolo suggests, but I don't know for certain. Unless we can prove that new speedups do the right thing all the time I don't think we can put them back in - there was even a note in the old code that we didn't always do the right thing.
Perhaps someone can give an update on this bug?
I thought I did on the 7th. I'm not sure there's anything we can do about this. The behavior is correct and the previous behavior was proved to be not correct. If someone has an idea that's guaranteed to be correct in all cases I'm willing to try to implement it though.
Is there an even way to fix this to even produce correct code?
Thought I did last time.. otherwise I'm not sure what the question is.
You'd almost think Andrew is not a native speaker. What I think he asks is: "Can the patch that caused this be done in a different way such that the code is still correct but the compile time regression goes away."
(In reply to comment #25) (Well considering, I asked the same question which Eric was asking, or really I was asking the same question). And Eric put this into waiting for a reason and I am keeping it there.
We aren't waiting for anything, so move out of WAITING state.
Moving to 4.0.2 pre Mark.
I think this has been fixed on the mainline: 600 .7s 1200 1.899s 1800 3.7s 2400 5.945s This is all with checking still enabled.
The code in combine is no better than it ever was. What has changed is this patch: http://gcc.gnu.org/ml/gcc-patches/2005-07/msg02021.html which causes the optimization of incrementing a memory location to be done at expand time rather than at combine time. Thus in 4.1 this particular test case is no longer slow. I will attach a new test case. I can see some straightforward potential fixes. Instead of doing the loop to look for a place to put the REG_DEAD note, just set the bit in refresh_blocks to rerun life analysis. Or if keeping the loop seems useful (after all, the loop is itself an attempt to control compilation time), at least limit it to looking back just a few instructions, to avoid the quadratic behaviour. A more complex way to fix it would be keep track of which registers are completely set by i1 and i2. If we find a REG_DEAD note for a register which is set by i1 or i2, and that register no longer appears in the new i2 and i3, then we don't care about that REG_DEAD note.
Created attachment 9817 [details] New test case New test case which continues to show the problem in gcc 4.1. This test case also shows the problem in gcc 4.0.
Eric, your patch which caused the quadratic behaviour is here: http://gcc.gnu.org/ml/gcc-patches/2003-05/msg01358.html I know this is a real long-shot, but do you have any recollection of what problem you were solving? That is, which test case was failing before you checked in your patch? And how was it failing? The failure case looks relatively benign to me--we think that a register lives longer than it really does. To me that old code looks pretty much right--in fact it is one of the suggestions I made in my previous comment. In comment #20 you say there was a comment in the code which indicated that it didn't always do the right thing, and in fact you removed that comment in your patch: - - there are extremely rare cases (see distribute_regnotes) when a - REG_DEAD note is lost But in fact that comment not only was correct but still is correct. In some cases, a REG_DEAD note will be lost, and in those cases the code sets a bit in refresh_blocks and reruns life analysis. Actually, I can see that there is a bug in the original code in some unusual cases. It relies on reg_set_p to see whether the register is set and therefore the REG_DEAD note can be discarded. But really we can only casually remove the REG_DEAD note if the register is completely set, and reg_set_p can return true if the register is partially set. For example, if a STRICT_LOW_PART is used when setting the register, we may wind up losing the REG_DEAD note incorrectly. I think that we can only ignore the register if dead_or_set_p returns true for it.
I found the bug report which led to the patch which is the issue in this PR. It starts here: http://gcc.gnu.org/ml/gcc-patches/2003-05/msg00796.html and ends with this patch: http://gcc.gnu.org/ml/gcc-patches/2003-05/msg01358.html I will try to recreate this problem. It looks like we should be able to fix this specific type of problem without losing any speed. We should just should not skip adding the REG_DEAD note if we don't directly set the register. It actually looks like the code works that way anyhow, but with luck I will be able to recreate the problem and figure out what is really happening.
I am able to recreate the problem with a hppa64-hp-hpux11 cross-compiler, in the sense that combine does lose the REG_DEAD note when compiling the gcc.c-torture/compile/920501-13.c test case with -O1. In the current compiler, the compilation succeeds. A REG_UNUSED note is added by the recompute_reg_usage pass, the unused register is allocated, and the allocated register is later eliminated during the combine_stack_adjustments pass. So it works but the overall effect is not ideal since we allocate an unneccessary register. I'll see if we can avoid losing the REG_DEAD note in the combine pass.
Created attachment 9952 [details] Patch This patch fixes the problem in this PR, and also does not lose the REG_DEAD note in the hppa64-hp-hpux11 test case. I won't try to checkin until it has been tested on hppa64-hp-hpux11. Unfortunately, that target reportedly does not bootstrap at the moment for unrelated reasons.
Subject: Bug 13931 CVSROOT: /cvs/gcc Module name: gcc Changes by: ian@gcc.gnu.org 2005-10-11 23:45:57 Modified files: gcc : ChangeLog combine.c Log message: PR rtl-optimization/13931 * combine.c: Revert patch of 2003-05-14, and: (try_combine): Only set elim_i1 and elim_i2 if the destination is completely killed in the appropriate insn. (distribute_notes): Don't skip multiple hard register test for elim_i1 and elim_i2. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.10141&r2=2.10142 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/combine.c.diff?cvsroot=gcc&r1=1.504&r2=1.505
Fixed on mainline. This is a minor compilation time speedup, and I don't think there is much benefit to porting back to the release branches.