Bug 13931 - [3.4/4.0/4.1 Regression] combiner much slower on big basic blocks
Summary: [3.4/4.0/4.1 Regression] combiner much slower on big basic blocks
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 3.4.0
: P3 normal
Target Milestone: 4.1.0
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
Depends on:
Blocks:
 
Reported: 2004-01-30 12:41 UTC by Paolo Bonzini
Modified: 2005-10-11 23:47 UTC (History)
4 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work: 3.2.3 3.3 3.3.2
Known to fail: 3.3.3 3.4.0 4.0.0
Last reconfirmed: 2005-01-28 18:49:55


Attachments
testcase (85 bytes, text/plain)
2004-01-30 12:42 UTC, Paolo Bonzini
Details
New test case (126 bytes, text/plain)
2005-09-27 05:14 UTC, Ian Lance Taylor
Details
Patch (2.58 KB, patch)
2005-10-10 18:19 UTC, Ian Lance Taylor
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Paolo Bonzini 2004-01-30 12:41:07 UTC
The combiner jumped up from 3.3.2's 8% to 3.4/3.5's 56% in the attached function. 
 
No. of lines in block            Run time 
600                                      0.36 
1200				    2.33 
1800                                    5.77 
2400                                   10.56 
 
There seems to be a quadratic bottleneck in combine.  distribute_notes has a loop that 
goes over all previous instructions.  distribute_notes calls reg_set_p, which calls set_of, 
which calls note_stores, which calls set_of_1, which calls reg_overlap_mentioned_p, 
which calls refers_to_regno_p... and all these show up quite at the top of the profile. 
 
# of instructions       function 
224,203,200  rtlanal.c:refers_to_regno_p 
150,433,200  rtlanal.c:reg_overlap_mentioned_p 
 75,561,636  rtlanal.c:note_stores 
 56,985,600  rtlanal.c:reg_referenced_p 
 49,576,316  regrename.c:validate_value_data 
 40,026,779  rtl.c:rtx_equal_p 
 28,740,658  flow.c:mark_set_1 
 27,562,800  combine.c:distribute_notes 
 25,928,400  rtlanal.c:set_of_1 
 15,840,000  rtlanal.c:reg_set_p 
 
This might be a duplicate of PR13775.
Comment 1 Paolo Bonzini 2004-01-30 12:42:29 UTC
Created attachment 5617 [details]
testcase
Comment 2 Andrew Pinski 2004-01-30 17:23:22 UTC
Confirmed, on i686, combine takes most of the time, while on powerpc:

 scheduling            :  22.06 (21%) usr   1.10 (34%) sys  49.98 (19%) wall
 rename registers      :  46.87 (44%) usr   0.38 (12%) sys 122.39 (47%) wall
 scheduling 2          :  20.88 (20%) usr   0.85 (27%) sys  55.33 (21%) wall 
Takes most of the time.
Comment 3 Paolo Bonzini 2004-01-30 19:24:01 UTC
It is probably because on powerpc the combination attempts fail, while on i686 
they succeed (there is an increment-a-memory-location instruction).  Gee, the 
timings for PPC are even worse. :-(

As far as the combiner is concerned, it seems to be worst case quadratic in the 
number of succeeded combinations in a basic block.  Maybe it is possible to 
just propagate notes just 50 or 100 instruction behind in the basic block after 
a succeeded combination?  Or maybe it screws up the liveness information?

It may also be ok to simply rerun data-flow analysis after combine (instead of 
doing distribute_notes) after 50 or 100 succeeded combinations.

(I don't have a tree here, otherwise I'd try out these two possibilities).
Comment 4 Paolo Bonzini 2004-01-30 19:26:09 UTC
Ah, is it a regression on PPC?
Comment 5 Andrew Pinski 2004-02-04 16:39:53 UTC
Actually the regression for PPC-darwin is only for darwin and not PPC linux and is filed in a different PR.
Comment 6 Mark Mitchell 2004-03-08 23:03:11 UTC
This needs to be fixed, but it would be too risky to try before 3.4.0.

Postponed until 3.4.1.  
Comment 7 Andrew Pinski 2004-06-07 03:24:11 UTC
I will take a look at this next week, it might be already helped by the merge of the tree-ssa but I doubt 
that though.
Comment 8 Andrew Pinski 2004-06-07 19:08:06 UTC
Actually look into this there is a compile time regression in 3.3.3 also from 3.2.3:
tin:~/src/gnu/gcctest>time gcc -S -O1 pr13931.c
2.050u 0.040s 0:02.15 97.2%     0+0k 0+0io 942pf+0w
tin:~/src/gnu/gcctest>time ~/ia32_linux_gcc3_2/bin/gcc -S -O1 pr13931.c
0.350u 0.040s 0:00.45 86.6%     0+0k 0+0io 687pf+0w
tin:~/src/gnu/gcctest>time ~/ia32_linux_gcc3_3/bin/gcc -S -O1 pr13931.c
1.700u 0.030s 0:01.77 97.7%     0+0k 0+0io 708pf+0w
tin:~/src/gnu/gcctest>time ~/ia32_linux_gcc3_4/bin/gcc -S -O1 pr13931.c
1.790u 0.030s 0:01.86 97.8%     0+0k 0+0io 734pf+0w
Comment 9 Andrew Pinski 2004-06-07 20:44:37 UTC
For powerpc-apple-darwin, most of the time is being taking in get_cse_reg_info.  But I will note that 
powerpc-apple-darwin is now faster than even the 3.3 release or even Apple's 3.3

Hmm I noticed something, for 3.3.1 20030707 on i686-pc-linux-gnu, the speed of the compile is fast 
but on a released version of 3.3.3 it is just as slow as 3.4.0, so there was a patch between them which 
made this slower.
Comment 10 Andrew Pinski 2004-06-07 20:53:56 UTC
If someone could find the patch which make this slower I could do something about it.
There has only been one patch to combine after 3.3 which might have caused this:

2003-10-06  Eric Botcazou  <ebotcazou@libertysurf.fr>

        PR optimization/11637
        * combine.c (adjust_for_new_dest): New function to adjust the
        notes and LOG_LINKS when the dest of an insn has changed.
        (try_combine): Use it when deleting the first insn of a two-insn
        parallel or splitting a two-load parallel.
Comment 11 Eric Botcazou 2004-06-07 21:12:59 UTC
Does reverting it fix the problem?  If so, please assign the PR to me.
Comment 12 Mark Mitchell 2004-06-18 23:47:01 UTC
Postponed until GCC 3.4.2.
Comment 13 Mark Mitchell 2004-08-29 18:47:42 UTC
Postponed until GCC 3.4.3.
Comment 14 Mark Mitchell 2004-10-30 20:02:10 UTC
Postponed until GCC 3.4.4.
Comment 15 Mark Mitchell 2004-10-30 20:02:26 UTC
Postponed until GCC 3.4.4.
Comment 16 Eric Botcazou 2004-11-15 12:11:15 UTC
Investigating.
Comment 17 Eric Botcazou 2004-11-16 08:05:19 UTC
> There has only been one patch to combine after 3.3 which might have caused
> this:
> 
> 2003-10-06  Eric Botcazou  <ebotcazou@libertysurf.fr>
> 
>         PR optimization/11637
>         * combine.c (adjust_for_new_dest): New function to adjust the
>         notes and LOG_LINKS when the dest of an insn has changed.
>         (try_combine): Use it when deleting the first insn of a two-insn
>         parallel or splitting a two-load parallel.

No, this patch is not responsible for the quadratic behaviour, which is still
present as of today on AMD64: reverting it doesn't change anything on mainline.
Digging a little further would have showed that the patch only adds a call to
distribute_links, not distribute_notes, which contains this comment:

	 Note that this correctly handles the link that used to point from
	 I3 to I2.  Also note that not much searching is typically done here
	 since most links don't point very far away.  */
Comment 18 Eric Botcazou 2004-11-16 08:21:58 UTC
And 5 minutes of additional work would have been sufficient to spot the obvious
culprit:

2003-05-14  Eric Christopher  <echristo@redhat.com>

	* combine.c: Fix header comments.
	(distribute_notes): Remove usage of elim_i1, elim_i2. Propagate
	to all calls and prototype.

containing most notably:

@@ -12550,19 +12530,14 @@ reg_bitfield_target_p (x, body)
    as appropriate.  I3 and I2 are the insns resulting from the combination
    insns including FROM (I2 may be zero).
 
-   ELIM_I2 and ELIM_I1 are either zero or registers that we know will
-   not need REG_DEAD notes because they are being substituted for.  This
-   saves searching in the most common cases.
-
    Each note in the list is either ignored or placed on some insns, depending
    on the type of note.  */

Reverting it makes the quadratic behaviour vanish on mainline.

It was backported between 3.3.2 and 3.3.3 by:

2003-12-16  Zack Weinberg  <zack@codesourcery.com>

	Backport the following patches from mainline.
        [...]

    2003-05-14  Eric Christopher  <echristo@redhat.com>

	* combine.c: Fix header comments.
	(distribute_notes): Remove usage of elim_i1, elim_i2. Propagate
	to all calls and prototype.
Comment 19 Andrew Pinski 2004-11-18 20:46:03 UTC
Note on the mainline on PPC-darwin we are about twice as fast 1.5 seconds compared to 3.3 (3.2 
seconds).
Comment 20 Eric Christopher 2004-12-07 19:54:32 UTC
The patch was put in to stop erroneous REG_DEAD notes from being created where
they shouldn't IIRC. Now, we may be able to rerun cfg as Paolo suggests, but I
don't know for certain. Unless we can prove that new speedups do the right thing
all the time I don't think we can put them back in - there was even a note in
the old code that we didn't always do the right thing.
Comment 21 Steven Bosscher 2004-12-22 19:23:22 UTC
Perhaps someone can give an update on this bug? 
 
 
Comment 22 Eric Christopher 2004-12-22 19:48:48 UTC
I thought I did on the 7th. I'm not sure there's anything we can do about this.
The behavior is correct and the previous behavior was proved to be not correct.
If someone has an idea that's guaranteed to be correct in all cases I'm willing
to try to implement it though.
Comment 23 Andrew Pinski 2005-01-15 06:11:26 UTC
Is there an even way to fix this to even produce correct code?
Comment 24 Eric Christopher 2005-01-15 07:31:16 UTC
Thought I did last time.. otherwise I'm not sure what the question is.
Comment 25 Steven Bosscher 2005-01-15 12:25:26 UTC
You'd almost think Andrew is not a native speaker.

What I think he asks is: "Can the patch that caused this be done in a different
way such that the code is still correct but the compile time regression goes away."

Comment 26 Andrew Pinski 2005-01-15 16:23:20 UTC
(In reply to comment #25)
(Well considering, I asked the same question which Eric was asking, or really I was asking the same 
question).

And Eric put this into waiting for a reason and I am keeping it there.
Comment 27 Ian Lance Taylor 2005-01-28 18:49:55 UTC
We aren't waiting for anything, so move out of WAITING state.
Comment 28 Andrew Pinski 2005-07-22 21:13:07 UTC
Moving to 4.0.2 pre Mark.
Comment 29 Andrew Pinski 2005-07-25 02:59:57 UTC
I think this has been fixed on the mainline:
600       .7s
1200    1.899s
1800    3.7s
2400    5.945s

This is all with checking still enabled.
Comment 30 Ian Lance Taylor 2005-09-27 05:10:51 UTC
The code in combine is no better than it ever was.  What has changed is this patch:
    http://gcc.gnu.org/ml/gcc-patches/2005-07/msg02021.html
which causes the optimization of incrementing a memory location to be done at
expand time rather than at combine time.  Thus in 4.1 this particular test case
is no longer slow.  I will attach a new test case.

I can see some straightforward potential fixes.  Instead of doing the loop to
look for a place to put the REG_DEAD note, just set the bit in refresh_blocks to
rerun life analysis.  Or if keeping the loop seems useful (after all, the loop
is itself an attempt to control compilation time), at least limit it to looking
back just a few instructions, to avoid the quadratic behaviour.

A more complex way to fix it would be keep track of which registers are
completely set by i1 and i2.  If we find a REG_DEAD note for a register which is
set by i1 or i2, and that register no longer appears in the new i2 and i3, then
we don't care about that REG_DEAD note.
Comment 31 Ian Lance Taylor 2005-09-27 05:14:32 UTC
Created attachment 9817 [details]
New test case

New test case which continues to show the problem in gcc 4.1.  This test case
also shows the problem in gcc 4.0.
Comment 32 Ian Lance Taylor 2005-09-27 05:33:02 UTC
Eric, your patch which caused the quadratic behaviour is here:
    http://gcc.gnu.org/ml/gcc-patches/2003-05/msg01358.html

I know this is a real long-shot, but do you have any recollection of what
problem you were solving?  That is, which test case was failing before you
checked in your patch?  And how was it failing?  The failure case looks
relatively benign to me--we think that a register lives longer than it really does.

To me that old code looks pretty much right--in fact it is one of the
suggestions I made in my previous comment.  In comment #20 you say there was a
comment in the code which indicated that it didn't always do the right thing,
and in fact you removed that comment in your patch:
-   - there are extremely rare cases (see distribute_regnotes) when a
-     REG_DEAD note is lost
But in fact that comment not only was correct but still is correct.  In some
cases, a REG_DEAD note will be lost, and in those cases the code sets a bit in
refresh_blocks and reruns life analysis.

Actually, I can see that there is a bug in the original code in some unusual
cases.  It relies on reg_set_p to see whether the register is set and therefore
the REG_DEAD note can be discarded.  But really we can only casually remove the
REG_DEAD note if the register is completely set, and reg_set_p can return true
if the register is partially set.  For example, if a STRICT_LOW_PART is used
when setting the register, we may wind up losing the REG_DEAD note incorrectly.
 I think that we can only ignore the register if dead_or_set_p returns true for it.
Comment 33 Ian Lance Taylor 2005-10-08 05:04:36 UTC
I found the bug report which led to the patch which is the issue in this PR.  It starts here:
    http://gcc.gnu.org/ml/gcc-patches/2003-05/msg00796.html
and ends with this patch:
    http://gcc.gnu.org/ml/gcc-patches/2003-05/msg01358.html

I will try to recreate this problem.  It looks like we should be able to fix this specific type of problem without losing any speed.  We should just should not skip adding the REG_DEAD note if we don't directly set the register.  It actually looks like the code works that way anyhow, but with luck I will be able to recreate the problem and figure out what is really happening.
Comment 34 Ian Lance Taylor 2005-10-08 05:51:36 UTC
I am able to recreate the problem with a hppa64-hp-hpux11 cross-compiler, in the sense that combine does lose the REG_DEAD note when compiling the gcc.c-torture/compile/920501-13.c test case with -O1.  In the current compiler, the compilation succeeds.  A REG_UNUSED note is added by the recompute_reg_usage pass, the unused register is allocated, and the allocated register is later eliminated during the combine_stack_adjustments pass.  So it works but the overall effect is not ideal since we allocate an unneccessary register.

I'll see if we can avoid losing the REG_DEAD note in the combine pass.
Comment 35 Ian Lance Taylor 2005-10-10 18:19:48 UTC
Created attachment 9952 [details]
Patch

This patch fixes the problem in this PR, and also does not lose the REG_DEAD note in the hppa64-hp-hpux11 test case.  I won't try to checkin until it has been tested on hppa64-hp-hpux11.  Unfortunately, that target reportedly does not bootstrap at the moment for unrelated reasons.
Comment 36 GCC Commits 2005-10-11 23:46:05 UTC
Subject: Bug 13931

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	ian@gcc.gnu.org	2005-10-11 23:45:57

Modified files:
	gcc            : ChangeLog combine.c 

Log message:
	PR rtl-optimization/13931
	* combine.c: Revert patch of 2003-05-14, and:
	(try_combine): Only set elim_i1 and elim_i2 if the destination is
	completely killed in the appropriate insn.
	(distribute_notes): Don't skip multiple hard register test for
	elim_i1 and elim_i2.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.10141&r2=2.10142
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/combine.c.diff?cvsroot=gcc&r1=1.504&r2=1.505

Comment 37 Ian Lance Taylor 2005-10-11 23:47:18 UTC
Fixed on mainline.  This is a minor compilation time speedup, and I don't think there is much benefit to porting back to the release branches.