This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/17863] [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)
- From: "hubicka at gcc dot gnu dot org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 30 Jan 2008 17:58:59 -0000
- Subject: [Bug tree-optimization/17863] [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)
- References: <bug-17863-2109@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #35 from hubicka at gcc dot gnu dot org 2008-01-30 17:58 -------
So for more proper analysis. The testcase is quite challenging for inlining
heuristics and by introducing early inlining and reducing call cost we now
inline less that we used to at a time I claimed that we inline everything.
However making inlined everything again is still not solving the problem.
For inline decisions, the problematic bit seems to be accu1 and friends. They
are templates using easier templates of same form. For n=1:
double accu1(const double*, const double*) [with int n = 0] (p1, p2)
{
double D.4655;
double D.4654;
double D.4653;
<bb 2>:
D.4654_2 = *p1_1(D);
D.4655_4 = *p2_3(D);
D.4653_5 = D.4654_2 * D.4655_4;
return D.4653_5;
}
With n>1 we simply copy the body few times:
double accu1(const double*, const double*) [with int n = 1] (p1, p2)
{
double D.17506;
double D.17507;
double D.17505;
double D.17505;
double d;
double D.6664;
double D.6663;
double D.6662;
<bb 2>:
D.6662_2 = *p1_1(D);
D.6663_4 = *p2_3(D);
d_5 = D.6662_2 * D.6663_4;
p2_6 = p2_3(D) + 8;
p1_7 = p1_1(D) + 8;
D.17506_11 = *p1_7;
D.17507_12 = *p2_6;
D.17505_13 = D.17506_11 * D.17507_12;
D.6664_9 = d_5 + D.17505_13;
return D.6664_9;
}
Early inlinier handles this well until the function grows up, that happens on
n=4 and for n=5 we end up not inlining:
double accu1(const double*, const double*) [with int n = 5] (p1, p2)
{
double d;
double D.6697;
double D.6696;
double D.6695;
double D.6694;
<bb 2>:
D.6694_2 = *p1_1(D);
D.6695_4 = *p2_3(D);
d_5 = D.6694_2 * D.6695_4;
p2_6 = p2_3(D) + 8;
p1_7 = p1_1(D) + 8;
D.6697_8 = accu1 (p1_7, p2_6);
D.6696_9 = D.6697_8 + d_5;
return D.6696_9;
}
This is as expected, for n=4 the code is definitely longer than call sequence,
having 4 FP multiples, 4 adds, 8 loads, I don't think simple heuristic can
resonably expect it to simplify.
We inline these functions later in late inlining as expected, but since there
are just too many calls of them, we end up eventually on large function and
large unit limits.
Now to get everything inlined one needs --param inline-call-cost=9999 --param
max-inline-insns-single=999999 (the second is needed for DCubuc::DCubic that is
just big IMO).
Now with this:
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time
/aux/hubicka/gcc-install/bin/g++ -O3 ttest.cc -fpermissive --static
-march=athlon-xp -Winline --param inline-call-cost=9999 --param
max-inline-insns-single=999999
ttest.cc: In function 'void testv4c()':
ttest.cc:21: warning: inlining failed in call to 'tcdata::tcdata()': --param
inline-unit-growth limit reached
ttest.cc:468: warning: called from here
real 1m0.934s
user 0m59.736s
sys 0m1.204s
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out
real 0m7.055s
user 0m7.052s
sys 0m0.000s
We still have long way to GCC 3-4 perfomrance (5s, see my previous post). I
suspect that alising simply give up. Setting inline-call-cost to 1 (the other
extreme) leads to 6.9s.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17863