This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug tree-optimization/17863] [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)

From: "hubicka at gcc dot gnu dot org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: 30 Jan 2008 17:58:59 -0000
Subject: [Bug tree-optimization/17863] [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)
References: <bug-17863-2109@http.gcc.gnu.org/bugzilla/>
Reply-to: gcc-bugzilla at gcc dot gnu dot org


------- Comment #35 from hubicka at gcc dot gnu dot org  2008-01-30 17:58 -------
So for more proper analysis. The testcase is quite challenging for inlining
heuristics and by introducing early inlining and reducing call cost we now
inline less that we used to at a time I claimed that we inline everything. 
However making inlined everything again is still not solving the problem.

For inline decisions, the problematic bit seems to be accu1 and friends.  They
are templates using easier templates of same form.  For n=1:
double accu1(const double*, const double*) [with int n = 0] (p1, p2)
{
  double D.4655;
  double D.4654;
  double D.4653;

<bb 2>:
  D.4654_2 = *p1_1(D);
  D.4655_4 = *p2_3(D);
  D.4653_5 = D.4654_2 * D.4655_4;
  return D.4653_5;

}

With n>1 we simply copy the body few times:
double accu1(const double*, const double*) [with int n = 1] (p1, p2)
{
  double D.17506;
  double D.17507;
  double D.17505;
  double D.17505;
  double d;
  double D.6664;
  double D.6663;
  double D.6662;

<bb 2>:
  D.6662_2 = *p1_1(D);
  D.6663_4 = *p2_3(D);
  d_5 = D.6662_2 * D.6663_4;
  p2_6 = p2_3(D) + 8;
  p1_7 = p1_1(D) + 8;
  D.17506_11 = *p1_7;
  D.17507_12 = *p2_6;
  D.17505_13 = D.17506_11 * D.17507_12;
  D.6664_9 = d_5 + D.17505_13;
  return D.6664_9;

}
Early inlinier handles this well until the function grows up, that happens on
n=4 and for n=5 we end up not inlining:
double accu1(const double*, const double*) [with int n = 5] (p1, p2)
{
  double d;
  double D.6697;
  double D.6696;
  double D.6695;
  double D.6694;

<bb 2>:
  D.6694_2 = *p1_1(D);
  D.6695_4 = *p2_3(D);
  d_5 = D.6694_2 * D.6695_4;
  p2_6 = p2_3(D) + 8;
  p1_7 = p1_1(D) + 8;
  D.6697_8 = accu1 (p1_7, p2_6);
  D.6696_9 = D.6697_8 + d_5;
  return D.6696_9;

}
This is as expected, for n=4 the code is definitely longer than call sequence,
having 4 FP multiples, 4 adds, 8 loads, I don't think simple heuristic can
resonably expect it to simplify.

We inline these functions later in late inlining as expected, but since there
are just too many calls of them, we end up eventually on large function and
large unit limits.

Now to get everything inlined one needs --param inline-call-cost=9999 --param
max-inline-insns-single=999999 (the second is needed for DCubuc::DCubic that is
just big IMO).

Now with this:
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time
/aux/hubicka/gcc-install/bin/g++  -O3 ttest.cc  -fpermissive --static
-march=athlon-xp  -Winline --param inline-call-cost=9999 --param
max-inline-insns-single=999999
ttest.cc: In function 'void testv4c()':
ttest.cc:21: warning: inlining failed in call to 'tcdata::tcdata()': --param
inline-unit-growth limit reached
ttest.cc:468: warning: called from here

real    1m0.934s
user    0m59.736s
sys     0m1.204s
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out

real    0m7.055s
user    0m7.052s
sys     0m0.000s

We still have long way to GCC 3-4 perfomrance (5s, see my previous post).  I
suspect that alising simply give up. Setting inline-call-cost to 1 (the other
extreme) leads to 6.9s.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17863

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]