[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
rguenth at tat dot physik dot uni-tuebingen dot de
gcc-bugzilla@gcc.gnu.org
Tue Dec 7 15:09:00 GMT 2004
------- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-07 15:09 -------
Subject: Re: [4.0 Regression] Inlining limits
cause 340% performance regression
On 7 Dec 2004, hubicka at ucw dot cz wrote:
> > Yes, it seems so. Really nice improvement. Though profiling is
> > sloooooow. I guess you avoid doing any CFG changing transformation
> > for the profiling stage? I.e. not even inline the simplest functions?
>
> I can inline but only after actually instrumenting the functios. That
> should minimize the costs, but I also noticed that tramp3d is
> surprisingly a lot slower with profiling.
>
> > That would be the reason the Intel compiler is unusable with profiling
> > for me. -fprofile-generate comes with a 50fold increase in runtime!
>
> -fprofile-generate is actually package of
> -fprofile-arcs/-fprofile-values + -fprofile-values-transformations
> It might be interesting to figure out whether -fprofile-arcs itslef
> brings similar slowdown. Only reason why this can happen I can think of
> is the fact that after instrumenting we again inline a lot less or we
> produce too many redundant counter. Perhaps it would make sense to
> think about inlining functions reducing code size before instrumenting
> as we would do that anyway, but it will be tricky to get gcov output and
> -f* flags independence right then.
Hm. There are a lot of counters - maybe it is possible to merge
the counters themselves? The resulting asm of tramp3d-v3 consists
of 30% addl/adcl lines for adding the profiling counts - where
the total number of lines is just wc -l of a -S -fverbose-asm compilation.
That's very much a lot. And additions are in cache unfriedly sequence,
too - dunno which optimization pass could improve this though. Consider
static inline void foo() {}
void bar() { foo(); }
which for -O2 -fprofile-generate produces
bar:
addl $1, .LPBX1
pushl %ebp
movl %esp, %ebp
adcl $0, .LPBX1+4
addl $1, .LPBX1+16
popl %ebp
adcl $0, .LPBX1+20
addl $1, .LPBX1+8
adcl $0, .LPBX1+12
ret
that should be
bar:
addl $1, .LPBX1
pushl %ebp
movl %esp, %ebp
adcl $0, .LPBX1+4
addl $1, .LPBX1+8
adcl $0, .LPBX1+12
addl $1, .LPBX1+16
adcl $0, .LPBX1+20
ret
And of course all the three counters could be merged. But that
would need a changed gcov file format somehow representing a
callgraph with merged edges.
The intel compiler is so much worse here because all the
counter adding is done thread-safe in a library (i.e. they
have an extra call for every edge and do not do any inlining).
> How our profilng performance is compared to ICC?
ICC is a lot worse. ICC with -prof_gen causes a 10000 fold slowdown
(if the current snapshot of icc doesn't segfault compiling the tramp3d
testcase) - ICC is completely unusable for me. So - GCC is great!
> > > It would be nice to experiment with this a little - in general the
> > > heuristics can be viewed as having three players. There are the limits
> > > (specified via --param) that it must obey, there is the cost model
> > > (estimated growth for inlining into all callees without profiling and
> > > the execute_count to estimated growth for inlining to one call with
> > > profiling) and the bin packing algorithm optimizing the gains while
> > > obeying the limits.
> > >
> > > With profiling in the cost model is pretty much realistic and it would
> > > be nice to figure out how the performance behave when the individual
> > > limits are changed and why. If you have some time for experimentation,
> > > it would be very usefull. I am trying to do the same with SPEC and GCC
> > > but I have dificulty to play with pooma or Gerald's application as I
> > > have little understanding what is going there. I will try it myself
> > > next but any feedback can be very usefull here.
> >
> > I can produce some numbers for the tramp testcase.
> Thanks! Note that with changling the flags you should not need to
> re-profile now so you can save quite a lot of time.
Ah, thats indeed nice.
Richard.
--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
More information about the Gcc-bugs
mailing list