[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

Tue Dec 7 15:09:00 GMT 2004

------- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de  2004-12-07 15:09 -------
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 7 Dec 2004, hubicka at ucw dot cz wrote:

> > Yes, it seems so.  Really nice improvement.  Though profiling is
> > sloooooow.  I guess you avoid doing any CFG changing transformation
> > for the profiling stage?  I.e. not even inline the simplest functions?
>
> I can inline but only after actually instrumenting the functios.  That
> should minimize the costs, but I also noticed that tramp3d is
> surprisingly a lot slower with profiling.
>
> > That would be the reason the Intel compiler is unusable with profiling
> > for me.  -fprofile-generate comes with a 50fold increase in runtime!
>
> -fprofile-generate is actually package of
> -fprofile-arcs/-fprofile-values + -fprofile-values-transformations
> It might be interesting to figure out whether -fprofile-arcs itslef
> brings similar slowdown.  Only reason why this can happen I can think of
> is the fact that after instrumenting we again inline a lot less or we
> produce too many redundant counter.  Perhaps it would make sense to
> think about inlining functions reducing code size before instrumenting
> as we would do that anyway, but it will be tricky to get gcov output and
> -f* flags independence right then.

Hm.  There are a lot of counters - maybe it is possible to merge
the counters themselves?  The resulting asm of tramp3d-v3 consists
of 30% addl/adcl lines for adding the profiling counts - where
the total number of lines is just wc -l of a -S -fverbose-asm compilation.
That's very much a lot.  And additions are in cache unfriedly sequence,
too - dunno which optimization pass could improve this though.  Consider

static inline void foo() {}
void bar() { foo(); }

which for -O2 -fprofile-generate produces

bar:
        addl    $1, .LPBX1
        pushl   %ebp
        movl    %esp, %ebp
        adcl    $0, .LPBX1+4
        addl    $1, .LPBX1+16
        popl    %ebp
        adcl    $0, .LPBX1+20
        addl    $1, .LPBX1+8
        adcl    $0, .LPBX1+12
        ret

that should be

bar:
        addl    $1, .LPBX1
        pushl   %ebp
        movl    %esp, %ebp
        adcl    $0, .LPBX1+4
        addl    $1, .LPBX1+8
        adcl    $0, .LPBX1+12
        addl    $1, .LPBX1+16
        adcl    $0, .LPBX1+20
	ret

And of course all the three counters could be merged.  But that
would need a changed gcov file format somehow representing a
callgraph with merged edges.

The intel compiler is so much worse here because all the
counter adding is done thread-safe in a library (i.e. they
have an extra call for every edge and do not do any inlining).

> How our profilng performance is compared to ICC?

ICC is a lot worse.  ICC with -prof_gen causes a 10000 fold slowdown
(if the current snapshot of icc doesn't segfault compiling the tramp3d
testcase) - ICC is completely unusable for me.  So - GCC is great!

> > > It would be nice to experiment with this a little - in general the
> > > heuristics can be viewed as having three players.  There are the limits
> > > (specified via --param) that it must obey, there is the cost model
> > > (estimated growth for inlining into all callees without profiling and
> > > the execute_count to estimated growth for inlining to one call with
> > > profiling) and the bin packing algorithm optimizing the gains while
> > > obeying the limits.
> > >
> > > With profiling in the cost model is pretty much realistic and it would
> > > be nice to figure out how the performance behave when the individual
> > > limits are changed and why.  If you have some time for experimentation,
> > > it would be very usefull.  I am trying to do the same with SPEC and GCC
> > > but I have dificulty to play with pooma or Gerald's application as I
> > > have little understanding what is going there.  I will try it myself
> > > next but any feedback can be very usefull here.
> >
> > I can produce some numbers for the tramp testcase.
> Thanks!  Note that with changling the flags you should not need to
> re-profile now so you can save quite a lot of time.

Ah, thats indeed nice.

Richard.

--
Richard Guenther <richard dot guenther at uni-tuebingen dot de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704