This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: 3.0 -finline-limit and c++


> Date: Fri, 25 May 2001 13:23:11 +0200
> From: Teemu Torma <tot@trema.com>

> At least in my mind, this is not very desirable effect,

Only on some code.  On other code, this has a beyond desirable effect.
Unfortunately, being a good general purpose compiler means that you
have to work reasonably well on all codes, even if it costs slightly
by not being so good on certain codes.

This is the case here.

When a better solution is found to the original problem that motivated
this code (see the archives), we can replace the current solution with
it.

For a recap of the original problem, lets say that a user has
availible hundreds of thousands of really small building blocks, and
uses them to build things.  These building blocks use the same
hundreds of thousands of building blocks to implement their
functionality, at the core are really simple things int += int, and
that's it.  All the building blocks are exposed to the compiler as
inline functions.  The idea is that you want to inline the smallest
core blocks to achieve a larger intermediate concrete block and have
the code use that to run.  This way, scheduling, register allocation
and so on all have a chance to squeeze the most out of the CPU.  The
problem is that after a certain point, you kill the instruction cache,
the memory bandwidth fetching instructions and so on beyond the point
at which the efficiencies gained by the elimination of the call/return
overhead, instruction scheduling, register allocation and so on can
recover.

The question is how to best achieve selection of which routines to
inline, and where, to achieve as optimal performance as we can?

The tentative answer we have today, is to artificially pick an n, and
just use the heuristic that we inline as long as the body we are
generating into is less than n instructions long.  I know this seem
backwards are wrong from the normal C perspective, but it isn't, it is
merely orthogonal and complementary.  When you run the algorithm in
your head with the testcase above, you will notice a certain, oh, that
kinda works just nicely and even rather optimally in some cases and
covers the complete solution space rather nicely.

Hope this helps explain the issue.  If someone wants to extend the
document (or source code) out with the reason why...  Better, if
someone wants to think about the problem and submit a solution that is
monotonically better than what we have and is at least as general...


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]