This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC/patch] Callgraph based inlining heuristics


> Jan Hubicka wrote:
> 
> >Hi,
> >I finally got C++ unit-at-a-time into shape so it don't consume all my free
> >time and got into implementing first try of my proposed inlining heuristics
> >that uses callgraph to decide on inlining top-down while inline using 
> >current
> >code bottom-up hopefully getting better from the both worlds.
> >
> >I am attaching somewhat raw implementation of the idea.  At the moment I am
> >using cgraph edges to flag on whether particular inlining will happen so
> >function can be inlinined into once caller but not inlined into another, 
> >but
> >the decisions must be transitive.
> >
> I am not sure what exactly you're saying in the last sentence here.

Assume that function A calls B that calls C.
If we decide to make A leaf, we must also inline all calls to C into B
in the other incarnations of B as there is only one edge in between A
and B.
> 
> >(that is not the case of attribute leafify proposed - doing the 
> >datastructure
> >in full generality allowing each particular walk in the callgraph  to be 
> >either
> >collapsed or not is probably too expensive so attribute leafify can be
> >implemented as a special case).
> >
> Assuming we really want such an attribute...
It would be better to do it wihout, yes :)
I am thinking about making the inline limits less strict when doing so
would make function leaf and making less strict limit for inlining leaf
functions.
I implemented the second and it seems to reduce call density.
> >After flag is set the size of function bodies is update by adding the size 
> >of
> >inlined body minus 1.  (this is the same as estimation in tree-inline.c.  
> >Not
> >very good.  We can consider choosing functions that may inline and trying 
> >to
> >analyze them harder, in the extreme case even produce rtl and profile and
> >copmpute average amount of insns executed).
> >
> If you use the same estimate for the function size as in tree-inline.c, 
> can't you just share that code with tree-inline?  I was thinking about 
> splitting up inlinable_function_p() into a function that looks for 
> inline inhibiting functions and a function to estimate the size of the 
> inlinable function body (to get rid of that ugly "nolimit" flag).  Then 
> if we come up with a better way to estimate the function body size, we 
> only have to rework the code in one place.

Actually I am dreaming about simply droipping the current heuristics in
tree-inline, making inlinable_function_p to always do "nolimit" so it
just return whether function X can be inlined and moving everything to
unit-at-a-time code.
> 
> (BTW tree-inline.c needs to be cleaned up in other ways too, e.g. move 
> walk_tree out to tree.c...)
> 
> >In the second stage I proceed topologically again inline small functions
> >updating the size of bodies and in the third stage I decide on inlining 
> >once.
> >
> Does that mean we no longer inline large functions into small ones if 
> the large function is called more than once?  That is one of the things 
> people have suggested many times to reduce the compile time cost of 
> inlining.
I do something like this.
When X is marked always_inline, inline it to all callers but avoid
cycles
when X is small *after* we inline into it, inline it to all callers that
don't grow up too much (to avoid cases where relatively small function
is inlined 1000000 times into another function).
when X is called just once, it is local, then inline it into caller but
also watch for the size of caller.
> 
> >I've added new constant specifying large functions (30000 statements) and 
> >I am
> >limiting growth of the large functions by factor of 2.  This avoids 
> >quadratic
> >bottleneck seen on insn-recog during bootstrap.
> >
> 30000 statements still seems rather large.  How did you get to this 
> number?  Are you still always inlining really small functions even if 
Just dreamed it up.  It was first shot and I didn't tried to change it.
> the large function is larger than 2 times its original size?
> 
> >I tested libstdc++ compilation with -O2 and get 10% savings of the .so 
> >library
> >size (with debug info in) and about 5% compiler speedup.  I also tried 
> >PR/8362 testcase.
> >
> PR 8361 :-)
> >
> That looks interesting.  "expand" goes down from 13 to 4 seconds, so 
> you're producing fewer insns there already, very nice!
> Do you know where the difference comes from, ie. which functions don't 
> get inlined with your code that used to be inlined before?

>From the dump, the heuristics preventing large function bodies does not
fire until I increase limit of small functions to 1000, so it is always
the case wher function body after inlining grows over the limit.
> 
> >I also tested it on some of C SPEC2000 sources and it does not seem to 
> >supress
> >any important inlining.  I hope to do some behchmarking tomorrow as I need 
> >some
> >sleep now.
> >
> >So :) any comments are welcome.  Also I would like to know about other
> >testcases where inliner behaves in broken way.
> >
> PR10160 is a good one.  You can of course try POOMA and Deal II, and 
> maybe you can get someone to try Linux kernel without the always_inline 
> hack they now need for GCC 3.3.
To make always_inline go away, we actually need to increase inlining
limit from current 300.  As I was mentioning in the other post, I hope I
will be able to do that as new heuristics does not get crazy then.
> >+ #define INSNS_PER_STMT 10
> >+ #define INSNS_PER_CALL 10
> >
> IIRC there is some code in integrate.c that takes into account the 
> number of arguments and how they are passed in estimating the size of a 
> call.  I have no idea if there's any benefit in a more accurate 
> estimate, though, as long as we estimate 10 insns for all statements no 
> matter what.
> 
> Why do you think there should be a separate INSNS_PER_CALL anyway? You 
> estimate 10 insns for calls in the function body you're expanding 
> inline, so why have a different number (potentially) for the call you're 
> inlining??
> 
> Oh, and again, it would be nice to just have this in tree-inline.c

At the moment i just use the same code as tree-inline.c to make these
two comparable.  The way of computing statements to estimate size is
just crazy.  With gimple we should be able to do better, I can try to
put together something for current trees, but definitly I want to do
> >+ #define LARGE_FUNCTION_INSNS 30000
> >+ #define LARGE_FUNCTION_GROWTH 200
> >
> Maybe add comment here that these are multiples of 100.  Actually, why 
> is LARGE_FUNCTION_GROWTH not just 2?  Every time you multiply something 
It is in percents as I think something like 120% would make more sense.
These two should go into params.def but I didn't get time to do that
yet.
> >
> Are you ever supposed to get here if flag_inline_trees is not set?  
> Languages that do not use the tree inliner can never do unit-at-a-time 
> either, and unit-at-a-time implies flag_inline_functions, doesn't it?
Hmm, probably yes :)
> >! 
> >!    We do simple backgracing to get INLINED_TIMES right.  This should not 
> >be
> >
> What is backgracing?
backtracing.
> >!    expensive as we limit amount of inlining.  Alternatively we may first
> >
> _the_ amount of ...
> 
> >!    discover set of nodes, to topological sort on these and propagate
> >!    information.  */
> >
> discover the set of nodes, topologically sort these and propagate 
> information.
> 
> What information exactly BTW?
The amount of times function WHAT will be inlined into TO.
> >!       struct cgraph_node *caller;
> >! 
> >!       /* Look at the edge on the top of the stack.  */
> >!       e = stack[sp - 1];
> >!       caller = e->caller;
> >
> Purely cosmetic: If you always use (sp - 1) to get the top of the stack, 
> you might as well start with sp=-1 and check for sp<0 to see if 
> something is left on the stack....
I just cut&pasted this from same code in cfganal.c, perhaps I can
cleanup both :)
> 
> >! 	      for (e1 = e->next_caller; e1; e1 = e1->next_caller)
> >! 		if (e1->inline_call)
> >! 		  break;
> >
> I see these loops all the time.  Maybe this deserves a macro...

These are supposed to get different once I start to do something smart,
but perhaps I can do that...
> 
> >! /* Estimate size of the function after inlining WHAT into TO.  */
> >! static int
> >! cgraph_estimate_size_after_inlining (int times,
> >! 				     struct cgraph_node *to,
> >! 				     struct cgraph_node *what)
> >! {
> >!   return to->global.insns + (what->global.insns - INSNS_PER_CALL * 
> >times);
> >! }
> >
> Should be in tree-inline

It depends in cgraph datastructures that are not available in
tree-inline all the time :(
> >! static bool
> >! cgraph_default_inline_p (struct cgraph_node *n)
> >! {
> >!   if (!DECL_INLINE (n->decl))
> >!     return false;
> >!   if (DID_INLINE_FUNC (n->decl))
> >!     return n->global.insns < MAX_INLINE_INSNS_AUTO;
> >!   else
> >!     return n->global.insns < MAX_INLINE_INSNS_SINGLE;
> >! }
> >
> Here you don't seem to take care of really small functions, ie. 
> one-liners.  Is that deliberate?  In tree-inline, we have 
> MIN_INLINE_INSNS for that, and at its current value, inlining such 
> functions is almost always beneficial (because 100 of the 130 insns for 
> MIN_INLINE_INSNS are  for statements that don't produce insns, ie. 
> SCOPE_STMT, stuff like that).

I think I don't need MIN_INLINE_INSNS as I don't decrease the inline
limits with inlining depth as tree inlining does, so when
MAX_INLINE_INSNS_AUTO/SINGLE is more than 20, I will always inline one
statement functions.

> Not sure what you're doing here.  First you set node->output for 
> functions with a DECL_SAVED_TREE, ie. deferred functions.  Isn't that 
> true for all functions in unit-at-a-time? Then you reset node->output 

No, callgraph also contains functions that are not in the current unit
(external functions).  These don't have DECL_SAVED_TREE.  I am not
interested in seeing these as I can't inline them anyway.

> for all functions, but cgraph_postorder already seems to reset it for 
> functions which have it set...  Maybe needs a comment?

I've already cleaned this up in another version of patch I sent this
morning.
> >! 	      warning_with_decl (node->decl,
> >! 				 "Always inline flag ignored in respect to 
> >`%s' to avoid cycle");
> >
> Can't you instead do:
> 
>  warning ("%Halways_inline attribute ignored for `%D' to allow recursion",
>           &DECL_SOURCE_LOCATION (node->decl), decl);

Perhaps, but it is not exactly the case - I don't ignore it in all
calees, just in the calees where I would get into infinite recursion
while inlining.
> 
> >!   /* In the first pass mark all always_inline edges.  Do this with a 
> >priority
> >!      so no our decisions makes this impossible.  */
> >
> 
> This is the same comment as before.  What's the idea here, is this the 
> second pass??  It's hard to tell from the code without any comments below...
Cut&pasting :)
> 
> Looks like you've worked really hard on this, and that it could be a 
> real improvement.  Do you have some memory statistics as well, because 
> having all function bodies hanging around as trees may be expensive.

With the 8361 the memory peak is about 20% lower than without
unit-at-a-time.  C++ in most cases keeps the bodies around anyway so
this should not affect that much, but I will try to test it more.
I don't know about the usual case as I am not quite sure how to measure
it....

Honza
> 
> Gr.
> Steven
> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]