This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC/patch] Callgraph based inlining heuristics

From: Steven Bosscher <s dot bosscher at student dot tudelft dot nl>
To: Jan Hubicka <jh at suse dot cz>
Cc: gcc at gcc dot gnu dot org, mark at codesourcery dot com, rth at redhat dot com,aj at suse dot de
Date: Mon, 23 Jun 2003 15:13:22 +0200
Subject: Re: [RFC/patch] Callgraph based inlining heuristics
References: <20030622232032.GC3266@kam.mff.cuni.cz>

Jan Hubicka wrote:

Hi,
I finally got C++ unit-at-a-time into shape so it don't consume all my free
time and got into implementing first try of my proposed inlining heuristics
that uses callgraph to decide on inlining top-down while inline using current
code bottom-up hopefully getting better from the both worlds.

I am attaching somewhat raw implementation of the idea.  At the moment I am
using cgraph edges to flag on whether particular inlining will happen so
function can be inlinined into once caller but not inlined into another, but
the decisions must be transitive.

I am not sure what exactly you're saying in the last sentence here.

(that is not the case of attribute leafify proposed - doing the datastructure
in full generality allowing each particular walk in the callgraph  to be either
collapsed or not is probably too expensive so attribute leafify can be
implemented as a special case).

Assuming we really want such an attribute...

I sort the functions topologically and then do 3 stages - in the first stage I
set inline to all callers of always_inline functions watching for the cycles.
After flag is set the size of function bodies is update by adding the size of
inlined body minus 1.  (this is the same as estimation in tree-inline.c.  Not
very good.  We can consider choosing functions that may inline and trying to
analyze them harder, in the extreme case even produce rtl and profile and
copmpute average amount of insns executed).

If you use the same estimate for the function size as in tree-inline.c, can't you just share that code with tree-inline? I was thinking about splitting up inlinable_function_p() into a function that looks for inline inhibiting functions and a function to estimate the size of the inlinable function body (to get rid of that ugly "nolimit" flag). Then if we come up with a better way to estimate the function body size, we only have to rework the code in one place.

(BTW tree-inline.c needs to be cleaned up in other ways too, e.g. move walk_tree out to tree.c...)

In the second stage I proceed topologically again inline small functions
updating the size of bodies and in the third stage I decide on inlining once.

Does that mean we no longer inline large functions into small ones if the large function is called more than once? That is one of the things people have suggested many times to reduce the compile time cost of inlining.

I've added new constant specifying large functions (30000 statements) and I am
limiting growth of the large functions by factor of 2.  This avoids quadratic
bottleneck seen on insn-recog during bootstrap.

30000 statements still seems rather large. How did you get to this number? Are you still always inlining really small functions even if the large function is larger than 2 times its original size?

I tested libstdc++ compilation with -O2 and get 10% savings of the .so library
size (with debug info in) and about 5% compiler speedup.  I also tried PR/8362 testcase.

PR 8361 :-)

With -O2 -fno-unit-at-a-time I get:

---- 8< ----
TOTAL                 : 134.67             2.86           137.79

---- 8< ----
With -O2 -funit-at-a-time I get:

---- 8< ----
TOTAL                 :  92.61             3.19            96.50

And I get 24% fewer instructions produced.

That looks interesting. "expand" goes down from 13 to 4 seconds, so you're producing fewer insns there already, very nice! Do you know where the difference comes from, ie. which functions don't get inlined with your code that used to be inlined before?

I also tested it on some of C SPEC2000 sources and it does not seem to supress
any important inlining.  I hope to do some behchmarking tomorrow as I need some
sleep now.

So :) any comments are welcome.  Also I would like to know about other
testcases where inliner behaves in broken way.

PR10160 is a good one. You can of course try POOMA and Deal II, and maybe you can get someone to try Linux kernel without the always_inline hack they now need for GCC 3.3.

*** cgraph.h.in Sun Jun 22 21:56:33 2003 --- cgraph.h Mon Jun 23 01:36:16 2003 *************** struct cgraph_local_info ---- 8< ---- --- 30,42 ---- /* Set when function function is visiable in current compilation unit only and it's address is never taken. */ bool local; ! ! /* False when there is something making inlining impossible (such as va_arg) */

Comments have a period followed by two spaces.

!   bool inlinable;
!   /* True when function should be inlined independently on it's size.  */
!   bool disgread_inline_limits;

disregard_inline_limits

*************** struct cgraph_global_info *** 44,49 **** --- 46,54 ---- { /* Set when the function will be inlined exactly once. */ bool inline_once; + + /* Estimated size of the function after inlining. */ + int insns; };

Why not call this estimated_size or something else more informative?

*** cgraphunit.c.in Sun Jun 22 21:56:26 2003 --- cgraphunit.c Mon Jun 23 01:38:47 2003 ---- 8< ---- --- 35,53 ---- #include "cgraph.h" #include "varpool.h" #include "diagnostic.h" + #include "c-common.h" + #include "params.h" + + #define INSNS_PER_STMT 10 + #define INSNS_PER_CALL 10

IIRC there is some code in integrate.c that takes into account the number of arguments and how they are passed in estimating the size of a call. I have no idea if there's any benefit in a more accurate estimate, though, as long as we estimate 10 insns for all statements no matter what.

Why do you think there should be a separate INSNS_PER_CALL anyway? You estimate 10 insns for calls in the function body you're expanding inline, so why have a different number (potentially) for the call you're inlining??

Oh, and again, it would be nice to just have this in tree-inline.c

+ #define LARGE_FUNCTION_INSNS 30000
+ #define LARGE_FUNCTION_GROWTH 200

Maybe add comment here that these are multiples of 100. Actually, why is LARGE_FUNCTION_GROWTH not just 2? Every time you multiply something by it, you immediately devide by 100 again.

*************** cgraph_optimize_function (node) *** 255,260 **** --- 262,269 ---- { tree decl = node->decl; + /* optimize_inline_calls avoid inlining of current_function_decl. */ + current_function_decl = 0;

avoids?

   if (flag_inline_trees)
     optimize_inline_calls (decl);

Are you ever supposed to get here if flag_inline_trees is not set? Languages that do not use the tree inliner can never do unit-at-a-time either, and unit-at-a-time implies flag_inline_functions, doesn't it?

! /* Expand all functions that must be output. ! ! Attempt to topologically sort the nodes so function is output when ! all called functions are already assembled to allow data to be ! propagated accross the callgraph. Use a stack to get smaller distance ! between a function and it's callees (later we may choose to use a more ! sophisticated algorithm for function reordering; we will likely want ! to use subsections to make the output functions appear in top-down ! order. */

Right parenthesis missing.

--- 355,818 ---- } } } ! free (stack); ! return order_pos; ! } ! ! #define INLINED_TIMES(node) ((size_t)(node)->aux) ! #define SET_INLINED_TIMES(node,times) ((node)->aux = (void *)(times)) ! ! /* Return list of nodes we decided to inline NODE into, set their output ! flag and compute INLINED_TIMES. ! ! We do simple backgracing to get INLINED_TIMES right. This should not be

What is backgracing?

! expensive as we limit amount of inlining. Alternatively we may first

_the_ amount of ...

!    discover set of nodes, to topological sort on these and propagate
!    information.  */

discover the set of nodes, topologically sort these and propagate information.

What information exactly BTW?

! /* Allocate stack for back-tracking up callgraph. */ ! stack = xmalloc ((cgraph_n_nodes + 1) * sizeof (struct cgraph_edge)); ! sp = 0; ! ! /* Push the first edge on to the stack. */ ! stack[sp++] = e; ! ! while (sp) { ! struct cgraph_node *caller; ! ! /* Look at the edge on the top of the stack. */ ! e = stack[sp - 1]; ! caller = e->caller;

Purely cosmetic: If you always use (sp - 1) to get the top of the stack, you might as well start with sp=-1 and check for sp<0 to see if something is left on the stack....

! 	      for (e1 = e->next_caller; e1; e1 = e1->next_caller)
! 		if (e1->inline_call)
! 		  break;

I see these loops all the time. Maybe this deserves a macro...

! /* Estimate size of the function after inlining WHAT into TO.  */
! static int
! cgraph_estimate_size_after_inlining (int times,
! 				     struct cgraph_node *to,
! 				     struct cgraph_node *what)
! {
!   return to->global.insns + (what->global.insns - INSNS_PER_CALL * times);
! }

Should be in tree-inline

! /* Return false when inlining WHAT into TO is not good idea as it would cause ! too large growth of function bodies. */ ! static bool ! cgraph_check_inline_limits (struct cgraph_node *to, ! struct cgraph_node *what, ! struct cgraph_node **inlined, int ninlined) ! { ! int i; ! int times = 0; ! struct cgraph_edge *e; ! int newsize; ! ! for (e = to->callees; e; e = e->next_callee) ! if (e->callee == what) ! times++; ! ! to->global.insns = cgraph_estimate_size_after_inlining (times, to, what); ! newsize = cgraph_estimate_size_after_inlining (times, to, what); ! if (newsize > LARGE_FUNCTION_INSNS ! && newsize > to->local.self_insns * LARGE_FUNCTION_GROWTH / 100) ! return false; ! for (i = 0; i < ninlined; i++) ! { ! newsize = ! cgraph_estimate_size_after_inlining (INLINED_TIMES (inlined[i]) * ! times, inlined[i], what); ! if (newsize > LARGE_FUNCTION_INSNS ! && newsize > ! inlined[i]->local.self_insns * LARGE_FUNCTION_GROWTH / 100)

Weird line. The devisions of LARGE_FUNCTION_GROWTH / 100 seem redundant, see earlier remark.

! return false; ! } ! return true; ! } ! ! /* Return true when function N is small enought to be inlined. */

enough

! static bool
! cgraph_default_inline_p (struct cgraph_node *n)
! {
!   if (!DECL_INLINE (n->decl))
!     return false;
!   if (DID_INLINE_FUNC (n->decl))
!     return n->global.insns < MAX_INLINE_INSNS_AUTO;
!   else
!     return n->global.insns < MAX_INLINE_INSNS_SINGLE;
! }

Here you don't seem to take care of really small functions, ie. one-liners. Is that deliberate? In tree-inline, we have MIN_INLINE_INSNS for that, and at its current value, inlining such functions is almost always beneficial (because 100 of the 130 insns for MIN_INLINE_INSNS are for statements that don't produce insns, ie. SCOPE_STMT, stuff like that).

! static void ! cgraph_decide_inlining (void) { struct cgraph_node *node; + int nnodes; + struct cgraph_node **order = + xcalloc (sizeof (struct cgraph_node *), cgraph_n_nodes); + struct cgraph_node **inlined = + xcalloc (sizeof (struct cgraph_node *), cgraph_n_nodes); + int ninlined; + int i, y; ! /* Cgraph_postorder ignore nodes with no node->output set. */ ! for (node = cgraph_nodes; node; node = node->next) ! { ! node->global.insns = node->local.self_insns; ! if (DECL_SAVED_TREE (node->decl)) ! node->output = 1; ! } ! ! nnodes = cgraph_postorder (order); for (node = cgraph_nodes; node; node = node->next) + node->output = 0;

Not sure what you're doing here. First you set node->output for functions with a DECL_SAVED_TREE, ie. deferred functions. Isn't that true for all functions in unit-at-a-time? Then you reset node->output for all functions, but cgraph_postorder already seems to reset it for functions which have it set... Maybe needs a comment?

+ /* In the first pass mark all always_inline edges. Do this with a priority + so no our decisions makes this impossible. */ + if (!quiet_flag) + fprintf (stderr, "\n\nDeciding on always_inline functions:"); + for (i = nnodes - 1; i >= 0; i--) { ! struct cgraph_edge *e; ! ! node = order[i]; ! ! for (e = node->callees; e; e = e->next_callee) ! if (e->callee->local.disgread_inline_limits) ! break; ! if (!e) ! continue; ! ninlined = cgraph_inlined_into (order[i], inlined); ! for (; e; e = e->next_callee) ! { ! if (e->inline_call || !e->callee->local.disgread_inline_limits) ! continue; ! if (e->callee->output || e->callee == node) ! { ! current_function_decl = e->callee->decl; ! warning_with_decl (node->decl, ! "Always inline flag ignored in respect to `%s' to avoid cycle");

Can't you instead do:

 warning ("%Halways_inline attribute ignored for `%D' to allow recursion",
          &DECL_SOURCE_LOCATION (node->decl), decl);

!   /* In the first pass mark all always_inline edges.  Do this with a priority
!      so no our decisions makes this impossible.  */

This is the same comment as before. What's the idea here, is this the second pass?? It's hard to tell from the code without any comments below...

! if (!quiet_flag) ! fprintf (stderr, "\n\nDeciding on inlining:"); ! for (i = nnodes - 1; i >= 0; i--) ! { ! struct cgraph_edge *e;

Looks like you've worked really hard on this, and that it could be a real improvement. Do you have some memory statistics as well, because having all function bodies hanging around as trees may be expensive.

Gr.
Steven

Follow-Ups:
- Re: [RFC/patch] Callgraph based inlining heuristics
  - From: Jan Hubicka

References:
- [RFC/patch] Callgraph based inlining heuristics
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]