This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

API for callgraph and IPA passes for whole program optimization


Hi,
since it seems that we are getting ready with LTO, I think it is time to write
updated design document for callgraph and implementation of whole program
optimizer as outlined in Whopr document http://gcc.gnu.org/wiki/whopr
As usual, all comments or ideas are welcome ;)

Basic organization
==================

As described in whopr document, to have linktime optimizer scalable for whole
programs of reasonable size, we need to:

  A) Push as much as possible into compilation stage, including preparing of
  "summary" information for interprocedural optimization and write that on
  disk.
  B) At linktime stage do all interprocedural optimization decisions based
  solely on the callgraph, variable pool and the summaries present in object
  files without actually needing to examine all function bodies. We have to
  expect that all function bodies won't fit in memory or take long to load.
  C) Final compilation should apply the decisions of interprocedural
  optimization and do local optimization and should be possible to distribute
  it.

So every interprocedural optimization pass should be split into 3 phases:
  1) local analysis producing simple and small function summary based on the
  function body analyzed alone.
  2) global execution that based on the function summaries, callgraph and
  variable pool do the optimization decisions, note them in callgraph and
  varpool and somehow possibly update the local information of later IPA
  passes.  The optimization summaries produced should be again small, simple
  and possible to save on disk.
  3) local modification that takes the function body or variable initializer
  and applies changes based on optimization summary produced in 2).

This is just rough outline: in real implementation we will probably want to do
some limited interprocedural optimization at compilation stage (especially
early inlining that enables a lot more optimization in C++) and we also will
probably want to do some late interprocedural optimization on fragments of the
whole program, such as IPA register allocation, that can't use summary produced
so early in optimization.

Current implementation status and short term plans
==================================================

The IPA organized into separate analyze/execution/modify passes was part of
original cgraph design.  The design was outlined in 2004 GCC summit paper
http://www.gccsummit.org/2004/2004-GCC-Summit-Proceedings.pdf

Inliner was organized this way, later passes wasn't organized and enforcing
this organization was difficult given the fact that linktime optimization
wasn't in sight http://gcc.gnu.org/ml/gcc/2004-11/msg00694.html so the code was
partly removed from mainline.

Since inliner is one of more interesting passes in this point of view,
reorganizing rest of optimizer should be that difficult at all.

I planned to open IPA branch and do transition back to the organization,
however it now seems that it is better done form part at stage1 directly on
mainline (most changes was prototyped on IPA branch and worked well) and by
part on LTO branch that now offer way to actually test the whole framework.
This way the amount of changes at LTO branch remains localized and we won't run
into conflict where IPA branch would need to be based on LTO branch to progress
and vice versa.

The plan would be to update passmanager first and transit main IPA passes
(inliner, constant propagation, alias analysis) on mainline.  The more advanced
IPA stuff, as struct reorg can go later since it is not a showstopper for first
incarnation of whole program optimizer anyway. On LTO branch we can do changes
related to memory management and overall pass queue organization.

Basic datastructure organization
================================

The callgraph and variable pool nodes are now split into the generic parts
(such as callgraph edges) and the per-optimization summaries.  Those all should
be split into local and global parts:

  - local containing the function summary computed by stage A,
  - global part containing the result of computation.

Temporary storage needed by the pass itself should be allocated in on-side
array based on cgraph UIDs or in aux pointers.

The local and global data are available via cgraph_local_info /
cgraph_global_info accessors since the plan is to un-embed them from cgraph
nodes once the memory consumption will justify it and also the functions
perform an sanity checking.

So implementation plan here would be just to enforce this split for all passes
that wasn't exactly implemented this way.

Basic passmanager changes
=========================

At the moment passmanager operates in two modes:

  1) IPA mode for toplevel passes: those passes are executed in sequence once
  for compilation.
  2) local mode for nested passes: those passes are executed in sequence (that
  is all nested passes in sequence up to next IPA pass) in topological order of
  callgraph on each function.

I would like to (re)-add the following hooks

 /* For IPA pass only. Analyze the given function body and produce summary
    information.  */
 void analyze_function (struct cgraph_node *);
 /* For IPA pass only. Analyze the given variable initializer and produce
    summary information.  */
 void analyze_variable (struct varpool_node *);
 /* For IPA pass only. Read summary from disk.  */
 void read_function_local_info (struct cgraph_node *, whatever parameters needed
 		                by LTO implementation);
 void read_variable_local_info (struct varpool_node *, whatever parameters needed
 		              by LTO implementation);
 /* For IPA pass only. Apply changes to given function body.  */
 void modify_function (struct cgraph_node *);
 /* For IPA pass only. Apply changes to given variable body.  */
 void modify_body (struct varpool_node *);
 /* For IPA pass only. Write summary to disk.  */
 void write_function_local_info (struct cgraph_node *, whatever parameters
 				 needed by LTO implementation);
 void write_variable_local_info (struct varpool_node *, whatever parameters
 				 needed by LTO implementation);

For implementing the stage C by whopr document (ie be able to produce .o files with decisions from global optimization in them), we would also need two extra hook for reading and writing ipa_optimization_info, but I would leave this out.

I would propose doing this change along with killing RTL dump letter fields,
since most annoying change of this is actually updating all the initializers of
all GCC passes by hand.

   BTW what about instead of adding 8 NULL fields to each initializer adding a
   simple macro, like
     IPA_PASS (analyze_fun, analyze_var, write_fun, write_var, read_fun,
               read_var, execute, modify_fun, modify_var)
     LOCAL_PASS (execute_fun)
     RTL_PASS (execute_fun)
   macros so we don't have to go over it again?
   I would be happy to do the non-macroized change however.

With these extra hooks, passmanager queue could be organized as follows:
   all_lowering_passes: executed per function as done now.
   all_early_ipa_passes: queue consisting of IPA passes with only execute
     function set.  Here we will do things like early inlining, early
     optimization and similar passes.
   all_interunit_ipa_passes: IPA passes with analyze/execute/modify pair.
     Pass manager will execute them after early_ipa_passes and will call
     all analyze hooks first.  Possibly followed by write hooks and
     exit, or with execute next and modify hooks last based on the fact if we
     do LTO or not.
   all_late_ipa_passes: If we opt for having late small IPA optimizer, we can
     put passes here.  Probably not in initial implementation.
   all_passes: Local optimization passes as we do now executed on topological
   order. This can be subpass of last pass of all_interunit_ipa_passes too.

With LTO linktime optimization the queue will start with
all_interunit_ipa_passes with read hooks followed by execute and modify hooks.

Memory management of function bodies
===================================

On LTO branch I would like to use cgraph_fuction_body_needed to bring function
body into memory on demand and cgraph_function_body_not_needed to schedule
function body for removal from memory (probably done lazily at next GGC
cycle).

The functions do simple reference counting: the idea is that at the last stage
passes, like inlining, can simply as for body as needed and can do it several
times for example when inlining recursively.

Finally there is function cgraph_function_body_modified that locks the body in
memory so it is not re-read from disk and it is removed only after the function
body is compiled to disk or declared as no longer needed.

This is particularly used by a passmanager when doing final local optimization
to keep the function currently being worked on in memory. However it is
intended to be used by passes that during their execute stage needs to produce
new function or modify them in complicated ways.

Basic locking/unlocking will thus be driven by passmanager for local passes,
while passes needing additional bodies, such as inliner, will do their own
requests.

Interaction of optimization passes and virtual clones
=====================================================

Now to the interesting part.  Since all function bodies are analyzed first by
all passes, the passes while they are designed to execute in sequence actually
all execute in parallel.

This means that the pass deciding to do particular optimization at its
execution stage is responsible to update summaries of all later passes so they
are not confused by out of date information.

It would be nice if the passes was actually able to derive benefits from
earlier analysis: For example if constant propagation modify the function so
some indirect call become direct, we want inliner to be able to decide to
inline it. This is a must for devirtualization.

It seems that all IPA passes we think about falls into categories:
  1) Purely analyzing passes not modifying the functions (such as alias
  analysis or pure/const discovery).  This is easy category.
  2) Passes doing function cloning based on IPA analysis (inlining, IPCP)
  3) Passes doing simple replacement (such as discovery of constant static vars)
  4) More complex stuff (such as struct-reorg modifying the datastructures and
  algorithms in functions to match the changes)

I think we can have few generic ways to perform changes we want:

 Virtual clones
 ==============

Fortunately most of basic passes we plan can be summarized as function
cloning: either the passes actually do cloning, like IPCP, or can be seen
that way. Inlining, for example, can be seen as a pass doing function clone that
ends up in the function body of the other function.

Callgraph implements so called virtual clones.  Real clones currently
done by current IPCP are implemented as ordinary functions from cgraph POV.
The function body is fully copied, new declaration is crated and all call
statements in program rewritten. The virtual clones consist of callgraph node
pointing to declaration of different function body + description how to produce
the function based on it. So creating the clone imply only creation of new
cgraph node and redirection of edges without touching the statements themselves.

The inlining decisions are currently represented as inline plan: when function is decided to be inlined, new virtual clone is built and placed in callgraph. All inline clones have just one caller and if the offline function is not needed, last inline clone is replacing it.

After removal of analyze/modify hooks from passmanager, the inliner is split
into 3 passes (pass_inline_parameters, pass_inline and pass_apply_inline)
doing the analyze/execute/modfify sequence.  This helps to keep memory usage down
not constructing all inlined function bodies in memory at once.

The clones seems like good abstraction, since they are quite transparent to
most of IPA passes: by copying around the function summary to the clone we
basically update the summary and the execute portion of pass will actually get
same results as if the summary of caller and inlined callee was somehow merged.
This can be all done without any knowledge about what the other passes are
doing.

So the plan is to turn IPCP and other passes from doing real cloning into
same virtual cloning.

 Late introduction of functions
 ==============================

In some other cases it seems that IPA passes needs means of introducing new
function late.  This is currently done by profiling (that introduce a
constructor), for function calling constructors or OpenMP.

All function bodies are now transiting over following stages:

  NEW:  function is around but nothing is done about it
  Finalized:  Function body is built and passed from fronted to callgraph
  Lowered: Function body is gimplified and brought into low gimple via all_lowering_passes in passmanager
  Analyzed: Callgraph edges are built
  Locally optimized: Early optimization passes are run and function is in SSA
  Compiled: rest_of_compilation (all passes) are run

As the compilation transits from parsing, to cgraph_optimize stages and later
the IPA pass queue is executed, all the function bodies are brought to
appropriate stages in parallel.

New functions are irritating, since the pass would be required to bring it to
proper stage that is bit difficult. This is handled by cgraph_add_new_function
that basically schedules the function for addition and at the end of each IPA
pass it is brought from its current stage (whatever it is) to the stage same
as other functions.

This can be updated to for IPA passes too: if IPA does cgraph_add_new_function
call, we can schedule all the analyze hooks so later passes will see the
function as if it was in the callgraph forever.

The function will need to be locked in memory and thus it would be
responsibility of the IPA pass to not build too many functions this way. This
is however consistent with idea that IPA passes should not grow the program too
much anyway.

 Reanalysis of function
 ======================
Similarly as new function can be added late, if we decide to modify the
function early in complicated way and lock in in memory:

This can be done by doing all modify hooks of the earlier passes on function in
question, aply the changes and re-call analyze hooks of all later passes noting
the fact that it was done so in cgraph_node, so at the end of compilation only
the later modify hooks will be executed.

This way the later passes will see the results of the earlier pass.  All we
need is just to ensure that analyze hook of given pass can be executed multiple
times on each function that is easy to maintain.

Again this imply locking function in memory and should not be used in passes
that can do a lot of changes and can be better dealt with differently, but it
seems like good solution for me for some specific situations.

 The rest
 ========
I don't have much plans for the rest: I guess we need to handle them case by
case and probably implement some knowledge of all other passes into more
difficult transformations or not do the more difficult transformation at this
level of IPA at all using the "small IPA" plan described earlier.  For all
current passes with exception of struct_reorg it is quite clear we won't need
this. For struct_reorg we can probably just apply the reanalysis idea if we
don't do too large scale changes.

Late IPA passes
===============

One idea is that because the large IPA passes have significant pass ordering
difficulties outlined above and because result of IPA optimization might
simplify the program significantly so more IPA is possible, it seems reasonable
to think about late IPA passes.

Those would operate just on portion of program that fits in memory based on
some region choice done by the big IPA passes and do things like:
  1) do local optimization of function bodies after results of big IPA passes
  was applied
  1) more precise alias analysis/inlining/wathever we find that works better
  2) IPA register allocation and other stuff that needs information at very low
  (RTL) level to operate effectively

Thats about it.  I would welcome all comments and if it seems useful I can
turn it into wiki page adding details to the current implementation plan at wiki.

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]