LIPO - Profile Feedback Based Lightweight IPO

(Our Acronym Sucks)

Contents

LIPO - Profile Feedback Based Lightweight IPO

Overview & Motivation

Two of the most important compiler techniques boosting application performance are CMO/IPO (cross module optimizations) and FDO (Feedback Directed Optimizations). IPO is proven to be a very effective optimization technique to improve application performance. Among all code transformations in IPO, cross module inlining (CMI) is one of the most important. CMI eliminates procedure boundaries which otherwise are not possible in single module compilation. Program performance can be improved with procedure boundary elimination (inlining) due to elimination of call overhead, added context sensitivity, larger optimization region, and larger scheduling region. Added context sensitivity allows more constant propagation, redundancy elimination, dead code and unreachable code elimination. There are also enabler transformations for CMI such as indirect call promotion. The remaining benefits mostly come from whole program analysis such as pointer analysis, mod/ref analysis, etc. However IPO does come with a cost increased compile time and large disk storage requirement (for saving intermediate representation of the sources). The increased compile time is mostly due to the complexity of IPA analysis algorithm, lack of parallelism in IPA, and increased IO activities. Profile feedback guides compiler in making better inlining decisions, loop unrolling decisions. It also helps instruction cache performance by function splitting/outlining, block reordering, function reordering. It also allows better block ordering/tracing to improve branch prediction. Register allocator can also be guided to make better decision in spill candidate selection. Lots of value profiling based optimizations including indirect call promotions, memOp specialization, div/rem specializations are also made possible.

For users who want the benefits of CMO with FDO but do not want to pay for the overhead of the increased compile time, it would be desirable for the compiler to provide a lightweight mechanism to achieve that. In this wiki, a lightweight IPO (LIPO) is proposed, which is a new compilation model that combines the two techniques in a seamless fashion. It shifts important IPA analysis (e.g. inlining analysis) from compile time to training runtime, and saves the result of dynamic IPA analysis including module grouping information into the program data base. With the results of dynamic IPA performed at runtime, the profile-use pass of the FDO compilation can therefore performs cross module optimizations. LIPO represents a major paradigm shift in program compilation – it makes IPO implicitly available with FDO. This new compilation model is transparent, and does not require user change their build system (if the build system already incorporates FDO, this is certainly true. This will be even more true when LIPO is incorporated with sampled FDO when no additional instrumention step is needed). It is scalable and does not lose build time parallelism.

The Basic Design

For traditional IPO (i.e, link-time-optimization LTO), cross module inline analysis is delayed to link time because it is the first point in the whole compilation where all modules (translation units) are available. This requires serialization of intermediate representations of all source modules for the program.

When FDO is used, additional steps are introduced. The program build comprises of the following steps:

build target program with instrumentation
run the instrumented program with representative input data set; dump the program profile into profile database
build the program with profile feedback collected in the training run.

The additional steps for FDO are usually a hassle to users, but it provides a way to perform traditional cross module analysis without using LTO.

The basic idea to perform cross module analysis (IPA) at runtime of the profile collection run. On exit of the normal execution, instead of simply dumping the profile data into the profile database, the program can perform additional analysis. For instance, cross module inlining analysis. It can build up the dynamic callgraph, annotate the dynamic call graph edges with call counts, and group file modules with caller and callees connected with hot call edges. The group information (e.g. source location of the enclosing translation units of the functions) is then stored into the profile database.

During the profile use compilation, the module grouping information is read in before parsing starts, and for a given primary source, the auxiliary source modules needed for CMI are also read in and parsed. The auxiliary functions participate in inline transformations, after which auxiliary module functions which are not needed are removed to reduce backend compile time. There is one module group for each original source module, which is called the primary source of the group. Each module group contains zero or more auxiliary modules. Conceptually, with LIPO compilation, auxiliary modules are included into the primary module as headers with functions defined as extern inline. The header inclusion is done by the compiler automatically and smartly based on cross module analysis.

With this design, there is no need for serialization of the persistent intermediate representation of the source modules. The diagram of the lightweight IPO with FDO is shown Figure 1. In the diagram we can see that the IPA analysis (module/function partitioning) phase is merged into the profile collection run phase. The resulting module grouping information will then be used in the profile use compilation phase to guide the compilation.

Figure 1: Lightweight Model for IPO with FDO

Figure 1: Lightweight Model for IPO with FDO

A Trivial Example

The small program contains 3 source modules. Module b.c contains a call to a function defined in module a.c which is hot. Module c.c contains a cold call to the function defined in a.c.

     int foo (int *ap, int i)
     {
        return ap[i] + ap[i + 1];
     }

     #include <stdio.h>
     #include <stdlib.h>
     extern int foo (int *, int);
     extern int bar (void);

     int main (int argc, char **argv)
     {
       int n, i, s = 0;
       int *data;
       FILE *data_file;

       n = atoi (argv[1]);
       data = (int *) malloc (sizeof (int) * n);
       if ((data_file = fopen ("data.dat","r")) == 0)
          return -1;
       fread (data, sizeof (int), n, data_file);

       for (i = 0; i < n - 1; i++)
          s += foo (data, i);

       s += bar ();
       fprintf (stderr, "result = %d\n", s);
       return 0;
     }

     extern int foo (int *, int);
     int a[4] = {3,4,6,10};
     int bar (void)
     {
        return foo (&a[0], 0) + foo (&a[2], 0);
     }

Assuming runtime inlining analysis decides that is beneficial to inline foo into main (it is a hot call, and inlining the call can also lead to SCR opportunity), the module group for a.c will be { a.c }. b.c's module group will be {b.c, a.c}, and for c.c, the module group is { c.c }. In profile-use compilation, a.c and c.c will be compiled as usual (trivial module group), while compiling b.c will suck in a.c as the auxiliary module so that foo can be inlined into main.

LIPO Usage Model

LIPO integrates IPO into FDO seamlessly, so the usage model of LIPO is almost identical to that of FDO. The only difference is one additional option is needed to turn on the IPO functionality in both profile-gen and profile-use passes.

Compiler/Linker Options

Option for instrumentation compilation

       -fprofile-generate -fripa

Option for instrumentation linking

       -fprofile-generate

Option for profile use compilation

       -fprofile-use -fripa

In the current implementation, -fripa only turns on cross module inlining analysis. To optionally turn on additional full program analysis (future), suboptions can be used.

       -fripa=global:pointsto:modref:structlay

Dynamic IPA Runtime Parameters

In the current implementation, runtime control parameters for dynamic IPA are passed via environment variables. The following environment variables are provided:

GCOV_DYN_CGRAPH_DUMP : used to control dumping of the dynamic callgraph. When the value is 1, it is dumped in raw form; value 2 tells the program to dump in .dot format

       GCOV_DYN_CGRAPH_DUMP=1
       GCOV_DYN_CGRAPH_DUMP=2

GCOV_DYN_CGRAPH_RANDOM_GROUP : used for LIPO stress testing. For details see section below.
GCOV_DYN_CGRAPH_CUTOFF: used for control module grouping. In the current hotness based dynamic inline analysis, the module grouping is done according to the hotness of the call edges. A call edge is considered hot if its count exceeds a certain threshold which is dynamically computed. Given a sorted (descending order) call edge array, the threshold count is the count of the edge such that the the percentage accumulative count (from the hottest edge to this edge) over the total edge count exceeds a given cutoff value. This cutoff value can be specified via this environment variable. The default value is 95(%).

      GCOV_DYN_CGRAPH_CUTOFF=percent_value

More On LIPO Model: Comparison with Existing CMO Frameworks

Comparison with IMA

gcc C FE supports the -combine option. It is an all-in-memory compilation model. The main drawbacks of this compilation models are:

Since every translation units are kept in memory, the approach is not very scalable;
User can choose to manually partition all source modules into smaller sets, but this can be labor intensive, and be intrusive to the existing makefiles;
IMA implementation tends to report false incompatible declaration errors for declarations in different modules. This can greatly impact its usability;
Extending IMA to other languages other than C (e.g. C++) can be very difficult

Compared with IMA, LIPO has the following characteristics:

It does not partition source modules into smaller sets. Instead, it extends the compilation scope for each source module by including needed auxiliary source modules;
Auxiliary source modules are not top level entities, and most functions defined in them can be eliminated after ipa-inline pass, thus saving compile time;
In a module group, each source module is parsed independently -- there is no name sharing between frontend symbol tables across modules (other than builtins). This makes it very robust and source tolerant. Symbol merging is done after file parsing. In this aspect, it behaves similar to LTO, but done in memory-core (no serialization).

Comparison With LTO

In the LTO compilation model, the program compilation is divided into three phases:

FE phase: This phase is very similar to non-IPO compilation where individual source files are parsed. However, instead of generating relocatable objects, the compiler generates object files with source file's intermediate representations plus the summary data wrapped in target system's object file format (e.g. ELF). In some implementation, the compiler can choose to generate fat object files which are regular relocatables plus additional data sections carrying the IR and summary data according the user options. Such fat object files can be consumed by both regular link and IPO link. The primary role of the phase is to perform collection of summary data needed by inter-procedural analysis. The design of the summary data not only has impact on the precision of the IPA anlayses, but also on the IPA compile time. Common summary data include IR symbol table, type table, callsite table, and analysis specific data such as ICP jump functions, etc. To obtain summary data with better precision, pre-IPA scalar optimization phases are usually enabled in the FE phases.
IPA phase: This phase performs inter-procedural analysis (and some transformations such as cross-module inlining depending on the implementation) based on the summary data collected. This phase is performed at link time (of the intermediate ELF files) -- thus IPO is also called LTO (Link Time Optimization). This phase is the key bottle neck (in terms of compile time) of the whole compilation -- due to the lack of parallelism.
BE phase: After the IPA phase, the IPA driver invokes backend driver (in parallel) to perform IPA transformations, backend optimizations, and code generation. Each invocation of the backend driver is fed with one or a subset of intermediate files plus program data containing the IPA analysis result (the program database can exist in various forms depending on the implementation). After all the real object files are created, the linker takes back the control and produces the final executable or shared library.

As we can see from the description, this LTO model requires the generation of intermediate files due to the nature of this compilation pipelines. In terms of object file sizes, an intermediate object file is usually 3-6X the size of the a regular relocatable object. For very large programs, LTO compilation imposes very large compilation overhead not only in terms of disk storage, but also in worse compile time due to excessive IO involved.

On the other hand, for the lightweight IPO (LIPO),

The compilation is IELFless -- no serialization of intermediate representation is needed;
The compilation maintains fully the build parallelism at the primary source module level -- there is no single bottleneck holding up the build time;
Debug information can be easily maintained
Relative easy to add support for incremental build
The implementation is not intrusive to the gcc code base and does not have any negative impact on the default compilation paths
The implementation can leverage most of the existing single file IPA infrastructure.
IPO actually requires FDO to achieve its full potential -- LIPO is a natural way to combine them together.

The drawbacks of LIPO are

It is tied to FDO (which is also an advantage -- see above)
It is currently partial IPA -- no full program analysis is available for now (see future items). (For very large C++ programs, this is probably a moot point).
Mixed language CMO is not supported -- mainly because the file mixing happens in the front end phase. (some combinations can be implemented, but probably not worth the effort).

LIPO Quality Infrastructure

Robustness Testing Strategy

There are two problems related to LIPO's test coverage

LIPO requires profile feedback
The module grouping is determined by runtime profiling, and may limit what can be tested

To overcome the limitations, mechanisms are introduced to allow stress testing.

A compiler option (currently implemented as parameter value) is added to force the compiler to operate in the LIPO mode even without the profile data.

     --param force-lipo=1

Random grouping: instead of creating module groups using profile data, random groups with arbitrary sizes can be created and written into the profile data file. The environment variable is used by the training program run.

     LIPO_RANDOM_GROUPING=seed:max_group_size

The random grouping feature has helped uncover many bugs in the implementation.

Runtime Error Triaging Support

A couple of compiler options have added to help triaging runtime failures with binaries built using LIPO. A common triage strategy to find the bad optimized object is to use binary search via mix and match object files from the bad build with object files from a good build. However, non-LIPO object files can usually not be mixed with LIPO built object files due to static promotions (see details section below) -- link time error may occur. To solve this problem, a compiler option is introduced (via an environment variable --- should probably be a parameter) to limit the max group size for each compilation. For triaging purpose, the max group size can be set to 1. This will in most cases generate a good set of object files that can be mixed with LIPO objects:

     --param max-lipo-group=<value>

Two other debug counts are also added which prove to be useful:

debug count for number of inlining

      -fdbg-cnt=inline:N

debug count for the number alias queries -- when the count is exceeded, may_alias will be returned always.

      -fdbg-cnt=alias:N

Compile Time Overhead

In the new compilation model, one source file can be an auxiliary module for more than one primary source modules. This introduces redundancy in compilation, but this is unlikely to become a problem because

The auxiliary source modules only need to be parsed in the compilation process for the primary module, i.e., the optimization and codegen passes for most of the functions in the auxiliary modules will be skipped. Compiler usually spends a smaller percentage of time in parsing.
If possible, selective parsing capability can further reduce compile time.
The size of the module grouping is usually small -- the majority of the module groups are trivial (containing only one module) -- see data below.
What is most important, compilation processes for different primary modules can be done in parallel. The higher the the level of build parallelism, the less impact the overall build time.

Here is a data point. The largest SPEC2006 C++ program is xalanbmk. Total number of source modules for this program is 693. On a two core 2.4Ghz core-2 machine, the build time (wall clock):

Build Time Table
	LIPO	FDO
-j 1	16m	12m
-j 4	11m	9m

For the LIPO build, most of the module groups are trivial (single file module). Cumulative edge count cutoff is set to 95%.

Module Groups
Total Number	693
Total Trivial	664
Max Module Group Size	19
Average Non-Trival Group Size	6.4

Some Detailed Notes

Runtime Callgraph Build

At the end of the program execution (profile collection run), and before dumping of the profile data, the program analysis starts. The first task performed is to build the callgraph for the program. In the current version which only supports cross module inlining (the most important one), there is no need for building the static callgraph.

To build dynamic call graph at runtime (profile collection), the profile information on how many times each static call site is dynamically called need to be computed or collected. For an indirect call, the call counts to the callee targets need to use the value profile counters for that callsite. For a direct function call, there are two alternatives. One way is to infer the direct call count via edge/basic block execution count. For this to work, the raw edge counts need additional annotation, i.e., the basic block graph with callsite information. This can complicate the implementation. The other alternative is to directly rely on direct function call profiling. This mechanism is implemented in LIPO.

New Indirect Call Profiling

The existing indirect call profiling implementation can not be used for LIPO. The main reason is that the function target id is only unique within a given source module. Besides, it is based one value profiling which can be weak.

A new indirect call profiling is implemented for LIPO. It can track N hottest targets (currently set to 2). The function identification uses the so called global unique id. A global unique id is comprised of two parts: the module id and the funcdef_no which is the intra-module function id (The old implementation uses callgraph node id). The module ids are allocated dynamically during program initialization when modules are registered.

Direct Function Call Profiling

This is needed for building runtime callgraph. To reduce overhead, the instrumentation of direct call profiling is performed after ipa-inline phase which is different from edge profiling. The profile counters are only used for guiding runtime analysis, but not used for guiding -fprofile-use compilation. To support profile merging, direct call profiling counters still need to be written out to gcda files.

Source Module Affinity Analysis (from CMI)

Source module affinity analysis is done via function grouping in the dynamic call graph. A greedy algorithm is used: for each node in the dynamic callgraph, follow the outgoing edges and put the callee into the same group if the call edge is hotter than a threshold. This is done transitively. For all the callee functions in the group, if the defining module is different from the caller node, the module is added as one auxiliary module in the module group if it is not yet added. Note that the same callee function is allowed to be in multiple different caller's function group.

Module Grouping Data Format

The module grouping information will need to be written into the profile data file. For a primary module, it is in the form of an array of module information record. Each module info record contains the following information:

The dynamically allocated module id
A flag indicating whether it is a primary module or auxiliary module
If it is a primary module, there is a flag indicating if the module is imported by some other module groups. The flag is important to decide if any static variable/functions in this module needs to be promoted to global or not
The real path of the source file location
The real path of the gcda file for the source module
An array of system/user include paths for building the module
An array of -D/-U for building the module

Multiple Module Parsing Support

For certain languages (such as C++), the name lookup rules are complicated. To parse multiple source modules, simply combining all the sources and treat them as one extended translation unit can be very tricky and error prone. Even for the simple language such as C, the -combine implementation suffers from the weak tolerance in the user sources -- often in times, false errors are emitted on incompatible variables/types, etc. Going through this route is avoided from the very beginning (In fact, this was true originally for LIPO-CPP. LIPO-C originally leveraged -combine implementation, but got switched to the C++ LIPO model after encountering usability issues: failures in FE when building gcc etc).

The adopted parsing model is simple: namespace/FE symbol table isolation/separation. In other words, each source module maintains its own FE symbol table (decls/types/name bindings), and name lookup does not happen across modules. To achieve this effect, after parsing of each module, bindings of names for global entities need to be cleared. FE declared builtins are handled specially -- their names are shared. However before parsing of an auxiliary module starts, LIPO driver restores their values to the states before primary module starts. An alternative is to declare a different set of builtins per module -- but this requires significant driver changes. The current scheme works fine.

Multiple Module Profile Annotation Support

Auxiliary functions also need profile data for FDO, so the profile counter hashtable need to be set up in two steps in LIPO mode. The gcda file for the primary module is first read in. Depending on the module information records recorded in the profile data file, more files may be added to the input file list. If there is at least one auxiliary module added, the profile counter table for the primary module is now rehashed (using the extended function id as the key), and auxiliary module's gcda files are read in and entered into the hash table. The profile counter lookup is done the same way as non-LIPO compilation, but with function global id as the lookup key.

In-core Linking

In LIPO's parsing support, name lookup won't find the same symbol (function/variable) across source module boundaries, multiple declarations may be created for the same symbol/entity in the compilation. After parsing is done, the symbols will be to be unified. This is done by cgraph node linking and varpool node linking. A undefined cgraph node is resolved to a node in another module if it is defined in the module. There is no actual fixups in the gimple code stream. The cgraph linking needs to be done before call graph edge build so that edge destinations can be resolved to the real targets.

Type unification is also needed. For structure types, the equivalence is based on qualified type names, not on structural equivalence. For type based aliasing purpose, the latter scheme can be fragile, i.e, failure to detect structural equivalence can lead to false aliasing (no-alias) information. On the other hand, name based method is more robust -- the worst case scenario is conservative aliasing being provided. To be more user friendly, additional check can be performed to warn about inconsistent type definitions with the same names.

Function Linkage

Functions defined in auxiliary modules have special linkage. For most of the functions, they behave like extern inline functions. They are not not needed (for code expansion), so they can be deleted after ipa-inlining. The exceptions are

Functions in comdat groups. They need to be expanded if they are still reachable after ipa-inline phase -- there is no guarantee the reference can be satisfied anywhere, including its hosting primary module. They may be multiple copies of a comdat function in one LIPO compilation. The expansion/output keeps track of emitted functions and skip the duplicates.
Function clones. These functions are created by the compiler and can never be 'external'. They can be deleted if there is no reference after inlining.

Static Promotion and Global Externalization

Static variables, static functions need special handling in both primary module and an auxiliary source module. Global variables in auxiliary modules also need special handling. Static here means the entity has internal linkage, the name of which is either file scoped or function scoped.

static variables in the primary module: When the primary module is marked as imported, the function referencing a static variable may be inlined into another module in a different compilation, thus creating an external reference to the static variable. In other words, the variable can no longer be internal and needs to be promoted into a global. The key is how to create a unique, non-conflicting linker id (across all modules) for that static variable so that both caller and callee module agree upon. The solution is to use the original mangled name of the variable appended with lipo/cmo keyword plus the hosting module's id. It is possible static variables in different scopes have the same name. For such cases, sequential numbers will be assigned to discriminate them (which reflects their declaration order in the module).
static variables in an auxiliary module: they need to be promoted the same way as those in an imported primary module. The difference is they also need to be externalized. See last bullet item below.
static functions in the primary module: a static function in an imported primary module may be directly reference by another module if indirect->direct call promotion happens. To make sure potential external references be satisfied, the static function needs to be promoted to be a global function. The naming convention is similar that of static variables.
static functions in an auxiliary module: the handling of static functions in an auxiliary module is similar to an imported primary module.
global variables in an auxiliary module: global variables defined in an auxiliary module should be treated as extern and should be externalized.

Current Status

Functional Quality

It is fully functional for C.
It is fully functional for C++.
It is fully tested with SPEC2000, and SPEC2006
It is stress tested using random large groups with SPEC2000 and SPEC2006
It is tested with very large user applications (much larger than largest SPEC program)

In summary, the major performance delivery functionality is almost ready for mainline checkin. A public branch is needed for independent enhancements (see Future work section).

Runtime Performance

LIPO options
- PASS1 : -fprofile-generate -O2 -fripa
- PASS2 : -fprofile-use -O2 -fripa
FDO options
- PASS1 : -fprofile-generate -O2
- PASS2 : -fprofile-use -O2
255.vortex: -fno-strict-aliasing
400.perlbench: -fno-strict-aliasing
experiments performed on 2.40Ghz core2 machine.
number of trials 3, build target: linux/x86-64
The number shown does not represent the full potential of LIPO -- no SPEC specific tuning (mainly inline heuristics) is done.

SPEC2000 Numbers

SPECint2000 Performance (LIPO vs FDO)
	LIPO	FDO	improvement
164.gzip	1392	1309	6.3%
175.vpr	1804	1784	1.1%
176.gcc	2034	2023	0.5%
181.mcf	1764	1772	-0.4%
186.crafty	2509	2324	8.0%
197.parser	1665	1630	2.1%
252.eon	2650	2488	6.5%
253.perlbmk	2519	2510	0.4%
254.gap	2156	2055	4.9%
255.vortex	3246	2649	22.5%
256.bzip2	1810	1790	1.1%
300.twolf	2398	2395	0.1%
geomean	2106	2020	4.3%

SPEC2006 Numbers

SPEC2006 Performance (LIPO vs FDO)
	LIPO	FDO	improvement
400.perlbench	19.3	19.0	1.6%
401.bzip2	13.6	13.7	-0.7%
403.gcc	13.4	13.2	1.5%
429.mcf	14.3	14.3	0.0%
445.gobmk	15.2	14.7	3.4%
456.hmmer	15.4	15.3	0.6%
458.sjeng	16.4	15.6	5.1%
462.libquantum	19.5	19.6	-0.5%
464.h264ref	23.2	21.1	10.0%
471.omnetpp	11.7	11.1	5.4%
473.astar	11.0	10.9	0.9%
483.xalancbmk	8.52	8.21	3.8%
geomean	14.62	14.26	2.5%
433.milc	11.2	11.2	0.0%
444.namd	12.3	12.3	0.0%
447.dealII	21.3	21.3	0.0%
453.povray	19.9	17.3	15.0%
453.soplex	14.9	14.5	2.8%
470.lbm	14.2	14.2	0.0%
482.sphinx	15.3	15.3	0.0%
geomean	15.21	14.86	2.4%

gcc branch

The branch is svn/gcc/branches/lw-ipo. It is maintained by Xinliang David Li (davidxl@google.com).

(Note: Until it is updated, it is recommended to configure gcc with LIPO support without enabling type checking.)

Contacts

Xinliang Li (davidxl@google.com)
Raksit Ashok (raksit@google.com)
Robert Hundt (rhundt@google.com)

Future Work

Support for more language FEs, for instance gfortran.
Makedep tool. Gcov-dump tool is already enhanced so that auxiliary module source paths can be dumped. Besides, instead of dumping gcda file, an text file containing module imports are also dumped. The remaining work is to wrap this under a makedep tool so that extra source dependencies can be created on the fly during make.
Shared library support: Profile data merging for multiple runs of the same binaries are supported, but not yet for different binaries. This can be limited for shared libraries as it is desirable to accumulate profile data for a share library from training runs of multiple binaries.
Full Program Analysis Support. This requires storing additional static summary data in the instrumented binary so that static analysis can be performed at runtime.
Working with Sampling Based FDO : The lightweight IPO model described above assumes instrumentation based profile collection, but it can be extended to work with sampling based profiler. The key is to move the analysis into the profiler so that it not only dumps profile data to be consumed by profile use phase, but also performs additional analysis. This may require additional binary annotation data for source file information.