LIPO - Profile Feedback Based Lightweight IPO

(Our Acronym Sucks)

Overview & Motivation

Two of the most important compiler techniques boosting application performance are CMO/IPO (cross module optimizations) and FDO (Feedback Directed Optimizations). IPO is proven to be a very effective optimization technique to improve application performance. Among all code transformations in IPO, cross module inlining (CMI) is one of the most important. CMI eliminates procedure boundaries which otherwise are not possible in single module compilation. Program performance can be improved with procedure boundary elimination (inlining) due to elimination of call overhead, added context sensitivity, larger optimization region, and larger scheduling region. Added context sensitivity allows more constant propagation, redundancy elimination, dead code and unreachable code elimination. There are also enabler transformations for CMI such as indirect call promotion. The remaining benefits mostly come from whole program analysis such as pointer analysis, mod/ref analysis, etc. However IPO does come with a cost increased compile time and large disk storage requirement (for saving intermediate representation of the sources). The increased compile time is mostly due to the complexity of IPA analysis algorithm, lack of parallelism in IPA, and increased IO activities. Profile feedback guides compiler in making better inlining decisions, loop unrolling decisions. It also helps instruction cache performance by function splitting/outlining, block reordering, function reordering. It also allows better block ordering/tracing to improve branch prediction. Register allocator can also be guided to make better decision in spill candidate selection. Lots of value profiling based optimizations including indirect call promotions, memOp specialization, div/rem specializations are also made possible.

For users who want the benefits of CMO with FDO but do not want to pay for the overhead of the increased compile time, it would be desirable for the compiler to provide a lightweight mechanism to achieve that. In this wiki, a lightweight IPO (LIPO) is proposed, which is a new compilation model that combines the two techniques in a seamless fashion. It shifts important IPA analysis (e.g. inlining analysis) from compile time to training runtime, and saves the result of dynamic IPA analysis including module grouping information into the program data base. With the results of dynamic IPA performed at runtime, the profile-use pass of the FDO compilation can therefore performs cross module optimizations. LIPO represents a major paradigm shift in program compilation – it makes IPO implicitly available with FDO. This new compilation model is transparent, and does not require user change their build system (if the build system already incorporates FDO, this is certainly true. This will be even more true when LIPO is incorporated with sampled FDO when no additional instrumention step is needed). It is scalable and does not lose build time parallelism.

The Basic Design

For traditional IPO (i.e, link-time-optimization LTO), cross module inline analysis is delayed to link time because it is the first point in the whole compilation where all modules (translation units) are available. This requires serialization of intermediate representations of all source modules for the program.

When FDO is used, additional steps are introduced. The program build comprises of the following steps:

  1. build target program with instrumentation
  2. run the instrumented program with representative input data set; dump the program profile into profile database
  3. build the program with profile feedback collected in the training run.

The additional steps for FDO are usually a hassle to users, but it provides a way to perform traditional cross module analysis without using LTO.

The basic idea to perform cross module analysis (IPA) at runtime of the profile collection run. On exit of the normal execution, instead of simply dumping the profile data into the profile database, the program can perform additional analysis. For instance, cross module inlining analysis. It can build up the dynamic callgraph, annotate the dynamic call graph edges with call counts, and group file modules with caller and callees connected with hot call edges. The group information (e.g. source location of the enclosing translation units of the functions) is then stored into the profile database.

During the profile use compilation, the module grouping information is read in before parsing starts, and for a given primary source, the auxiliary source modules needed for CMI are also read in and parsed. The auxiliary functions participate in inline transformations, after which auxiliary module functions which are not needed are removed to reduce backend compile time. There is one module group for each original source module, which is called the primary source of the group. Each module group contains zero or more auxiliary modules. Conceptually, with LIPO compilation, auxiliary modules are included into the primary module as headers with functions defined as extern inline. The header inclusion is done by the compiler automatically and smartly based on cross module analysis.

With this design, there is no need for serialization of the persistent intermediate representation of the source modules. The diagram of the lightweight IPO with FDO is shown Figure 1. In the diagram we can see that the IPA analysis (module/function partitioning) phase is merged into the profile collection run phase. The resulting module grouping information will then be used in the profile use compilation phase to guide the compilation.


Figure 1: Lightweight Model for IPO with FDO


A Trivial Example

The small program contains 3 source modules. Module b.c contains a call to a function defined in module a.c which is hot. Module c.c contains a cold call to the function defined in a.c.

     int foo (int *ap, int i)
     {
        return ap[i] + ap[i + 1];
     }

     #include <stdio.h>
     #include <stdlib.h>
     extern int foo (int *, int);
     extern int bar (void);

     int main (int argc, char **argv)
     {
       int n, i, s = 0;
       int *data;
       FILE *data_file;

       n = atoi (argv[1]);
       data = (int *) malloc (sizeof (int) * n);
       if ((data_file = fopen ("data.dat","r")) == 0)
          return -1;
       fread (data, sizeof (int), n, data_file);

       for (i = 0; i < n - 1; i++)
          s += foo (data, i);

       s += bar ();
       fprintf (stderr, "result = %d\n", s);
       return 0;
     }

     extern int foo (int *, int);
     int a[4] = {3,4,6,10};
     int bar (void)
     {
        return foo (&a[0], 0) + foo (&a[2], 0);
     }

Assuming runtime inlining analysis decides that is beneficial to inline foo into main (it is a hot call, and inlining the call can also lead to SCR opportunity), the module group for a.c will be { a.c }. b.c's module group will be {b.c, a.c}, and for c.c, the module group is { c.c }. In profile-use compilation, a.c and c.c will be compiled as usual (trivial module group), while compiling b.c will suck in a.c as the auxiliary module so that foo can be inlined into main.

LIPO Usage Model

LIPO integrates IPO into FDO seamlessly, so the usage model of LIPO is almost identical to that of FDO. The only difference is one additional option is needed to turn on the IPO functionality in both profile-gen and profile-use passes.

Compiler/Linker Options

       -fprofile-generate -fripa

       -fprofile-generate

       -fprofile-use -fripa

In the current implementation, -fripa only turns on cross module inlining analysis. To optionally turn on additional full program analysis (future), suboptions can be used.

       -fripa=global:pointsto:modref:structlay

Dynamic IPA Runtime Parameters

In the current implementation, runtime control parameters for dynamic IPA are passed via environment variables. The following environment variables are provided:

       GCOV_DYN_CGRAPH_DUMP=1
       GCOV_DYN_CGRAPH_DUMP=2

      GCOV_DYN_CGRAPH_CUTOFF=percent_value

More On LIPO Model: Comparison with Existing CMO Frameworks

Comparison with IMA

gcc C FE supports the -combine option. It is an all-in-memory compilation model. The main drawbacks of this compilation models are:

Compared with IMA, LIPO has the following characteristics:

Comparison With LTO

In the LTO compilation model, the program compilation is divided into three phases:

As we can see from the description, this LTO model requires the generation of intermediate files due to the nature of this compilation pipelines. In terms of object file sizes, an intermediate object file is usually 3-6X the size of the a regular relocatable object. For very large programs, LTO compilation imposes very large compilation overhead not only in terms of disk storage, but also in worse compile time due to excessive IO involved.

On the other hand, for the lightweight IPO (LIPO),

The drawbacks of LIPO are

LIPO Quality Infrastructure

Robustness Testing Strategy

There are two problems related to LIPO's test coverage

To overcome the limitations, mechanisms are introduced to allow stress testing.

     --param force-lipo=1

     LIPO_RANDOM_GROUPING=seed:max_group_size

The random grouping feature has helped uncover many bugs in the implementation.

Runtime Error Triaging Support

A couple of compiler options have added to help triaging runtime failures with binaries built using LIPO. A common triage strategy to find the bad optimized object is to use binary search via mix and match object files from the bad build with object files from a good build. However, non-LIPO object files can usually not be mixed with LIPO built object files due to static promotions (see details section below) -- link time error may occur. To solve this problem, a compiler option is introduced (via an environment variable --- should probably be a parameter) to limit the max group size for each compilation. For triaging purpose, the max group size can be set to 1. This will in most cases generate a good set of object files that can be mixed with LIPO objects:

     --param max-lipo-group=<value>

Two other debug counts are also added which prove to be useful:

      -fdbg-cnt=inline:N

      -fdbg-cnt=alias:N

Compile Time Overhead

In the new compilation model, one source file can be an auxiliary module for more than one primary source modules. This introduces redundancy in compilation, but this is unlikely to become a problem because

Here is a data point. The largest SPEC2006 C++ program is xalanbmk. Total number of source modules for this program is 693. On a two core 2.4Ghz core-2 machine, the build time (wall clock):

Build Time Table

LIPO

FDO

-j 1

16m

12m

-j 4

11m

9m

For the LIPO build, most of the module groups are trivial (single file module). Cumulative edge count cutoff is set to 95%.

Module Groups

Total Number

693

Total Trivial

664

Max Module Group Size

19

Average Non-Trival Group Size

6.4

Some Detailed Notes

Runtime Callgraph Build

At the end of the program execution (profile collection run), and before dumping of the profile data, the program analysis starts. The first task performed is to build the callgraph for the program. In the current version which only supports cross module inlining (the most important one), there is no need for building the static callgraph.

To build dynamic call graph at runtime (profile collection), the profile information on how many times each static call site is dynamically called need to be computed or collected. For an indirect call, the call counts to the callee targets need to use the value profile counters for that callsite. For a direct function call, there are two alternatives. One way is to infer the direct call count via edge/basic block execution count. For this to work, the raw edge counts need additional annotation, i.e., the basic block graph with callsite information. This can complicate the implementation. The other alternative is to directly rely on direct function call profiling. This mechanism is implemented in LIPO.

New Indirect Call Profiling

The existing indirect call profiling implementation can not be used for LIPO. The main reason is that the function target id is only unique within a given source module. Besides, it is based one value profiling which can be weak.

A new indirect call profiling is implemented for LIPO. It can track N hottest targets (currently set to 2). The function identification uses the so called global unique id. A global unique id is comprised of two parts: the module id and the funcdef_no which is the intra-module function id (The old implementation uses callgraph node id). The module ids are allocated dynamically during program initialization when modules are registered.

Direct Function Call Profiling

This is needed for building runtime callgraph. To reduce overhead, the instrumentation of direct call profiling is performed after ipa-inline phase which is different from edge profiling. The profile counters are only used for guiding runtime analysis, but not used for guiding -fprofile-use compilation. To support profile merging, direct call profiling counters still need to be written out to gcda files.

Source Module Affinity Analysis (from CMI)

Source module affinity analysis is done via function grouping in the dynamic call graph. A greedy algorithm is used: for each node in the dynamic callgraph, follow the outgoing edges and put the callee into the same group if the call edge is hotter than a threshold. This is done transitively. For all the callee functions in the group, if the defining module is different from the caller node, the module is added as one auxiliary module in the module group if it is not yet added. Note that the same callee function is allowed to be in multiple different caller's function group.

Module Grouping Data Format

The module grouping information will need to be written into the profile data file. For a primary module, it is in the form of an array of module information record. Each module info record contains the following information:

Multiple Module Parsing Support

For certain languages (such as C++), the name lookup rules are complicated. To parse multiple source modules, simply combining all the sources and treat them as one extended translation unit can be very tricky and error prone. Even for the simple language such as C, the -combine implementation suffers from the weak tolerance in the user sources -- often in times, false errors are emitted on incompatible variables/types, etc. Going through this route is avoided from the very beginning (In fact, this was true originally for LIPO-CPP. LIPO-C originally leveraged -combine implementation, but got switched to the C++ LIPO model after encountering usability issues: failures in FE when building gcc etc).

The adopted parsing model is simple: namespace/FE symbol table isolation/separation. In other words, each source module maintains its own FE symbol table (decls/types/name bindings), and name lookup does not happen across modules. To achieve this effect, after parsing of each module, bindings of names for global entities need to be cleared. FE declared builtins are handled specially -- their names are shared. However before parsing of an auxiliary module starts, LIPO driver restores their values to the states before primary module starts. An alternative is to declare a different set of builtins per module -- but this requires significant driver changes. The current scheme works fine.

Multiple Module Profile Annotation Support

Auxiliary functions also need profile data for FDO, so the profile counter hashtable need to be set up in two steps in LIPO mode. The gcda file for the primary module is first read in. Depending on the module information records recorded in the profile data file, more files may be added to the input file list. If there is at least one auxiliary module added, the profile counter table for the primary module is now rehashed (using the extended function id as the key), and auxiliary module's gcda files are read in and entered into the hash table. The profile counter lookup is done the same way as non-LIPO compilation, but with function global id as the lookup key.

In-core Linking

In LIPO's parsing support, name lookup won't find the same symbol (function/variable) across source module boundaries, multiple declarations may be created for the same symbol/entity in the compilation. After parsing is done, the symbols will be to be unified. This is done by cgraph node linking and varpool node linking. A undefined cgraph node is resolved to a node in another module if it is defined in the module. There is no actual fixups in the gimple code stream. The cgraph linking needs to be done before call graph edge build so that edge destinations can be resolved to the real targets.

Type unification is also needed. For structure types, the equivalence is based on qualified type names, not on structural equivalence. For type based aliasing purpose, the latter scheme can be fragile, i.e, failure to detect structural equivalence can lead to false aliasing (no-alias) information. On the other hand, name based method is more robust -- the worst case scenario is conservative aliasing being provided. To be more user friendly, additional check can be performed to warn about inconsistent type definitions with the same names.

Function Linkage

Functions defined in auxiliary modules have special linkage. For most of the functions, they behave like extern inline functions. They are not not needed (for code expansion), so they can be deleted after ipa-inlining. The exceptions are

Static Promotion and Global Externalization

Static variables, static functions need special handling in both primary module and an auxiliary source module. Global variables in auxiliary modules also need special handling. Static here means the entity has internal linkage, the name of which is either file scoped or function scoped.

Current Status

Functional Quality

In summary, the major performance delivery functionality is almost ready for mainline checkin. A public branch is needed for independent enhancements (see Future work section).

Runtime Performance

SPEC2000 Numbers

SPECint2000 Performance (LIPO vs FDO)

LIPO

FDO

improvement

164.gzip

1392

1309

6.3%

175.vpr

1804

1784

1.1%

176.gcc

2034

2023

0.5%

181.mcf

1764

1772

-0.4%

186.crafty

2509

2324

8.0%

197.parser

1665

1630

2.1%

252.eon

2650

2488

6.5%

253.perlbmk

2519

2510

0.4%

254.gap

2156

2055

4.9%

255.vortex

3246

2649

22.5%

256.bzip2

1810

1790

1.1%

300.twolf

2398

2395

0.1%

geomean

2106

2020

4.3%

SPEC2006 Numbers

SPEC2006 Performance (LIPO vs FDO)

LIPO

FDO

improvement

400.perlbench

19.3

19.0

1.6%

401.bzip2

13.6

13.7

-0.7%

403.gcc

13.4

13.2

1.5%

429.mcf

14.3

14.3

0.0%

445.gobmk

15.2

14.7

3.4%

456.hmmer

15.4

15.3

0.6%

458.sjeng

16.4

15.6

5.1%

462.libquantum

19.5

19.6

-0.5%

464.h264ref

23.2

21.1

10.0%

471.omnetpp

11.7

11.1

5.4%

473.astar

11.0

10.9

0.9%

483.xalancbmk

8.52

8.21

3.8%

geomean

14.62

14.26

2.5%

433.milc

11.2

11.2

0.0%

444.namd

12.3

12.3

0.0%

447.dealII

21.3

21.3

0.0%

453.povray

19.9

17.3

15.0%

453.soplex

14.9

14.5

2.8%

470.lbm

14.2

14.2

0.0%

482.sphinx

15.3

15.3

0.0%

geomean

15.21

14.86

2.4%

gcc branch

The branch is svn/gcc/branches/lw-ipo. It is maintained by Xinliang David Li (davidxl@google.com).

(Note: Until it is updated, it is recommended to configure gcc with LIPO support without enabling type checking.)

Contacts

Future Work

None: LightweightIpo (last edited 2013-07-03 14:41:48 by ManuelLopezIbanez)