Automatic Feedback Directed Optimizer

What is AutoFDO

AutoFDO uses sampling based profile to drive feedback directed optimizations.

AutoFDO uses perf to collect sample profiles. A standalone tool is used to convert the perf.data file into gcov format. GCC reads in the gcov file and interprets the profile into a set of hashmaps. A standalone pass is added to use the processed profile data to annotate the basic block counts and estimate branch probabilities.

Phases

There are three phases in AutoFDO:

Phase 1: Read profile from the profile data file

The following info is read from the profile gcov file:

Phase 1 just reads in data without processing it. It is invoked before tree parsing because LIPO needs module profile before tree parsing. (read_aux_modules)

Phase 2: Process profile to build internal data structure (hashmap)

This is done after tree parsing, because the processing requires the map from function name to its debug name (bfd_name). The following hashmaps is used to store profile.

Phase 3: Annotate control flow graph

AutoFDO invokes a standalone pass over the control flow graph to:

Beyond Phase 3: Changes in other optimizations to make AutoFDO work

After the above 3 phases, all profile is readily annotated on the GCC IR. AutoFDO tries to reuse all FDO infrastructure as much as possible to make use of the profile. E.g. it uses existing mechanism to calculate the basic block/edge frequency, as well as the cgraph node/edge count.

However, AutoFDO still differs from FDO in the following aspects:

Accuracy

The profile used by AutoFDO is not accurate because:

Because of the inaccuracy, some optimization needs to be adjusted:

Cloned function body

The major difference between AutoFDO and FDO is that AutoFDO profiles on optimized binary instead of instrumented binary. This makes it very different in handling cloned functions.

Suppose function bar() was called by both foo1() and foo2(). Assume neither callsite is inlined in early inliner. There will be only one copy of bar() profile in the FDO profile. During the FDO optimize build, if bar() is inlined, its profile count will be scaled proportional to its callsite count.

In AutoFDO, the O2-optimized binary is used for profile. Let us assume bar() is inlined into both foo1() and foo2() in the O2-optimized binary. Then there will be two copies of bar() profile in the AutoFDO profile. Both copies reside in callers' profile, noted as foo1:bar and foo2:bar. Meanwhile there is no standalone profile for bar(). During the AutoFDO optimize build, when annotating bar(), there is no profile available.

One may think merging profiles from all copies into one can solve the problem. But new issue arises for very small functions that are inline in the early inline phase. Each of these small functions may have hundreds of copies in the O2-optimized binary, and each one of them may have completely different behavior (that is why FDO instrumentation happens after early inline). As a result, merging profiles for these small functions and scale them during inlining will cause significant accuracy loss. Unfortunately, because AutoFDO does not instrument at all, it cannot distinguish those early-inlined copies with those ipa-inlined copies.

In AutoFDO implementation, we do not merge profiles for different copies. Instead, when a function body is cloned, we will find if the cloned body has profile available in the O2-optimized binary and use it to annotate the cloned function. If no profile is found in the O2-optimized binary, then GCC falls back to use the scaling to annotate the cloned function. This promotes the following changes in GCC:

None: AutoFDO (last edited 2012-10-09 21:12:47 by 216)