Automatic Feedback Directed Optimizer
What is AutoFDO
AutoFDO uses sampling based profile to drive feedback directed optimizations.
AutoFDO uses perf to collect sample profiles. A standalone tool is used to convert the perf.data file into gcov format. GCC reads in the gcov file and interprets the profile into a set of hashmaps. A standalone pass is added to use the processed profile data to annotate the basic block counts and estimate branch probabilities.
Phases
There are three phases in AutoFDO:
Phase 1: Read profile from the profile data file
The following info is read from the profile gcov file:
- Function names and file names.
- Source level profile, which is a mapping from inline stack to its sample counts.
- Module profile: Module to aux-modules mapping
Phase 1 just reads in data without processing it. It is invoked before tree parsing because LIPO needs module profile before tree parsing. (read_aux_modules)
Phase 2: Process profile to build internal data structure (hashmap)
This is done after tree parsing, because the processing requires the map from function name to its debug name (bfd_name). The following hashmaps is used to store profile.
- function_htab: map from function_name to its entry_bb count
- stack_htab: map from inline stack to its sample count
- bfd_name_htab: map from function name to its debug name (bfd_name)
- module_htab: map from module name to its aux-module names
Phase 3: Annotate control flow graph
AutoFDO invokes a standalone pass over the control flow graph to:
- Annotate basic block count
- Estimate branch probability
Beyond Phase 3: Changes in other optimizations to make AutoFDO work
After the above 3 phases, all profile is readily annotated on the GCC IR. AutoFDO tries to reuse all FDO infrastructure as much as possible to make use of the profile. E.g. it uses existing mechanism to calculate the basic block/edge frequency, as well as the cgraph node/edge count.
However, AutoFDO still differs from FDO in the following aspects:
Accuracy
The profile used by AutoFDO is not accurate because:
- AutoFDO profile is collected using statistical approach, which may have random errors. This is important when total sample count is small.
AutoFDO profile is a scaled profile, as a result, 0 sample does not mean never executed.
- AutoFDO profile is represented using debug info, which may be inaccurate in optimized binaries. As a result, some hot basic blocks may have zero sample count.
Because of the inaccuracy, some optimization needs to be adjusted:
- In register allocation, REG_FREQ_FROM_BB check if the entry_bb count is zero to decide whether to use frequency info or not. In AutoFDO mode, this check should be disabled.
- In loop unrolling, a loop header may have the same count as the loop latch, thus the loop iteration is estimated incorrectly. We disable loop unrolling for such loops.
- When cloning a function body, FDO uses a 100% scaling factor when the callee count is 0. However, AutoFDO uses 0% scaling factor because the function body may still be hot even if the entry count is 0.
Cloned function body
The major difference between AutoFDO and FDO is that AutoFDO profiles on optimized binary instead of instrumented binary. This makes it very different in handling cloned functions.
Suppose function bar() was called by both foo1() and foo2(). Assume neither callsite is inlined in early inliner. There will be only one copy of bar() profile in the FDO profile. During the FDO optimize build, if bar() is inlined, its profile count will be scaled proportional to its callsite count.
In AutoFDO, the O2-optimized binary is used for profile. Let us assume bar() is inlined into both foo1() and foo2() in the O2-optimized binary. Then there will be two copies of bar() profile in the AutoFDO profile. Both copies reside in callers' profile, noted as foo1:bar and foo2:bar. Meanwhile there is no standalone profile for bar(). During the AutoFDO optimize build, when annotating bar(), there is no profile available.
One may think merging profiles from all copies into one can solve the problem. But new issue arises for very small functions that are inline in the early inline phase. Each of these small functions may have hundreds of copies in the O2-optimized binary, and each one of them may have completely different behavior (that is why FDO instrumentation happens after early inline). As a result, merging profiles for these small functions and scale them during inlining will cause significant accuracy loss. Unfortunately, because AutoFDO does not instrument at all, it cannot distinguish those early-inlined copies with those ipa-inlined copies.
In AutoFDO implementation, we do not merge profiles for different copies. Instead, when a function body is cloned, we will find if the cloned body has profile available in the O2-optimized binary and use it to annotate the cloned function. If no profile is found in the O2-optimized binary, then GCC falls back to use the scaling to annotate the cloned function. This promotes the following changes in GCC:
- When calculating badness factor for a cgraphedge, even if its count is zero, it is still possible to be hot because the callsite was inlined in O2-optimized binary. However, it is difficult to get the entry block count of the inlined callee (i.e. cgraphedge count). As a result, we add another afdo_badness, to compare the total count inside the inside callee to the maximum total count. We compare afdo_badness with traditional badness, and choose the minimum as the adjusted badness factor.
- When ipa-inline checks if an edge is hot (edge_hot_enough_p), AutoFDO also checks if the total count inside the callsite is larger than the hotness threshold.
When ipa-inline decides to inline a callsite, AutoFDO tries to find dedicated profile copy of the inlined callee node, and use its total_count to update the callee node(possibly cloned). If no dedicated profile copy is found, it falls back to use scaling to update the inlined callee count, and add the scaling factor to a hashmap so that when cloning function body (in tree-inline.c), it can finds the correct ratio to scale the profile.