[PATCH] ipa-inline: Improve growth accumulation for recursive calls
Fri Aug 14 09:04:47 GMT 2020
On 2020/8/13 20:52, Jan Hubicka wrote:
>> Since there are no other callers outside of these specialized nodes, the
>> guessed profile count should be same equal? Perf tool shows that even
>> each specialized node is called only once, none of them take same time for
>> each call:
>> 40.65% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.4
>> 16.31% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.3
>> 10.91% exchange2_gcc.o libgfortran.so.5.0.0 [.] _gfortran_mminloc0_4_i4
>> 5.41% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.6
>> 4.68% exchange2_gcc.o exchange2_gcc.orig.slow [.] __logic_MOD_new_solver
>> 3.76% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.5
>> 1.07% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.7
>> 0.84% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.0
>> 0.47% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.2
>> 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.1
>> 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_covered.constprop.0
>> 0.11% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_reflected.constprop.0
>> 0.00% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.1
>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
>> So profile count and frequency seem not very helpful for this case?
> Yep, you can not really determine the time spent on each of recursion
> levels from the recursive edge probability since you can not assume it
> to be the same on each level of recursion (here we know it is 0 at level
> 10 and it is slowly dropping as the recursion increases because there
> are fewer posiblities to fill up the partly filled sudoku:).
> Especially you can not assume it after the ipa-cp did the work to figure
> out that there is recursion depth counter that affect the function.
> Thinking of the ipa-cp profile updating changes, I did not consider safe
> the approach of adding extra weight to the recursive call. The problem
> is that the recursive call frequency estimate, when comming from static
> profile stimation, is likely completely off, just like static profile
> estimator can not be used to predict realistically the number of
> iterations of loops. However even with profile feedback this is hard to
> I was wondering if we can not simply detect this scenario and avoid
> using the recursive edge probability in the profile updating.
> We could simply scale by the number of copies.
> This means that if we produce 10 clones, we could just set each clone to
> 1/10th of the frequency. This implies that the hottest spot will not be
> underestimated expontentially much as can happen now, but just 10 times
> at worst, wich is probably still OK. We play similar games in loop
> optimizers: static profile estimators expect every loop to trip around 3
> times so unroller/vectorizer/etc would make no sense on them.
> By scaling down according to number of copies we will keep the
> frequencies of calls to function called from clones "sane". We will
> still run into problems when we inline one clone to another (sine its
> proflie will get scaled by the recursive edge frequency), but perhaps
> that can be special cases in profiler or make ipa-cp to adjust the
> recursive edges to get frequency close to 1 as a hack.
> This is not pretty, but at least it looks safer to me than the original
> profile updating patch adding extra weight to the frequency.
Really appreciate for your detailed explanation. BTW, My previous patch
for PGO build on exchange2 takes this similar method by setting each cloned
node to 1/10th of the frequency several month agao :)
More information about the Gcc-patches