[PATCH] ipa-inline: Improve growth accumulation for recursive calls

Wed Aug 12 17:53:07 GMT 2020

Hello,
with Martin we spent some time looking into exchange2 and my
understanding of the problem is the following:

There is the self recursive function digits_2 with the property that it
has 10 nested loops and calls itself from the innermost.
Now we do not do amazing job on guessing the profile since it is quite
atypical. First observation is that the callback frequencly needs to be
less than 1 otherwise the program never terminates, however with 10
nested loops one needs to predict every loop to iterate just few times
and conditionals guarding them as not very likely. For that we added
PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
(causing regression in exhange since the bad profile turned out to
disable some harmful vectorization) and I also now added a cap to the
self recursive frequency so things to not get mispropagated by ipa-cp.

Now if ipa-cp decides to duplicate digits few times we have a new
problem.  The tree of recursion is orgnaized in a way that the depth is
bounded by 10 (which GCC does not know) and moreover most time is not
spent on very deep levels of recursion.

For that you have the patch which increases frequencies of recursively
cloned nodes, however it still seems to me as very specific hack for
exchange: I do not see how to guess where most of time is spent. 
Even for very regular trees, by master theorem, it depends on very
little differences in the estimates of recursion frequency whether most
of time is spent on the top of tree, bottom or things are balanced.

With algorithms doing backtracing, like exhchange, the likelyness of
recusion reduces with deeper recursion level, but we do not know how
quickly and what the level is.

> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> 
> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
> build option:
> 
> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> 
> ipa-inline pass will consider inline these nodes called only once, but these
> large functions inlined too deeply will cause serious register spill and
> performance down as followed.
> 
> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8)
> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8
> 
> Performance diff:
> inlineB is ~25% faster than inlineA;
> inlineC is ~20% faster than inlineB;
> inlineD is ~30% faster than inlineC.
> 
> The master GCC code now generates inline sequence like inlineB, this patch
> makes the ipa-inline pass behavior like inlineD by:
>  1) The growth acumulation for recursive calls by adding the growth data
> to the edge when edge's caller is inlined into another function to avoid
> inline too deeply;
>  2) And if caller and callee are both specialized from same node, the edge
> should also be considered as recursive edge.
> 
> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
> Any comments?  Thanks.
> 
> 523.xalancbmk_r +1.32%
> 541.leela_r     +1.51%
> 548.exchange2_r +31.87%
> 507.cactuBSSN_r +0.80%
> 526.blender_r   +1.25%
> 538.imagick_r   +1.82%
> 
> gcc/ChangeLog:
> 
> 2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>
> 
> 	* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
> 	callee and specialized from same node.
> 	* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
> 	inlined_to growth to edge whose caller is inlined.
> ---
>  gcc/cgraph.h              | 2 ++
>  gcc/ipa-inline-analysis.c | 3 +++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 0211f08964f..11903ac1960 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>    cgraph_node *c = callee->ultimate_alias_target ();
>    if (caller->inlined_to)
>      return caller->inlined_to->decl == c->decl;
> +  else if (caller->clone_of && c->clone_of)
> +    return caller->clone_of->decl == c->clone_of->decl;
>    else
>      return caller->decl == c->decl;

If you clone the function so it is no longer self recursive, it does not
make much sense to lie to optimizers that the function is still
recursive.

The inlining would be harmful even if the programer did cloning by hand.
I guess main problem is the extreme register pressure issue combining
loop depth of 10 in caller with loop depth of 10 in callee just because
the function is called once.

The negative effect is most likely also due to wrong profile estimate
which drives IRA to optimize wrong spot.  But I wonder if we simply
don't want to teach inlining function called once to not construct large
loop depths?  Something like do not inline if caller&callee loop depth
is over 3 or so?

Honza
>  }
> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
> index 148efbc09ef..ba0cf836364 100644
> --- a/gcc/ipa-inline-analysis.c
> +++ b/gcc/ipa-inline-analysis.c
> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data)
>  	  continue;
>  	}
>        d->growth += estimate_edge_growth (e);
> +      if (e->caller->inlined_to)
> +	d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
> +
>        if (d->growth > d->cap)
>  	return true;
>      }
> -- 
> 2.25.1
>