Bug 92074 - [10 regression] 26% performance regression on Spec2017 548.exchange2_r
Summary: [10 regression] 26% performance regression on Spec2017 548.exchange2_r
Status: CLOSED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: 10.0
Assignee: Jan Hubicka
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2019-10-12 02:09 UTC by Kewen Lin
Modified: 2019-11-01 09:35 UTC (History)
5 users (show)

See Also:
Host:
Target: powerpc64le-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-10-14 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kewen Lin 2019-10-12 02:09:10 UTC
We found r276516 caused 26% performance degradation on SPEC2017 548.exchange2_r on powerpc64le.

commit a3a17776a15a425662183c39700210bc76928a43
Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Oct 3 15:08:21 2019 +0000

            * params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT,
            PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New.
            * doc/invoke.texi (inline-heuristics-hint-percent,
            inline-heuristics-hint-percent-O2): Document.
            * tree-inline.c (inline_insns_single, inline_insns_auto): Add new
            hint attribute.
            (can_inline_edge_by_limits_p): Use it.


    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@276516 138bc75d-0d04-0410-961f-82ee72b054a4

The option set used is: -Ofast -mcpu=power9 -fpeel-loops -funroll-loops -mrecip

cpu info: POWER9, altivec supported
Comment 1 Bill Schmidt 2019-10-14 14:58:17 UTC
This is a pretty serious regression.  Should the patch be reverted until the problem can be sorted out?
Comment 2 Jan Hubicka 2019-10-14 20:06:53 UTC
The regression is because we now inline covered into digits2:

IPA function summary for digits_2/29 inlinable
  global time:     1553.078985
  self size:       1295
  global size:     1295
  min size:       0
  self stack:      261
  global stack:    261
    size:981.000000, time:1505.442572
    size:3.000000, time:1.999121,  executed if:(not inlined)
    size:0.500000, time:0.500000,  executed if:(not inlined),  nonconst if:(op0[ref offset: 0] changed) && (not inlined)
    size:210.500000, time:27.456610,  nonconst if:(op0[ref offset: 0] changed)
    size:21.000000, time:3.795164,  executed if:(op0[ref offset: 0] == 5)
    size:6.000000, time:0.334389,  executed if:(op0[ref offset: 0] != 8)
    size:1.000000, time:0.033237,  executed if:(op0[ref offset: 0] != 8),  nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0] != 8)
    size:66.000000, time:13.130882,  executed if:(op0[ref offset: 0] == 8)
  loop iterations:(op0[ref offset: 0] changed)
  calls:
    digits_2/29 function not considered for inlining
      loop depth: 9 freq:0.03 size: 2 time: 11callee size:647 stack:261 predicate: (op0[ref offset: 0] != 8)
       op0 is compile time invariant
    covered.constprop/93 function not considered for inlining
      loop depth: 9 freq:0.00 size: 4 time: 13callee size:214 stack:1472 predicate: (op0[ref offset: 0] == 8)
       op0 is compile time invariant
       op1 is compile time invariant

digits_2 is quite deeply recursive and inlining quite expensive function "covered" does not help. 

This can be solved by --param inline-heuristics-hint-percent=600
the current default of 1600 is way too high and I scheduled some benchmarks to tune it down but unfortunately our LNT benchmarking is down currently. (I would like to see it reduced to even lower value if polyhedron and SPEC testing is happy about that)

Generally it would be nice if inliner understood that inlining into self recursive functions on the path that is not going to recursion may be harmful. This we do not model and thus this works/does not work sort of randomly.
Comment 3 Xiong Hu XS Luo 2019-10-15 08:17:31 UTC
(In reply to Jan Hubicka from comment #2)
> The regression is because we now inline covered into digits2:
> 
> IPA function summary for digits_2/29 inlinable
>   global time:     1553.078985
>   self size:       1295
>   global size:     1295
>   min size:       0
>   self stack:      261
>   global stack:    261
>     size:981.000000, time:1505.442572
>     size:3.000000, time:1.999121,  executed if:(not inlined)
>     size:0.500000, time:0.500000,  executed if:(not inlined),  nonconst
> if:(op0[ref offset: 0] changed) && (not inlined)
>     size:210.500000, time:27.456610,  nonconst if:(op0[ref offset: 0]
> changed)
>     size:21.000000, time:3.795164,  executed if:(op0[ref offset: 0] == 5)
>     size:6.000000, time:0.334389,  executed if:(op0[ref offset: 0] != 8)
>     size:1.000000, time:0.033237,  executed if:(op0[ref offset: 0] != 8), 
> nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0] != 8)
>     size:66.000000, time:13.130882,  executed if:(op0[ref offset: 0] == 8)
>   loop iterations:(op0[ref offset: 0] changed)
>   calls:
>     digits_2/29 function not considered for inlining
>       loop depth: 9 freq:0.03 size: 2 time: 11callee size:647 stack:261
> predicate: (op0[ref offset: 0] != 8)
>        op0 is compile time invariant
>     covered.constprop/93 function not considered for inlining
>       loop depth: 9 freq:0.00 size: 4 time: 13callee size:214 stack:1472
> predicate: (op0[ref offset: 0] == 8)
>        op0 is compile time invariant
>        op1 is compile time invariant
> 
> digits_2 is quite deeply recursive and inlining quite expensive function
> "covered" does not help. 

Hi Honza,

I am analyzing the exchange2 of the recursive call digits_2(int k), this is not relevant with current PR. Sorry for distracting. 

In Fortran, k is pass by reference instead of pass by value, the new IPA-SRA could do the SRA and convert it to pass by value with some workaround, but ipa-sra is running after ipa-cp, and ipa-cp is not able to leverage the SRA results in WPA stage.

As digits_2 consumes most of the run time, and the input param value increases from 1 to 9, if manually convert the recursive call to non-recursive call like:
 case(1) call digits_2_1(); ... case(9) call digits_2_9();

The performance will go up for about 60%.

So there may be possible methods to do such kind of optimization:
1. Enable profile with value range and probability, save the input param k's value range to be [1, 9] 90%, ~[1, 9] 10%, then ipa-cp and ipa-sra could do recursive const propagation for digits_2 to generate digits_2.constprop1, digits_2.constprop2, etc. It would be a combined optimization of ipa-profile, ipa-cp, ipa-sra. This would be complicated as ipa-cp doesn't support recursive const prop and pass by reference prop with operands yet(like *(&k)+1).
2. Or use an independent pass(I am not sure whether it already exists in current GCC) to do the recursive to non-recursive call conversion like manual way for HOT recursive calls, then ipa-cp could do the const prop as usual. 

Any suggestion about this, please? Thanks.



> 
> This can be solved by --param inline-heuristics-hint-percent=600
> the current default of 1600 is way too high and I scheduled some benchmarks
> to tune it down but unfortunately our LNT benchmarking is down currently. (I
> would like to see it reduced to even lower value if polyhedron and SPEC
> testing is happy about that)
> 
> Generally it would be nice if inliner understood that inlining into self
> recursive functions on the path that is not going to recursion may be
> harmful. This we do not model and thus this works/does not work sort of
> randomly.
Comment 4 Hongtao.liu 2019-10-22 05:30:19 UTC
Same regression on skylake.
Comment 5 Martin Liška 2019-10-22 09:33:04 UTC
(In reply to Hongtao.liu from comment #4)
> Same regression on skylake.

Confirmed and same happens for znver1:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=32.407.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=33.407.0
Comment 6 Jan Hubicka 2019-10-23 14:45:56 UTC
Author: hubicka
Date: Wed Oct 23 14:45:24 2019
New Revision: 277333

URL: https://gcc.gnu.org/viewcvs?rev=277333&root=gcc&view=rev
Log:
	PR ipa/92074
	* params.def (inline-heuristics-hint-percent): Set to 600.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/params.def
Comment 7 Kewen Lin 2019-10-24 03:01:58 UTC
Verified and confirm the commit can recover the number.
Comment 8 Kewen Lin 2019-11-01 09:35:12 UTC
Closed it.