Reduce inline limits a bit to compensate changes in inlining metrics

Mon Feb 12 09:22:00 GMT 2018

On Fri, 9 Feb 2018, Jan Hubicka wrote:

> Hi,
> this patch addresses the code size regression by reducing 
> max-inline-insns-auto 40->30 and increasing inline-min-speedup 8->15.
> 
> The main reason why we need retuning is following
> 
>  - inline-min-speedup works in a way that if expected runtime 
>    of caller+calee combo after inlining reduces by more than 8%
>    the inliner is going to bypass inline-insns-auto (because it knows the
>    inlining is benefical rather than just inlining in hope it will be).
>    The decrease either happens because callee is very lightweight at
>    average or because we can track it will optimize well.
> 
>    During GCC 8 development I have switched time estimates from int to sreal.
>    Original estimates was capping time to about 1000 instructions and thus
>    large function rarely saw speedup because that was based comparing caped
>    numbers.  With sreals we can now track benefits better
> 
>  - We made quite some progress on early optimizations making function
>    bodies to appear smaller to inliner which in turn inlines more of them.
>    This is reason why we want to decrease inline-min-speedup to gain some code
>    size back.
> 
>    The code size estimate difference at beggining of inlning is about 6% to
>    gcc 6 and about 12% to gcc 4.9.
> 
> I have benchmarked patch on Haswell SPEC2000, SPEC2006, polyhedron and our C++
> benchmarks.  Here I found no off-noise changes on SPEC2000/2006. I know that
> reducing inline-insns-auto to 10  still produces no regressions and even
> improves facerec 6600->8000 but that seems bit of effect of good luck (it also
> depends on setting of branch predictor weights and needs to be analyzed
> independently).  min-speedup can be increased to 30 without measurable effects
> as well.
> 
> On C++ benchmark suite I know that cray degrades with min-speedup set to 30 (it
> needs value of 22). Also there is degradation with profile-generate on tramp3d.
> 
> So overall I believe that for Haswell the reduction of inline limits is doing
> very consistent code size improvement without perofrmance tradeoffs.
> 
> I also tested Itanium and here things are slightly more sensitive. The
> reduction of limits affects gzip 337->332 (-1.5%), vpr 1000->980 (-2%), crafty
> (925->935) (+2%) and vortex (1165->1180) (+1%). So overall it is specint2000
> neutral. Reducing inline-isns-auto to 10 brings off noise overall degradation
> by -1% and 20 is in-between.
> 
> specfp2000 reacts positively by improving applu 520->525 (+1%) and mgrid
> 391->397 (+1.3%). It would let me to reduct inline-isns-auto to 10 without
> any other regressions.
> 
> C++ benchmarks does not show any off-noise changes.
> 
> I have also did some limited testing on ppc and arm. They reacted more similarly
> to Haswell also showing no important changes for reducing the inlining limits.
> 
> Now reducing inline limits triggers failure of testsuite/g++.dg/pr83239.C
> which tests that inlining happens.  The reason why it does not happen is
> becuae ipa-fnsplit is trying to second guess if inliner will evnetually consider
> function for inlining and the test is out of date.  I decided to hack around
> it for stage4 and will try to clean these things up next stage1.
> 
> Bootstraped/regtested x86_64-linux.  I know it is late in stage4, but would it
> be OK to for GCC 8? 

Ok.

Richard.

> 	PR middle-end/83665
> 	* params.def (inline-min-speedup): Increase from 8 to 15.
> 	(max-inline-insns-auto): Decrease from 40 to 30.
> 	* ipa-split.c (consider_split): Add some buffer for function to
> 	be considered inlining candidate.
> 	* invoke.texi (max-inline-insns-auto, inline-min-speedup): UPdate
> 	default values.
> Index: params.def
> ===================================================================
> --- params.def	(revision 257520)
> +++ params.def	(working copy)
> @@ -52,13 +52,13 @@ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCO
>  DEFPARAM (PARAM_INLINE_MIN_SPEEDUP,
>  	  "inline-min-speedup",
>  	  "The minimal estimated speedup allowing inliner to ignore inline-insns-single and inline-insns-auto.",
> -	  8, 0, 0)
> +	  15, 0, 0)
>  
>  /* The single function inlining limit. This is the maximum size
>     of a function counted in internal gcc instructions (not in
>     real machine instructions) that is eligible for inlining
>     by the tree inliner.
> -   The default value is 450.
> +   The default value is 400.
>     Only functions marked inline (or methods defined in the class
>     definition for C++) are affected by this.
>     There are more restrictions to inlining: If inlined functions
> @@ -77,11 +77,11 @@ DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
>     that is applied to functions marked inlined (or defined in the
>     class declaration in C++) given by the "max-inline-insns-single"
>     parameter.
> -   The default value is 40.  */
> +   The default value is 30.  */
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
>  	  "max-inline-insns-auto",
>  	  "The maximum number of instructions when automatically inlining.",
> -	  40, 0, 0)
> +	  30, 0, 0)
>  
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
>  	  "max-inline-insns-recursive",
> Index: ipa-split.c
> ===================================================================
> --- ipa-split.c	(revision 257520)
> +++ ipa-split.c	(working copy)
> @@ -558,10 +558,13 @@ consider_split (struct split_point *curr
>  		 "  Refused: split size is smaller than call overhead\n");
>        return;
>      }
> +  /* FIXME: The logic here is not very precise, because inliner does use
> +     inline predicates to reduce function body size.  We add 10 to anticipate
> +     that.  Next stage1 we should try to be more meaningful here.  */
>    if (current->header_size + call_overhead
>        >= (unsigned int)(DECL_DECLARED_INLINE_P (current_function_decl)
>  			? MAX_INLINE_INSNS_SINGLE
> -			: MAX_INLINE_INSNS_AUTO))
> +			: MAX_INLINE_INSNS_AUTO) + 10)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
> @@ -574,7 +577,7 @@ consider_split (struct split_point *curr
>       Limit this duplication.  This is consistent with limit in tree-sra.c  
>       FIXME: with LTO we ought to be able to do better!  */
>    if (DECL_ONE_ONLY (current_function_decl)
> -      && current->split_size >= (unsigned int) MAX_INLINE_INSNS_AUTO)
> +      && current->split_size >= (unsigned int) MAX_INLINE_INSNS_AUTO + 10)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi	(revision 257520)
> +++ doc/invoke.texi	(working copy)
> @@ -10131,13 +10131,14 @@ a lot of functions that would otherwise
>  by the compiler are investigated.  To those functions, a different
>  (more restrictive) limit compared to functions declared inline can
>  be applied.
> -The default value is 40.
> +The default value is 30.
>  
>  @item inline-min-speedup
>  When estimated performance improvement of caller + callee runtime exceeds this
>  threshold (in percent), the function can be inlined regardless of the limit on
>  @option{--param max-inline-insns-single} and @option{--param
>  max-inline-insns-auto}.
> +The default value is 15.
>  
>  @item large-function-insns
>  The limit specifying really large functions.  For functions larger than this
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)