Remove overall growth from badness metrics

Tue Jan 8 11:46:00 GMT 2019

> > I plan to commit the patch tomorrow after re-testing everything after
> > the bugfixes from today and yesterday.  In addition to this have found
> > that current inline-unit-growth is too small for LTO of large programs
> > (especially Firefox:) and there are important improvements when
> > increased from 20 to 30 or 40.  I am re-running C++ benchmarks and other
> > tests to decide about precise setting.  Finally I plan to increase
> > the new parameters for bit more inlining at -O2 and -Os.
> 
> Usually increasing these parameters might increase the compilation time and the 
> final code size, do you have any data for compilation time and code size impact from
> these parameter change?

Yes, currently LNT is down because some machines apparently ran out of
disk space after christmas, so I can not show you data on that, but I
can show Firefox.  Will make summary of LNT too once it restarts.

In general this parameter affects primarily -O3 builds, becuase -O2
hardly hits the limit. From -O3 only programs with very large units are
affected (-O2 units hits the limit only if you do have a lot of inline
hints in the code).

In my test bed this included Firefox with or without LTO becuase they do
"poor man's" LTO by #including multiple .cpp files into single unified
source which makes average units large.  Also tramp3d, DLV from our C++
benhcmark is affected. 

I have some data on Firefox and I will build remainin ones:

 growth		LTO+PGO	   PGO	     LTO        none      -finline-functions
 20 (default)   83752215   94390023  93085455  103437191  94351191
 40             85299111   97220935  101600151 108910311  115311719
 clang          111520431            114863807 108437807

Build times are within noise of my setup, but they are less pronounced
than the code size difference. I think at most 1 minute out of 100.
Note that Firefox consists of 6% Rust code that is not built by GCC and
and building that consumes over half of the build time.

Problem I am trying to solve here are is to get consistent LTO
performance improvements compared to non-LTO. Currently there are
some regressions:
 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6ba1ebfe913d152989495d8cb450bce02f27d44&newProject=try&newRevision=c7bd18804e328ed490eab707072b3cf59da91042&framework=1&showOnlyComparable=1&showOnlyImportant=1
All those regressions goes away with limit increase.

I tracked them down to the fact that we do not inline some very small
functions already (such as IsHTMLWhitespace .  In GCC 5 timeframe I
tuned this parameter to 20% based on Firefox LTO benchmarks but I was
not that serious about performance since my setup was not giving very
reproducible results for sub 5% differences on tp5o. Since we plan to
enable LTO by default for Tumbleweed I need to find something that does
not cause too many regression while keeping code size advantage of
non-LTO.

Honza
> 
> thanks.
> 
> Qing
> > 
> > Bootstrapped/regtested x86_64-linux, will commit it tomorrow.
> > 
> > 	* ipa-inline.c (edge_badness): Do not account overall_growth into
> > 	badness metrics.
> > Index: ipa-inline.c
> > ===================================================================
> > --- ipa-inline.c	(revision 267612)
> > +++ ipa-inline.c	(working copy)
> > @@ -1082,8 +1082,8 @@ edge_badness (struct cgraph_edge *edge,
> >   /* When profile is available. Compute badness as:
> > 
> >                  time_saved * caller_count
> > -     goodness =  -------------------------------------------------
> > -	         growth_of_caller * overall_growth * combined_size
> > +     goodness =  --------------------------------
> > +	         growth_of_caller * combined_size
> > 
> >      badness = - goodness
> > 
> > @@ -1094,7 +1094,6 @@ edge_badness (struct cgraph_edge *edge,
> > 	   || caller->count.ipa ().nonzero_p ())
> >     {
> >       sreal numerator, denominator;
> > -      int overall_growth;
> >       sreal inlined_time = compute_inlined_call_time (edge, edge_time);
> > 
> >       numerator = (compute_uninlined_call_time (edge, unspec_edge_time)
> > @@ -1106,73 +1105,6 @@ edge_badness (struct cgraph_edge *edge,
> >       else if (caller->count.ipa ().initialized_p ())
> > 	numerator = numerator >> 11;
> >       denominator = growth;
> > -
> > -      overall_growth = callee_info->growth;
> > -
> > -      /* Look for inliner wrappers of the form:
> > -
> > -	 inline_caller ()
> > -	   {
> > -	     do_fast_job...
> > -	     if (need_more_work)
> > -	       noninline_callee ();
> > -	   }
> > -	 Withhout panilizing this case, we usually inline noninline_callee
> > -	 into the inline_caller because overall_growth is small preventing
> > -	 further inlining of inline_caller.
> > -
> > -	 Penalize only callgraph edges to functions with small overall
> > -	 growth ...
> > -	*/
> > -      if (growth > overall_growth
> > -	  /* ... and having only one caller which is not inlined ... */
> > -	  && callee_info->single_caller
> > -	  && !edge->caller->global.inlined_to
> > -	  /* ... and edges executed only conditionally ... */
> > -	  && edge->sreal_frequency () < 1
> > -	  /* ... consider case where callee is not inline but caller is ... */
> > -	  && ((!DECL_DECLARED_INLINE_P (edge->callee->decl)
> > -	       && DECL_DECLARED_INLINE_P (caller->decl))
> > -	      /* ... or when early optimizers decided to split and edge
> > -		 frequency still indicates splitting is a win ... */
> > -	      || (callee->split_part && !caller->split_part
> > -		  && edge->sreal_frequency () * 100
> > -		     < PARAM_VALUE
> > -			  (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY)
> > -		  /* ... and do not overwrite user specified hints.   */
> > -		  && (!DECL_DECLARED_INLINE_P (edge->callee->decl)
> > -		      || DECL_DECLARED_INLINE_P (caller->decl)))))
> > -	{
> > -	  ipa_fn_summary *caller_info = ipa_fn_summaries->get (caller);
> > -	  int caller_growth = caller_info->growth;
> > -
> > -	  /* Only apply the penalty when caller looks like inline candidate,
> > -	     and it is not called once and.  */
> > -	  if (!caller_info->single_caller && overall_growth < caller_growth
> > -	      && caller_info->inlinable
> > -	      && caller_info->size
> > -		 < (DECL_DECLARED_INLINE_P (caller->decl)
> > -		    ? MAX_INLINE_INSNS_SINGLE : MAX_INLINE_INSNS_AUTO))
> > -	    {
> > -	      if (dump)
> > -		fprintf (dump_file,
> > -			 "     Wrapper penalty. Increasing growth %i to %i\n",
> > -			 overall_growth, caller_growth);
> > -	      overall_growth = caller_growth;
> > -	    }
> > -	}
> > -      if (overall_growth > 0)
> > -        {
> > -	  /* Strongly preffer functions with few callers that can be inlined
> > -	     fully.  The square root here leads to smaller binaries at average.
> > -	     Watch however for extreme cases and return to linear function
> > -	     when growth is large.  */
> > -	  if (overall_growth < 256)
> > -	    overall_growth *= overall_growth;
> > -	  else
> > -	    overall_growth += 256 * 256 - 256;
> > -	  denominator *= overall_growth;
> > -        }
> >       denominator *= ipa_fn_summaries->get (caller)->self_size + growth;
> > 
> >       badness = - numerator / denominator;
> > @@ -1182,18 +1114,14 @@ edge_badness (struct cgraph_edge *edge,
> > 	  fprintf (dump_file,
> > 		   "      %f: guessed profile. frequency %f, count %" PRId64
> > 		   " caller count %" PRId64
> > -		   " time w/o inlining %f, time with inlining %f"
> > -		   " overall growth %i (current) %i (original)"
> > -		   " %i (compensated)\n",
> > +		   " time w/o inlining %f, time with inlining %f\n",
> > 		   badness.to_double (),
> > 		   edge->sreal_frequency ().to_double (),
> > 		   edge->count.ipa ().initialized_p () ? edge->count.ipa ().to_gcov_type () : -1,
> > 		   caller->count.ipa ().initialized_p () ? caller->count.ipa ().to_gcov_type () : -1,
> > 		   compute_uninlined_call_time (edge,
> > 						unspec_edge_time).to_double (),
> > -		   inlined_time.to_double (),
> > -		   estimate_growth (callee),
> > -		   callee_info->growth, overall_growth);
> > +		   inlined_time.to_double ());
> > 	}
> >     }
> >   /* When function local profile is not available or it does not give
>