55674 – [4.8 Regression] >20% size increase of lto/pgo binaries since r193747

Bug 55674 - [4.8 Regression] >20% size increase of lto/pgo binaries since r193747

Summary: [4.8 Regression] >20% size increase of lto/pgo binaries since r193747

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	gcov-profile (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Duplicates (1):	55669 (view as bug list)
Depends on:
Blocks:

Reported:	2012-12-13 12:56 UTC by Markus Trippelsdorf
Modified:	2016-10-02 19:58 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Markus Trippelsdorf 2012-12-13 12:56:29 UTC

See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 onwards.

Binaries build with "-flto -fprofile-use" increased more than 20% in size
since revision 193747 went in.

libxul's size increased ~20%:
34MB before, 42MB now.

This also happens for simpler testcases, e.g tramp3d-v4.cpp (30% increase):
0.9MB before,  1.3M now.

Setting "--param hot-bb-count-ws-permille" to lower values doesn't help.

Comment 1 Teresa Johnson 2012-12-13 14:45:01 UTC

I'm really surprised that using --param hot-bb-count-ws-permille=950 didn't help, since even fewer things should look hot enough to inline than before the revision.

Would you send me the new -fdump-ipa-inline output? 

Thanks,
Teresa

Comment 2 Markus Trippelsdorf 2012-12-13 14:46:31 UTC

In the tramp3d-v4 case, when I run with this simple debug patch:

diff --git a/gcc/predict.c b/gcc/predict.c
index 5d3de29..bf3a259 100644
--- a/gcc/predict.c
+++ b/gcc/predict.c
@@ -147,6 +147,7 @@ maybe_hot_count_p (struct function *fun, gcov_type count)
       gcc_assert (ws);
       min_count = ws->min_counter;
     }
+  fprintf (stderr, "count=%i min_count=%i profile_info->sum_max=%i\n", count, min_count, profile_info->sum_max);
   return (count >= min_count);
 }

it shows that in the middle of the output, min_count changes from 
8585 to 0:

... ~600000 similar lines with  min_count=8585
count=3 min_count=8585 profile_info->sum_max=257406300
count=3 min_count=8585 profile_info->sum_max=257406300
count=3 min_count=8585 profile_info->sum_max=257406300
count=3 min_count=8585 profile_info->sum_max=257406300
count=40 min_count=0 profile_info->sum_max=257406300
count=61 min_count=0 profile_info->sum_max=257406300
count=40 min_count=0 profile_info->sum_max=257406300
count=40 min_count=0 profile_info->sum_max=257406300
... ~600000 similar lines with  min_count=0

Comment 3 Teresa Johnson 2012-12-13 14:49:19 UTC

Hi Markus,

Are you sure you have my subsequent fixes patched in, to make sure the
histogram is getting streamed through the LTO files? This was the
behavior I saw when I was debugging the original issue that I fixed
with those patches. Basically, the hotness checks before we went into
LTO were good, and after LTO the min count was 0.

Teresa

On Thu, Dec 13, 2012 at 6:46 AM, markus at trippelsdorf dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #2 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 14:46:31 UTC ---
> In the tramp3d-v4 case, when I run with this simple debug patch:
>
> diff --git a/gcc/predict.c b/gcc/predict.c
> index 5d3de29..bf3a259 100644
> --- a/gcc/predict.c
> +++ b/gcc/predict.c
> @@ -147,6 +147,7 @@ maybe_hot_count_p (struct function *fun, gcov_type count)
>        gcc_assert (ws);
>        min_count = ws->min_counter;
>      }
> +  fprintf (stderr, "count=%i min_count=%i profile_info->sum_max=%i\n", count,
> min_count, profile_info->sum_max);
>    return (count >= min_count);
>  }
>
> it shows that in the middle of the output, min_count changes from
> 8585 to 0:
>
> ... ~600000 similar lines with  min_count=8585
> count=3 min_count=8585 profile_info->sum_max=257406300
> count=3 min_count=8585 profile_info->sum_max=257406300
> count=3 min_count=8585 profile_info->sum_max=257406300
> count=3 min_count=8585 profile_info->sum_max=257406300
> count=40 min_count=0 profile_info->sum_max=257406300
> count=61 min_count=0 profile_info->sum_max=257406300
> count=40 min_count=0 profile_info->sum_max=257406300
> count=40 min_count=0 profile_info->sum_max=257406300
> ... ~600000 similar lines with  min_count=0
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 4 Markus Trippelsdorf 2012-12-13 14:52:37 UTC

(In reply to comment #3)
> Hi Markus,
> 
> Are you sure you have my subsequent fixes patched in, to make sure the
> histogram is getting streamed through the LTO files? This was the
> behavior I saw when I was debugging the original issue that I fixed
> with those patches. Basically, the hotness checks before we went into
> LTO were good, and after LTO the min count was 0.
> 
Hi Teresa,

yes my gcc is up-to-date (from todays git).

Comment 5 Teresa Johnson 2012-12-13 15:02:55 UTC

Ok, I will download tramp3d-v4 right now and see what is going on. Can
you send me the full set of options you are using to compile it?

Teresa

On Thu, Dec 13, 2012 at 6:52 AM, markus at trippelsdorf dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #4 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 14:52:37 UTC ---
> (In reply to comment #3)
>> Hi Markus,
>>
>> Are you sure you have my subsequent fixes patched in, to make sure the
>> histogram is getting streamed through the LTO files? This was the
>> behavior I saw when I was debugging the original issue that I fixed
>> with those patches. Basically, the hotness checks before we went into
>> LTO were good, and after LTO the min count was 0.
>>
> Hi Teresa,
>
> yes my gcc is up-to-date (from todays git).
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 6 Markus Trippelsdorf 2012-12-13 15:06:12 UTC

(In reply to comment #5)
> Ok, I will download tramp3d-v4 right now and see what is going on. Can
> you send me the full set of options you are using to compile it?
> 

g++ -w -O3 -fprofile-generate -march=native tramp3d-v4.cpp
./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20 
g++ -w -O3 -fprofile-use -flto=4 -march=native tramp3d-v4.cpp

Comment 7 Teresa Johnson 2012-12-13 15:50:05 UTC

Reproduced. Looks like somehow my fix to stream this through LTO is
not working properly. I see that the min count is valid when
generating the .o file, but goes to zero when we do the link(lto)
phase. Hopefully this means that the heuristic itself is not broken.
=)  I'll see if I can root cause this and have a fix today.

Teresa

On Thu, Dec 13, 2012 at 7:06 AM, markus at trippelsdorf dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #6 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 15:06:12 UTC ---
> (In reply to comment #5)
>> Ok, I will download tramp3d-v4 right now and see what is going on. Can
>> you send me the full set of options you are using to compile it?
>>
>
> g++ -w -O3 -fprofile-generate -march=native tramp3d-v4.cpp
> ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20
> g++ -w -O3 -fprofile-use -flto=4 -march=native tramp3d-v4.cpp
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 8 Teresa Johnson 2012-12-13 18:23:08 UTC

Dumb mistake in my previous fix to the lto support. Here is the patch
that fixes it, I will submit for review after regression testing
completes:


Index: lto-cgraph.c
===================================================================
--- lto-cgraph.c (revision 194404)
+++ lto-cgraph.c (working copy)
@@ -1368,7 +1368,9 @@
          so we need to account for a non-zero histogram entry at new_ix.  */
       unsigned new_ix = gcov_histo_index (scaled_min);
       lto_gcov_summary.histogram[new_ix].min_value
-          = MIN (lto_gcov_summary.histogram[new_ix].min_value, scaled_min);
+          = (lto_gcov_summary.histogram[new_ix].num_counters
+             ? MIN (lto_gcov_summary.histogram[new_ix].min_value, scaled_min)
+             : scaled_min);
       /* Some of the scaled counter values would ostensibly need to be placed
          into different (larger) histogram buckets, but we keep things simple
          here and place the scaled cumulative counter value in the bucket

Please let me know how this affects the mozilla size.

Thanks,
Teresa

On Thu, Dec 13, 2012 at 7:49 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Reproduced. Looks like somehow my fix to stream this through LTO is
> not working properly. I see that the min count is valid when
> generating the .o file, but goes to zero when we do the link(lto)
> phase. Hopefully this means that the heuristic itself is not broken.
> =)  I'll see if I can root cause this and have a fix today.
>
> Teresa
>
> On Thu, Dec 13, 2012 at 7:06 AM, markus at trippelsdorf dot de
> <gcc-bugzilla@gcc.gnu.org> wrote:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>>
>> --- Comment #6 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 15:06:12 UTC ---
>> (In reply to comment #5)
>>> Ok, I will download tramp3d-v4 right now and see what is going on. Can
>>> you send me the full set of options you are using to compile it?
>>>
>>
>> g++ -w -O3 -fprofile-generate -march=native tramp3d-v4.cpp
>> ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20
>> g++ -w -O3 -fprofile-use -flto=4 -march=native tramp3d-v4.cpp
>>
>> --
>> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
>> ------- You are receiving this mail because: -------
>> You are on the CC list for the bug.
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comment 9 Markus Trippelsdorf 2012-12-13 19:10:40 UTC

*** Bug 55669 has been marked as a duplicate of this bug. ***

Comment 10 Markus Trippelsdorf 2012-12-13 22:14:01 UTC

(In reply to comment #8)
> Please let me know how this affects the mozilla size.

Looks much better now:
 39748288 hot-bb-count-ws-permille=999 (default)
 34573408 hot-bb-count-ws-permille=890
 34072808 without lto/pgo

Comment 11 Teresa Johnson 2012-12-13 22:16:19 UTC

Do you happen to know what it was with lto/pgo before the change? Should be
roughly equivalent to hot-bb-count-ws-permille=970 from what I saw in your
profiles. What size increase is acceptable?

Thanks,
Teresa

On Thu, Dec 13, 2012 at 2:14 PM, markus at trippelsdorf dot de <
gcc-bugzilla@gcc.gnu.org> wrote:

>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #10 from Markus Trippelsdorf <markus at trippelsdorf dot de>
> 2012-12-13 22:14:01 UTC ---
> (In reply to comment #8)
> > Please let me know how this affects the mozilla size.
>
> Looks much better now:
>  39748288 hot-bb-count-ws-permille=999 (default)
>  34573408 hot-bb-count-ws-permille=890
>  34072808 without lto/pgo
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>

Comment 12 Markus Trippelsdorf 2012-12-13 22:35:33 UTC

(In reply to comment #11)
> Do you happen to know what it was with lto/pgo before the change? Should be
> roughly equivalent to hot-bb-count-ws-permille=970 from what I saw in your
> profiles

Yes, it was ~34MB before the change.

> What size increase is acceptable?

It's hard to say in case of Firefox, because the only thing
that one can reliably measure is the JavaScript performance.
And this varies only very slightly with different compiler options.
So you have no way to measure up to which point more inlining
is still beneficial.

Comment 13 tejohnson 2012-12-14 15:11:00 UTC

Author: tejohnson
Date: Fri Dec 14 15:10:45 2012
New Revision: 194502

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=194502
Log:
2012-12-14  Teresa Johnson  <tejohnson@google.com>

        PR gcov-profile/55674
	* lto-cgraph.c (merge_profile_summaries): Set min correctly the
        first time we merge into a histogram entry.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lto-cgraph.c

Comment 14 Markus Trippelsdorf 2012-12-14 16:31:21 UTC

fixed

Comment 15 Jan Hubicka 2012-12-18 15:40:34 UTC

> It's hard to say in case of Firefox, because the only thing
> that one can reliably measure is the JavaScript performance.
> And this varies only very slightly with different compiler options.

One way to get better differences is to disable the JIT :)
Doesn't snappy allow us to measure things like startup time and
other stuff where not much of JITed code is involved?

> So you have no way to measure up to which point more inlining
> is still beneficial.

I will do some testing on SPEC on when the performance starts dropping.
Setting limit to 890 seems bit dangerous.  -Os can slow down the code
considerably, so if 10% of time program will be running -Os code, we probably
get measurable slowdowns overall.

Thanks!
Honza

Comment 16 Markus Trippelsdorf 2012-12-18 17:03:52 UTC

I did some measurements with tramp3d and in this case
the default (999) gives the best performance:

par. size    time
--------------------
999  955859  3.71752
990  933390  3.73969
980  904718  3.84547
...    "        "
750  904718  3.84769
740  837654  7.67177
600  836024  8.80879

Comment 17 Jan Hubicka 2012-12-18 17:25:37 UTC

> I did some measurements with tramp3d and in this case
> the default (999) gives the best performance:
> 
> par. size    time
> --------------------
> 999  955859  3.71752
> 990  933390  3.73969
> 980  904718  3.84547
> ...    "        "
> 750  904718  3.84769
> 740  837654  7.67177
> 600  836024  8.80879

Yep, tramp3d is unforutnately quite special case: we measure the number of instructions prior
late optimization, while in tramp3d over 90% of code disappear as a result of inlining and further
simplification, so we get GIGO problem...

I am not sure how to handle these things in any resonable way....

I will test couple of values on spec2k this week and lets see how things scale elsewhere.

Honza

Comment 18 Teresa Johnson 2012-12-19 16:44:21 UTC

On Tue, Dec 18, 2012 at 9:25 AM, hubicka at ucw dot cz
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #17 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-18 17:25:37 UTC ---
>> I did some measurements with tramp3d and in this case
>> the default (999) gives the best performance:
>>
>> par. size    time
>> --------------------
>> 999  955859  3.71752
>> 990  933390  3.73969
>> 980  904718  3.84547
>> ...    "        "
>> 750  904718  3.84769
>> 740  837654  7.67177
>> 600  836024  8.80879
>
> Yep, tramp3d is unforutnately quite special case: we measure the number of
> instructions prior
> late optimization, while in tramp3d over 90% of code disappear as a result of
> inlining and further
> simplification, so we get GIGO problem...
>
> I am not sure how to handle these things in any resonable way....
>
> I will test couple of values on spec2k this week and lets see how things scale
> elsewhere.

As another data point, in our internal benchmarks I had tried a few
values and 99.9% gave the best performance. Just going down to 99.0%
reduced the inlining too much, even compared to the old static cutoff
count, missing some key inlines and reducing performance.

Thanks,
Teresa

>
> Honza
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.



--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comment 19 Jan Hubicka 2012-12-21 16:15:34 UTC

> As another data point, in our internal benchmarks I had tried a few
> values and 99.9% gave the best performance. Just going down to 99.0%
> reduced the inlining too much, even compared to the old static cutoff
> count, missing some key inlines and reducing performance.
this really should not happen too much.  I still think something along the following lines
is desirable.  Does it helps setting more resonable threshold?

Honza

Index: predict.c
===================================================================
*** predict.c	(revision 194655)
--- predict.c	(working copy)
*************** maybe_hot_count_p (struct function *fun,
*** 145,151 ****
      {
        ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE));
        gcc_assert (ws);
!       min_count = ws->min_counter;
      }
    return (count >= min_count);
  }
--- 145,156 ----
      {
        ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE));
        gcc_assert (ws);
! 
!       /* We want all counters above ws->min_counter * profile_info->runs
! 	 to be safely identified as hot regions.  This may be spoiled
! 	 by optimizations such as unrolling that reduce counts of the
! 	 body, thus divide by 32.  */
!       min_count = ws->min_counter / 32;
      }
    return (count >= min_count);
  }

Comment 20 Teresa Johnson 2012-12-21 16:26:17 UTC

On Fri, Dec 21, 2012 at 8:15 AM, hubicka at ucw dot cz
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #19 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-21 16:15:34 UTC ---
>> As another data point, in our internal benchmarks I had tried a few
>> values and 99.9% gave the best performance. Just going down to 99.0%
>> reduced the inlining too much, even compared to the old static cutoff
>> count, missing some key inlines and reducing performance.
> this really should not happen too much.  I still think something along the
> following lines
> is desirable.  Does it helps setting more resonable threshold?

I'll give this patch a try and let you know how it affects the
performance I see. But unrolling shouldn't affect inlining, since all
unrolling is after inlining, right?

Thanks,
Teresa

>
> Honza
>
> Index: predict.c
> ===================================================================
> *** predict.c    (revision 194655)
> --- predict.c    (working copy)
> *************** maybe_hot_count_p (struct function *fun,
> *** 145,151 ****
>       {
>         ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE));
>         gcc_assert (ws);
> !       min_count = ws->min_counter;
>       }
>     return (count >= min_count);
>   }
> --- 145,156 ----
>       {
>         ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE));
>         gcc_assert (ws);
> !
> !       /* We want all counters above ws->min_counter * profile_info->runs
> !      to be safely identified as hot regions.  This may be spoiled
> !      by optimizations such as unrolling that reduce counts of the
> !      body, thus divide by 32.  */
> !       min_count = ws->min_counter / 32;
>       }
>     return (count >= min_count);
>   }
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Comment 21 Jan Hubicka 2012-12-22 23:20:37 UTC

> I'll give this patch a try and let you know how it affects the
> performance I see. But unrolling shouldn't affect inlining, since all
> unrolling is after inlining, right?

Well, it is, but we do other transformations that affect profile prior inlining
+ we use hot/cold code all across compilation, not just to drive inlining.

Honza

Comment 22 Teresa Johnson 2013-01-11 18:18:48 UTC

Hi Honza,

I ran a number of experiments at different thresholds, and found that
performance starts dropping pretty quickly as the working set
threshold is dropped, even to 98.5% (2 working sets lower than the
99.9% threshold). This appears to be due to a reduction in key
inlines. If I use your divide by 32 heuristic, the performance is
recovered for most of the apps I looked at (with the 98.5% threshold),
but is still reduced for others.

When I looked at the ratio of the min counter values between the
98.28% bucket (selected when I use the 98.5% threshold) and the 99.9%
bucket, I can see that the apps that still degrade with 98.5% + your
divide by 32 heuristic are those with the largest ratio (100-300). For
the apps that no longer degrade from the 98.5% thresholds when
dividing by 32, the ratio is small and often less than 32 (so in
effect the resulting min counter is even smaller than with the current
99.9% threshold).

What threshold were you thinking of changing it to, based on your
experiments? For the google branches, based on the data I collected,
we plan to leave it at 99.9%.

Thanks,
Teresa

On Sat, Dec 22, 2012 at 3:20 PM, hubicka at ucw dot cz
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
>
> --- Comment #21 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-22 23:20:37 UTC ---
>> I'll give this patch a try and let you know how it affects the
>> performance I see. But unrolling shouldn't affect inlining, since all
>> unrolling is after inlining, right?
>
> Well, it is, but we do other transformations that affect profile prior inlining
> + we use hot/cold code all across compilation, not just to drive inlining.
>
> Honza
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.