See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 onwards. Binaries build with "-flto -fprofile-use" increased more than 20% in size since revision 193747 went in. libxul's size increased ~20%: 34MB before, 42MB now. This also happens for simpler testcases, e.g tramp3d-v4.cpp (30% increase): 0.9MB before, 1.3M now. Setting "--param hot-bb-count-ws-permille" to lower values doesn't help.
I'm really surprised that using --param hot-bb-count-ws-permille=950 didn't help, since even fewer things should look hot enough to inline than before the revision. Would you send me the new -fdump-ipa-inline output? Thanks, Teresa
In the tramp3d-v4 case, when I run with this simple debug patch: diff --git a/gcc/predict.c b/gcc/predict.c index 5d3de29..bf3a259 100644 --- a/gcc/predict.c +++ b/gcc/predict.c @@ -147,6 +147,7 @@ maybe_hot_count_p (struct function *fun, gcov_type count) gcc_assert (ws); min_count = ws->min_counter; } + fprintf (stderr, "count=%i min_count=%i profile_info->sum_max=%i\n", count, min_count, profile_info->sum_max); return (count >= min_count); } it shows that in the middle of the output, min_count changes from 8585 to 0: ... ~600000 similar lines with min_count=8585 count=3 min_count=8585 profile_info->sum_max=257406300 count=3 min_count=8585 profile_info->sum_max=257406300 count=3 min_count=8585 profile_info->sum_max=257406300 count=3 min_count=8585 profile_info->sum_max=257406300 count=40 min_count=0 profile_info->sum_max=257406300 count=61 min_count=0 profile_info->sum_max=257406300 count=40 min_count=0 profile_info->sum_max=257406300 count=40 min_count=0 profile_info->sum_max=257406300 ... ~600000 similar lines with min_count=0
Hi Markus, Are you sure you have my subsequent fixes patched in, to make sure the histogram is getting streamed through the LTO files? This was the behavior I saw when I was debugging the original issue that I fixed with those patches. Basically, the hotness checks before we went into LTO were good, and after LTO the min count was 0. Teresa On Thu, Dec 13, 2012 at 6:46 AM, markus at trippelsdorf dot de <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #2 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 14:46:31 UTC --- > In the tramp3d-v4 case, when I run with this simple debug patch: > > diff --git a/gcc/predict.c b/gcc/predict.c > index 5d3de29..bf3a259 100644 > --- a/gcc/predict.c > +++ b/gcc/predict.c > @@ -147,6 +147,7 @@ maybe_hot_count_p (struct function *fun, gcov_type count) > gcc_assert (ws); > min_count = ws->min_counter; > } > + fprintf (stderr, "count=%i min_count=%i profile_info->sum_max=%i\n", count, > min_count, profile_info->sum_max); > return (count >= min_count); > } > > it shows that in the middle of the output, min_count changes from > 8585 to 0: > > ... ~600000 similar lines with min_count=8585 > count=3 min_count=8585 profile_info->sum_max=257406300 > count=3 min_count=8585 profile_info->sum_max=257406300 > count=3 min_count=8585 profile_info->sum_max=257406300 > count=3 min_count=8585 profile_info->sum_max=257406300 > count=40 min_count=0 profile_info->sum_max=257406300 > count=61 min_count=0 profile_info->sum_max=257406300 > count=40 min_count=0 profile_info->sum_max=257406300 > count=40 min_count=0 profile_info->sum_max=257406300 > ... ~600000 similar lines with min_count=0 > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug.
(In reply to comment #3) > Hi Markus, > > Are you sure you have my subsequent fixes patched in, to make sure the > histogram is getting streamed through the LTO files? This was the > behavior I saw when I was debugging the original issue that I fixed > with those patches. Basically, the hotness checks before we went into > LTO were good, and after LTO the min count was 0. > Hi Teresa, yes my gcc is up-to-date (from todays git).
Ok, I will download tramp3d-v4 right now and see what is going on. Can you send me the full set of options you are using to compile it? Teresa On Thu, Dec 13, 2012 at 6:52 AM, markus at trippelsdorf dot de <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #4 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 14:52:37 UTC --- > (In reply to comment #3) >> Hi Markus, >> >> Are you sure you have my subsequent fixes patched in, to make sure the >> histogram is getting streamed through the LTO files? This was the >> behavior I saw when I was debugging the original issue that I fixed >> with those patches. Basically, the hotness checks before we went into >> LTO were good, and after LTO the min count was 0. >> > Hi Teresa, > > yes my gcc is up-to-date (from todays git). > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug.
(In reply to comment #5) > Ok, I will download tramp3d-v4 right now and see what is going on. Can > you send me the full set of options you are using to compile it? > g++ -w -O3 -fprofile-generate -march=native tramp3d-v4.cpp ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20 g++ -w -O3 -fprofile-use -flto=4 -march=native tramp3d-v4.cpp
Reproduced. Looks like somehow my fix to stream this through LTO is not working properly. I see that the min count is valid when generating the .o file, but goes to zero when we do the link(lto) phase. Hopefully this means that the heuristic itself is not broken. =) I'll see if I can root cause this and have a fix today. Teresa On Thu, Dec 13, 2012 at 7:06 AM, markus at trippelsdorf dot de <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #6 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 15:06:12 UTC --- > (In reply to comment #5) >> Ok, I will download tramp3d-v4 right now and see what is going on. Can >> you send me the full set of options you are using to compile it? >> > > g++ -w -O3 -fprofile-generate -march=native tramp3d-v4.cpp > ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20 > g++ -w -O3 -fprofile-use -flto=4 -march=native tramp3d-v4.cpp > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug.
Dumb mistake in my previous fix to the lto support. Here is the patch that fixes it, I will submit for review after regression testing completes: Index: lto-cgraph.c =================================================================== --- lto-cgraph.c (revision 194404) +++ lto-cgraph.c (working copy) @@ -1368,7 +1368,9 @@ so we need to account for a non-zero histogram entry at new_ix. */ unsigned new_ix = gcov_histo_index (scaled_min); lto_gcov_summary.histogram[new_ix].min_value - = MIN (lto_gcov_summary.histogram[new_ix].min_value, scaled_min); + = (lto_gcov_summary.histogram[new_ix].num_counters + ? MIN (lto_gcov_summary.histogram[new_ix].min_value, scaled_min) + : scaled_min); /* Some of the scaled counter values would ostensibly need to be placed into different (larger) histogram buckets, but we keep things simple here and place the scaled cumulative counter value in the bucket Please let me know how this affects the mozilla size. Thanks, Teresa On Thu, Dec 13, 2012 at 7:49 AM, Teresa Johnson <tejohnson@google.com> wrote: > Reproduced. Looks like somehow my fix to stream this through LTO is > not working properly. I see that the min count is valid when > generating the .o file, but goes to zero when we do the link(lto) > phase. Hopefully this means that the heuristic itself is not broken. > =) I'll see if I can root cause this and have a fix today. > > Teresa > > On Thu, Dec 13, 2012 at 7:06 AM, markus at trippelsdorf dot de > <gcc-bugzilla@gcc.gnu.org> wrote: >> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 >> >> --- Comment #6 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 15:06:12 UTC --- >> (In reply to comment #5) >>> Ok, I will download tramp3d-v4 right now and see what is going on. Can >>> you send me the full set of options you are using to compile it? >>> >> >> g++ -w -O3 -fprofile-generate -march=native tramp3d-v4.cpp >> ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20 >> g++ -w -O3 -fprofile-use -flto=4 -march=native tramp3d-v4.cpp >> >> -- >> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email >> ------- You are receiving this mail because: ------- >> You are on the CC list for the bug. > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
*** Bug 55669 has been marked as a duplicate of this bug. ***
(In reply to comment #8) > Please let me know how this affects the mozilla size. Looks much better now: 39748288 hot-bb-count-ws-permille=999 (default) 34573408 hot-bb-count-ws-permille=890 34072808 without lto/pgo
Do you happen to know what it was with lto/pgo before the change? Should be roughly equivalent to hot-bb-count-ws-permille=970 from what I saw in your profiles. What size increase is acceptable? Thanks, Teresa On Thu, Dec 13, 2012 at 2:14 PM, markus at trippelsdorf dot de < gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #10 from Markus Trippelsdorf <markus at trippelsdorf dot de> > 2012-12-13 22:14:01 UTC --- > (In reply to comment #8) > > Please let me know how this affects the mozilla size. > > Looks much better now: > 39748288 hot-bb-count-ws-permille=999 (default) > 34573408 hot-bb-count-ws-permille=890 > 34072808 without lto/pgo > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. >
(In reply to comment #11) > Do you happen to know what it was with lto/pgo before the change? Should be > roughly equivalent to hot-bb-count-ws-permille=970 from what I saw in your > profiles Yes, it was ~34MB before the change. > What size increase is acceptable? It's hard to say in case of Firefox, because the only thing that one can reliably measure is the JavaScript performance. And this varies only very slightly with different compiler options. So you have no way to measure up to which point more inlining is still beneficial.
Author: tejohnson Date: Fri Dec 14 15:10:45 2012 New Revision: 194502 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=194502 Log: 2012-12-14 Teresa Johnson <tejohnson@google.com> PR gcov-profile/55674 * lto-cgraph.c (merge_profile_summaries): Set min correctly the first time we merge into a histogram entry. Modified: trunk/gcc/ChangeLog trunk/gcc/lto-cgraph.c
fixed
> It's hard to say in case of Firefox, because the only thing > that one can reliably measure is the JavaScript performance. > And this varies only very slightly with different compiler options. One way to get better differences is to disable the JIT :) Doesn't snappy allow us to measure things like startup time and other stuff where not much of JITed code is involved? > So you have no way to measure up to which point more inlining > is still beneficial. I will do some testing on SPEC on when the performance starts dropping. Setting limit to 890 seems bit dangerous. -Os can slow down the code considerably, so if 10% of time program will be running -Os code, we probably get measurable slowdowns overall. Thanks! Honza
I did some measurements with tramp3d and in this case the default (999) gives the best performance: par. size time -------------------- 999 955859 3.71752 990 933390 3.73969 980 904718 3.84547 ... " " 750 904718 3.84769 740 837654 7.67177 600 836024 8.80879
> I did some measurements with tramp3d and in this case > the default (999) gives the best performance: > > par. size time > -------------------- > 999 955859 3.71752 > 990 933390 3.73969 > 980 904718 3.84547 > ... " " > 750 904718 3.84769 > 740 837654 7.67177 > 600 836024 8.80879 Yep, tramp3d is unforutnately quite special case: we measure the number of instructions prior late optimization, while in tramp3d over 90% of code disappear as a result of inlining and further simplification, so we get GIGO problem... I am not sure how to handle these things in any resonable way.... I will test couple of values on spec2k this week and lets see how things scale elsewhere. Honza
On Tue, Dec 18, 2012 at 9:25 AM, hubicka at ucw dot cz <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #17 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-18 17:25:37 UTC --- >> I did some measurements with tramp3d and in this case >> the default (999) gives the best performance: >> >> par. size time >> -------------------- >> 999 955859 3.71752 >> 990 933390 3.73969 >> 980 904718 3.84547 >> ... " " >> 750 904718 3.84769 >> 740 837654 7.67177 >> 600 836024 8.80879 > > Yep, tramp3d is unforutnately quite special case: we measure the number of > instructions prior > late optimization, while in tramp3d over 90% of code disappear as a result of > inlining and further > simplification, so we get GIGO problem... > > I am not sure how to handle these things in any resonable way.... > > I will test couple of values on spec2k this week and lets see how things scale > elsewhere. As another data point, in our internal benchmarks I had tried a few values and 99.9% gave the best performance. Just going down to 99.0% reduced the inlining too much, even compared to the old static cutoff count, missing some key inlines and reducing performance. Thanks, Teresa > > Honza > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> As another data point, in our internal benchmarks I had tried a few > values and 99.9% gave the best performance. Just going down to 99.0% > reduced the inlining too much, even compared to the old static cutoff > count, missing some key inlines and reducing performance. this really should not happen too much. I still think something along the following lines is desirable. Does it helps setting more resonable threshold? Honza Index: predict.c =================================================================== *** predict.c (revision 194655) --- predict.c (working copy) *************** maybe_hot_count_p (struct function *fun, *** 145,151 **** { ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE)); gcc_assert (ws); ! min_count = ws->min_counter; } return (count >= min_count); } --- 145,156 ---- { ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE)); gcc_assert (ws); ! ! /* We want all counters above ws->min_counter * profile_info->runs ! to be safely identified as hot regions. This may be spoiled ! by optimizations such as unrolling that reduce counts of the ! body, thus divide by 32. */ ! min_count = ws->min_counter / 32; } return (count >= min_count); }
On Fri, Dec 21, 2012 at 8:15 AM, hubicka at ucw dot cz <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #19 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-21 16:15:34 UTC --- >> As another data point, in our internal benchmarks I had tried a few >> values and 99.9% gave the best performance. Just going down to 99.0% >> reduced the inlining too much, even compared to the old static cutoff >> count, missing some key inlines and reducing performance. > this really should not happen too much. I still think something along the > following lines > is desirable. Does it helps setting more resonable threshold? I'll give this patch a try and let you know how it affects the performance I see. But unrolling shouldn't affect inlining, since all unrolling is after inlining, right? Thanks, Teresa > > Honza > > Index: predict.c > =================================================================== > *** predict.c (revision 194655) > --- predict.c (working copy) > *************** maybe_hot_count_p (struct function *fun, > *** 145,151 **** > { > ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE)); > gcc_assert (ws); > ! min_count = ws->min_counter; > } > return (count >= min_count); > } > --- 145,156 ---- > { > ws = find_working_set (PARAM_VALUE (HOT_BB_COUNT_WS_PERMILLE)); > gcc_assert (ws); > ! > ! /* We want all counters above ws->min_counter * profile_info->runs > ! to be safely identified as hot regions. This may be spoiled > ! by optimizations such as unrolling that reduce counts of the > ! body, thus divide by 32. */ > ! min_count = ws->min_counter / 32; > } > return (count >= min_count); > } > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug.
> I'll give this patch a try and let you know how it affects the > performance I see. But unrolling shouldn't affect inlining, since all > unrolling is after inlining, right? Well, it is, but we do other transformations that affect profile prior inlining + we use hot/cold code all across compilation, not just to drive inlining. Honza
Hi Honza, I ran a number of experiments at different thresholds, and found that performance starts dropping pretty quickly as the working set threshold is dropped, even to 98.5% (2 working sets lower than the 99.9% threshold). This appears to be due to a reduction in key inlines. If I use your divide by 32 heuristic, the performance is recovered for most of the apps I looked at (with the 98.5% threshold), but is still reduced for others. When I looked at the ratio of the min counter values between the 98.28% bucket (selected when I use the 98.5% threshold) and the 99.9% bucket, I can see that the apps that still degrade with 98.5% + your divide by 32 heuristic are those with the largest ratio (100-300). For the apps that no longer degrade from the 98.5% thresholds when dividing by 32, the ratio is small and often less than 32 (so in effect the resulting min counter is even smaller than with the current 99.9% threshold). What threshold were you thinking of changing it to, based on your experiments? For the google branches, based on the data I collected, we plan to leave it at 99.9%. Thanks, Teresa On Sat, Dec 22, 2012 at 3:20 PM, hubicka at ucw dot cz <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674 > > --- Comment #21 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-22 23:20:37 UTC --- >> I'll give this patch a try and let you know how it affects the >> performance I see. But unrolling shouldn't affect inlining, since all >> unrolling is after inlining, right? > > Well, it is, but we do other transformations that affect profile prior inlining > + we use hot/cold code all across compilation, not just to drive inlining. > > Honza > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug.