This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition

From: Jan Hubicka <hubicka at ucw dot cz>
To: Teresa Johnson <tejohnson at google dot com>
Cc: Jan Hubicka <hubicka at ucw dot cz>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "marxin.liska" <marxin dot liska at gmail dot com>
Date: Tue, 24 Sep 2013 14:31:13 +0200
Subject: Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
Authentication-results: sourceware.org; auth=none
References: <CAAe5K+UcMevsDcXOeq5tu0+u-FrLVVWR6Wd20ZhBHNWdNsQ4Zw at mail dot gmail dot com> <CAAe5K+W+VEERrsaCCjCD=n40_MO2EPashDp5qnKoS8SLaSjBjQ at mail dot gmail dot com> <20130817204408 dot GA16557 at kam dot mff dot cuni dot cz> <CAAe5K+XGKvd+_8Ukp0kpOWvc2k165F=4fdemf-iDz+QkirLPmg at mail dot gmail dot com> <20130819150942 dot GA28264 at kam dot mff dot cuni dot cz> <CAAe5K+UnqBfxYrZxSkjRudq8NYni_9ih+t=us+2hH+UPsrwDLA at mail dot gmail dot com> <CAAe5K+UhCzfwm_WomE1yk+ET1tBiNT5svfn_LAc57MUSqJbnaQ at mail dot gmail dot com> <20130831160420 dot GC7492 at kam dot mff dot cuni dot cz> <20130831214614 dot GA12372 at kam dot mff dot cuni dot cz> <CAAe5K+U8kAu8epSpUEMyY4bV_rTxxw78reSvu431w1fOxKkw9A at mail dot gmail dot com>

> Hi Honza,
> 
> I am finally getting back to working on this after a few weeks of
> working on some other priorities.

I am also trying to return to this, so good timming ;)
Martin has got smaller C++ programs (Inkscape) to not touch cold segment
during the startup with FDO (w/o partitioning). Firefox still does, I think
the problem are lost samples due to different linker decisions even with LTO.
(i.e. linker pick an object from .a libraryat profile-generate time that i never
passes later.).

I plan to look into that today.
> 
> Did you mean to commit the above change? I see that it went in as part
> of r202258 but doesn't show up in the ChangeLog entry for that
> revision.

Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
fix that.
> 
> >
> > In other cases it was mostly loop unrolling in combination with jump threading. So
> > I modified my script to separately report when failure happens for test trained
> > once and test trained hundred times.
> 
> Thanks for the linker script. I reproduced your results. I looked at a
> couple cases. The first was one that failed after 1 training run only
> (20000910-2.c). It was due to jump threading, which you noted was a
> problem. For this one I think we can handle it in the partitioning,
> since there is an FDO insanity that we could probably treat more
> conservatively when splitting.

We should fix the roundoff issues - when I was introducing the
frequency/probability/count system I made an assumptions that parts of programs
with very low counts do not matter, since they are not part of hot spot (and I
cared only about the hot spot).  Now we care about identifying unlikely
executed spots and we need to fix this.
> 
> I looked at one that failed after 100 as well (20031204-1.c). In this
> case, it was due to expansion which was creating multiple branches/bbs
> from a logical OR and guessing incorrectly on how to assign the
> counts:
> 
>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
> 
> The (*cp == ':' || *cp == '\0') part looked like the following going
> into RTL expansion:
> 
>   [20031204-1.c : 31:33] _29 = _28 == 58;
>   [20031204-1.c : 31:33] _30 = _28 == 0;
>   [20031204-1.c : 31:33] _31 = _29 | _30;
>   [20031204-1.c : 31:18] if (_31 != 0)
>     goto <bb 16>;
>   else
>     goto <bb 19>;
> 
> where the result of the OR was always true, so bb 16 had a count of
> 100 and bb 19 a count of 0. When it was expanded, the expanded version
> of the above turned into 2 bbs with a branch in between. Both
> comparisons were done in the first bb, but the first bb checked
> whether the result of the *cp == '\0' compare was true, and if not
> branched to the check for whether the *cp == ':' compare was true. It
> gave the branch to the second check against ':' a count of 0, so that
> bb got a count of 0 and was split out, and put the count of 100 on the
> fall through assuming the compare with '\0' always evaluated to true.
> In reality, this OR condition was always true because *cp was ':', not
> '\0'. Therefore, the count of 0 on the second block with the check for
> ':' was incorrect, we ended up trying to execute it, and failed.
> 
> Presumably we had the correct profile data for both blocks, but the
> accuracy was reduced when the OR was represented as a logical
> computation with a single branch. We could change the expansion code
> to do something different, e.g. treat as a 50-50 branch. But we would
> still end up with integer truncation issues when there was a single
> training run. But that could be dealt with conservatively in the
> bbpart code as I suggested for the jump threading issue above. I.e. a
> cold block with incoming non-cold edges conservatively not marked cold
> for splitting.
> 
> >
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> >
> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
> > I think it is kind of design problem: while implementing counts&frequencies
> > the idea was that small counts do not matter, so integer arithmetic is all
> > right.
> >
> > I wonder if with current C++ wonderland we can't simply switch count
> > to a better representation. Either sreal or fixedpoint with capping
> > (the integer overflow issues are tiring, too).
> 
> It also seems like we should be able to detect the profile insanities
> caused by integer truncation and handle them conservatively. That
> being said, I see some sreal uses already in the profile.c code, so
> presumably we could use this for the counts as well if it turns out to
> be necessary?

Yes, I was thinking about this, too.  We would need to do some evaulation of
compile time implications but switching counts from gcov_type to
profile_counter_t that is typedef to sreal seems sane idea.

We could switch CFG code first.  there should not be many hot spots where
counts are involved.  We can offline the common calculation we already moved to
macros that

We will need to invent also REG representation for them.  Now we have INT_LIST
for that we may have SREAL list and introduce SREAL as valid RTX argument.
This can be done incrementally.
> 
> BTW, Rong also implemented his runtime patch to do the COMDAT profile
> merging. However, that ended up having some issues, that were solvable
> but would have caused us to lose all context sensitivity from COMDATS
> inlined during the profile-gen build. I am going to go back to solving
> this in the profile-use phase as we discussed in the separate thread
> on the COMDAT inlining patch I had been working on.

Yes, lets move ahead with this, too.  I think i should dig out the change
that made frequencies to be guessed again.

As for COMDAT merging, i would like to see the patch.  I am experimenting
now with a patch to also privatize COMDATs during -fprofile-generate to
avoid problems with lost profiles mentioned above.

As for context sensitivity, one approach would be to have two sets of
counters for every comdat - one merged globally and one counting local
instances.  We can then privatize always and at profile read in stage
just clone every comdat and have two instances - one for offline copy
and one for inlining.

This is not different from how I always wanted to handle GNU extern inlines
(that also do have this issue - when you do not inline it, the unit do not see
any profile of it).

We can just tie the two functions together so "inline" version stay prior
inlining and then have linker to redirect to inline version instead of offline
version in such cases.  It already knows how to skip aliases and this is not
terribly different from that.

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Follow-Ups:
- Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  - From: Teresa Johnson
- Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  - From: Rong Xu

References:
- Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  - From: Teresa Johnson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]