This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition


On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi Honza,
>>
>> I am finally getting back to working on this after a few weeks of
>> working on some other priorities.
>
> I am also trying to return to this, so good timming ;)
> Martin has got smaller C++ programs (Inkscape) to not touch cold segment
> during the startup with FDO (w/o partitioning). Firefox still does, I think
> the problem are lost samples due to different linker decisions even with LTO.
> (i.e. linker pick an object from .a libraryat profile-generate time that i never
> passes later.).
>
> I plan to look into that today.
>>
>> Did you mean to commit the above change? I see that it went in as part
>> of r202258 but doesn't show up in the ChangeLog entry for that
>> revision.
>
> Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
> fix that.
>>
>> >
>> > In other cases it was mostly loop unrolling in combination with jump threading. So
>> > I modified my script to separately report when failure happens for test trained
>> > once and test trained hundred times.
>>
>> Thanks for the linker script. I reproduced your results. I looked at a
>> couple cases. The first was one that failed after 1 training run only
>> (20000910-2.c). It was due to jump threading, which you noted was a
>> problem. For this one I think we can handle it in the partitioning,
>> since there is an FDO insanity that we could probably treat more
>> conservatively when splitting.
>
> We should fix the roundoff issues - when I was introducing the
> frequency/probability/count system I made an assumptions that parts of programs
> with very low counts do not matter, since they are not part of hot spot (and I
> cared only about the hot spot).  Now we care about identifying unlikely
> executed spots and we need to fix this.
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>>
>> >
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> >
>> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
>> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
>> > I think it is kind of design problem: while implementing counts&frequencies
>> > the idea was that small counts do not matter, so integer arithmetic is all
>> > right.
>> >
>> > I wonder if with current C++ wonderland we can't simply switch count
>> > to a better representation. Either sreal or fixedpoint with capping
>> > (the integer overflow issues are tiring, too).
>>
>> It also seems like we should be able to detect the profile insanities
>> caused by integer truncation and handle them conservatively. That
>> being said, I see some sreal uses already in the profile.c code, so
>> presumably we could use this for the counts as well if it turns out to
>> be necessary?
>
> Yes, I was thinking about this, too.  We would need to do some evaulation of
> compile time implications but switching counts from gcov_type to
> profile_counter_t that is typedef to sreal seems sane idea.
>
> We could switch CFG code first.  there should not be many hot spots where
> counts are involved.  We can offline the common calculation we already moved to
> macros that
>
> We will need to invent also REG representation for them.  Now we have INT_LIST
> for that we may have SREAL list and introduce SREAL as valid RTX argument.
> This can be done incrementally.
>>
>> BTW, Rong also implemented his runtime patch to do the COMDAT profile
>> merging. However, that ended up having some issues, that were solvable
>> but would have caused us to lose all context sensitivity from COMDATS
>> inlined during the profile-gen build. I am going to go back to solving
>> this in the profile-use phase as we discussed in the separate thread
>> on the COMDAT inlining patch I had been working on.
>
> Yes, lets move ahead with this, too.  I think i should dig out the change
> that made frequencies to be guessed again.
>
> As for COMDAT merging, i would like to see the patch.  I am experimenting
> now with a patch to also privatize COMDATs during -fprofile-generate to
> avoid problems with lost profiles mentioned above.
>

Do you mean you privatize every COMDAT function in the profile-generate?
We discussed this idea internally and we thought it would not work for
large applications (like in google) due to size.

> As for context sensitivity, one approach would be to have two sets of
> counters for every comdat - one merged globally and one counting local
> instances.  We can then privatize always and at profile read in stage
> just clone every comdat and have two instances - one for offline copy
> and one for inlining.
>

In my implementation, I also allow multiple sets of COMDAT profile
co-existing in one compilation.
Due to the auxiliary modules in LIPO, I actually have more than two.

But I'm wondering how do you determine which profile to use for each
call-site -- the inline decision may not
be the same for profile-generate and profile-use compilation.

> This is not different from how I always wanted to handle GNU extern inlines
> (that also do have this issue - when you do not inline it, the unit do not see
> any profile of it).
>
> We can just tie the two functions together so "inline" version stay prior
> inlining and then have linker to redirect to inline version instead of offline
> version in such cases.  It already knows how to skip aliases and this is not
> terribly different from that.
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]