This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition


On Fri, Aug 9, 2013 at 2:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Hi,
>> > Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
>> > His code simply measures time of firest execution of a function and orders functions in the given order.
>> > The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
>> > in this order.
>> >
>> > I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
>> >
>> > In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
>> > one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
>> > section since most of gimp is not trained.
>>
>> 2.pdf is reordered with Martin's technique?
>>
>> >
>> > Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
>> > near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
>> > fall into cold section of function.
>>
>> 1.pdf is generated using the usual FDO +
>> -freorder-blocks-and-partition (i.e. not using Martin's technique)?
>
> 2.pdf is Martin's reordering (that works ortoghonally to what we already have -
> it just orders the functions inside idividual subsections. This make the
> subsections more visible than without his patch).
> 1.pdf is Marting's rerodering + yours patch (I asked him to double check it is
> the latest verision) + -freorder-blocks-and-partition.
>
> He simply trains and measures the gimp startup, nothing else, so there should not
> be problem with representativity of the data.

Ok, so a single training run, and it is essentially the same as what
is being used to create the graph after optimization.

>>
>> >
>> > It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
>> > based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
>> > conditionals and then the heuristic about predecestors won't help).
>>
>> Yes, this doesn't look good. Did you use the latest version of my
>> patch that doesn't walk the dominators?
>>
>> Do you know how many training runs are done for this benchmark? I
>> think a lot of the issues that you pointed out with the hot loop
>> preceded by non-looping conditional code as in your earlier example,
>> or multiple nested conditionals, comes from the fact that the cold
>> cutoff is not 0, but some number less than the number of training
>> runs. Perhaps the cutoff for splitting should be 0. Then the main
>> issue that needs to be corrected is profile insanities, not code that
>> is executed once (since that would not be marked cold).
>
> Hmm, compute_function_frequency uses probably_never_executed_bb_p that requires
> the count of basic block to be less than number of train runs.  In Martin's
> setup that will be 0.
>
> This is the same as what -frerorder-block-of-partition does?

Right, it simply puts blocks that are probably_never_executed_bb_p
into the cold section. But it sounds like that is not an issue here
since Martin is doing a single training run so the cutoff is
essentially 0.

>>
>> The only other issue that I can think of here is that the training
>> data was not representative and didn't execute these blocks.
>>
>> >
>> > What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
>> > blocks hot + forcingly make any hot blocks not reachable this way reachable?
>>
>> Is this different than what I currently have + changing the cold
>> cutoff to 0? In that case any blocks reachable through non-0 edges
>> should be non-0 and marked hot, and the current patch forces the most
>> frequent paths to all hot blocks to be hot.
>
> Do we sanity check that the cold partition does not contain any blocks of
> count 0?  It may be that the profile is broken enough to make partitioning
> not work.

Do you mean sanity check that the cold partition does not contain any
blocks of count > 0? (they should all be zero) I don't think that
sanity check is there, but I can try adding that.

The issue with such a sanity check may be due to the later fixup I
have in this patch (fixup_partitions). It is invoked after certain
optimizations on the cfg that may make hot blocks previously reached
by both hot and cold edges only reachable by cold blocks. These blocks
are remarked cold. If the profile data hasn't been updated correctly
it is possible that they would still have a non-0 count, although they
are essentially cold after the cfg transformation.

But certainly such a sanity check should always succeed after the
original partitioning.

> I can think of inlining where the count gets scaled all way down to 0.  Perhaps
> count scaling code can be modified to never round towards 0 for block executing
> non-0 times...

This reminds me of why this situation could happen. When I have been
testing this on the google branch I found situations where COMDAT
routines have 0 profile counts (if I remember correctly, this happens
when profile-gen binary has call to out-of-line copy of COMDAT in
module A, linker chooses the out-of-line copy from module B, therefore
the profile data for COMDAT in module A is 0). When the COMDAT gets
inlined, the 0 counts on its bbs are scaled to 0, even though the
callsite is non-zero. I have a patch that I was planning to send as a
follow-up that handles this case by propagating the callsite bb's
count to the inlined code when it has 0 counts, scaling by the edge
frequencies. I can either include that patch in this one, or send it
for review separately right now. Do you want to give it a try with
this one to see if it addresses the issue?

Also, can you send me reproduction instructions for gimp? I don't
think I need Martin's patch, but which version of gimp and what is the
equivalent way for me to train it? I have some scripts to generate a
similar type of instruction heat map graph that I have been using to
tune partitioning and function reordering. Essentially it uses linux
perf to sample on instructions_retired and then munge the data in
several ways to produce various stats and graphs. One thing that has
been useful has been to combine the perf data with nm output to
determine which cold functions are being executed at runtime.

However, for this to tell me which split cold bbs are being executed I
need to use a patch that Sri sent for review several months back that
gives the split cold section its own name:
  http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
Steven had some follow up comments that Sri hasn't had a chance to address yet:
  http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
(cc'ing Sri as we should probably revive this patch soon to address
gdb and other issues with detecting split functions properly)

Thanks!
Teresa

>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]