This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] AutoFDO patch for trunk


> >> Index: gcc/bb-reorder.c
> >> ===================================================================
> >> --- gcc/bb-reorder.c  (revision 210180)
> >> +++ gcc/bb-reorder.c  (working copy)
> >> @@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun)
> >>       user defined section attributes.  Don't call it if either case
> >>       arises.  */
> >>    return (flag_reorder_blocks_and_partition
> >> -       && optimize
> >> +       && optimize &&!flag_auto_profile
> >
> > Formatting error. Why we want !flag_auto_profile? Can't we just arrange
> > flag_reorder_blocks_and_partition to default to false?
> >
> > In fact I would like to see flag_reorder_blocks_and_partition to offload obviously
> > cold code (in partiuclar the EH cleanups) even without profile. That should work
> > well with autofdo.
> 
> This will cause bzip2 performance to degrade 6%. I haven't had time to
> triage the problem. Will investigate this later.

Still I would preffer to make this by default flag_reorder_blocks_and_partition
to false with auto_profile.  We could do that incrementally, lets just drop
this from initial patch.
> >> -  if (flag_branch_probabilities)
> >> +  if (flag_auto_profile)
> >> +    init_auto_profile ();
> >
> > Perhaps initialization can happen from toplev.c instead?
> 
> This is actually reading of afdo profile. I've renamed the function to
> make it consistent.

OK.

> 
> >> Index: gcc/cgraphclones.c
> >> ===================================================================
> >> --- gcc/cgraphclones.c        (revision 210180)
> >> +++ gcc/cgraphclones.c        (working copy)
> >> @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
> >>      }
> >>    else
> >>      count_scale = 0;
> >> +  /* In AutoFDO, if edge count is larger than callee's entry block
> >> +     count, we will not update the original callee because it may
> >> +     mistakenly mark some hot function as cold.  */
> >> +  if (flag_auto_profile && count >= n->count)
> >> +    update_original = false;
> >
> > This looks like a hack. Lets go with it later - I see what you are shooting for here
> > and non-autoFDO has kind of same problem, too.  Counts are not at all that reliable
> > after some inlining.
> >
> > I wonder how this hack would fare for -fprofile-use
> 
> It does not affect -fprofile-use performance in my testing. However,
> for AutoFDO, this happens much more often.

Yep, lets however drop it from the initial patch and make the tweaks go in later.
> 
> >> @@ -1286,6 +1285,9 @@ del_node_map (void)
> >>  struct cgraph_node*
> >>  find_func_by_profile_id (int profile_id)
> >>  {
> >> +  if (flag_auto_profile)
> >> +    return cgraph_node_for_asm (
> >> +     get_identifier ((const char *)(size_t)profile_id));
> >
> > I think with LTO we will need to make this machinery safe for symbol mangling.  Any plans on this?
> 
> No plan yet. Let's get AutoFDO in first, and spend some separate
> effort to make AutoFDO+LTO work?

OK.
> 
> >> Index: gcc/ipa-inline.h
> >> ===================================================================
> >> --- gcc/ipa-inline.h  (revision 210180)
> >> +++ gcc/ipa-inline.h  (working copy)
> >> @@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge)
> >>        edge_growth_cache[edge->uid] = zero;
> >>      }
> >>  }
> >> +
> >> +unsigned int early_inliner (function *fun);
> >
> > Can't this be handled by scheduling early inliner as subpass of autofdo or something similar?
> > I would rather make this explicit in PM.
> 
> See my comment below in auto-profile.c
> 
> >> Index: gcc/auto-profile.c
> >> ===================================================================
> >> --- gcc/auto-profile.c        (revision 0)
> >> +++ gcc/auto-profile.c        (revision 0)
> >> +   After the above 3 phases, all profile is readily annotated on the GCC IR.
> >> +   AutoFDO tries to reuse all FDO infrastructure as much as possible to make
> >> +   use of the profile. E.g. it uses existing mechanism to calculate the basic
> >> +   block/edge frequency, as well as the cgraph node/edge count.
> >> +*/
> >
> > I suppose the phases can be made explicit to the PM, so early inliner can stay
> > a pass.
> 
> 
> Unfortunately, this is more complicated:
> 
> while(changed) {
>   vpt();
>   einline();
>   annotate();
> }

Hmm, PM do not support iteration (though it wouldmake sense in some cases at least for
expensive optimization).  Why you need to iterate einline with the other two?
> 
> 
> >> +/* Store a string array, indexed by string position in the array.  */
> >> +class string_table {
> >> +public:
> >> +  static string_table *create ();
> >> +
> >> +  /* For a given string, returns its index.  */
> >> +  int get_index (const char *name) const;
> >> +
> >> +  /* For a given decl, returns the index of the decl name.  */
> >> +  int get_index_by_decl (tree decl) const;
> >> +
> >> +  /* For a given index, returns the string.  */
> >> +  const char *get_name (int index) const;
> >
> > I wonder how these should work with LTO.  I suppose we can tell user to not LTO
> > for train run, we should teach the table to ignore LTO generated suffixes as
> > well as random seeds (like coverage code already does).
> >
> > What happens when two static functions have same name?
> 
> Shall we first focus on non-LTO version. And there could be separate
> effort to make AutoFDO+LTO work?
> 
> For static functions, there will only be one copy in the profile. the
> profile will be merged. I agree that this could potentially cause
> problem. But OTOH, in a good practice, these functions should *not* be
> hot functions. At least in our practice, we didn't see any hot
> functions like this.

You mean that static functions in differnt units that have same name tends to be
not hot? I don't really follow it...
> 
> I agree. My original intention is to make the function name handling
> as simple as possible. The current impl is based on the asumption that
> all important (hot) functions deserve a separate name (not just suffix
> difference). We could definitely refine the function name handling in
> the future patches.
> 
> > I would definitely merge logic here with logic in gcov that already knows how to skip
> > random seeds and other cases that makes filenames to diverge...
> 
> Could you point me to the gcov logic that handles skipping function name suffix?

See coverage_checksum_string in coverage.c
> 
> >> +
> >> +/* Return the combined location, which is a 32bit integer in which
> >> +   higher 16 bits stores the line offset of LOC to the start lineno
> >> +   of DECL, The lower 16 bits stores the discrimnator.  */
> >> +
> >> +static unsigned
> >> +get_combined_location (location_t loc, tree decl)
> >> +{
> >> +  return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl)) << 16);
> >> +}
> >
> > What happens on overflows here? I guess they are possible - 65536 lines is not infinity
> > these days.
> 
> If a function is longer than 65535 lines, then some lines may have
> incorrect count attributed. But in pactice, we did not find any
> function that is over 10K lines.

In autogenerated code that may happen. I.e. insn-recog is 300k lines of code and it is
pure luck that the generator artifically split the large decision tree into subfunctions
(probably since early 90's)

But yep, also something that can be dealt with incrementally.
Peraps we could add a TODO comment to the begginign of file describing the main
limitations.
> >> +
> >> +/* Special propagation for circuit expressions. Because GCC translates
> >> +   control flow into data flow for circuit expressions. E.g.
> >> +   BB1:
> >> +   if (a && b)
> >> +     BB2
> >> +   else
> >> +     BB3
> >> +
> >> +   will be translated into:
> >> +
> >> +   BB1:
> >> +     if (a)
> >> +       goto BB.t1
> >> +     else
> >> +       goto BB.t3
> >> +   BB.t1:
> >> +     if (b)
> >> +       goto BB.t2
> >> +     else
> >> +       goto BB.t3
> >> +   BB.t2:
> >> +     goto BB.t3
> >> +   BB.t3:
> >> +     tmp = PHI (0 (BB1), 0 (BB.t1), 1 (BB.t2)
> >> +     if (tmp)
> >> +       goto BB2
> >> +     else
> >> +       goto BB3
> >> +
> >> +   In this case, we need to propagate through PHI to determine the edge
> >> +   count of BB1->BB.t1, BB.t1->BB.t2.  */
> >
> > So here you have known probabilities of BB1, BB2 and BB3 and you need to guess all the rest?
> > Why it need so much of gimple matching?
> 
> This is basically because some basic blocks have no stmt so that we
> can not get its count from source level profile. However, we can get
> edge count through PHI nodes.
> 
> >> +  while (changed && i++ < 10)
> >> +    {
> >> +      changed = false;
> >> +
> >> +      if (afdo_propagate_edge (true))
> >> +     changed = true;
> >> +      if (afdo_propagate_edge (false))
> >> +     changed = true;
> >> +      afdo_propagate_circuit ();
> >> +    }
> >
> > I wonder if there is any code to share with same (except for the circuit thing) code in profile.c
> > and gcov.c? Profile.c also have the minimal flow algorithm for fixing broken profiles, perhaps
> > it could be useful here?
> 
> The initial SampleFDO implementation uses MCF algorithm to calculate
> edge counts (the current mcf.c actually comes from that effort).
> However, from our performance tuning effort, we found that MCF is an
> overkill for AutoFDO purpose, and it's extremely slow and hard to
> debug. Thus we implemented this ad-hoc heuristic for propagation.

OK, I do not see however why we do not share one of these two solutions for both
cases. Again someting that may be cleaned up later.
How slow MCF is in practice?
> 
> >> +  FOR_ALL_BB_FN (bb, cfun)
> >> +    {
> >> +      edge e;
> >> +      edge_iterator ei;
> >> +
> >> +      FOR_EACH_EDGE (e, ei, bb->succs)
> >> +     e->count =
> >> +             (double) bb->count * e->probability / REG_BR_PROB_BASE;
> >> +      bb->aux = NULL;
> >> +    }
> >> +
> >> +  loop_optimizer_finalize ();
> >> +  free_dominance_info (CDI_DOMINATORS);
> >> +  free_dominance_info (CDI_POST_DOMINATORS);
> >> +}
> >
> > I wonder if using the estimated profile to fill in the missing gaps would make sense?
> > I.e. running the standard branch probabilities->freuqencies propagation code bug this
> > time taking the determined counts as known values?
> 
> Not sure if I follow. Could you elaborate?
> 
> >> +
> >> +/* Perform value profile transformation using AutoFDO profile. Add the
> >> +   promoted stmts to PROMOTED_STMTS. Return TRUE if there is any
> >> +   indirect call promoted.  */
> >> +
> >> +static bool
> >> +afdo_vpt_for_early_inline (stmt_set *promoted_stmts)
> >
> > What exactly this function does? It inlines according to the inline decisions
> > made at train run time and recorded to profile feedback?
> 
> As replied above, we need iterative vpt-einline to make sure the IR
> looks exactly the same as profiling binary before annotation.

So the main reason is that by vpt you turn indirect calls into direct calls?
Perhaps this can happen during the einline itself (so we do not go through the
redundant inline transform passes)?

I will look in detail into the patch soon.
Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]