This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition


On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16  Teresa Johnson  <tejohnson@google.com>
>>
>>         * tree-inline.c (copy_bb): Compute count based on frequency.
>>         (copy_edges_for_bb): Ditto.
>>         (copy_cfg_body): Ditto.
>>         (copy_body): Pass down frequency.
>>         (expand_call_inline): Ditto.
>>         (tree_function_versioning): Ditto.
>>         * predict.c (init_and_estimate_bb_frequencies): New function.
>>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>>         counts.
>>         (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now.  So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).

The approach this patch takes is to simply treat those functions the
same as we would if we didn't feed back profile data in the first
place, by using the frequencies. This is sufficient except when one is
inlined, which is why I have the special handling in the inliner
itself.

>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..

But at profile read time we don't have access to the inter-module
calls. Presumably having guessed profiles for these routines should
help the O2 profile-use case as well (i.e. better optimized
out-of-line copy), so I wouldn't want to limit it to IPO or LIPO
compiles where we can identify inter-module call counts at some point
in the compilation.

>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...

How about an approach like this:
- Invoke init_and_estimate_bb_frequencies as I am doing to guess the
profiles at profile read time for functions with 0 counts.
- At inline time, invoke some kind of freqs_to_counts routine for any
0-count routine that is reached by non-zero call edges. It would take
the sum of all incoming call edge counts and synthesize counts for the
bbs using the guessed profile frequencies applied earlier by
init_and_estimate_bb_frequencies. Then the inliner can do its normal
bb count scaling.

Does that seem like a reasonable approach?

There is one other fix in this patch:
- The clone_inlined_nodes/update_noncloned_frequencies changes below
are handling a different case: 0-count call edge in this module, with
non-zero callee node count due to calls from other modules. It will
allow update_noncloned_frequencies to scale down the edge counts in
callee's cloned call tree. This was a fix I made for the
callgraph-based linker plugin function reordering, and not splitting
(since it is using both the node and edge weights to make ordering
decisions). Here's a description of the issue when I was debugging it:

----
In this case, because the callee we are inlining does not have any
other callers and is not external, we call
update_noncloned_frequencies from clone_inlined_nodes instead of
creating a clone. This routine does not attempt to scale the outgoing
edge weight counters on the callee, since the assumption must be that
there are no other callers so all the weight is attributed to the
current edge that we are inlining.

In this case this is clearly not correct, because the caller's count
is 0. I'm assuming that this is happening because the callee we are
inlining is a comdat, so its non-zero weights must have come from a
different module. It seems like update_noncloned_frequencies should go
ahead and scale the counts too.
----

For the above case, I think the right place to fix this is probably
during clone_inlined_nodes/update_noncloned_frequencies, as scaling is
handled by cgraph_clone_node in the case where we need cloning (also
called from clone_inlined_nodes).

Teresa

>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c       (revision 201644)
>> +++ tree-inline.c       (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>>  static basic_block
>>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> -         gcov_type count_scale)
>> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>>  {
>>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>>    basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>>       basic_block_info automatically.  */
>>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>>                                           (basic_block) prev->aux);
>> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
>> +  copy_basic_block->count
>> +      = (count_scale
>> +         ? apply_scale (bb->count, count_scale)
>> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +            that didn't get profile counts) then we compute the new bb counts
>> +            via the statically-estimated frequencies.  */
>> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>>    /* We are going to rebuild frequencies from scratch.  These values
>>       have just small importance to drive canonicalize_loop_headers.  */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>>     debug stmts are left after a statement that must end the basic block.  */
>>
>>  static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> +                   basic_block ret_bb,
>>                    bool can_make_abnormal_goto)
>>  {
>>    basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>>           flags |= EDGE_FALLTHRU;
>>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> -       new_edge->count = apply_scale (old_edge->count, count_scale);
>> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> +       new_edge->count
>> +            = (count_scale
>> +               ? apply_scale (old_edge->count, count_scale)
>> +               // The bb counts have already been scaled with
>> freq_to_count_scale
>> +               // when that is non-zero, so just scale that new bb count by
>> +               // the edge probability.
>> +               : apply_probability (new_src_bb->count, old_edge->probability));
>>         new_edge->probability = old_edge->probability;
>>        }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>>     another function.  Walks FN via CFG, returns new fndecl.  */
>>
>>  static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> +              int frequency, int frequency_scale,
>>                basic_block entry_block_map, basic_block exit_block_map,
>>                bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2293,15 +2308,20 @@ static tree
>>    basic_block bb;
>>    tree new_fndecl = NULL;
>>    bool need_debug_cleanup = false;
>> -  gcov_type count_scale;
>> +  gcov_type count_scale = 0;
>> +  gcov_type freq_to_count_scale = 0;
>>    int last;
>>    int incoming_frequency = 0;
>>    gcov_type incoming_count = 0;
>>
>> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> -    count_scale
>> -        = GCOV_COMPUTE_SCALE (count,
>> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> +  if (entry_bb->count)
>> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +     that didn't get profile counts) then we compute the new bb counts
>> +     via the statically-estimated frequencies.  */
>> +  else if (entry_bb->frequency)
>> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>>    else
>>      count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>>             incoming_frequency += EDGE_FREQUENCY (e);
>>             incoming_count += e->count;
>>           }
>> -      incoming_count = apply_scale (incoming_count, count_scale);
>> +      incoming_count
>> +          = (count_scale
>> +             ? apply_scale (incoming_count, count_scale)
>> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +                that didn't get profile counts) then we compute the
>> new bb counts
>> +                via the statically-estimated frequencies.  */
>> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>>        incoming_frequency
>>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>>        ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>>        {
>> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> +                                     freq_to_count_scale);
>>         bb->aux = new_bb;
>>         new_bb->aux = bb;
>>         new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy
>>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> +                                              exit_block_map,
>>                                                can_make_abormal_goto);
>>
>>    if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>>     another function.  */
>>
>>  static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> +          int frequency_scale,
>>            basic_block entry_block_map, basic_block exit_block_map,
>>            bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>>    /* If this body has a CFG, walk CFG and copy.  */
>>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
>> +                       entry_block_map, exit_block_map,
>>                         blocks_to_copy, new_entry);
>>    copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>>       function in any way before this point, as this CALL_EXPR may be
>>       a self-referential call; if we're calling ourselves, we need to
>>       duplicate our body before altering anything.  */
>> -  copy_body (id, bb->count,
>> +  copy_body (id, bb->count, bb->frequency,
>>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>>              bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>>      }
>>
>>    /* Copy the Function's body.  */
>> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> +            blocks_to_copy, new_entry);
>>
>>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>>    number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 201644)
>> +++ predict.c   (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>>    return new pass_strip_predict_hints (ctxt);
>>  }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> +   is no profile data available.  */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
>> +    return;
>> +
>> +  loop_optimizer_init (0);
>> +  add_noreturn_fake_exit_edges ();
>> +  mark_irreducible_loops ();
>> +  connect_infinite_loops_to_exit ();
>> +  estimate_bb_frequencies ();
>> +  remove_fake_exit_edges ();
>> +  loop_optimizer_finalize ();
>> +}
>> +
>>  /* Rebuild function frequencies.  Passes are in general expected to
>>     maintain profile by hand, however in some cases this is not possible:
>>     for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>>  {
>>    timevar_push (TV_REBUILD_FREQUENCIES);
>>    if (profile_status == PROFILE_GUESSED)
>> -    {
>> -      loop_optimizer_init (0);
>> -      add_noreturn_fake_exit_edges ();
>> -      mark_irreducible_loops ();
>> -      connect_infinite_loops_to_exit ();
>> -      estimate_bb_frequencies ();
>> -      remove_fake_exit_edges ();
>> -      loop_optimizer_finalize ();
>> -    }
>> +    init_and_estimate_bb_frequencies ();
>>    else if (profile_status == PROFILE_READ)
>>      counts_to_freqs ();
>>    else
>> Index: predict.h
>> ===================================================================
>> --- predict.h   (revision 201644)
>> +++ predict.h   (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>>  extern int counts_to_freqs (void);
>>  extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>>  extern const char *predictor_name (enum br_predictor);
>>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>>  extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c   (revision 201644)
>> +++ profile.c   (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>>    values.release ();
>>    free_edge_list (el);
>> +
>> +  /* Call after setting profile_status to PROFILE_READ, will then
>> +     invoke counts_to_freqs and if the sum of the counts is zero, will
>> +     estimate the frequencies.  */
>> +  init_and_estimate_bb_frequencies ();
>> +
>>    coverage_end_function (lineno_checksum, cfg_checksum);
>>  }
>>  ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c      (revision 201644)
>> +++ ipa-inline-transform.c      (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>>  static void
>>  update_noncloned_frequencies (struct cgraph_node *node,
>> -                             int freq_scale)
>> +                             gcov_type count_scale, int freq_scale)
>>  {
>>    struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>>      freq_scale = 1;
>>    for (e = node->callees; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>>        if (!e->inline_failed)
>> -        update_noncloned_frequencies (e->callee, freq_scale);
>> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>>      }
>>    for (e = node->indirect_calls; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>>             }
>>           duplicate = false;
>>           e->callee->symbol.externally_visible = false;
>> -          update_noncloned_frequencies (e->callee, e->frequency);
>> +          // In the case of a COMDAT, the callee's count may be from other
>> +          // modules, and we need to scale it for the current module's calls
>> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> +          if (e->callee->count > e->count)
>> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>>         }
>>        else
>>         {



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]