This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition


On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I added both of these and ran into issues due to profile maintenance.
>> For example, there were non-zero blocks in the cold section because
>> pro_and_epilogue split a simple return block that was previously reach
>> by both hot and cold paths. The new return block that was then only
>> reached via the cold path did not have its count properly updated to
>> reflect this, and since with this patch, blocks dominated by cold
>> blocks are remarked cold, we ended up with a non-zero count block in
>> the cold section. And there were 0 count blocks reached by non-zero
>> edges because copyprop did not clean up edge weights after removing
>> some branches and blocks, leading to non-zero edge weights that had
>> previously targeted a branch that was removed, now targeting a 0 count
>> block that the removed branch always branched around.
>
> I see, can you please send fixes for the problems you identified?
> Thanks for owrking on this!

I don't have fixes at this point - I just identified the phase and
transformation from looking at the dump. But I'll try to fix them soon
while I'm working on performance tuning for splitting. I have a
feeling there are probably a bunch of places where the profile isn't
getting updated properly, unfortunately.

>>
>> In any case, the good news is in that the cases I looked at, the
>> splitting code is doing the right thing and these blocks that were
>> marked cold really were cold. It would be great to fix the profile
>> maintenance issues, but that in the meantime the above sanity checks
>> are too aggressive.
>
> We can keep them and output info into dump file - it is what most of
> the profile sanity checking does anyway.

Ok, I will add that.

>
> Did you try to use Martin's linker script to turn text.unlikely
> section unexecutable?  I think that way we will easily find what really
> causes us to use it during startup of trained application (just like
> Martin does for gimp).

I haven't - where can I get that script?

>>
>> I think it makes sense to commit the current patch if possible, as it
>> is making the splitting more sane.
>
> My only concern about the patch is that I am not convinced the dominator
> based code has chance to work reliably enough so we won't see too many
> accesses into the cold section.

Remember it isn't using dominance anymore. The latest patch was
instead ensuring the most frequent path between hot blocks and the
entry/exit are marked hot. That should be better than the dominance
approach used in the earlier version.

> We can commit it and work on better solution incrementally but it will
> probably mean replacing it later.  If you think it makes things easier
> to work on it incrementally, I think the patch is OK.

Yes, I think this is a big step forward from what is there now for
splitting, which does the splitting purely based on bb count in
isolation. I don't have a much better solution in mind yet.

>>
>> > - I'll try building and profiling gimp myself to see if I can
>> > reproduce the issue with code executing out of the cold section.
>>
>> I have spent some time this week trying to get the latest gimp Martin
>> pointed me to configured and built, but it took awhile to track down
>> and configure/build all of the required versions of dependent
>> packages. I'm still hitting some issues trying to get it compiled, so
>> it may not yet be configured properly. I'll take a look again early
>> next week.
>
> I do not think there is anything special about gimp.  You can probably
> take any other bigger app, like GCC itself. With profiledbootstrap
> and linker script to lock unlikely section you should get ICEs where
> we jump into cold secton and should not.

Ok, please point me to the linker script and I will try gcc
profiledbootstrap as well. I wanted to try gimp if possible as I
haven't seen this much jumping to the cold section in some of the
internal apps I tried.

Thanks,
Teresa

>>
>> Teresa
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16  Teresa Johnson  <tejohnson@google.com>
>>
>>         * tree-inline.c (copy_bb): Compute count based on frequency.
>>         (copy_edges_for_bb): Ditto.
>>         (copy_cfg_body): Ditto.
>>         (copy_body): Pass down frequency.
>>         (expand_call_inline): Ditto.
>>         (tree_function_versioning): Ditto.
>>         * predict.c (init_and_estimate_bb_frequencies): New function.
>>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>>         counts.
>>         (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now.  So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).
>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..
>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...
>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c       (revision 201644)
>> +++ tree-inline.c       (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>>  static basic_block
>>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> -         gcov_type count_scale)
>> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>>  {
>>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>>    basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>>       basic_block_info automatically.  */
>>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>>                                           (basic_block) prev->aux);
>> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
>> +  copy_basic_block->count
>> +      = (count_scale
>> +         ? apply_scale (bb->count, count_scale)
>> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +            that didn't get profile counts) then we compute the new bb counts
>> +            via the statically-estimated frequencies.  */
>> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>>    /* We are going to rebuild frequencies from scratch.  These values
>>       have just small importance to drive canonicalize_loop_headers.  */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>>     debug stmts are left after a statement that must end the basic block.  */
>>
>>  static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> +                   basic_block ret_bb,
>>                    bool can_make_abnormal_goto)
>>  {
>>    basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>>           flags |= EDGE_FALLTHRU;
>>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> -       new_edge->count = apply_scale (old_edge->count, count_scale);
>> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> +       new_edge->count
>> +            = (count_scale
>> +               ? apply_scale (old_edge->count, count_scale)
>> +               // The bb counts have already been scaled with
>> freq_to_count_scale
>> +               // when that is non-zero, so just scale that new bb count by
>> +               // the edge probability.
>> +               : apply_probability (new_src_bb->count, old_edge->probability));
>>         new_edge->probability = old_edge->probability;
>>        }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>>     another function.  Walks FN via CFG, returns new fndecl.  */
>>
>>  static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> +              int frequency, int frequency_scale,
>>                basic_block entry_block_map, basic_block exit_block_map,
>>                bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2293,15 +2308,20 @@ static tree
>>    basic_block bb;
>>    tree new_fndecl = NULL;
>>    bool need_debug_cleanup = false;
>> -  gcov_type count_scale;
>> +  gcov_type count_scale = 0;
>> +  gcov_type freq_to_count_scale = 0;
>>    int last;
>>    int incoming_frequency = 0;
>>    gcov_type incoming_count = 0;
>>
>> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> -    count_scale
>> -        = GCOV_COMPUTE_SCALE (count,
>> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> +  if (entry_bb->count)
>> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +     that didn't get profile counts) then we compute the new bb counts
>> +     via the statically-estimated frequencies.  */
>> +  else if (entry_bb->frequency)
>> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>>    else
>>      count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>>             incoming_frequency += EDGE_FREQUENCY (e);
>>             incoming_count += e->count;
>>           }
>> -      incoming_count = apply_scale (incoming_count, count_scale);
>> +      incoming_count
>> +          = (count_scale
>> +             ? apply_scale (incoming_count, count_scale)
>> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +                that didn't get profile counts) then we compute the
>> new bb counts
>> +                via the statically-estimated frequencies.  */
>> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>>        incoming_frequency
>>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>>        ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>>        {
>> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> +                                     freq_to_count_scale);
>>         bb->aux = new_bb;
>>         new_bb->aux = bb;
>>         new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy
>>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> +                                              exit_block_map,
>>                                                can_make_abormal_goto);
>>
>>    if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>>     another function.  */
>>
>>  static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> +          int frequency_scale,
>>            basic_block entry_block_map, basic_block exit_block_map,
>>            bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>>    /* If this body has a CFG, walk CFG and copy.  */
>>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
>> +                       entry_block_map, exit_block_map,
>>                         blocks_to_copy, new_entry);
>>    copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>>       function in any way before this point, as this CALL_EXPR may be
>>       a self-referential call; if we're calling ourselves, we need to
>>       duplicate our body before altering anything.  */
>> -  copy_body (id, bb->count,
>> +  copy_body (id, bb->count, bb->frequency,
>>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>>              bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>>      }
>>
>>    /* Copy the Function's body.  */
>> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> +            blocks_to_copy, new_entry);
>>
>>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>>    number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 201644)
>> +++ predict.c   (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>>    return new pass_strip_predict_hints (ctxt);
>>  }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> +   is no profile data available.  */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
>> +    return;
>> +
>> +  loop_optimizer_init (0);
>> +  add_noreturn_fake_exit_edges ();
>> +  mark_irreducible_loops ();
>> +  connect_infinite_loops_to_exit ();
>> +  estimate_bb_frequencies ();
>> +  remove_fake_exit_edges ();
>> +  loop_optimizer_finalize ();
>> +}
>> +
>>  /* Rebuild function frequencies.  Passes are in general expected to
>>     maintain profile by hand, however in some cases this is not possible:
>>     for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>>  {
>>    timevar_push (TV_REBUILD_FREQUENCIES);
>>    if (profile_status == PROFILE_GUESSED)
>> -    {
>> -      loop_optimizer_init (0);
>> -      add_noreturn_fake_exit_edges ();
>> -      mark_irreducible_loops ();
>> -      connect_infinite_loops_to_exit ();
>> -      estimate_bb_frequencies ();
>> -      remove_fake_exit_edges ();
>> -      loop_optimizer_finalize ();
>> -    }
>> +    init_and_estimate_bb_frequencies ();
>>    else if (profile_status == PROFILE_READ)
>>      counts_to_freqs ();
>>    else
>> Index: predict.h
>> ===================================================================
>> --- predict.h   (revision 201644)
>> +++ predict.h   (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>>  extern int counts_to_freqs (void);
>>  extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>>  extern const char *predictor_name (enum br_predictor);
>>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>>  extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c   (revision 201644)
>> +++ profile.c   (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>>    values.release ();
>>    free_edge_list (el);
>> +
>> +  /* Call after setting profile_status to PROFILE_READ, will then
>> +     invoke counts_to_freqs and if the sum of the counts is zero, will
>> +     estimate the frequencies.  */
>> +  init_and_estimate_bb_frequencies ();
>> +
>>    coverage_end_function (lineno_checksum, cfg_checksum);
>>  }
>>  ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c      (revision 201644)
>> +++ ipa-inline-transform.c      (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>>  static void
>>  update_noncloned_frequencies (struct cgraph_node *node,
>> -                             int freq_scale)
>> +                             gcov_type count_scale, int freq_scale)
>>  {
>>    struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>>      freq_scale = 1;
>>    for (e = node->callees; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>>        if (!e->inline_failed)
>> -        update_noncloned_frequencies (e->callee, freq_scale);
>> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>>      }
>>    for (e = node->indirect_calls; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>>             }
>>           duplicate = false;
>>           e->callee->symbol.externally_visible = false;
>> -          update_noncloned_frequencies (e->callee, e->frequency);
>> +          // In the case of a COMDAT, the callee's count may be from other
>> +          // modules, and we need to scale it for the current module's calls
>> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> +          if (e->callee->count > e->count)
>> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>>         }
>>        else
>>         {



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]