This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
- From: Teresa Johnson <tejohnson at google dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: Bernhard Reutner-Fischer <rep dot dot dot nop at gmail dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Steven Bosscher <stevenb dot gcc at gmail dot com>, Jeff Law <law at redhat dot com>, "marxin.liska" <marxin dot liska at gmail dot com>, Sriraman Tallam <tmsriram at google dot com>
- Date: Mon, 19 Aug 2013 06:34:23 -0700
- Subject: Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
- References: <20130802150529 dot GC15776 at kam dot mff dot cuni dot cz> <CAAe5K+WiR02Rs1jYMFRF2F8ey60UO7LwRa8WWq7coQ5Pq8HhiQ at mail dot gmail dot com> <CAAe5K+WToUznwFFfm5beapXAOOrOgxHR8LXmYBTL70C4VVsT+w at mail dot gmail dot com> <20130808222332 dot GA31755 at kam dot mff dot cuni dot cz> <CAAe5K+W+8borbPkt4BB1ayRgDbFBtd6oyZsGuUiC854o9t0Rjg at mail dot gmail dot com> <20130809095843 dot GC31755 at kam dot mff dot cuni dot cz> <CAAe5K+XXT6t5CXBDXPWMNSrLWwqfw8F_J2fNUAN2afqb5qPhzQ at mail dot gmail dot com> <20130809152804 dot GA6579 at kam dot mff dot cuni dot cz> <CAAe5K+UcMevsDcXOeq5tu0+u-FrLVVWR6Wd20ZhBHNWdNsQ4Zw at mail dot gmail dot com> <CAAe5K+W+VEERrsaCCjCD=n40_MO2EPashDp5qnKoS8SLaSjBjQ at mail dot gmail dot com> <20130817204408 dot GA16557 at kam dot mff dot cuni dot cz>
On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I added both of these and ran into issues due to profile maintenance.
>> For example, there were non-zero blocks in the cold section because
>> pro_and_epilogue split a simple return block that was previously reach
>> by both hot and cold paths. The new return block that was then only
>> reached via the cold path did not have its count properly updated to
>> reflect this, and since with this patch, blocks dominated by cold
>> blocks are remarked cold, we ended up with a non-zero count block in
>> the cold section. And there were 0 count blocks reached by non-zero
>> edges because copyprop did not clean up edge weights after removing
>> some branches and blocks, leading to non-zero edge weights that had
>> previously targeted a branch that was removed, now targeting a 0 count
>> block that the removed branch always branched around.
>
> I see, can you please send fixes for the problems you identified?
> Thanks for owrking on this!
I don't have fixes at this point - I just identified the phase and
transformation from looking at the dump. But I'll try to fix them soon
while I'm working on performance tuning for splitting. I have a
feeling there are probably a bunch of places where the profile isn't
getting updated properly, unfortunately.
>>
>> In any case, the good news is in that the cases I looked at, the
>> splitting code is doing the right thing and these blocks that were
>> marked cold really were cold. It would be great to fix the profile
>> maintenance issues, but that in the meantime the above sanity checks
>> are too aggressive.
>
> We can keep them and output info into dump file - it is what most of
> the profile sanity checking does anyway.
Ok, I will add that.
>
> Did you try to use Martin's linker script to turn text.unlikely
> section unexecutable? I think that way we will easily find what really
> causes us to use it during startup of trained application (just like
> Martin does for gimp).
I haven't - where can I get that script?
>>
>> I think it makes sense to commit the current patch if possible, as it
>> is making the splitting more sane.
>
> My only concern about the patch is that I am not convinced the dominator
> based code has chance to work reliably enough so we won't see too many
> accesses into the cold section.
Remember it isn't using dominance anymore. The latest patch was
instead ensuring the most frequent path between hot blocks and the
entry/exit are marked hot. That should be better than the dominance
approach used in the earlier version.
> We can commit it and work on better solution incrementally but it will
> probably mean replacing it later. If you think it makes things easier
> to work on it incrementally, I think the patch is OK.
Yes, I think this is a big step forward from what is there now for
splitting, which does the splitting purely based on bb count in
isolation. I don't have a much better solution in mind yet.
>>
>> > - I'll try building and profiling gimp myself to see if I can
>> > reproduce the issue with code executing out of the cold section.
>>
>> I have spent some time this week trying to get the latest gimp Martin
>> pointed me to configured and built, but it took awhile to track down
>> and configure/build all of the required versions of dependent
>> packages. I'm still hitting some issues trying to get it compiled, so
>> it may not yet be configured properly. I'll take a look again early
>> next week.
>
> I do not think there is anything special about gimp. You can probably
> take any other bigger app, like GCC itself. With profiledbootstrap
> and linker script to lock unlikely section you should get ICEs where
> we jump into cold secton and should not.
Ok, please point me to the linker script and I will try gcc
profiledbootstrap as well. I wanted to try gimp if possible as I
haven't seen this much jumping to the cold section in some of the
internal apps I tried.
Thanks,
Teresa
>>
>> Teresa
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16 Teresa Johnson <tejohnson@google.com>
>>
>> * tree-inline.c (copy_bb): Compute count based on frequency.
>> (copy_edges_for_bb): Ditto.
>> (copy_cfg_body): Ditto.
>> (copy_body): Pass down frequency.
>> (expand_call_inline): Ditto.
>> (tree_function_versioning): Ditto.
>> * predict.c (init_and_estimate_bb_frequencies): New function.
>> (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>> * predict.h (init_and_estimate_bb_frequencies): Declare.
>> * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>> * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>> counts.
>> (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now. So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).
>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..
>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...
>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c (revision 201644)
>> +++ tree-inline.c (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>> static basic_block
>> copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> - gcov_type count_scale)
>> + gcov_type count_scale, gcov_type freq_to_count_scale)
>> {
>> gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>> basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>> basic_block_info automatically. */
>> copy_basic_block = create_basic_block (NULL, (void *) 0,
>> (basic_block) prev->aux);
>> - copy_basic_block->count = apply_scale (bb->count, count_scale);
>> + copy_basic_block->count
>> + = (count_scale
>> + ? apply_scale (bb->count, count_scale)
>> + /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> + that didn't get profile counts) then we compute the new bb counts
>> + via the statically-estimated frequencies. */
>> + : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>> /* We are going to rebuild frequencies from scratch. These values
>> have just small importance to drive canonicalize_loop_headers. */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>> debug stmts are left after a statement that must end the basic block. */
>>
>> static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> + basic_block ret_bb,
>> bool can_make_abnormal_goto)
>> {
>> basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>> && old_edge->dest->aux != EXIT_BLOCK_PTR)
>> flags |= EDGE_FALLTHRU;
>> new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> - new_edge->count = apply_scale (old_edge->count, count_scale);
>> + basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> + new_edge->count
>> + = (count_scale
>> + ? apply_scale (old_edge->count, count_scale)
>> + // The bb counts have already been scaled with
>> freq_to_count_scale
>> + // when that is non-zero, so just scale that new bb count by
>> + // the edge probability.
>> + : apply_probability (new_src_bb->count, old_edge->probability));
>> new_edge->probability = old_edge->probability;
>> }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>> another function. Walks FN via CFG, returns new fndecl. */
>>
>> static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> + int frequency, int frequency_scale,
>> basic_block entry_block_map, basic_block exit_block_map,
>> bitmap blocks_to_copy, basic_block new_entry)
>> {
>> @@ -2293,15 +2308,20 @@ static tree
>> basic_block bb;
>> tree new_fndecl = NULL;
>> bool need_debug_cleanup = false;
>> - gcov_type count_scale;
>> + gcov_type count_scale = 0;
>> + gcov_type freq_to_count_scale = 0;
>> int last;
>> int incoming_frequency = 0;
>> gcov_type incoming_count = 0;
>>
>> - if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> - count_scale
>> - = GCOV_COMPUTE_SCALE (count,
>> - ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> + basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> + if (entry_bb->count)
>> + count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> + /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> + that didn't get profile counts) then we compute the new bb counts
>> + via the statically-estimated frequencies. */
>> + else if (entry_bb->frequency)
>> + freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>> else
>> count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>> incoming_frequency += EDGE_FREQUENCY (e);
>> incoming_count += e->count;
>> }
>> - incoming_count = apply_scale (incoming_count, count_scale);
>> + incoming_count
>> + = (count_scale
>> + ? apply_scale (incoming_count, count_scale)
>> + /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> + that didn't get profile counts) then we compute the
>> new bb counts
>> + via the statically-estimated frequencies. */
>> + : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>> incoming_frequency
>> = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>> ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>> FOR_EACH_BB_FN (bb, cfun_to_copy)
>> if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>> {
>> - basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> + basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> + freq_to_count_scale);
>> bb->aux = new_bb;
>> new_bb->aux = bb;
>> new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>> FOR_ALL_BB_FN (bb, cfun_to_copy)
>> if (!blocks_to_copy
>> || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> - need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> + need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> + exit_block_map,
>> can_make_abormal_goto);
>>
>> if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>> another function. */
>>
>> static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> + int frequency_scale,
>> basic_block entry_block_map, basic_block exit_block_map,
>> bitmap blocks_to_copy, basic_block new_entry)
>> {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>> /* If this body has a CFG, walk CFG and copy. */
>> gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> - body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> + body = copy_cfg_body (id, count, frequency, frequency_scale,
>> + entry_block_map, exit_block_map,
>> blocks_to_copy, new_entry);
>> copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>> function in any way before this point, as this CALL_EXPR may be
>> a self-referential call; if we're calling ourselves, we need to
>> duplicate our body before altering anything. */
>> - copy_body (id, bb->count,
>> + copy_body (id, bb->count, bb->frequency,
>> GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>> bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>> }
>>
>> /* Copy the Function's body. */
>> - copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> - ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> + copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> + REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> + blocks_to_copy, new_entry);
>>
>> /* Renumber the lexical scoping (non-code) blocks consecutively. */
>> number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c (revision 201644)
>> +++ predict.c (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>> return new pass_strip_predict_hints (ctxt);
>> }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> + is no profile data available. */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> + if (profile_status == PROFILE_READ && counts_to_freqs ())
>> + return;
>> +
>> + loop_optimizer_init (0);
>> + add_noreturn_fake_exit_edges ();
>> + mark_irreducible_loops ();
>> + connect_infinite_loops_to_exit ();
>> + estimate_bb_frequencies ();
>> + remove_fake_exit_edges ();
>> + loop_optimizer_finalize ();
>> +}
>> +
>> /* Rebuild function frequencies. Passes are in general expected to
>> maintain profile by hand, however in some cases this is not possible:
>> for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>> {
>> timevar_push (TV_REBUILD_FREQUENCIES);
>> if (profile_status == PROFILE_GUESSED)
>> - {
>> - loop_optimizer_init (0);
>> - add_noreturn_fake_exit_edges ();
>> - mark_irreducible_loops ();
>> - connect_infinite_loops_to_exit ();
>> - estimate_bb_frequencies ();
>> - remove_fake_exit_edges ();
>> - loop_optimizer_finalize ();
>> - }
>> + init_and_estimate_bb_frequencies ();
>> else if (profile_status == PROFILE_READ)
>> counts_to_freqs ();
>> else
>> Index: predict.h
>> ===================================================================
>> --- predict.h (revision 201644)
>> +++ predict.h (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>> extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>> extern int counts_to_freqs (void);
>> extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>> extern const char *predictor_name (enum br_predictor);
>> extern tree build_predict_expr (enum br_predictor, enum prediction);
>> extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c (revision 201644)
>> +++ profile.c (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>> values.release ();
>> free_edge_list (el);
>> +
>> + /* Call after setting profile_status to PROFILE_READ, will then
>> + invoke counts_to_freqs and if the sum of the counts is zero, will
>> + estimate the frequencies. */
>> + init_and_estimate_bb_frequencies ();
>> +
>> coverage_end_function (lineno_checksum, cfg_checksum);
>> }
>> ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c (revision 201644)
>> +++ ipa-inline-transform.c (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>> static void
>> update_noncloned_frequencies (struct cgraph_node *node,
>> - int freq_scale)
>> + gcov_type count_scale, int freq_scale)
>> {
>> struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>> freq_scale = 1;
>> for (e = node->callees; e; e = e->next_callee)
>> {
>> + e->count = apply_scale (e->count, count_scale);
>> e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>> if (e->frequency > CGRAPH_FREQ_MAX)
>> e->frequency = CGRAPH_FREQ_MAX;
>> if (!e->inline_failed)
>> - update_noncloned_frequencies (e->callee, freq_scale);
>> + update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>> }
>> for (e = node->indirect_calls; e; e = e->next_callee)
>> {
>> + e->count = apply_scale (e->count, count_scale);
>> e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>> if (e->frequency > CGRAPH_FREQ_MAX)
>> e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>> }
>> duplicate = false;
>> e->callee->symbol.externally_visible = false;
>> - update_noncloned_frequencies (e->callee, e->frequency);
>> + // In the case of a COMDAT, the callee's count may be from other
>> + // modules, and we need to scale it for the current module's calls
>> + // (e.g. e->count may be 0 despite e->callee->count > 0).
>> + gcov_type count_scale = REG_BR_PROB_BASE;
>> + if (e->callee->count > e->count)
>> + count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> + update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>> }
>> else
>> {
--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413