[PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components

Tue Sep 27 21:25:00 GMT 2016

On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> Deciding what blocks should run with a certain component active so that
> the total cost of executing the prologues (and epilogues) is optimal, is
> not a computationally feasible problem.  Instead, for each basic block,
> we estimate the cost of putting a prologue right before the block, and
> if that is cheaper than the total cost of putting prologues optimally
> (according to the estimated cost) in the dominator subtrees strictly
> dominated by this first block, place it at the first block instead.
> This simple procedure places the components optimally for any dominator
> sub tree where the root node's cost does not depend on anything outside
> its subtree.
>
> The cost is the execution frequency of all edges into the block coming
> from blocks that do not have this component active.  The estimated cost
> is the execution frequency of the block, minus the execution frequency
> of any backedges (which by definition are coming from subtrees, so if
> the "head" block gets a prologue, the source block of any backedge has
> that component active as well).
>
> Currently, the epilogues are placed as late as possible, given the
> constraints.  This does not matter for execution cost, but we could
> save a little bit of code size by placing the epilogues in a smarter
> way.  This is a possible future optimisation.
>
> Now all that is left is inserting prologues and epilogues on all edges
> that jump into resp. out of the "active" set of blocks.  Often we need
> to insert some components' prologues (or epilogues) on all edges into
> (or out of) a block.  In theory cross-jumping can unify all such, but
> in practice that often fails; besides, that is a lot of work.  So in
> this case we insert the prologue and epilogue components at the "head"
> or "tail" of a block, instead.
>
> As a final optimisation, if a block needs a prologue and its immediate
> dominator has the block as a post-dominator, that immediate dominator
> gets the prologue as well.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* function.c (thread_prologue_and_epilogue_insns): Recompute the
> 	live info.  Call try_shrink_wrapping_separate.  Compute the
> 	prologue_seq afterwards, if it has possibly changed.  Compute the
> 	split_prologue_seq and epilogue_seq later, too.
> 	* shrink-wrap.c: #include cfgbuild.h.
> 	(dump_components): New function.
> 	(struct sw): New struct.
> 	(SW): New function.
> 	(init_separate_shrink_wrap): New function.
> 	(fini_separate_shrink_wrap): New function.
> 	(place_prologue_for_one_component): New function.
> 	(spread_components): New function.
> 	(disqualify_problematic_components): New function.
> 	(emit_common_heads_for_components): New function.
> 	(emit_common_tails_for_components): New function.
> 	(insert_prologue_epilogue_for_components): New function.
> 	(try_shrink_wrapping_separate): New function.
> 	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
>
> ---
>  gcc/function.c    |   9 +-
>  gcc/shrink-wrap.c | 729 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/shrink-wrap.h |   1 +
>  3 files changed, 737 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/function.c b/gcc/function.c
> index 53bad87..79e7b4f 100644
> --- a/gcc/function.c
> +++ b/gcc/function.c
> @@ -5920,9 +5920,7 @@ thread_prologue_and_epilogue_insns (void)
>    edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>    edge orig_entry_edge = entry_edge;
>
> -  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
>    rtx_insn *prologue_seq = make_prologue_seq ();
> -  rtx_insn *epilogue_seq = make_epilogue_seq ();
>
>    /* Try to perform a kind of shrink-wrapping, making sure the
>       prologue/epilogue is emitted only around those parts of the
> @@ -5930,6 +5928,13 @@ thread_prologue_and_epilogue_insns (void)
>
>    try_shrink_wrapping (&entry_edge, prologue_seq);
>
> +  try_shrink_wrapping_separate (entry_edge->dest);
> +
> +  if (crtl->shrink_wrapped_separate)
> +    prologue_seq = make_prologue_seq ();
Note this implies that make_prologue_seq (and consequently the target 
specific bits for prologue generation) can be safely called more than 
once.  That's probably OK, but might be worth a comment -- your decision 
whether or not to add the comment.

> diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
> index b85b1c3..0dced22 100644
> --- a/gcc/shrink-wrap.c
> +++ b/gcc/shrink-wrap.c

> +/* Place the prologue for component WHICH, in the basic blocks dominated
> +   by HEAD.  Do a DFS over the dominator tree, and set bit WHICH in the
> +   HAS_COMPONENTS of a block if either the block has that bit set in
> +   NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all
> +   dominator subtrees separately.  */
> +static void
> +place_prologue_for_one_component (unsigned int which, basic_block head)
> +{
> +  /* The block we are currently dealing with.  */
> +  basic_block bb = head;
> +  /* Is this the first time we visit this block, i.e. have we just gone
> +     down the tree.  */
> +  bool first_visit = true;
> +
> +  /* Walk the dominator tree, visit one block per iteration of this loop.
> +     Each basic block is visited twice: once before visiting any children
> +     of the block, and once after visiting all of them (leaf nodes are
> +     visited only once).  As an optimization, we do not visit subtrees
> +     that can no longer influence the prologue placement.  */
> +  for (;;)
Is there some reason you wrote this as a loop rather than recursion? 
IMHO it makes this function (and spread_components) more difficult to 
reason about than it needs to be.

> +
> +/* Mark HAS_COMPONENTS for every block dominated by at least one block with
> +   PRO_COMPONENTS set for the respective components, starting at HEAD.  */
> +static void
> +spread_components (basic_block head)
I don't see any references to PRO_COMPONENTS in the actual code.  If I'm 
reading the code correctly, you're just accumulating/propagating 
HAS_COMPONENTS from parents to children via a DFS walk of the dominator 
tree, right?

> +
> +/* Place code for prologues and epilogues for COMPONENTS where we can put
> +   that code at the end of basic blocks.  */
> +static void
> +emit_common_tails_for_components (sbitmap components)
[ Snip. ]
> +
> +      /* Put the code at the end of the BB, but before any final jump.  */
> +      if (!bitmap_empty_p (epi))
So what if the final jump uses hard registers in one way or another?   I 
don't immediately see anything that verifies it is safe to transpose the 
epilogue and the final jump.

Conceptually want the epilogue to occur on the outgoing edge(s).  But 
you want to actually transpose the epilogue and the control flow insn so 
that you can insert the epilogue in one place.

But naive transposition isn't safe.  The control flow insn may use one 
or more registers that you were restoring or you may be on a cc0 target. 
   I think you need to handle the former more cleanly.  The latter I'd 
be comfortable filtering out in try_shrink_wrapping_separate.

This has similarities to some of the problems we've had in the past with 
rtl-gcse insertion as well as output reloads on insns with control flow.

With transposition issue addressed, the only blocker I see are some 
simple testcases we can add to the suite.  They don't have to be real 
extensive.  And one motivating example for the list archives, ideally 
the glibc malloc case.

Jeff