This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components


On 07/31/2016 07:42 PM, Segher Boessenkool wrote:

Deciding what blocks should run with a certain component active so that
the total cost of executing the prologues (and epilogues) is optimal, is
not a computationally feasible problem.
Really? It's just a dataflow problem is it not and one that ought to converge very quickly I'd think. Or is it more a function of having to run it on so many independent components? I'm still pondering the value of having every GPR be an independent component :-)

ISTM this adds a fair amount of complexity to the implementation in that prologues and epilogues for a particular component can run more than once. Can you give a concrete example where this happens so that we can all understand it better?

If we keep this aspect of the implementation it seems like a note in the developer section would be in order. It's certainly non-intuitive.

I only glanced over the code that seems related to this aspect of the implementation. If we decide to go forward, I'd like to look at it again more closely.


Now all that is left is inserting prologues and epilogues on all edges
that jump into resp. out of the "active" set of blocks.  Often we need
to insert some components' prologues (or epilogues) on all edges into
(or out of) a block.  In theory cross-jumping can unify all such, but
in practice that often fails; besides, that is a lot of work.  So in
this case we insert the prologue and epilogue components at the "head"
or "tail" of a block, instead.
Cross jumping is rather simplistic, so I'm not surprised that it doesn't catch all this stuff.


As a final optimisation, if a block needs a prologue and its immediate
dominator has the block as a post-dominator, the dominator gets the
prologue as well.
So why not just put it in the idom and not in the dominated block? Doesn't this just end up running that component's prologue twice?


2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* function.c (thread_prologue_and_epilogue_insns): Recompute the
	live info.  Call try_shrink_wrapping_separate.  Compute the
	prologue_seq afterwards, if it has possibly changed.  Compute the
	split_prologue_seq and epilogue_seq later, too.
	* shrink-wrap.c: #include cfgbuild.h.
	(dump_components): New function.
	(struct sw): New struct.
	(SW): New function.
	(init_separate_shrink_wrap): New function.
	(fini_separate_shrink_wrap): New function.
	(place_prologue_for_one_component): New function.
	(spread_components): New function.
	(disqualify_problematic_components): New function.
	(emit_common_heads_for_components): New function.
	(emit_common_tails_for_components): New function.
	(insert_prologue_epilogue_for_components): New function.
	(try_shrink_wrapping_separate): New function.
	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
---
 gcc/function.c    |  15 +-
 gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/shrink-wrap.h |   1 +
 3 files changed, 729 insertions(+), 2 deletions(-)

diff --git a/gcc/function.c b/gcc/function.c
index bba0705..390e9a6 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -5912,6 +5912,12 @@ make_epilogue_seq (void)
 void
 thread_prologue_and_epilogue_insns (void)
 {
+  if (optimize > 1)
+    {
+      df_live_add_problem ();
+      df_live_set_all_dirty ();
+    }
Perhaps conditional on separate shrink wrapping?

@@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void)
   edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
   edge orig_entry_edge = entry_edge;

-  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
   rtx_insn *prologue_seq = make_prologue_seq ();
-  rtx_insn *epilogue_seq = make_epilogue_seq ();

   /* Try to perform a kind of shrink-wrapping, making sure the
      prologue/epilogue is emitted only around those parts of the
@@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)

   try_shrink_wrapping (&entry_edge, prologue_seq);

+  df_analyze ();
+  try_shrink_wrapping_separate (entry_edge->dest);
+  if (crtl->shrink_wrapped_separate)
+    prologue_seq = make_prologue_seq ();
Perhaps push the df_analyze call into try_shrink_wrapping_separate?

ANd if it was successful, do you need to update the DF information? Is that perhaps the cause of some of the issues we're seeing with DCE, regrename, the scheduler?



diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index b85b1c3..643e375 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "tree-pass.h"
 #include "cfgrtl.h"
+#include "cfgbuild.h"
 #include "params.h"
 #include "bb-reorder.h"
 #include "shrink-wrap.h"
@@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq)
   BITMAP_FREE (bb_with);
   free_dominance_info (CDI_DOMINATORS);
 }
+
+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.  The following code does this, for prologue and epilogue
+   components that can be put in more than one location, and where those
+   components can be executed more than once (the epilogue component will
+   always be executed before the prologue component is executed a second
+   time).
+
+   What exactly is a component is target-dependent.  The more usual
+   components are simple saves/restores to/from the frame of callee-saved
+   registers.  This code treats components abstractly (as an sbitmap),
+   letting the target handle all details.
+
+   Prologue components are placed in such a way that for every component
+   the prologue is executed as infrequently as possible.  We do this by
+   walking the dominator tree, comparing the cost of placing a prologue
+   component before a block to the sum of costs determined for all subtrees
+   of that block.
+
+   From this placement, we then determine for each component all blocks
+   where at least one of this block's dominators (including itself) will
+   get a prologue inserted.  That then is how the components are placed.
+   We could place the epilogue components a bit smarter (we can save a
+   bit of code size sometimes); this is a possible future improvement.
+
+   Prologues and epilogues are preferably placed into a block, either at
+   the beginning or end of it, if it is needed for all predecessor resp.
+   successor edges; or placed on the edge otherwise.
+
+   If the placement of any prologue/epilogue leads to a situation we cannot
+   handle (for example, an abnormal edge would need to be split, or some
+   targets want to use some specific registers that may not be available
+   where we want to put them), separate shrink-wrapping for the components
+   in that prologue/epilogue is aborted.  */
+
+
+/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
+   label LABEL.  */
+static void
+dump_components (const char *label, sbitmap components)
+{
+  if (bitmap_empty_p (components))
+    return;
+
+  fprintf (dump_file, " [%s", label);
+
+  for (unsigned int j = 0; j < components->n_bits; j++)
+    if (bitmap_bit_p (components, j))
+      fprintf (dump_file, " %u", j);
+
+  fprintf (dump_file, "]");
Consider allowing the target to provide a mapping from the component to a symbolic name of some kind and using that in the dumps. Related, consider using an enum rather than magic constants in the target bits (I noted seeing component #0 as a magic constant in the ppc implementation).


+
+/* A helper function for accessing the pass-specific info.  */
+static inline struct sw *
+SW (basic_block bb)
+{
+  gcc_assert (bb->aux);
+  return (struct sw *) bb->aux;
+}
+
+/* Create the pass-specific data structures for separately shrink-wrapping
+   with components COMPONENTS.  */
+static void
+init_separate_shrink_wrap (sbitmap components)
So it seems like there's a toplevel list of components that's owned by the target and state at each block components that are owned by the generic code. That's fine. Just make sure we doc that the toplevel list of components is allocated by the backend (and where does it get freed?)

Consider using auto_sbitmap rather than manually managing allocation/releasing of the per-block structures. In fact, can't all of SW become a class and we lose the explicit init/fini routines in favor of a ctor/dtor?



+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+   that code at the start of basic blocks.  */
+static void
+emit_common_heads_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      /* Find which prologue resp. epilogue components are needed for all
+	 predecessor edges to this block.  */
Nit. Avoid ".resp". I actually had to look that abbreviation up :-) Being a bit more verbose in the comments would be preferred over ".resp".


Having looked at this in more detail now, I'm wondering if we want to talk at all in the docs about when selection vs disqualifying happens. ISTM that we build the set of components we want to insert for each edge/block. When that's complete, we then prune those results with the disqualifying sets.

For the PPC R0 vs LR is the only thing that causes disqualification right? Can't that be handled when we build the set of components we want to insert for each edge/block? Is there some advantage to handling disqualifications after all the potential insertion points have been handled?

Jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]