This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components

From: Jeff Law <law at redhat dot com>
To: Segher Boessenkool <segher at kernel dot crashing dot org>, gcc-patches at gcc dot gnu dot org
Cc: bschmidt at redhat dot com
Date: Thu, 8 Sep 2016 12:34:07 -0600
Subject: Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
Authentication-results: sourceware.org; auth=none
References: <cover.1470015604.git.segher@kernel.crashing.org> <e21c6260ba313fd4905f00fa0da29f92af14b13e.1470015604.git.segher@kernel.crashing.org>

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:


Deciding what blocks should run with a certain component active so that
the total cost of executing the prologues (and epilogues) is optimal, is
not a computationally feasible problem.

Really? It's just a dataflow problem is it not and one that ought toconverge very quickly I'd think. Or is it more a function of having torun it on so many independent components? I'm still pondering the valueof having every GPR be an independent component :-)

ISTM this adds a fair amount of complexity to the implementation in thatprologues and epilogues for a particular component can run more thanonce. Can you give a concrete example where this happens so that we canall understand it better?

If we keep this aspect of the implementation it seems like a note in thedeveloper section would be in order. It's certainly non-intuitive.

I only glanced over the code that seems related to this aspect of theimplementation. If we decide to go forward, I'd like to look at itagain more closely.


Now all that is left is inserting prologues and epilogues on all edges
that jump into resp. out of the "active" set of blocks.  Often we need
to insert some components' prologues (or epilogues) on all edges into
(or out of) a block.  In theory cross-jumping can unify all such, but
in practice that often fails; besides, that is a lot of work.  So in
this case we insert the prologue and epilogue components at the "head"
or "tail" of a block, instead.

Cross jumping is rather simplistic, so I'm not surprised that it doesn'tcatch all this stuff.


As a final optimisation, if a block needs a prologue and its immediate
dominator has the block as a post-dominator, the dominator gets the
prologue as well.

So why not just put it in the idom and not in the dominated block?Doesn't this just end up running that component's prologue twice?


2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* function.c (thread_prologue_and_epilogue_insns): Recompute the
	live info.  Call try_shrink_wrapping_separate.  Compute the
	prologue_seq afterwards, if it has possibly changed.  Compute the
	split_prologue_seq and epilogue_seq later, too.
	* shrink-wrap.c: #include cfgbuild.h.
	(dump_components): New function.
	(struct sw): New struct.
	(SW): New function.
	(init_separate_shrink_wrap): New function.
	(fini_separate_shrink_wrap): New function.
	(place_prologue_for_one_component): New function.
	(spread_components): New function.
	(disqualify_problematic_components): New function.
	(emit_common_heads_for_components): New function.
	(emit_common_tails_for_components): New function.
	(insert_prologue_epilogue_for_components): New function.
	(try_shrink_wrapping_separate): New function.
	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
---
 gcc/function.c    |  15 +-
 gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/shrink-wrap.h |   1 +
 3 files changed, 729 insertions(+), 2 deletions(-)

diff --git a/gcc/function.c b/gcc/function.c
index bba0705..390e9a6 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -5912,6 +5912,12 @@ make_epilogue_seq (void)
 void
 thread_prologue_and_epilogue_insns (void)
 {
+  if (optimize > 1)
+    {
+      df_live_add_problem ();
+      df_live_set_all_dirty ();
+    }

Perhaps conditional on separate shrink wrapping?

@@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void)
   edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
   edge orig_entry_edge = entry_edge;

-  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
   rtx_insn *prologue_seq = make_prologue_seq ();
-  rtx_insn *epilogue_seq = make_epilogue_seq ();

   /* Try to perform a kind of shrink-wrapping, making sure the
      prologue/epilogue is emitted only around those parts of the
@@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)

   try_shrink_wrapping (&entry_edge, prologue_seq);

+  df_analyze ();
+  try_shrink_wrapping_separate (entry_edge->dest);
+  if (crtl->shrink_wrapped_separate)
+    prologue_seq = make_prologue_seq ();

Perhaps push the df_analyze call into try_shrink_wrapping_separate?

ANd if it was successful, do you need to update the DF information? Isthat perhaps the cause of some of the issues we're seeing with DCE,regrename, the scheduler?

diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index b85b1c3..643e375 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "tree-pass.h"
 #include "cfgrtl.h"
+#include "cfgbuild.h"
 #include "params.h"
 #include "bb-reorder.h"
 #include "shrink-wrap.h"
@@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq)
   BITMAP_FREE (bb_with);
   free_dominance_info (CDI_DOMINATORS);
 }
+
+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.  The following code does this, for prologue and epilogue
+   components that can be put in more than one location, and where those
+   components can be executed more than once (the epilogue component will
+   always be executed before the prologue component is executed a second
+   time).
+
+   What exactly is a component is target-dependent.  The more usual
+   components are simple saves/restores to/from the frame of callee-saved
+   registers.  This code treats components abstractly (as an sbitmap),
+   letting the target handle all details.
+
+   Prologue components are placed in such a way that for every component
+   the prologue is executed as infrequently as possible.  We do this by
+   walking the dominator tree, comparing the cost of placing a prologue
+   component before a block to the sum of costs determined for all subtrees
+   of that block.
+
+   From this placement, we then determine for each component all blocks
+   where at least one of this block's dominators (including itself) will
+   get a prologue inserted.  That then is how the components are placed.
+   We could place the epilogue components a bit smarter (we can save a
+   bit of code size sometimes); this is a possible future improvement.
+
+   Prologues and epilogues are preferably placed into a block, either at
+   the beginning or end of it, if it is needed for all predecessor resp.
+   successor edges; or placed on the edge otherwise.
+
+   If the placement of any prologue/epilogue leads to a situation we cannot
+   handle (for example, an abnormal edge would need to be split, or some
+   targets want to use some specific registers that may not be available
+   where we want to put them), separate shrink-wrapping for the components
+   in that prologue/epilogue is aborted.  */
+
+
+/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
+   label LABEL.  */
+static void
+dump_components (const char *label, sbitmap components)
+{
+  if (bitmap_empty_p (components))
+    return;
+
+  fprintf (dump_file, " [%s", label);
+
+  for (unsigned int j = 0; j < components->n_bits; j++)
+    if (bitmap_bit_p (components, j))
+      fprintf (dump_file, " %u", j);
+
+  fprintf (dump_file, "]");

Consider allowing the target to provide a mapping from the component toa symbolic name of some kind and using that in the dumps. Related,consider using an enum rather than magic constants in the target bits (Inoted seeing component #0 as a magic constant in the ppc implementation).

+
+/* A helper function for accessing the pass-specific info.  */
+static inline struct sw *
+SW (basic_block bb)
+{
+  gcc_assert (bb->aux);
+  return (struct sw *) bb->aux;
+}
+
+/* Create the pass-specific data structures for separately shrink-wrapping
+   with components COMPONENTS.  */
+static void
+init_separate_shrink_wrap (sbitmap components)

So it seems like there's a toplevel list of components that's owned bythe target and state at each block components that are owned by thegeneric code. That's fine. Just make sure we doc that the toplevellist of components is allocated by the backend (and where does it getfreed?)

Consider using auto_sbitmap rather than manually managingallocation/releasing of the per-block structures. In fact, can't all ofSW become a class and we lose the explicit init/fini routines in favorof a ctor/dtor?

+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+   that code at the start of basic blocks.  */
+static void
+emit_common_heads_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      /* Find which prologue resp. epilogue components are needed for all
+	 predecessor edges to this block.  */

Nit. Avoid ".resp". I actually had to look that abbreviation up :-)Being a bit more verbose in the comments would be preferred over ".resp".

Having looked at this in more detail now, I'm wondering if we want totalk at all in the docs about when selection vs disqualifying happens.ISTM that we build the set of components we want to insert for eachedge/block. When that's complete, we then prune those results with thedisqualifying sets.

For the PPC R0 vs LR is the only thing that causes disqualificationright? Can't that be handled when we build the set of components wewant to insert for each edge/block? Is there some advantage to handlingdisqualifications after all the potential insertion points have beenhandled?


Jeff

Follow-Ups:
- Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  - From: Segher Boessenkool

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]