This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH, PR 10474] Split live-ranges of function arguments to help shrink-wrapping


Hi,

in spring I have suggested to shedule pass_cprop_hardreg before
pass_thread_prologue_and_epilogue in order to create many more
shrink-wrapping opportunities.  The problem is that formal arguments
of a functions which need to be saved across calls on slow paths often
get assigned callee saved registers and the very first BB thus needs
prologue, which makes any attempt at shrink-wrapping difficult.

Steven suggested that splitting live-ranges of these registers might
be a better approach (and Vlad suggested that I should look at
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01137.html, a big thanks
to both) and this is what the patch below does.

The basic idea is to split live ranges of problematic pseudos in the
common dominator of all non-sibling calls and all uses of those
pseudos in BBs which are reachable from calls, if such dominator is
not in a loop and if it does not post-dominate the entry BB.

As far as the number of performed shrink-wrappings is concerned, it
has even bigger effect to scheduling pass_cprop_hardreg earlier that I
have proposed in spring.  Apart from looking at bootstrap and SPEC
2006 (at -Ofast) I have tried compiling Python and Ruby (with whatever
their default flags were, I believe mostly -O3) because David
suggested at his Cauldron talk last year that shrink-wrapping may be
important:

    Number of performed shrink-wrappings:
    | Source                  | Trunk | With patch |       % |
    |-------------------------+-------+------------+---------|
    | SPEC 2006 Int           |   734 |       1670 | +127.52 |
    | SPEC 2006 FP (*)        |   575 |       1022 |  +77.74 |
    | C/CPP bootstrap stage 2 |  1748 |       3015 |  +72.48 |
    | Python 2.7.5            |   231 |        401 |  +73.59 |
    | Python 3.3.2            |   234 |        416 |  +77.78 |
    | Ruby 2.0.0-p247         |   331 |        464 |  +40.18 |

(*) I forgot to increase the stack limit and I believe that is the
reason why 410.bwaves, 416.gamess, 447.dealII and 481.wrf SPEC 2006 FP
benchmarks failed, regardless of my patch, but I have not investigated
what exactly happened, so there might have even been compile-time
issues.

As far as run-times are concerned, I have tried running the benchmarks
from hg.python.org/benchmarks but I do not have anything to report.
David claimed that shrink-wrapping in function lookdict_string is
likely to help but this patch alone does not make that happen, I am
investigating what else needs to be done for this to happen.  I have
not yet tried benchmarking ruby.  Spec 2006 (*) run times are a bit
encouraging.

    Run-times (seconds, median of three runs):
    | Benchmark      | Trunk | Live range splitting |     % |
    |----------------+-------+----------------------+-------|
    | 400.perlbench  |   294 |                  290 | -1.36 |
    | 401.bzip2      |   384 |                  387 | +0.78 |
    | 403.gcc        |   239 |                  237 | -0.84 |
    | 429.mcf        |   227 |                  228 | +0.44 |
    | 445.gobmk      |   392 |                  390 | -0.51 |
    | 456.hmmer      |   343 |                  342 | -0.29 |
    | 458.sjeng      |   418 |                  411 | -1.67 |
    | 462.libquantum |   287 |                  288 | +0.35 |
    | 464.h264ref    |   417 |                  418 | +0.24 |
    | 471.omnetpp    |   275 |                  280 | +1.82 |
    | 473.astar      |   312 |                  295 | -5.45 |
    | 483.xalancbmk  |   183 |                  183 |  0.00 |
    |----------------+-------+----------------------+-------|
    | 433.milc       |   336 |                  335 | -0.30 |
    | 434.zeusmp     |   247 |                  246 | -0.40 |
    | 435.gromacs    |   259 |                  259 |  0.00 |
    | 436.cactusADM  |   192 |                  186 | -3.12 |
    | 437.leslie3d   |   227 |                  226 | -0.44 |
    | 444.namd       |   315 |                  315 |  0.00 |
    | 450.soplex     |   202 |                  200 | -0.99 |
    | 453.povray     |   136 |                  134 | -1.47 |
    | 454.calculix   |   300 |                  300 |  0.00 |
    | 459.GemsFDTD   |   273 |                  267 | -2.20 |
    | 465.tonto      |   685 |                  685 |  0.00 |
    | 470.lbm        |   212 |                  213 | +0.47 |
    | 482.sphinx3    |   363 |                  360 | -0.83 |

The patch is speculative, quite a few modifications do not lead to
shrink wrappings, as outlined in the following table which shows how
many of the changed functions still needed their prologue in the first
BB.

    Number of functions touched:
    |                         |  Modified |         |       |
    | Source                  | functions | In vain |     % |
    |-------------------------+-----------+---------+-------|
    | SPEC 2006 Int           |      2004 |     745 | 37.18 |
    | SPEC 2006 FP (*)        |      1491 |     881 | 59.09 |
    | C/C++ bootstrap stage 2 |      2864 |     935 | 32.65 |
    | Python 2.7.5            |       351 |     121 | 34.47 |
    | Python 3.3.2            |       410 |     151 | 36.83 |
    | Ruby 2.0.0-p247         |       336 |     132 | 39.29 |

I believe we can do even more shrink-wrappings and thus make the above
percentages smaller if attempt to fix a similar problem with the
return value.  The parameters are however clearly the biggest obstacle
and need to be dealt with somehow.

The patch also adds quite some more computations, especially for
functions which are not rejected early on.  The last table below
summarizes how much extra work is done.  Total is the number of
functions on which split_live_ranges_for_shrink_wrap was called, IIUC
function ira, that excludes large functions.

      Number of functions for which:
        1 - all BBs were scanned for calls
        2 - the first BB was scanned for register moves
        3 - dominators were calculated
        4 - loops were calculated
        5 - post-dominators were calculated
        6 - which were modified

    | Source                  | Total |     1 |     % |     2 |     % |    3 |     % |    4 |     % |    5 |     % |    6 |    % |
    |-------------------------+-------+-------+-------+-------+-------+------+-------+------+-------+------+-------+------+------|
    | SPEC 2006 Int           | 30816 | 11888 | 38.58 |  9902 | 32.13 | 7889 | 25.60 | 2911 |  9.45 | 2692 |  8.74 | 2004 | 6.50 |
    | SPEC 2006 FP (*)        | 17957 |  9239 | 51.45 |  7016 | 39.07 | 6507 | 36.24 | 3640 | 20.27 | 3496 | 19.47 | 1491 | 8.30 |
    | C/C++ bootstrap stage 2 | 48163 | 19756 | 41.02 | 16211 | 33.66 | 9598 | 19.93 | 4535 |  9.42 | 4174 |  8.67 | 2864 | 5.95 |
    | Python 2.7.5            |  5472 |  2911 | 53.20 |  2437 | 44.54 | 1987 | 36.31 |  632 | 11.55 |  587 | 10.73 |  351 | 6.41 |
    | Python 3.3.2            |  6156 |  3246 | 52.73 |  2656 | 43.14 | 2165 | 35.17 |  669 | 10.87 |  629 | 10.22 |  410 | 6.66 |
    | Ruby 2.0.0-p247         |  7866 |  3834 | 48.74 |  3333 | 42.37 | 2588 | 32.90 |  797 | 10.13 |  731 |  9.29 |  336 | 4.27 |


The patch itself is below.  It passes bootstrap and testing on
x86_64-linux.  Because it is basically my first RTL-optimization
patch, I expect loads of comments and requests for changes.
Nevertheless, I think it would be nice to have it (or some later
version) in the upcoming release.

Thanks,

Martin


2013-10-18  Martin Jambor  <mjambor@suse.cz>

	PR rtl-optimization/10474
	* ira.c (interesting_dest_for_shprep): New function.
	(split_live_ranges_for_shrink_wrap): Likewise.
	(ira): Call split_live_ranges_for_shrink_wrap.

testsuite/
	* gcc.dg/pr10474.c: New testcase.
	* gcc.dg/ira-shrinkwrap-prep-1.c: Likewise.
	* gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.

diff --git a/gcc/ira.c b/gcc/ira.c
index 203fbff..fe208e2 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -4314,6 +4314,197 @@ find_moveable_pseudos (void)
   free_dominance_info (CDI_DOMINATORS);
 }
 
+
+/* If insn is interesting for parameter range-splitting shring-wrapping
+   preparation, i.e. it is a single set from a hard register to a pseudo, which
+   is live at CALL_DOM, return the destination.  Otherwise return NULL.  */
+
+static rtx
+interesting_dest_for_shprep (rtx insn, basic_block call_dom)
+{
+  rtx set = single_set (insn);
+  if (!set)
+    return NULL;
+  rtx src = SET_SRC (set);
+  rtx dest = SET_DEST (set);
+  if (!REG_P (src) || !HARD_REGISTER_P (src)
+      || !REG_P (dest) || HARD_REGISTER_P (dest)
+      || (call_dom && !bitmap_bit_p (df_get_live_in (call_dom), REGNO (dest))))
+    return NULL;
+  return dest;
+}
+
+/* Split live ranges of pseudos that are loaded from hard registers in the
+   first BB in a BB that dominates all non-sibling call if such a BB can be
+   found and is not in a loop.  */
+
+static void
+split_live_ranges_for_shrink_wrap (void)
+{
+  basic_block bb, call_dom = NULL;
+  basic_block first = single_succ (ENTRY_BLOCK_PTR);
+  rtx insn, last_interesting_insn = NULL;
+  bitmap_head need_new, reachable;
+  vec<basic_block> queue;
+
+  if (!flag_shrink_wrap)
+    return;
+
+  bitmap_initialize (&need_new, 0);
+  bitmap_initialize (&reachable, 0);
+  queue.create (n_basic_blocks);
+
+  FOR_EACH_BB (bb)
+    FOR_BB_INSNS (bb, insn)
+      if (CALL_P (insn) && !SIBLING_CALL_P (insn))
+	{
+	  if (bb == first)
+	    {
+	      bitmap_clear (&need_new);
+	      bitmap_clear (&reachable);
+	      queue.release ();
+	      return;
+	    }
+
+	  bitmap_set_bit (&need_new, bb->index);
+	  bitmap_set_bit (&reachable, bb->index);
+	  queue.quick_push (bb);
+	  break;
+	}
+
+  if (queue.is_empty ())
+    {
+      bitmap_clear (&need_new);
+      bitmap_clear (&reachable);
+      queue.release ();
+      return;
+    }
+
+  while (!queue.is_empty ())
+    {
+      edge e;
+      edge_iterator ei;
+
+      bb = queue.pop ();
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	if (e->dest != EXIT_BLOCK_PTR
+	    && bitmap_set_bit (&reachable, e->dest->index))
+	  queue.quick_push (e->dest);
+    }
+  queue.release ();
+
+  FOR_BB_INSNS (first, insn)
+    {
+      rtx dest = interesting_dest_for_shprep (insn, NULL);
+      if (!dest)
+	continue;
+
+      if (DF_REG_DEF_COUNT (REGNO (dest)) > 1)
+	{
+	  bitmap_clear (&need_new);
+	  bitmap_clear (&reachable);
+	  return;
+	}
+
+      for (df_ref use = DF_REG_USE_CHAIN (REGNO(dest));
+	   use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  if (NONDEBUG_INSN_P (DF_REF_INSN (use))
+	      && GET_CODE (DF_REF_REG (use)) == SUBREG)
+	    {
+	      /* This is necessary to avoid hitting an assert at
+		 postreload.c:2294 in libstc++ testcases on x86_64-linux.  I'm
+		 not really sure what the probblem actually is there.  */
+	      bitmap_clear (&need_new);
+	      bitmap_clear (&reachable);
+	      return;
+	    }
+
+	  rtx uin = DF_REF_INSN (use);
+	  int ubbi = BLOCK_FOR_INSN (uin)->index;
+	  if (bitmap_bit_p (&reachable, ubbi))
+	    bitmap_set_bit (&need_new, ubbi);
+	}
+      last_interesting_insn = insn;
+    }
+
+  bitmap_clear (&reachable);
+  if (!last_interesting_insn)
+    {
+      bitmap_clear (&need_new);
+      return;
+    }
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  call_dom = nearest_common_dominator_for_set (CDI_DOMINATORS, &need_new);
+  bitmap_clear (&need_new);
+  if (call_dom == first)
+    goto out;
+
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+  while (bb_loop_depth (call_dom) > 0)
+    call_dom = get_immediate_dominator (CDI_DOMINATORS, call_dom);
+  loop_optimizer_finalize ();
+
+  if (call_dom == first)
+      goto out;
+
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+  if (dominated_by_p (CDI_POST_DOMINATORS, first, call_dom))
+    {
+      free_dominance_info (CDI_POST_DOMINATORS);
+      goto out;
+    }
+  free_dominance_info (CDI_POST_DOMINATORS);
+
+  if (dump_file)
+    fprintf (dump_file, "Will split live ranges of parameters at BB %i\n",
+	     call_dom->index);
+
+  FOR_BB_INSNS (first, insn)
+    {
+      rtx dest = interesting_dest_for_shprep (insn, call_dom);
+      if (!dest)
+	continue;
+
+      rtx newreg = NULL_RTX;
+      df_ref use, next;
+      for (use = DF_REG_USE_CHAIN (REGNO(dest)); use; use = next)
+	{
+	  rtx uin = DF_REF_INSN (use);
+	  next = DF_REF_NEXT_REG (use);
+
+	  basic_block ubb = BLOCK_FOR_INSN (uin);
+	  if (ubb == call_dom
+	      || dominated_by_p (CDI_DOMINATORS, ubb, call_dom))
+	    {
+	      if (!newreg)
+		newreg = ira_create_new_reg (dest);
+	      validate_change (uin, DF_REF_LOC (use), newreg, true);
+	    }
+	}
+
+      if (newreg)
+	{
+	  rtx new_move = gen_move_insn (newreg, dest);
+	  emit_insn_after (new_move, bb_note (call_dom));
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Split live-range of register ");
+	      print_rtl_single (dump_file, dest);
+	    }
+	}
+
+      if (insn == last_interesting_insn)
+	break;
+    }
+  apply_change_group();
+
+ out:
+  free_dominance_info (CDI_DOMINATORS);
+}
+
 /* Perform the second half of the transformation started in
    find_moveable_pseudos.  We look for instances where the newly introduced
    pseudo remains unallocated, and remove it by moving the definition to
@@ -4522,7 +4713,10 @@ ira (FILE *f)
      allocation because of -O0 usage or because the function is too
      big.  */
   if (ira_conflicts_p)
-    find_moveable_pseudos ();
+    {
+      find_moveable_pseudos ();
+      split_live_ranges_for_shrink_wrap ();
+    }
 
   max_regno_before_ira = max_reg_num ();
   ira_setup_eliminable_regset (true);
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
new file mode 100644
index 0000000..fe497c2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-rtl-ira -fdump-rtl-pro_and_epilogue"  } */
+
+int __attribute__((noinline, noclone))
+foo (int a)
+{
+  return a + 5;
+}
+
+static int g;
+
+int __attribute__((noinline, noclone))
+bar (int a)
+{
+  int r;
+
+  if (a)
+    {
+      r = foo (a);
+      g = r + a;
+    }
+  else
+    r = a+1;
+  return r;
+}
+
+/* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira"  } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira"  } } */
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue"  } } */
+/* { dg-final { cleanup-rtl-dump "ira" } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
new file mode 100644
index 0000000..872a757
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-rtl-ira -fdump-rtl-pro_and_epilogue"  } */
+
+int __attribute__((noinline, noclone))
+foo (int a)
+{
+  return a + 5;
+}
+
+static int g;
+
+int __attribute__((noinline, noclone))
+bar (int a)
+{
+  int r;
+
+  if (a)
+    {
+      r = a;
+      while (r < 500)
+	if (r % 2)
+	  r = foo (r);
+	else
+	  r = foo (r+1);
+      g = r + a;
+    }
+  else
+    r = g+1;
+  return r;
+}
+
+/* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira"  } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira"  } } */
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue"  } } */
+/* { dg-final { cleanup-rtl-dump "ira" } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */
diff --git a/gcc/testsuite/gcc.dg/pr10474.c b/gcc/testsuite/gcc.dg/pr10474.c
new file mode 100644
index 0000000..ee085c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr10474.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-rtl-pro_and_epilogue"  } */
+
+void f(int *i)
+{
+	if (!i)
+		return;
+	else
+	{
+		__builtin_printf("Hi");
+		*i=0;
+	}
+}
+
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue"  } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]