This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improve unrolled size estimates


On Mon, 11 May 2009, Jan Hubicka wrote:

> > 
> > That is, I still dislike that you only handle constant index loads
> > from strings.  How does the benchmark numbers look like if you
> > remove that?
> 
> I handle both STRING_CST and DECL with CONSTANT flag and DECL_INITIAL.

Oh, indeed you do.  Ok, change

+      if ((DECL_P (base) && TREE_CONSTANT (base) && DECL_INITIAL (base))
+         || TREE_CODE (base) == STRING_CST)

to what get_symbol_constant_value checks, namely

+      if ((DECL_P (base)
            && TREE_READONLY (base)
            && DECL_INITIAL (base)
            && targetm.binds_local_p (base))
+         || CONSTANT_CLASS_P (base))

and in the loop

+         while (handled_component_p (base)
+                || (TREE_CODE (base) == ARRAY_REF
+                    && constant_after_peeling (TREE_OPERAND (base, 1), 
stmt, loop)))
+           base = TREE_OPERAND (base, 0);
+         return (DECL_P (base) || TREE_CODE (base) == STRING_CST);

instead bail out for non-constant indices like

+         while (handled_component_p (base))
            if (TREE_CODE (base) == ARRAY_REF
                && !constant_after_peeling (TREE_OPERAND (base, 1), ...)
              return false;
          return true;

that removes the fancy duplicate return condition.

Further spelling fixes on the new version below, apply them to the
original

> > > I am affraid that I don't have time right now to do all that, there are
> > > still too many other things on my TODO list.
> > > 
> > > I've dropped the patch into pretty-ipa.  There are several improvemnts:
> > >   SPECINT  EON 2130->2465 (base only).  GEOM is up 1433->1454
> > >   EON has loop zeroing small vector in the internal loop.  I guess it
> > >   now gets fully unrolled.  Size seems same.
> > >   SPECFP base Wupwise 1485->1491, swim 1529->1540, applu 1250->1275,
> > >   equake 1680->1775, facerec 1991->2030, 
> > >   (equake is important)
> > >   SPECFP peak wupwise 1395->1417, applu 1326->1355, mesa 1472->1500
> > >   art 2219->2248, 
> > > 
> > > There is also code size regression on botan
> > > http://gcc.opensuse.org/c++bench-frescobaldi-ipa-64/botan/
> > > and raytracer
> > > http://gcc.opensuse.org/c++bench-frescobaldi-ipa-64/raytracer/
> > > 
> > > So I guess it is worthwhile to play with it.  I would be however happy
> > > if someone beats me ;)))
> > 
> > Dissecting the patch would be nice.  The formula adjustments look
> > pretty obvious, so do the last-iteration fixes...
> 
> OK, this lets us to move ahead ;)
> I've regtsted this patch and doing full testing now on x86_64-linux.
> This patch however adds two new failures:
> FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 "c.array"
> FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized "my_array"
> 
> In first case there is loop computing sum of constant array of 3
> elements that is supposed to be optimized into constant, in the other
> there are two loops iterating 4 times doing reading one array computing
> other that are no longer guessed to decrease in size by cunrolli pass
> and thus wait till cunroll pass that unrolls them.  Testcase tests that
> the temporary array computed by first loop and consumed by second is
> optimized away.
> 
> Boths shows real missed optimization. In gcc.dg/tree-ssa/loop-37.c 
> problem is that we don't do CCP after unrolling so we don't fold
> constant access to constant array, In gcc.dg/tree-ssa/loop-36.c we end
> up with TARGET_MEM_EXPR doing same access as MEM_EXPR that prevents us
> from realizing that those can be promoted to registers even if extra FRE
> is scheduled after loop unrolling.
> 
> Both testcases will fail if number of iterations is increased by 1 on
> mainline and will pass with the patch if number of iterations is
> decreased to 2 with this patch, so I can just update testsuite
> accordingly, or xfail them or we need to implement some hack here.
> 
> Honza
> 
> 	* tree-ssa-loop-ivcanon.c (tree_num_loop_insns): Fix off-by-one error.
> 	(struct loop_size): new structure.
> 	(constant_after_peeling): New predicate.
> 	(tree_estimate_loop_size): New function.
> 	(estimated_unrolled_size): Rewrite for new estimates.
> 	(try_unroll_loop_completely): Use new estimates.
> 
> 	* testsuite/gcc.dg/tree-ssa/pr21829.c: Simplify matching since
> 	we now optimize better.
> Index: tree-ssa-loop-ivcanon.c
> ===================================================================
> --- tree-ssa-loop-ivcanon.c	(revision 147349)
> +++ tree-ssa-loop-ivcanon.c	(working copy)
> @@ -118,7 +118,7 @@ tree_num_loop_insns (struct loop *loop, 
>  {
>    basic_block *body = get_loop_body (loop);
>    gimple_stmt_iterator gsi;
> -  unsigned size = 1, i;
> +  unsigned size = 0, i;
>  
>    for (i = 0; i < loop->num_nodes; i++)
>      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> @@ -128,28 +128,115 @@ tree_num_loop_insns (struct loop *loop, 
>    return size;
>  }
>  
> -/* Estimate number of insns of completely unrolled loop.  We assume
> -   that the size of the unrolled loop is decreased in the
> -   following way (the numbers of insns are based on what
> -   estimate_num_insns returns for appropriate statements):
> -
> -   1) exit condition gets removed (2 insns)
> -   2) increment of the control variable gets removed (2 insns)
> -   3) All remaining statements are likely to get simplified
> -      due to constant propagation.  Hard to estimate; just
> -      as a heuristics we decrease the rest by 1/3.
> +/* Describe size of loop as detected by tree_estimate_loop_size.  */
> +struct loop_size
> +{
> +  /* Number of instructions in the loop.  */
> +  int overall;

vertical space

> +  /* Number of instructions that will be likely optimized out in
> +     peeled iterations of loop  (i.e. computation based on induction
> +     variable where induction variable starts at known constant.)  */
> +  int eliminated_by_peeling;

likewise

> +  /* Same statistics for last iteration of loop: it is smaller because
> +     instructions after exit are not executed.  */
> +  int last_iteration;
> +  int last_iteration_eliminated_by_peeling;
> +};
> +
> +/* Computes an estimated number of insns in LOOP, weighted by WEIGHTS.
> +   Return results in SIZE, estimate benefits for complette unrolling existing by EXIT.  */

complete

exiting

> +
> +static void
> +tree_estimate_loop_size (struct loop *loop, edge exit, struct loop_size *size)
> +{
> +  basic_block *body = get_loop_body (loop);
> +  gimple_stmt_iterator gsi;
> +  unsigned int i;
> +  bool after_exit;
> +
> +  size->overall = 0;
> +  size->eliminated_by_peeling = 0;
> +  size->last_iteration = 0;
> +  size->last_iteration_eliminated_by_peeling = 0;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Estimating sizes for loop %i\n", loop->num);
> +  for (i = 0; i < loop->num_nodes; i++)
> +    {
> +      if (exit && body[i] != exit->src
> +	  && dominated_by_p (CDI_DOMINATORS, body[i], exit->src))
> +	after_exit = true;
> +      else
> +	after_exit = false;
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, " BB: %i, after_exit: %i\n", body[i]->index, after_exit);
> +
> +      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> +	{
> +	  gimple stmt = gsi_stmt (gsi);
> +	  int num = estimate_num_insns (stmt, &eni_size_weights);
> +	  bool likely_eliminated = false;
> +
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    {
> +	      fprintf (dump_file, "  size: %3i ", num);
> +	      print_gimple_stmt (dump_file, gsi_stmt (gsi), 0, 0);
> +	    }
> +
> +	  /* Look for reasons why we might optimize this stmt away. */
> +
> +	  /* Exit conditional.  */
> +	  if (body[i] == exit->src && stmt == last_stmt (exit->src))
> +	    {
> +	      if (dump_file && (dump_flags & TDF_DETAILS))
> +	        fprintf (dump_file, "   Exit condition will be eliminated.\n");
> +	      likely_eliminated = true;
> +	    }
> +
> +	  size->overall += num;
> +	  if (likely_eliminated)
> +	    size->eliminated_by_peeling += num;
> +	  if (!after_exit)
> +	    {
> +	      size->last_iteration += num;
> +	      if (likely_eliminated)
> +		size->last_iteration_eliminated_by_peeling += num;
> +	    }
> +	}
> +    }
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "size: %i-%i, last_iteration: %i-%i\n", size->overall,
> +    	     size->eliminated_by_peeling, size->last_iteration,
> +	     size->last_iteration_eliminated_by_peeling);
> +  
> +  free (body);
> +}
>  
> -   NINSNS is the number of insns in the loop before unrolling.
> -   NUNROLL is the number of times the loop is unrolled.  */
> +/* Estimate number of insns of completely unrolled loop.
> +   It is (NUNROLL + 1) * size of loop body with taking into account
> +   the fact that in last copy everything after exit conditoinal

conditional

> +   is dead and that some instructions will be eliminated after
> +   peeling.
> +
> +   Loop body is likely going to simplify futher, this is dificult

difficult

> +   to guess, we just decrease the result by 1/3.  */
>  
>  static unsigned HOST_WIDE_INT
> -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +estimated_unrolled_size (struct loop_size *size,
>  			 unsigned HOST_WIDE_INT nunroll)
>  {
> -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  HOST_WIDE_INT unr_insns = ((nunroll)
> +  			     * (HOST_WIDE_INT) (size->overall
> +			     			- size->eliminated_by_peeling));
> +  /* Decrease cost of induction variable increment.  */
> +  unr_insns --;
> +  if (!nunroll)
> +    unr_insns = 0;
> +  unr_insns += size->last_iteration - size->last_iteration_eliminated_by_peeling;
> +
> +  unr_insns = unr_insns * 2 / 3;
>    if (unr_insns <= 0)
>      unr_insns = 1;
> -  unr_insns *= (nunroll + 1);
>  
>    return unr_insns;
>  }
> @@ -165,6 +252,7 @@ try_unroll_loop_completely (struct loop 
>  {
>    unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
>    gimple cond;
> +  struct loop_size size;
>  
>    if (loop->inner)
>      return false;
> @@ -182,9 +270,10 @@ try_unroll_loop_completely (struct loop 
>        if (ul == UL_SINGLE_ITER)
>  	return false;
>  
> -      ninsns = tree_num_loop_insns (loop, &eni_size_weights);
> +      tree_estimate_loop_size (loop, exit, &size);
> +      ninsns = size.overall;
>  
> -      unr_insns = estimated_unrolled_size (ninsns, n_unroll);
> +      unr_insns = estimated_unrolled_size (&size, n_unroll);
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	{
>  	  fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
> Index: testsuite/gcc.dg/tree-ssa/pr21829.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/pr21829.c	(revision 147349)
> +++ testsuite/gcc.dg/tree-ssa/pr21829.c	(working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-optimized -fdump-tree-cddce2" } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
>  
>  int test(int v)
>  {
> @@ -16,33 +16,7 @@ int test(int v)
>    return x;
>  }
>  
> -/* This should be optimized to
> +/* This should be unrolled and optimized into conditional set of return value "v < 0".  */
>  
> -    if (v <= 0) goto <L1>; else goto <L3>;
> -
> -   <L1>:;
> -
> -    # x_1 = PHI <0(3), 1(1)>;
> -   <L3>:;
> -    return x_1;
> -
> -   retaining only a single conditional.  This doesn't work as nobody
> -   combines the two tests
> -
> -    if (v < 0) goto <bb 4>; else goto <bb 3>;
> -
> -   <bb 3>:
> -
> -    if (v <= 0) goto <bb 4>; else goto <bb 5>;
> -
> -   this late in the game.  tree-ssa-ifcombine.c would do it if we would
> -   unroll the loop during early loop unrolling though.
> -
> -   For now vrp2 does all the needed folding and threading and cddce2
> -   provides a nice IL to scan.  */
> -
> -/* { dg-final { scan-tree-dump-times "if " 1 "optimized" { xfail *-*-* } } } */
> -/* { dg-final { scan-tree-dump-times "if " 2 "cddce2" } } */
> -/* { dg-final { scan-tree-dump "x_. = PHI <0\\\(.\\\), 1\\\(.\\\)>" "cddce2" } } */
> -/* { dg-final { cleanup-tree-dump "cddce2" } } */
> +/* { dg-final { scan-tree-dump-not "if \\(" "optimized" } } */
>  /* { dg-final { cleanup-tree-dump "optimized" } } */


Thus, the original version with the above changes is ok if it still
bootstraps & tests.

Thanks for your patience,
Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]