This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[patch] Partial fix for PR 18048

From: Zdenek Dvorak <rakdver at atrey dot karlin dot mff dot cuni dot cz>
To: gcc-patches at gcc dot gnu dot org
Date: Sun, 24 Oct 2004 11:41:12 +0200
Subject: [patch] Partial fix for PR 18048
Hello,

this patch contains several improvements to ivopts and related
optimizations that improve the code produced for mgrid benchmark quite a
bit (although even with the patch, ivopts still spoil the code).

The improvements are:

1) Ivopts produce expressions like

   *(&a[start] + 8 * index)

   and believe that the memory access created for it will include the
   multiplication by 8.  This fails if there are several similar
   accesses, since dom will CSE the "8 * index" part (thus adding
   unnecessary multiplication instruction, and increasing register
   pressure).

   The part of the patch in the fold-const.c ensures that such an
   expression is folded to a[start + index], thus achieving what ivopts
   intended.  This solution unfortunately is not generic enough to catch
   all cases when something like this may happen, and it does not work
   on LP64 platforms where this optimization might be incorrect, but it
   seems to be the only possibility within the current framework.  I
   will start a discussion about a more appropriate solution for 4.1 in
   a separate thread.

2) The algorithm used for finding the best set of bivs for the loop is
   quite primitive (and cannot be much better due to compile time
   constraints), and as such it is likely to get stuck in local
   minimums.  Which happens in mgrid, causing ivopts to create about
   30 bivs...

   The patch to try_add_cand_for makes the initial set of candidates
   be much smaller (based primarily on the important candidates).
   The idea is that getting stuck in local minimum with too many bivs
   is a disaster, but if it happens with a few bivs, it just means that
   the iv uses cannot be expressed much cheaper than they are, so we
   are not that far from optimum anyway.

3) A small improvement in number_of_iterations_cond.
   build_int_cst (type, -1) is not suitable for producing unsigned
   constant with all bits one, since the bits outside of the precision
   of the type are set to 1 as well, which makes fold to fail to
   perform some of the optimizations with such constants.

Bootstrapped & regtested on ppc, i686 and x86_64.  Benchmark results
from i686 (-O2, base without the patch, peak with) (quick summary --
minor variations except for the huge gain on mgrid, on average the
results improve even if mgrid is not taken into account):

   Benchmarks    Ref Time  Run Time   Ratio    Ref Time  Run Time   Ratio
   ------------  --------  --------  --------  --------  --------  --------
   164.gzip          1400   209       670    *     1400   211       663    *
   175.vpr           1400   390       359    *     1400   395       354    *
   176.gcc           1100        --          X     1100        --          X
   181.mcf           1800   733       245    *     1800   728       247    *
   186.crafty        1000   124       804    *     1000   123       814    *
   197.parser        1800   401       449    *     1800   400       450    *
   252.eon           1300        --          X     1300        --          X
   253.perlbmk       1800   208       865    *     1800   205       879    *
   254.gap           1100   179       616    *     1100   176       627    *
   255.vortex        1900   231       822    *     1900   236       805    *
   256.bzip2         1500   351       427    *     1500   351       427    *
   300.twolf         3000   799       376    *     3000   800       375    *

   168.wupwise       1600   295       543    *     1600   286       559    *
   171.swim          3100   720       430    *     3100   725       428    *
   172.mgrid         1800   574       314    *     1800   504       357    *
   173.applu         2100   528       398    *     2100   520       404    *
   177.mesa          1400   207       675    *     1400   207       675    *
   178.galgel        2900        --          X     2900        --          X
   179.art           2600  1458       178    *     2600  1463       178    *
   183.equake        1300   294       442    *     1300   294       442    *
   187.facerec       1900   473       401    *     1900   467       406    *
   188.ammp          2200   615       357    *     2200   609       361    *
   189.lucas         2000   436       458    *     2000   437       458    *
   191.fma3d         2100   379       554    *     2100   382       549    *
   200.sixtrack      1100   270       408    *     1100   265       415    *
   301.apsi          2600   671       387    *     2600   665       391    *

Zdenek

	PR tree-optimization/18048
	* fold-const.c (try_move_mult_to_index): New function.
	(fold): Use try_move_mult_to_index.
	* tree-ssa-loop-ivopts.c (try_add_cand_for): Prefer common candidates.
	* tree-ssa-loop-niter.c (number_of_iterations_cond): Produce
	an all-ones unsigned constant without extra bits.

Index: fold-const.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/fold-const.c,v
retrieving revision 1.468
diff -c -3 -p -r1.468 fold-const.c
*** fold-const.c	11 Oct 2004 03:42:00 -0000	1.468
--- fold-const.c	23 Oct 2004 15:13:53 -0000
*************** tree_swap_operands_p (tree arg0, tree ar
*** 5967,5972 ****
--- 5967,6050 ----
    return 0;
  }
  
+ /* Tries to replace &a[idx] CODE s * delta with &a[idx CODE delta], if s is
+    step of the array.  TYPE is the type of the expression.  ADDR is the address.
+    MULT is the multiplicative expression.  If the function succeeds, the new
+    address expression is returned.  Otherwise NULL_TREE is returned.  */
+ 
+ static tree
+ try_move_mult_to_index (tree type, enum tree_code code, tree addr, tree mult)
+ {
+   tree s, delta, step;
+   tree arg0 = TREE_OPERAND (mult, 0), arg1 = TREE_OPERAND (mult, 1);
+   tree ref = TREE_OPERAND (addr, 0), pref;
+   tree ret, pos;
+   tree itype;
+ 
+   STRIP_NOPS (arg0);
+   STRIP_NOPS (arg1);
+   
+   if (TREE_CODE (arg0) == INTEGER_CST)
+     {
+       s = arg0;
+       delta = arg1;
+     }
+   else if (TREE_CODE (arg1) == INTEGER_CST)
+     {
+       s = arg1;
+       delta = arg0;
+     }
+   else
+     return NULL_TREE;
+ 
+   for (;; ref = TREE_OPERAND (ref, 0))
+     {
+       if (TREE_CODE (ref) == ARRAY_REF)
+ 	{
+ 	  step = array_ref_element_size (ref);
+ 
+ 	  if (TREE_CODE (step) != INTEGER_CST)
+ 	    continue;
+ 
+ 	  itype = TREE_TYPE (step);
+ 
+ 	  /* If the type sizes do not match, we might run into problems
+ 	     when one of them would overflow.  */
+ 	  if (TYPE_PRECISION (itype) != TYPE_PRECISION (type))
+ 	    continue;
+ 
+ 	  if (!operand_equal_p (step, fold_convert (itype, s), 0))
+ 	    continue;
+ 
+ 	  delta = fold_convert (itype, delta);
+ 	  break;
+ 	}
+ 
+       if (!handled_component_p (ref))
+ 	return NULL_TREE;
+     }
+ 
+   /* We found the suitable array reference.  So copy everything up to it,
+      and replace the index.  */
+ 
+   pref = TREE_OPERAND (addr, 0);
+   ret = copy_node (pref);
+   pos = ret;
+ 
+   while (pref != ref)
+     {
+       pref = TREE_OPERAND (pref, 0);
+       TREE_OPERAND (pos, 0) = copy_node (pref);
+       pos = TREE_OPERAND (pos, 0);
+     }
+ 
+   TREE_OPERAND (pos, 1) = fold (build2 (code, itype,
+ 					TREE_OPERAND (pos, 1),
+ 					delta));
+ 
+   return build1 (ADDR_EXPR, type, ret);
+ }
+ 
  /* Perform constant folding and related simplification of EXPR.
     The related simplifications include x*1 => x, x*0 => 0, etc.,
     and application of the associative law.
*************** fold (tree expr)
*** 6602,6607 ****
--- 6680,6703 ----
  						   alt0, alt1)),
  				     same));
  	    }
+ 
+ 	  /* Try replacing &a[i1] + c * i2 with &a[i1 + i2], if c is step
+ 	     of the array.  Loop optimizer sometimes produce this type of
+ 	     expressions.  */
+ 	  if (TREE_CODE (arg0) == ADDR_EXPR
+ 	      && TREE_CODE (arg1) == MULT_EXPR)
+ 	    {
+ 	      tem = try_move_mult_to_index (type, PLUS_EXPR, arg0, arg1);
+ 	      if (tem)
+ 		return fold (tem);
+ 	    }
+ 	  else if (TREE_CODE (arg1) == ADDR_EXPR
+ 		   && TREE_CODE (arg0) == MULT_EXPR)
+ 	    {
+ 	      tem = try_move_mult_to_index (type, PLUS_EXPR, arg1, arg0);
+ 	      if (tem)
+ 		return fold (tem);
+ 	    }
  	}
        else
  	{
*************** fold (tree expr)
*** 6974,6979 ****
--- 7070,7086 ----
  				     &diff))
  	  return build_int_cst_type (type, diff);
        }
+ 	  
+       /* Try replacing &a[i1] - c * i2 with &a[i1 - i2], if c is step
+ 	 of the array.  Loop optimizer sometimes produce this type of
+ 	 expressions.  */
+       if (TREE_CODE (arg0) == ADDR_EXPR
+ 	  && TREE_CODE (arg1) == MULT_EXPR)
+ 	{
+ 	  tem = try_move_mult_to_index (type, MINUS_EXPR, arg0, arg1);
+ 	  if (tem)
+ 	    return fold (tem);
+ 	}
  
        if (TREE_CODE (arg0) == MULT_EXPR
  	  && TREE_CODE (arg1) == MULT_EXPR
Index: tree-ssa-loop-ivopts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-ivopts.c,v
retrieving revision 2.20
diff -c -3 -p -r2.20 tree-ssa-loop-ivopts.c
*** tree-ssa-loop-ivopts.c	18 Oct 2004 18:01:09 -0000	2.20
--- tree-ssa-loop-ivopts.c	23 Oct 2004 15:13:53 -0000
*************** try_add_cand_for (struct ivopts_data *da
*** 3603,3622 ****
    bitmap act_inv = BITMAP_XMALLOC ();
    unsigned i;
    struct cost_pair *cp;
  
    bitmap_copy (best_ivs, ivs);
    bitmap_copy (best_inv, inv);
  
!   for (i = 0; i < use->n_map_members; i++)
      {
!       cp = use->cost_map + i;
!       if (cp->cost == INFTY)
  	continue;
  
        bitmap_copy (act_ivs, ivs);
!       bitmap_set_bit (act_ivs, cp->cand->id);
!       if (cp->depends_on)
! 	bitmap_a_or_b (act_inv, inv, cp->depends_on);
        else
  	bitmap_copy (act_inv, inv);
        act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
--- 3603,3634 ----
    bitmap act_inv = BITMAP_XMALLOC ();
    unsigned i;
    struct cost_pair *cp;
+   bitmap_iterator bi;
+   struct iv_cand *cand;
+   bitmap depends_on;
  
    bitmap_copy (best_ivs, ivs);
    bitmap_copy (best_inv, inv);
  
!   /* First try important candidates.  Only if it fails, try the specific ones.
!      Rationale -- in loops with many variables the best choice often is to use
!      just one generic biv.  If we added here many ivs specific to the uses,
!      the optimization algorithm later would be likely to get stuck in a local
!      minimum, thus causing us to create too many ivs.  The approach from
!      few ivs to more seems more likely to be succesful -- starting from few
!      ivs, replacing an expensive use by a specific iv should always be a
!      win.  */
!   EXECUTE_IF_SET_IN_BITMAP (data->important_candidates, 0, i, bi)
      {
!       cand = iv_cand (data, i);
! 
!       if (get_use_iv_cost (data, use, cand, &depends_on) == INFTY)
  	continue;
  
        bitmap_copy (act_ivs, ivs);
!       bitmap_set_bit (act_ivs, cand->id);
!       if (depends_on)
! 	bitmap_a_or_b (act_inv, inv, depends_on);
        else
  	bitmap_copy (act_inv, inv);
        act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
*************** try_add_cand_for (struct ivopts_data *da
*** 3629,3634 ****
--- 3641,3675 ----
  	}
      }
  
+   if (best_cost == INFTY)
+     {
+       for (i = 0; i < use->n_map_members; i++)
+ 	{
+ 	  cp = use->cost_map + i;
+ 	  if (cp->cost == INFTY)
+ 	    continue;
+ 
+ 	  /* Already tried this.  */
+ 	  if (cp->cand->important)
+ 	    continue;
+ 
+ 	  bitmap_copy (act_ivs, ivs);
+ 	  bitmap_set_bit (act_ivs, cp->cand->id);
+ 	  if (cp->depends_on)
+ 	    bitmap_a_or_b (act_inv, inv, cp->depends_on);
+ 	  else
+ 	    bitmap_copy (act_inv, inv);
+ 	  act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
+ 
+ 	  if (act_cost < best_cost)
+ 	    {
+ 	      best_cost = act_cost;
+ 	      bitmap_copy (best_ivs, act_ivs);
+ 	      bitmap_copy (best_inv, act_inv);
+ 	    }
+ 	}
+     }
+ 
    bitmap_copy (ivs, best_ivs);
    bitmap_copy (inv, best_inv);
  
Index: tree-ssa-loop-niter.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-niter.c,v
retrieving revision 2.13
diff -c -3 -p -r2.13 tree-ssa-loop-niter.c
*** tree-ssa-loop-niter.c	22 Oct 2004 17:05:08 -0000	2.13
--- tree-ssa-loop-niter.c	23 Oct 2004 15:13:53 -0000
*************** number_of_iterations_cond (tree type, tr
*** 419,427 ****
        d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
  		       build_int_cst_type (niter_type, 1), bits);
        s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
!       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type,
! 			   build_int_cst (niter_type, -1),
! 			   bits);
  
        assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
        assumption = fold (build2 (EQ_EXPR, boolean_type_node,
--- 419,427 ----
        d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
  		       build_int_cst_type (niter_type, 1), bits);
        s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
!       bound = EXEC_UNARY (BIT_NOT_EXPR, niter_type,
! 			  build_int_cst_type (niter_type, 0));
!       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type, bound, bits);
  
        assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
        assumption = fold (build2 (EQ_EXPR, boolean_type_node,
Follow-Ups:
- Re: [patch] Partial fix for PR 18048
  - From: Andrew Pinski
- Re: [patch] Partial fix for PR 18048
  - From: Zack Weinberg
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]