Fixes to estimate_num_insns from pretty-ipa branch

Thu Apr 9 14:18:00 GMT 2009

Hi,
this patch brings several fixes to estimate_num_insns that cummlated on pretty-ipa branch.
In particular
  1) Loads, stores have all non-zero cost same as we already account for non-temporary copies.
  2) FILTER_EXPR/EXC_PTR_EXPR load/stores are free. They most usually will get optimized out
     and also they really are same as other reg-reg copies that are accounted as
     zero cost too.
  3) When making original cost model I simply listed expensive nodes including DIV_EXPR.
     This is very unrealistic for expressions like a/8, a/12 that are common in my dumps.
     Now I basically just mark division by variable expensive.  Divisions by constant are cheap.
     (this is not quite true for particularly ugly constants, but they are rare and we don't
     have very easy way to ask div expansion).
  4) Switch exrepssions are accounted as spending time proportional to number of cases.
     This is not realistic since they are in fact quite cheap.
  5) BUILTIN_EXPECT have non-0 cost.
  6) Esimated_unrolled_size have off-by-one bug.  When loop iterates 4 times, it is estimating
     inlined body size as containing 5 copies.

All these fixes brought overall improvements on pretty-ipa, I am re-testing them on mainline on haydn
now and will report once results are done.  I am however sending this early since the patch cause
several testsuite regressions I am not quite sure how to fix properly:

./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized "my_array"

The problem here is that accounting load makes loop:

static const int my_array [3] = { 4, 5, 6 };

  for (j = 0; j < 3; j ++)
    sum += my_array [j];

to avoid unrolling for code size.  This is because loads now have cost so they
account more than the expected savings for conditional+IV.  This is sort of sane
estimate and moreover loop gets fully unrolled later, but since we don't
have ccp pass after "cunroll" pass, we never propagate constant loads of array
to constants.

I think this ought to be fixed by adding code that estimates what statements
will fold after complette unrolling and compute cost of unrolled loop here
better.  I do similar tricks in inliner and would suggest XFAILing this
testcase until we do so.  I can work on this incrementally.

./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-times cddce2 "if " 2
./gcc/testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump cddce2 "x_. = PHI <0\(.\), 1\(.\)>"

Here we completelly unroll the loop in cunrolli and simplify whole body to:

;; Function test (test)

test (int v)
{
<bb 2>:
  return v <= 0;

}

instead of

;; Function test (test)

test (int v)
{
  int x;

<bb 2>:
  if (v < 0)
    goto <bb 5>;
  else
    goto <bb 3>;

<bb 3>:
  if (v <= 0)
    goto <bb 5>;
  else
    goto <bb 4>;

<bb 4>:
  x = 0;
  goto <bb 6>;

<bb 5>:
  x = 1;

<bb 6>:
  return x;

}

I am not sure if the testcase can be fixed to still excercise same simplification?  Or shall I just update
pattern to be happy about no PHIs?

./gcc/testsuite/g++/g++.sum:FAIL: g++.dg/vect/pr36648.cc scan-tree-dump-times vect "vectorized 1 loops" 1
./gcc/testsuite/g++/g++.sum:FAIL: g++.dg/vect/pr36648.cc scan-tree-dump-times vect "vectorizing stmts using SLP" 1

This testcase is about need to complette unroll in cunrolli loop:
<bb 3>:
  D.2177_3->x = 0.0;
  D.2177_3->y = 0.0;
  D.2177_3->z = 0.0;
  D.2177_4 = D.2177_3 + 12;
  D.2178_6 = D.2178_5 + -1;

<bb 4>:
  # D.2177_3 = PHI <&foo.array_of_vectors[0](2), D.2177_4(3)>
  # D.2178_5 = PHI <3(2), D.2178_6(3)>
  if (D.2178_5 != -1)
    goto <bb 3>;
  else
    goto <bb 5>;

This loop is  actually quite big, but since we account stores as cheap we unroll it when optimizing for size
on mainline.  We probably ought to fix this on vectorizer side?  Blocking vectorization of hot loop because
at that time we didn't complettely unrolled inner loops yet seems at least stupid?
Again, I would suggest just XFAIL this until we fix it somehow.
Index: testsuite/gcc.dg/ipa/ipacost-1.c
===================================================================

--- testsuite/gcc.dg/ipa/ipacost-1.c	(revision 145799)
+++ testsuite/gcc.dg/ipa/ipacost-1.c	(working copy)
@@ -46,6 +46,8 @@ i_can_not_be_propagated_fully2 (int *a)
 main()
 {
   i_can_be_propagated_fully2 (array);
+  i_can_be_propagated_fully2 (array);
+  i_can_not_be_propagated_fully2 (array);
   i_can_not_be_propagated_fully2 (array);
 }
 
Index: testsuite/gcc.dg/ipa/ipacost-2.c
===================================================================
--- testsuite/gcc.dg/ipa/ipacost-2.c	(revision 145799)
+++ testsuite/gcc.dg/ipa/ipacost-2.c	(working copy)
@@ -47,6 +47,8 @@ i_can_not_be_propagated_fully2 (int *a)
 main()
 {
   i_can_be_propagated_fully2 (array);
+  i_can_be_propagated_fully2 (array);
+  i_can_not_be_propagated_fully2 (array);
   i_can_not_be_propagated_fully2 (array);
 }
 
@@ -54,7 +56,7 @@ main()
 /* { dg-final { scan-ipa-dump-times "versioned function i_can_be_propagated_fully " 1 "cp"  } } */
 /* { dg-final { scan-ipa-dump-times "versioned function i_can_not_be_propagated_fully2" 1 "cp"  } } */
 /* { dg-final { scan-ipa-dump-times "versioned function i_can_not_be_propagated_fully " 1 "cp"  } } */
-/* { dg-final { scan-tree-dump-not "i_can_be_propagated" "optimized"  } } */
-/* { dg-final { scan-tree-dump-not "i_can_be_propagated" "optimized"  } } */
+/* { dg-final { scan-tree-dump-not "i_can_be_propagated_fully \\(" "optimized"  } } */
+/* { dg-final { scan-tree-dump-not "i_can_be_propagated_fully2 \\(" "optimized"  } } */
 /* { dg-final { cleanup-ipa-dump "cp" } } */
 /* { dg-final { cleanup-tree-dump "optimized" } } */
Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c	(revision 145799)
+++ tree-ssa-loop-ivcanon.c	(working copy)
@@ -136,8 +136,8 @@ tree_num_loop_insns (struct loop *loop, 
    1) exit condition gets removed (2 insns)
    2) increment of the control variable gets removed (2 insns)
    3) All remaining statements are likely to get simplified
-      due to constant propagation.  Hard to estimate; just
-      as a heuristics we decrease the rest by 1/3.
+      due to constant propagation and optimization on memory stores.
+      Hard to estimate; just as a heuristics we decrease the rest by 1/3.
 
    NINSNS is the number of insns in the loop before unrolling.
    NUNROLL is the number of times the loop is unrolled.  */
@@ -149,7 +149,7 @@ estimated_unrolled_size (unsigned HOST_W
   HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
   if (unr_insns <= 0)
     unr_insns = 1;
-  unr_insns *= (nunroll + 1);
+  unr_insns *= nunroll;
 
   return unr_insns;
 }
Index: tree-inline.c
===================================================================
--- tree-inline.c	(revision 145799)
+++ tree-inline.c	(working copy)
@@ -2749,7 +2749,8 @@ estimate_move_cost (tree type)
 /* Returns cost of operation CODE, according to WEIGHTS  */
 
 static int
-estimate_operator_cost (enum tree_code code, eni_weights *weights)
+estimate_operator_cost (enum tree_code code, eni_weights *weights,
+			tree op1 ATTRIBUTE_UNUSED, tree op2)
 {
   switch (code)
     {
@@ -2859,7 +2860,9 @@ estimate_operator_cost (enum tree_code c
     case FLOOR_MOD_EXPR:
     case ROUND_MOD_EXPR:
     case RDIV_EXPR:
-      return weights->div_mod_cost;
+      if (TREE_CODE (op2) != INTEGER_CST)
+        return weights->div_mod_cost;
+      return 1;
 
     default:
       /* We expect a copy assignment with no operator.  */
@@ -2896,6 +2899,7 @@ estimate_num_insns (gimple stmt, eni_wei
   unsigned cost, i;
   enum gimple_code code = gimple_code (stmt);
   tree lhs;
+  tree rhs;
 
   switch (code)
     {
@@ -2919,16 +2923,35 @@ estimate_num_insns (gimple stmt, eni_wei
 	 of moving something into "a", which we compute using the function
 	 estimate_move_cost.  */
       lhs = gimple_assign_lhs (stmt);
+      rhs = gimple_assign_rhs1 (stmt);
+
+      /* EH magic stuff is most probably going to be optimized out.
+         We rarely really need to save EH info for unwinding
+         nested exceptions.  */
+      if (TREE_CODE (lhs) == FILTER_EXPR
+	  || TREE_CODE (lhs) == EXC_PTR_EXPR
+          || TREE_CODE (rhs) == FILTER_EXPR
+	  || TREE_CODE (rhs) == EXC_PTR_EXPR)
+	return 0;
       if (is_gimple_reg (lhs))
 	cost = 0;
       else
 	cost = estimate_move_cost (TREE_TYPE (lhs));
 
-      cost += estimate_operator_cost (gimple_assign_rhs_code (stmt), weights);
+      if (!is_gimple_reg (rhs) && !is_gimple_min_invariant (rhs))
+	cost += estimate_move_cost (TREE_TYPE (rhs));
+
+      cost += estimate_operator_cost (gimple_assign_rhs_code (stmt), weights,
+      				      gimple_assign_rhs1 (stmt),
+				      get_gimple_rhs_class (gimple_assign_rhs_code (stmt))
+				      == GIMPLE_BINARY_RHS
+				      ? gimple_assign_rhs2 (stmt) : NULL);
       break;
 
     case GIMPLE_COND:
-      cost = 1 + estimate_operator_cost (gimple_cond_code (stmt), weights);
+      cost = 1 + estimate_operator_cost (gimple_cond_code (stmt), weights,
+      				         gimple_op (stmt, 0),
+				         gimple_op (stmt, 1));
       break;
 
     case GIMPLE_SWITCH:
@@ -2937,7 +2960,10 @@ estimate_num_insns (gimple stmt, eni_wei
 
 	 TODO: once the switch expansion logic is sufficiently separated, we can
 	 do better job on estimating cost of the switch.  */
-      cost = gimple_switch_num_labels (stmt) * 2;
+      if (weights->time_based)
+        cost = floor_log2 (gimple_switch_num_labels (stmt)) * 2;
+      else
+        cost = gimple_switch_num_labels (stmt) * 2;
       break;
 
     case GIMPLE_CALL:
@@ -2960,8 +2986,7 @@ estimate_num_insns (gimple stmt, eni_wei
 	    case BUILT_IN_CONSTANT_P:
 	      return 0;
 	    case BUILT_IN_EXPECT:
-	      cost = 0;
-	      break;
+	      return 0;
 
 	    /* Prefetch instruction is not expensive.  */
 	    case BUILT_IN_PREFETCH:
@@ -2975,6 +3000,8 @@ estimate_num_insns (gimple stmt, eni_wei
 	if (decl)
 	  funtype = TREE_TYPE (decl);
 
+	if (!VOID_TYPE_P (TREE_TYPE (funtype)))
+	  cost += estimate_move_cost (TREE_TYPE (funtype));
 	/* Our cost must be kept in sync with
 	   cgraph_estimate_size_after_inlining that does use function
 	   declaration to figure out the arguments.  */
@@ -3095,11 +3122,13 @@ init_inline_once (void)
   eni_inlining_weights.target_builtin_call_cost = 1;
   eni_inlining_weights.div_mod_cost = 10;
   eni_inlining_weights.omp_cost = 40;
+  eni_inlining_weights.time_based = true;
 
   eni_size_weights.call_cost = 1;
   eni_size_weights.target_builtin_call_cost = 1;
   eni_size_weights.div_mod_cost = 1;
   eni_size_weights.omp_cost = 40;
+  eni_size_weights.time_based = false;
 
   /* Estimating time for call is difficult, since we have no idea what the
      called function does.  In the current uses of eni_time_weights,
@@ -3109,6 +3138,7 @@ init_inline_once (void)
   eni_time_weights.target_builtin_call_cost = 10;
   eni_time_weights.div_mod_cost = 10;
   eni_time_weights.omp_cost = 40;
+  eni_time_weights.time_based = true;
 }
 
 /* Estimate the number of instructions in a gimple_seq. */
Index: tree-inline.h
===================================================================
--- tree-inline.h	(revision 145799)
+++ tree-inline.h	(working copy)
@@ -125,6 +125,11 @@ typedef struct eni_weights_d
 
   /* Cost for omp construct.  */
   unsigned omp_cost;
+
+  /* True when time of statemnt should be estimated.  Thus i.e
+     cost of switch statement is logarithmic rather than linear in number
+     of cases.  */
+  bool time_based;
 } eni_weights;
 
 /* Weights that estimate_num_insns uses for heuristics in inlining.  */