This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Fixes to estimate_num_insns try II
- From: Richard Guenther <rguenther at suse dot de>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Tue, 12 May 2009 11:11:30 +0200 (CEST)
- Subject: Re: Fixes to estimate_num_insns try II
- References: <20090511213932.GA5012@kam.mff.cuni.cz>
On Mon, 11 May 2009, Jan Hubicka wrote:
> Hi,
> this is update of earlier patch for cost metrics I sent last month. This time
> there are not that many failures to handle since unroller is no longer too confused.
> There is still one case were we now miss optimization, loop-36.c:
>
> for (d=0; d<4; ++d)
> c.array[d] = a.array[d] * b.array[d];
> for (d=0; d<4; ++d)
> s+=c.array[d];
>
> expect both loops to be unrolled early in order for c.array to be optimized out by FRV.
> With new metrics we account the three memory operations in first loop:
>
> Analyzing # of iterations of loop 1
> exit condition [0, + , 1] <= 3
> bounds on difference of bases: 3 ... 3
> result:
> # of iterations 4, bounded by 4
> (set_nb_iterations_in_loop = 4))
> Loop 1 iterates 4 times.
> Estimating sizes for loop 1
> BB: 4, after_exit: 0
> size: 2 if (d_2 <= 3)
> Exit condition will be eliminated.
> BB: 3, after_exit: 1
> size: 1 D.1605_6 = a.array[d_2];
> size: 1 D.1606_7 = b.array[d_2];
> size: 1 D.1607_8 = D.1605_6 * D.1606_7;
> size: 1 c.array[d_2] = D.1607_8;
> size: 1 d_9 = d_2 + 1;
> Induction variable computation will be folded away.
> size: 7-3, last_iteration: 2-2
> Loop size: 7
> Estimated size after unrolling: 10
>
> So it is estimated that fully unrolling the loop 4 times is not going to be win
> code size wise, that is true. Even with c.array optimized out, the resulting
> loop have 8 memory loads and 4 multiplies compared to 7 instructions in
> original loop.
>
> The second loop is unrolled though:
>
> Loop 2 iterates 4 times.
> Estimating sizes for loop 2
> BB: 8, after_exit: 0
> size: 2 if (d_3 <= 3)
> Exit condition will be eliminated.
> BB: 5, after_exit: 1
> size: 1 D.1608_11 = c.array[d_3];
> size: 1 s_12 = D.1608_11 + s_1;
> size: 1 d_13 = d_3 + 1;
> Induction variable computation will be folded away.
> size: 5-3, last_iteration: 2-2
> Loop size: 5
> Estimated size after unrolling: 5
>
> I think it is not purpose of simple minded cunrolli heuristic to handle complex
> case like this, so I would propose simply decreasing number of iterations so
> the testcase still works that require reducing from 4 to 2.
>
> The remaining testcase updates are quite easy; we need to prevent in lining
> in ipacost testcases and slp-3.c that got unrolled after previous change is no
> longer getting unrolled interanl loop (iterating 8 times containing one load
> and one store, so it is sane decionsion to not do so).
>
> Bootstrapped/regtested x86_64-linux, OK?
Ok.
Thanks,
Richard.
> Honza
>
> * gcc.dg/tree-ssa/loop-36.c: Reduce amount of iterations to 2 so unrolling
> still happens.
> * gcc.dg/ipa/ipacost-1.c: Prevent inlining
> * gcc.dg/ipa/ipacost-2.c: Likewise.
> * gcc.dg/vect/slp-3.c: Loop is no longer unrolled.
>
> * tree-inline.c (estimate_operator_cost): Add operands;
> when division happens by constant, it is cheap.
> (estimate_num_insns): Loads and stores are not having cost of 0;
> EH magic stuff is cheap; when computing runtime cost of switch,
> use log2 base of amount of its cases; builtin_expect has cost of 0;
> compute cost for moving return value of call.
> (init_inline_once): Initialize time_based flags.
> * tree-inline.h (eni_weights_d): Add time_based flag.
> Index: testsuite/gcc.dg/tree-ssa/loop-36.c
> ===================================================================
> *** testsuite/gcc.dg/tree-ssa/loop-36.c (revision 147395)
> --- testsuite/gcc.dg/tree-ssa/loop-36.c (working copy)
> ***************
> *** 1,7 ****
> /* { dg-do compile } */
> /* { dg-options "-O2 -fdump-tree-dce2" } */
>
> ! struct X { float array[4]; };
>
> struct X a,b;
>
> --- 1,7 ----
> /* { dg-do compile } */
> /* { dg-options "-O2 -fdump-tree-dce2" } */
>
> ! struct X { float array[2]; };
>
> struct X a,b;
>
> *************** float foobar () {
> *** 9,17 ****
> float s = 0;
> unsigned int d;
> struct X c;
> ! for (d=0; d<4; ++d)
> c.array[d] = a.array[d] * b.array[d];
> ! for (d=0; d<4; ++d)
> s+=c.array[d];
> return s;
> }
> --- 9,17 ----
> float s = 0;
> unsigned int d;
> struct X c;
> ! for (d=0; d<2; ++d)
> c.array[d] = a.array[d] * b.array[d];
> ! for (d=0; d<2; ++d)
> s+=c.array[d];
> return s;
> }
> Index: testsuite/gcc.dg/ipa/ipacost-1.c
> ===================================================================
> *** testsuite/gcc.dg/ipa/ipacost-1.c (revision 147395)
> --- testsuite/gcc.dg/ipa/ipacost-1.c (working copy)
> *************** i_can_not_be_propagated_fully2 (int *a)
> *** 46,51 ****
> --- 46,53 ----
> main()
> {
> i_can_be_propagated_fully2 (array);
> + i_can_be_propagated_fully2 (array);
> + i_can_not_be_propagated_fully2 (array);
> i_can_not_be_propagated_fully2 (array);
> }
>
> Index: testsuite/gcc.dg/ipa/ipacost-2.c
> ===================================================================
> *** testsuite/gcc.dg/ipa/ipacost-2.c (revision 147395)
> --- testsuite/gcc.dg/ipa/ipacost-2.c (working copy)
> *************** i_can_not_be_propagated_fully2 (int *a)
> *** 47,52 ****
> --- 47,54 ----
> main()
> {
> i_can_be_propagated_fully2 (array);
> + i_can_be_propagated_fully2 (array);
> + i_can_not_be_propagated_fully2 (array);
> i_can_not_be_propagated_fully2 (array);
> }
>
> *************** main()
> *** 54,60 ****
> /* { dg-final { scan-ipa-dump-times "versioned function i_can_be_propagated_fully " 1 "cp" } } */
> /* { dg-final { scan-ipa-dump-times "versioned function i_can_not_be_propagated_fully2" 1 "cp" } } */
> /* { dg-final { scan-ipa-dump-times "versioned function i_can_not_be_propagated_fully " 1 "cp" } } */
> ! /* { dg-final { scan-tree-dump-not "i_can_be_propagated" "optimized" } } */
> ! /* { dg-final { scan-tree-dump-not "i_can_be_propagated" "optimized" } } */
> /* { dg-final { cleanup-ipa-dump "cp" } } */
> /* { dg-final { cleanup-tree-dump "optimized" } } */
> --- 56,62 ----
> /* { dg-final { scan-ipa-dump-times "versioned function i_can_be_propagated_fully " 1 "cp" } } */
> /* { dg-final { scan-ipa-dump-times "versioned function i_can_not_be_propagated_fully2" 1 "cp" } } */
> /* { dg-final { scan-ipa-dump-times "versioned function i_can_not_be_propagated_fully " 1 "cp" } } */
> ! /* { dg-final { scan-tree-dump-not "i_can_be_propagated_fully \\(" "optimized" } } */
> ! /* { dg-final { scan-tree-dump-not "i_can_be_propagated_fully2 \\(" "optimized" } } */
> /* { dg-final { cleanup-ipa-dump "cp" } } */
> /* { dg-final { cleanup-tree-dump "optimized" } } */
> Index: testsuite/gcc.dg/vect/slp-3.c
> ===================================================================
> *** testsuite/gcc.dg/vect/slp-3.c (revision 147395)
> --- testsuite/gcc.dg/vect/slp-3.c (working copy)
> *************** int main (void)
> *** 142,149 ****
> return 0;
> }
>
> ! /* One of the loops gets complettely unrolled. */
> ! /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail vect_no_align } } } */
> ! /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { xfail vect_no_align } } } */
> /* { dg-final { cleanup-tree-dump "vect" } } */
>
> --- 142,148 ----
> return 0;
> }
>
> ! /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { xfail vect_no_align } } } */
> ! /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { xfail vect_no_align } } } */
> /* { dg-final { cleanup-tree-dump "vect" } } */
>
> Index: tree-inline.c
> ===================================================================
> *** tree-inline.c (revision 147395)
> --- tree-inline.c (working copy)
> *************** estimate_move_cost (tree type)
> *** 2783,2789 ****
> /* Returns cost of operation CODE, according to WEIGHTS */
>
> static int
> ! estimate_operator_cost (enum tree_code code, eni_weights *weights)
> {
> switch (code)
> {
> --- 2783,2790 ----
> /* Returns cost of operation CODE, according to WEIGHTS */
>
> static int
> ! estimate_operator_cost (enum tree_code code, eni_weights *weights,
> ! tree op1 ATTRIBUTE_UNUSED, tree op2)
> {
> switch (code)
> {
> *************** estimate_operator_cost (enum tree_code c
> *** 2893,2899 ****
> case FLOOR_MOD_EXPR:
> case ROUND_MOD_EXPR:
> case RDIV_EXPR:
> ! return weights->div_mod_cost;
>
> default:
> /* We expect a copy assignment with no operator. */
> --- 2894,2902 ----
> case FLOOR_MOD_EXPR:
> case ROUND_MOD_EXPR:
> case RDIV_EXPR:
> ! if (TREE_CODE (op2) != INTEGER_CST)
> ! return weights->div_mod_cost;
> ! return 1;
>
> default:
> /* We expect a copy assignment with no operator. */
> *************** estimate_num_insns (gimple stmt, eni_wei
> *** 2930,2935 ****
> --- 2933,2939 ----
> unsigned cost, i;
> enum gimple_code code = gimple_code (stmt);
> tree lhs;
> + tree rhs;
>
> switch (code)
> {
> *************** estimate_num_insns (gimple stmt, eni_wei
> *** 2953,2968 ****
> of moving something into "a", which we compute using the function
> estimate_move_cost. */
> lhs = gimple_assign_lhs (stmt);
> if (is_gimple_reg (lhs))
> cost = 0;
> else
> cost = estimate_move_cost (TREE_TYPE (lhs));
>
> ! cost += estimate_operator_cost (gimple_assign_rhs_code (stmt), weights);
> break;
>
> case GIMPLE_COND:
> ! cost = 1 + estimate_operator_cost (gimple_cond_code (stmt), weights);
> break;
>
> case GIMPLE_SWITCH:
> --- 2957,2991 ----
> of moving something into "a", which we compute using the function
> estimate_move_cost. */
> lhs = gimple_assign_lhs (stmt);
> + rhs = gimple_assign_rhs1 (stmt);
> +
> + /* EH magic stuff is most probably going to be optimized out.
> + We rarely really need to save EH info for unwinding
> + nested exceptions. */
> + if (TREE_CODE (lhs) == FILTER_EXPR
> + || TREE_CODE (lhs) == EXC_PTR_EXPR
> + || TREE_CODE (rhs) == FILTER_EXPR
> + || TREE_CODE (rhs) == EXC_PTR_EXPR)
> + return 0;
> if (is_gimple_reg (lhs))
> cost = 0;
> else
> cost = estimate_move_cost (TREE_TYPE (lhs));
>
> ! if (!is_gimple_reg (rhs) && !is_gimple_min_invariant (rhs))
> ! cost += estimate_move_cost (TREE_TYPE (rhs));
> !
> ! cost += estimate_operator_cost (gimple_assign_rhs_code (stmt), weights,
> ! gimple_assign_rhs1 (stmt),
> ! get_gimple_rhs_class (gimple_assign_rhs_code (stmt))
> ! == GIMPLE_BINARY_RHS
> ! ? gimple_assign_rhs2 (stmt) : NULL);
> break;
>
> case GIMPLE_COND:
> ! cost = 1 + estimate_operator_cost (gimple_cond_code (stmt), weights,
> ! gimple_op (stmt, 0),
> ! gimple_op (stmt, 1));
> break;
>
> case GIMPLE_SWITCH:
> *************** estimate_num_insns (gimple stmt, eni_wei
> *** 2971,2977 ****
>
> TODO: once the switch expansion logic is sufficiently separated, we can
> do better job on estimating cost of the switch. */
> ! cost = gimple_switch_num_labels (stmt) * 2;
> break;
>
> case GIMPLE_CALL:
> --- 2994,3003 ----
>
> TODO: once the switch expansion logic is sufficiently separated, we can
> do better job on estimating cost of the switch. */
> ! if (weights->time_based)
> ! cost = floor_log2 (gimple_switch_num_labels (stmt)) * 2;
> ! else
> ! cost = gimple_switch_num_labels (stmt) * 2;
> break;
>
> case GIMPLE_CALL:
> *************** estimate_num_insns (gimple stmt, eni_wei
> *** 2994,3001 ****
> case BUILT_IN_CONSTANT_P:
> return 0;
> case BUILT_IN_EXPECT:
> ! cost = 0;
> ! break;
>
> /* Prefetch instruction is not expensive. */
> case BUILT_IN_PREFETCH:
> --- 3020,3026 ----
> case BUILT_IN_CONSTANT_P:
> return 0;
> case BUILT_IN_EXPECT:
> ! return 0;
>
> /* Prefetch instruction is not expensive. */
> case BUILT_IN_PREFETCH:
> *************** estimate_num_insns (gimple stmt, eni_wei
> *** 3009,3014 ****
> --- 3034,3041 ----
> if (decl)
> funtype = TREE_TYPE (decl);
>
> + if (!VOID_TYPE_P (TREE_TYPE (funtype)))
> + cost += estimate_move_cost (TREE_TYPE (funtype));
> /* Our cost must be kept in sync with
> cgraph_estimate_size_after_inlining that does use function
> declaration to figure out the arguments. */
> *************** init_inline_once (void)
> *** 3133,3143 ****
> --- 3160,3172 ----
> eni_inlining_weights.target_builtin_call_cost = 1;
> eni_inlining_weights.div_mod_cost = 10;
> eni_inlining_weights.omp_cost = 40;
> + eni_inlining_weights.time_based = true;
>
> eni_size_weights.call_cost = 1;
> eni_size_weights.target_builtin_call_cost = 1;
> eni_size_weights.div_mod_cost = 1;
> eni_size_weights.omp_cost = 40;
> + eni_size_weights.time_based = false;
>
> /* Estimating time for call is difficult, since we have no idea what the
> called function does. In the current uses of eni_time_weights,
> *************** init_inline_once (void)
> *** 3147,3152 ****
> --- 3176,3182 ----
> eni_time_weights.target_builtin_call_cost = 10;
> eni_time_weights.div_mod_cost = 10;
> eni_time_weights.omp_cost = 40;
> + eni_time_weights.time_based = true;
> }
>
> /* Estimate the number of instructions in a gimple_seq. */
> Index: tree-inline.h
> ===================================================================
> *** tree-inline.h (revision 147395)
> --- tree-inline.h (working copy)
> *************** typedef struct eni_weights_d
> *** 130,135 ****
> --- 130,140 ----
>
> /* Cost for omp construct. */
> unsigned omp_cost;
> +
> + /* True when time of statemnt should be estimated. Thus i.e
> + cost of switch statement is logarithmic rather than linear in number
> + of cases. */
> + bool time_based;
> } eni_weights;
>
> /* Weights that estimate_num_insns uses for heuristics in inlining. */
>
>
--
Richard Guenther <rguenther@suse.de>
Novell / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 - GF: Markus Rex