This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Improve unrolled size estimates
On Mon, 11 May 2009, Jan Hubicka wrote:
> >
> > That is, I still dislike that you only handle constant index loads
> > from strings. How does the benchmark numbers look like if you
> > remove that?
>
> I handle both STRING_CST and DECL with CONSTANT flag and DECL_INITIAL.
Oh, indeed you do. Ok, change
+ if ((DECL_P (base) && TREE_CONSTANT (base) && DECL_INITIAL (base))
+ || TREE_CODE (base) == STRING_CST)
to what get_symbol_constant_value checks, namely
+ if ((DECL_P (base)
&& TREE_READONLY (base)
&& DECL_INITIAL (base)
&& targetm.binds_local_p (base))
+ || CONSTANT_CLASS_P (base))
and in the loop
+ while (handled_component_p (base)
+ || (TREE_CODE (base) == ARRAY_REF
+ && constant_after_peeling (TREE_OPERAND (base, 1),
stmt, loop)))
+ base = TREE_OPERAND (base, 0);
+ return (DECL_P (base) || TREE_CODE (base) == STRING_CST);
instead bail out for non-constant indices like
+ while (handled_component_p (base))
if (TREE_CODE (base) == ARRAY_REF
&& !constant_after_peeling (TREE_OPERAND (base, 1), ...)
return false;
return true;
that removes the fancy duplicate return condition.
Further spelling fixes on the new version below, apply them to the
original
> > > I am affraid that I don't have time right now to do all that, there are
> > > still too many other things on my TODO list.
> > >
> > > I've dropped the patch into pretty-ipa. There are several improvemnts:
> > > SPECINT EON 2130->2465 (base only). GEOM is up 1433->1454
> > > EON has loop zeroing small vector in the internal loop. I guess it
> > > now gets fully unrolled. Size seems same.
> > > SPECFP base Wupwise 1485->1491, swim 1529->1540, applu 1250->1275,
> > > equake 1680->1775, facerec 1991->2030,
> > > (equake is important)
> > > SPECFP peak wupwise 1395->1417, applu 1326->1355, mesa 1472->1500
> > > art 2219->2248,
> > >
> > > There is also code size regression on botan
> > > http://gcc.opensuse.org/c++bench-frescobaldi-ipa-64/botan/
> > > and raytracer
> > > http://gcc.opensuse.org/c++bench-frescobaldi-ipa-64/raytracer/
> > >
> > > So I guess it is worthwhile to play with it. I would be however happy
> > > if someone beats me ;)))
> >
> > Dissecting the patch would be nice. The formula adjustments look
> > pretty obvious, so do the last-iteration fixes...
>
> OK, this lets us to move ahead ;)
> I've regtsted this patch and doing full testing now on x86_64-linux.
> This patch however adds two new failures:
> FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 "c.array"
> FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized "my_array"
>
> In first case there is loop computing sum of constant array of 3
> elements that is supposed to be optimized into constant, in the other
> there are two loops iterating 4 times doing reading one array computing
> other that are no longer guessed to decrease in size by cunrolli pass
> and thus wait till cunroll pass that unrolls them. Testcase tests that
> the temporary array computed by first loop and consumed by second is
> optimized away.
>
> Boths shows real missed optimization. In gcc.dg/tree-ssa/loop-37.c
> problem is that we don't do CCP after unrolling so we don't fold
> constant access to constant array, In gcc.dg/tree-ssa/loop-36.c we end
> up with TARGET_MEM_EXPR doing same access as MEM_EXPR that prevents us
> from realizing that those can be promoted to registers even if extra FRE
> is scheduled after loop unrolling.
>
> Both testcases will fail if number of iterations is increased by 1 on
> mainline and will pass with the patch if number of iterations is
> decreased to 2 with this patch, so I can just update testsuite
> accordingly, or xfail them or we need to implement some hack here.
>
> Honza
>
> * tree-ssa-loop-ivcanon.c (tree_num_loop_insns): Fix off-by-one error.
> (struct loop_size): new structure.
> (constant_after_peeling): New predicate.
> (tree_estimate_loop_size): New function.
> (estimated_unrolled_size): Rewrite for new estimates.
> (try_unroll_loop_completely): Use new estimates.
>
> * testsuite/gcc.dg/tree-ssa/pr21829.c: Simplify matching since
> we now optimize better.
> Index: tree-ssa-loop-ivcanon.c
> ===================================================================
> --- tree-ssa-loop-ivcanon.c (revision 147349)
> +++ tree-ssa-loop-ivcanon.c (working copy)
> @@ -118,7 +118,7 @@ tree_num_loop_insns (struct loop *loop,
> {
> basic_block *body = get_loop_body (loop);
> gimple_stmt_iterator gsi;
> - unsigned size = 1, i;
> + unsigned size = 0, i;
>
> for (i = 0; i < loop->num_nodes; i++)
> for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> @@ -128,28 +128,115 @@ tree_num_loop_insns (struct loop *loop,
> return size;
> }
>
> -/* Estimate number of insns of completely unrolled loop. We assume
> - that the size of the unrolled loop is decreased in the
> - following way (the numbers of insns are based on what
> - estimate_num_insns returns for appropriate statements):
> -
> - 1) exit condition gets removed (2 insns)
> - 2) increment of the control variable gets removed (2 insns)
> - 3) All remaining statements are likely to get simplified
> - due to constant propagation. Hard to estimate; just
> - as a heuristics we decrease the rest by 1/3.
> +/* Describe size of loop as detected by tree_estimate_loop_size. */
> +struct loop_size
> +{
> + /* Number of instructions in the loop. */
> + int overall;
vertical space
> + /* Number of instructions that will be likely optimized out in
> + peeled iterations of loop (i.e. computation based on induction
> + variable where induction variable starts at known constant.) */
> + int eliminated_by_peeling;
likewise
> + /* Same statistics for last iteration of loop: it is smaller because
> + instructions after exit are not executed. */
> + int last_iteration;
> + int last_iteration_eliminated_by_peeling;
> +};
> +
> +/* Computes an estimated number of insns in LOOP, weighted by WEIGHTS.
> + Return results in SIZE, estimate benefits for complette unrolling existing by EXIT. */
complete
exiting
> +
> +static void
> +tree_estimate_loop_size (struct loop *loop, edge exit, struct loop_size *size)
> +{
> + basic_block *body = get_loop_body (loop);
> + gimple_stmt_iterator gsi;
> + unsigned int i;
> + bool after_exit;
> +
> + size->overall = 0;
> + size->eliminated_by_peeling = 0;
> + size->last_iteration = 0;
> + size->last_iteration_eliminated_by_peeling = 0;
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "Estimating sizes for loop %i\n", loop->num);
> + for (i = 0; i < loop->num_nodes; i++)
> + {
> + if (exit && body[i] != exit->src
> + && dominated_by_p (CDI_DOMINATORS, body[i], exit->src))
> + after_exit = true;
> + else
> + after_exit = false;
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, " BB: %i, after_exit: %i\n", body[i]->index, after_exit);
> +
> + for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> + {
> + gimple stmt = gsi_stmt (gsi);
> + int num = estimate_num_insns (stmt, &eni_size_weights);
> + bool likely_eliminated = false;
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, " size: %3i ", num);
> + print_gimple_stmt (dump_file, gsi_stmt (gsi), 0, 0);
> + }
> +
> + /* Look for reasons why we might optimize this stmt away. */
> +
> + /* Exit conditional. */
> + if (body[i] == exit->src && stmt == last_stmt (exit->src))
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, " Exit condition will be eliminated.\n");
> + likely_eliminated = true;
> + }
> +
> + size->overall += num;
> + if (likely_eliminated)
> + size->eliminated_by_peeling += num;
> + if (!after_exit)
> + {
> + size->last_iteration += num;
> + if (likely_eliminated)
> + size->last_iteration_eliminated_by_peeling += num;
> + }
> + }
> + }
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "size: %i-%i, last_iteration: %i-%i\n", size->overall,
> + size->eliminated_by_peeling, size->last_iteration,
> + size->last_iteration_eliminated_by_peeling);
> +
> + free (body);
> +}
>
> - NINSNS is the number of insns in the loop before unrolling.
> - NUNROLL is the number of times the loop is unrolled. */
> +/* Estimate number of insns of completely unrolled loop.
> + It is (NUNROLL + 1) * size of loop body with taking into account
> + the fact that in last copy everything after exit conditoinal
conditional
> + is dead and that some instructions will be eliminated after
> + peeling.
> +
> + Loop body is likely going to simplify futher, this is dificult
difficult
> + to guess, we just decrease the result by 1/3. */
>
> static unsigned HOST_WIDE_INT
> -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +estimated_unrolled_size (struct loop_size *size,
> unsigned HOST_WIDE_INT nunroll)
> {
> - HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> + HOST_WIDE_INT unr_insns = ((nunroll)
> + * (HOST_WIDE_INT) (size->overall
> + - size->eliminated_by_peeling));
> + /* Decrease cost of induction variable increment. */
> + unr_insns --;
> + if (!nunroll)
> + unr_insns = 0;
> + unr_insns += size->last_iteration - size->last_iteration_eliminated_by_peeling;
> +
> + unr_insns = unr_insns * 2 / 3;
> if (unr_insns <= 0)
> unr_insns = 1;
> - unr_insns *= (nunroll + 1);
>
> return unr_insns;
> }
> @@ -165,6 +252,7 @@ try_unroll_loop_completely (struct loop
> {
> unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
> gimple cond;
> + struct loop_size size;
>
> if (loop->inner)
> return false;
> @@ -182,9 +270,10 @@ try_unroll_loop_completely (struct loop
> if (ul == UL_SINGLE_ITER)
> return false;
>
> - ninsns = tree_num_loop_insns (loop, &eni_size_weights);
> + tree_estimate_loop_size (loop, exit, &size);
> + ninsns = size.overall;
>
> - unr_insns = estimated_unrolled_size (ninsns, n_unroll);
> + unr_insns = estimated_unrolled_size (&size, n_unroll);
> if (dump_file && (dump_flags & TDF_DETAILS))
> {
> fprintf (dump_file, " Loop size: %d\n", (int) ninsns);
> Index: testsuite/gcc.dg/tree-ssa/pr21829.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/pr21829.c (revision 147349)
> +++ testsuite/gcc.dg/tree-ssa/pr21829.c (working copy)
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-optimized -fdump-tree-cddce2" } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
>
> int test(int v)
> {
> @@ -16,33 +16,7 @@ int test(int v)
> return x;
> }
>
> -/* This should be optimized to
> +/* This should be unrolled and optimized into conditional set of return value "v < 0". */
>
> - if (v <= 0) goto <L1>; else goto <L3>;
> -
> - <L1>:;
> -
> - # x_1 = PHI <0(3), 1(1)>;
> - <L3>:;
> - return x_1;
> -
> - retaining only a single conditional. This doesn't work as nobody
> - combines the two tests
> -
> - if (v < 0) goto <bb 4>; else goto <bb 3>;
> -
> - <bb 3>:
> -
> - if (v <= 0) goto <bb 4>; else goto <bb 5>;
> -
> - this late in the game. tree-ssa-ifcombine.c would do it if we would
> - unroll the loop during early loop unrolling though.
> -
> - For now vrp2 does all the needed folding and threading and cddce2
> - provides a nice IL to scan. */
> -
> -/* { dg-final { scan-tree-dump-times "if " 1 "optimized" { xfail *-*-* } } } */
> -/* { dg-final { scan-tree-dump-times "if " 2 "cddce2" } } */
> -/* { dg-final { scan-tree-dump "x_. = PHI <0\\\(.\\\), 1\\\(.\\\)>" "cddce2" } } */
> -/* { dg-final { cleanup-tree-dump "cddce2" } } */
> +/* { dg-final { scan-tree-dump-not "if \\(" "optimized" } } */
> /* { dg-final { cleanup-tree-dump "optimized" } } */
Thus, the original version with the above changes is ok if it still
bootstraps & tests.
Thanks for your patience,
Richard.