This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[patch] Partial fix for PR 18048
- From: Zdenek Dvorak <rakdver at atrey dot karlin dot mff dot cuni dot cz>
- To: gcc-patches at gcc dot gnu dot org
- Date: Sun, 24 Oct 2004 11:41:12 +0200
- Subject: [patch] Partial fix for PR 18048
Hello,
this patch contains several improvements to ivopts and related
optimizations that improve the code produced for mgrid benchmark quite a
bit (although even with the patch, ivopts still spoil the code).
The improvements are:
1) Ivopts produce expressions like
*(&a[start] + 8 * index)
and believe that the memory access created for it will include the
multiplication by 8. This fails if there are several similar
accesses, since dom will CSE the "8 * index" part (thus adding
unnecessary multiplication instruction, and increasing register
pressure).
The part of the patch in the fold-const.c ensures that such an
expression is folded to a[start + index], thus achieving what ivopts
intended. This solution unfortunately is not generic enough to catch
all cases when something like this may happen, and it does not work
on LP64 platforms where this optimization might be incorrect, but it
seems to be the only possibility within the current framework. I
will start a discussion about a more appropriate solution for 4.1 in
a separate thread.
2) The algorithm used for finding the best set of bivs for the loop is
quite primitive (and cannot be much better due to compile time
constraints), and as such it is likely to get stuck in local
minimums. Which happens in mgrid, causing ivopts to create about
30 bivs...
The patch to try_add_cand_for makes the initial set of candidates
be much smaller (based primarily on the important candidates).
The idea is that getting stuck in local minimum with too many bivs
is a disaster, but if it happens with a few bivs, it just means that
the iv uses cannot be expressed much cheaper than they are, so we
are not that far from optimum anyway.
3) A small improvement in number_of_iterations_cond.
build_int_cst (type, -1) is not suitable for producing unsigned
constant with all bits one, since the bits outside of the precision
of the type are set to 1 as well, which makes fold to fail to
perform some of the optimizations with such constants.
Bootstrapped & regtested on ppc, i686 and x86_64. Benchmark results
from i686 (-O2, base without the patch, peak with) (quick summary --
minor variations except for the huge gain on mgrid, on average the
results improve even if mgrid is not taken into account):
Benchmarks Ref Time Run Time Ratio Ref Time Run Time Ratio
------------ -------- -------- -------- -------- -------- --------
164.gzip 1400 209 670 * 1400 211 663 *
175.vpr 1400 390 359 * 1400 395 354 *
176.gcc 1100 -- X 1100 -- X
181.mcf 1800 733 245 * 1800 728 247 *
186.crafty 1000 124 804 * 1000 123 814 *
197.parser 1800 401 449 * 1800 400 450 *
252.eon 1300 -- X 1300 -- X
253.perlbmk 1800 208 865 * 1800 205 879 *
254.gap 1100 179 616 * 1100 176 627 *
255.vortex 1900 231 822 * 1900 236 805 *
256.bzip2 1500 351 427 * 1500 351 427 *
300.twolf 3000 799 376 * 3000 800 375 *
168.wupwise 1600 295 543 * 1600 286 559 *
171.swim 3100 720 430 * 3100 725 428 *
172.mgrid 1800 574 314 * 1800 504 357 *
173.applu 2100 528 398 * 2100 520 404 *
177.mesa 1400 207 675 * 1400 207 675 *
178.galgel 2900 -- X 2900 -- X
179.art 2600 1458 178 * 2600 1463 178 *
183.equake 1300 294 442 * 1300 294 442 *
187.facerec 1900 473 401 * 1900 467 406 *
188.ammp 2200 615 357 * 2200 609 361 *
189.lucas 2000 436 458 * 2000 437 458 *
191.fma3d 2100 379 554 * 2100 382 549 *
200.sixtrack 1100 270 408 * 1100 265 415 *
301.apsi 2600 671 387 * 2600 665 391 *
Zdenek
PR tree-optimization/18048
* fold-const.c (try_move_mult_to_index): New function.
(fold): Use try_move_mult_to_index.
* tree-ssa-loop-ivopts.c (try_add_cand_for): Prefer common candidates.
* tree-ssa-loop-niter.c (number_of_iterations_cond): Produce
an all-ones unsigned constant without extra bits.
Index: fold-const.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/fold-const.c,v
retrieving revision 1.468
diff -c -3 -p -r1.468 fold-const.c
*** fold-const.c 11 Oct 2004 03:42:00 -0000 1.468
--- fold-const.c 23 Oct 2004 15:13:53 -0000
*************** tree_swap_operands_p (tree arg0, tree ar
*** 5967,5972 ****
--- 5967,6050 ----
return 0;
}
+ /* Tries to replace &a[idx] CODE s * delta with &a[idx CODE delta], if s is
+ step of the array. TYPE is the type of the expression. ADDR is the address.
+ MULT is the multiplicative expression. If the function succeeds, the new
+ address expression is returned. Otherwise NULL_TREE is returned. */
+
+ static tree
+ try_move_mult_to_index (tree type, enum tree_code code, tree addr, tree mult)
+ {
+ tree s, delta, step;
+ tree arg0 = TREE_OPERAND (mult, 0), arg1 = TREE_OPERAND (mult, 1);
+ tree ref = TREE_OPERAND (addr, 0), pref;
+ tree ret, pos;
+ tree itype;
+
+ STRIP_NOPS (arg0);
+ STRIP_NOPS (arg1);
+
+ if (TREE_CODE (arg0) == INTEGER_CST)
+ {
+ s = arg0;
+ delta = arg1;
+ }
+ else if (TREE_CODE (arg1) == INTEGER_CST)
+ {
+ s = arg1;
+ delta = arg0;
+ }
+ else
+ return NULL_TREE;
+
+ for (;; ref = TREE_OPERAND (ref, 0))
+ {
+ if (TREE_CODE (ref) == ARRAY_REF)
+ {
+ step = array_ref_element_size (ref);
+
+ if (TREE_CODE (step) != INTEGER_CST)
+ continue;
+
+ itype = TREE_TYPE (step);
+
+ /* If the type sizes do not match, we might run into problems
+ when one of them would overflow. */
+ if (TYPE_PRECISION (itype) != TYPE_PRECISION (type))
+ continue;
+
+ if (!operand_equal_p (step, fold_convert (itype, s), 0))
+ continue;
+
+ delta = fold_convert (itype, delta);
+ break;
+ }
+
+ if (!handled_component_p (ref))
+ return NULL_TREE;
+ }
+
+ /* We found the suitable array reference. So copy everything up to it,
+ and replace the index. */
+
+ pref = TREE_OPERAND (addr, 0);
+ ret = copy_node (pref);
+ pos = ret;
+
+ while (pref != ref)
+ {
+ pref = TREE_OPERAND (pref, 0);
+ TREE_OPERAND (pos, 0) = copy_node (pref);
+ pos = TREE_OPERAND (pos, 0);
+ }
+
+ TREE_OPERAND (pos, 1) = fold (build2 (code, itype,
+ TREE_OPERAND (pos, 1),
+ delta));
+
+ return build1 (ADDR_EXPR, type, ret);
+ }
+
/* Perform constant folding and related simplification of EXPR.
The related simplifications include x*1 => x, x*0 => 0, etc.,
and application of the associative law.
*************** fold (tree expr)
*** 6602,6607 ****
--- 6680,6703 ----
alt0, alt1)),
same));
}
+
+ /* Try replacing &a[i1] + c * i2 with &a[i1 + i2], if c is step
+ of the array. Loop optimizer sometimes produce this type of
+ expressions. */
+ if (TREE_CODE (arg0) == ADDR_EXPR
+ && TREE_CODE (arg1) == MULT_EXPR)
+ {
+ tem = try_move_mult_to_index (type, PLUS_EXPR, arg0, arg1);
+ if (tem)
+ return fold (tem);
+ }
+ else if (TREE_CODE (arg1) == ADDR_EXPR
+ && TREE_CODE (arg0) == MULT_EXPR)
+ {
+ tem = try_move_mult_to_index (type, PLUS_EXPR, arg1, arg0);
+ if (tem)
+ return fold (tem);
+ }
}
else
{
*************** fold (tree expr)
*** 6974,6979 ****
--- 7070,7086 ----
&diff))
return build_int_cst_type (type, diff);
}
+
+ /* Try replacing &a[i1] - c * i2 with &a[i1 - i2], if c is step
+ of the array. Loop optimizer sometimes produce this type of
+ expressions. */
+ if (TREE_CODE (arg0) == ADDR_EXPR
+ && TREE_CODE (arg1) == MULT_EXPR)
+ {
+ tem = try_move_mult_to_index (type, MINUS_EXPR, arg0, arg1);
+ if (tem)
+ return fold (tem);
+ }
if (TREE_CODE (arg0) == MULT_EXPR
&& TREE_CODE (arg1) == MULT_EXPR
Index: tree-ssa-loop-ivopts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-ivopts.c,v
retrieving revision 2.20
diff -c -3 -p -r2.20 tree-ssa-loop-ivopts.c
*** tree-ssa-loop-ivopts.c 18 Oct 2004 18:01:09 -0000 2.20
--- tree-ssa-loop-ivopts.c 23 Oct 2004 15:13:53 -0000
*************** try_add_cand_for (struct ivopts_data *da
*** 3603,3622 ****
bitmap act_inv = BITMAP_XMALLOC ();
unsigned i;
struct cost_pair *cp;
bitmap_copy (best_ivs, ivs);
bitmap_copy (best_inv, inv);
! for (i = 0; i < use->n_map_members; i++)
{
! cp = use->cost_map + i;
! if (cp->cost == INFTY)
continue;
bitmap_copy (act_ivs, ivs);
! bitmap_set_bit (act_ivs, cp->cand->id);
! if (cp->depends_on)
! bitmap_a_or_b (act_inv, inv, cp->depends_on);
else
bitmap_copy (act_inv, inv);
act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
--- 3603,3634 ----
bitmap act_inv = BITMAP_XMALLOC ();
unsigned i;
struct cost_pair *cp;
+ bitmap_iterator bi;
+ struct iv_cand *cand;
+ bitmap depends_on;
bitmap_copy (best_ivs, ivs);
bitmap_copy (best_inv, inv);
! /* First try important candidates. Only if it fails, try the specific ones.
! Rationale -- in loops with many variables the best choice often is to use
! just one generic biv. If we added here many ivs specific to the uses,
! the optimization algorithm later would be likely to get stuck in a local
! minimum, thus causing us to create too many ivs. The approach from
! few ivs to more seems more likely to be succesful -- starting from few
! ivs, replacing an expensive use by a specific iv should always be a
! win. */
! EXECUTE_IF_SET_IN_BITMAP (data->important_candidates, 0, i, bi)
{
! cand = iv_cand (data, i);
!
! if (get_use_iv_cost (data, use, cand, &depends_on) == INFTY)
continue;
bitmap_copy (act_ivs, ivs);
! bitmap_set_bit (act_ivs, cand->id);
! if (depends_on)
! bitmap_a_or_b (act_inv, inv, depends_on);
else
bitmap_copy (act_inv, inv);
act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
*************** try_add_cand_for (struct ivopts_data *da
*** 3629,3634 ****
--- 3641,3675 ----
}
}
+ if (best_cost == INFTY)
+ {
+ for (i = 0; i < use->n_map_members; i++)
+ {
+ cp = use->cost_map + i;
+ if (cp->cost == INFTY)
+ continue;
+
+ /* Already tried this. */
+ if (cp->cand->important)
+ continue;
+
+ bitmap_copy (act_ivs, ivs);
+ bitmap_set_bit (act_ivs, cp->cand->id);
+ if (cp->depends_on)
+ bitmap_a_or_b (act_inv, inv, cp->depends_on);
+ else
+ bitmap_copy (act_inv, inv);
+ act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
+
+ if (act_cost < best_cost)
+ {
+ best_cost = act_cost;
+ bitmap_copy (best_ivs, act_ivs);
+ bitmap_copy (best_inv, act_inv);
+ }
+ }
+ }
+
bitmap_copy (ivs, best_ivs);
bitmap_copy (inv, best_inv);
Index: tree-ssa-loop-niter.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-niter.c,v
retrieving revision 2.13
diff -c -3 -p -r2.13 tree-ssa-loop-niter.c
*** tree-ssa-loop-niter.c 22 Oct 2004 17:05:08 -0000 2.13
--- tree-ssa-loop-niter.c 23 Oct 2004 15:13:53 -0000
*************** number_of_iterations_cond (tree type, tr
*** 419,427 ****
d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
build_int_cst_type (niter_type, 1), bits);
s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
! bound = EXEC_BINARY (RSHIFT_EXPR, niter_type,
! build_int_cst (niter_type, -1),
! bits);
assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
assumption = fold (build2 (EQ_EXPR, boolean_type_node,
--- 419,427 ----
d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
build_int_cst_type (niter_type, 1), bits);
s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
! bound = EXEC_UNARY (BIT_NOT_EXPR, niter_type,
! build_int_cst_type (niter_type, 0));
! bound = EXEC_BINARY (RSHIFT_EXPR, niter_type, bound, bits);
assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
assumption = fold (build2 (EQ_EXPR, boolean_type_node,