This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Fix PR46556 (poor address generation)
- From: Richard Guenther <rguenther at suse dot de>
- To: "William J. Schmidt" <wschmidt at linux dot vnet dot ibm dot com>
- Cc: gcc-patches at gcc dot gnu dot org, bergner at vnet dot ibm dot com, rguenth at gcc dot gnu dot org
- Date: Thu, 6 Oct 2011 12:13:08 +0200 (CEST)
- Subject: Re: [PATCH] Fix PR46556 (poor address generation)
- References: <1317831212.4199.45.camel@oc2474580526.ibm.com>
On Wed, 5 Oct 2011, William J. Schmidt wrote:
> This patch addresses the poor code generation in PR46556 for the
> following code:
>
> struct x
> {
> int a[16];
> int b[16];
> int c[16];
> };
>
> extern void foo (int, int, int);
>
> void
> f (struct x *p, unsigned int n)
> {
> foo (p->a[n], p->c[n], p->b[n]);
> }
>
> Prior to the fix for PR32698, gcc calculated the offset for accessing
> the array elements as: n*4; 64+n*4; 128+n*4.
>
> Following that fix, the offsets are calculated as: n*4; (n+16)*4; (n
> +32)*4. This led to poor code generation on powerpc64 targets, among
> others.
>
> The poor code generation was observed to not occur in loops, as the
> IVOPTS code does a good job of lowering these expressions to MEM_REFs.
> It was previously suggested that perhaps a general pass to lower memory
> accesses to MEM_REFs in GIMPLE would solve not only this, but other
> similar problems. I spent some time looking into various approaches to
> this, and reviewing some previous attempts to do similar things. In the
> end, I've concluded that this is a bad idea in practice because of the
> loss of useful aliasing information. In particular, early lowering of
> component references causes us to lose the ability to disambiguate
> non-overlapping references in the same structure, and there is no simple
> way to carry the necessary aliasing information along with the
> replacement MEM_REFs to avoid this. While some performance gains are
> available with GIMPLE lowering of memory accesses, there are also
> offsetting performance losses, and I suspect this would just be a
> continuous source of bug reports into the future.
>
> Therefore the current patch is a much simpler approach to solve the
> specific problem noted in the PR. There are two pieces to the patch:
>
> * The offending addressing pattern is matched in GIMPLE and transformed
> into a restructured MEM_REF that distributes the multiply, so that (n
> +32)*4 becomes 4*n+128 as before. This is done during the reassociation
> pass, for reasons described below. The transformation only occurs in
> non-loop blocks, since IVOPTS does a good job on such things within
> loops.
> * A tweak is added to the RTL forward-propagator to avoid propagating
> into memory references based on a single base register with no offset,
> under certain circumstances. This improves sharing of base registers
> for accesses within the same structure and slightly lowers register
> pressure.
>
> It would be possible to separate these into two patches if that's
> preferred. I chose to combine them because together they provide the
> ideal code generation that the new test cases test for.
>
> I initially implemented the pattern matcher during expand, but I found
> that the expanded code for two accesses to the same structure was often
> not being CSEd well. So I moved it back into the GIMPLE phases prior to
> DOM to take advantage of its CSE. To avoid an additional complete pass
> over the IL, I chose to piggyback on the reassociation pass. This
> transformation is not technically a reassociation, but it is related
> enough to not be a complete wart.
>
> One noob question about this: It would probably be preferable to have
> this transformation only take place during the second reassociation
> pass, so the ARRAY_REFs are seen by earlier optimization phases. Is
> there an easy way to detect that it's the second pass without having to
> generate a separate pass entry point?
>
> One other general question about the pattern-match transformation: Is
> this an appropriate transformation for all targets, or should it be
> somehow gated on available addressing modes on the target processor?
>
> Bootstrapped and regression tested on powerpc64-linux-gnu. Verified no
> performance degradations on that target for SPEC CPU2000 and CPU2006.
>
> I'm looking for eventual approval for trunk after any comments are
> resolved. Thanks!
People have already commented on pieces, so I'm looking only
at the tree-ssa-reassoc.c pieces (did you consider piggy-backing
on IVOPTs instead? The idea is to expose additional CSE
opportunities, right? So it's sort-of a strength-reduction
optimization on scalar code (classically strength reduction
in loops transforms for (i) { z = i*x; } to z = 0; for (i) { z += x }).
That might be worth in general, even for non-address cases.
So - if you rename that thing to tree-ssa-strength-reduce.c you
can get away without piggy-backing on anything ;) If you
structure it to detect a strength reduction opportunity
(thus, you'd need to match two/multiple of the patterns at the same time)
that would be a bonus ... generalizing it a little bit would be
another.
Now some comments on the patch ...
> Bill
>
>
> 2011-10-05 Bill Schmidt <wschmidt@linux.vnet.ibm.com>
>
> gcc:
>
> PR rtl-optimization/46556
> * fwprop.c (fwprop_bb_aux_d): New struct.
> (MEM_PLUS_REGS): New macro.
> (record_mem_plus_reg): New function.
> (record_mem_plus_regs): Likewise.
> (single_def_use_enter_block): Record mem-plus-reg patterns.
> (build_single_def_use_links): Allocate aux storage.
> (locally_poor_mem_replacement): New function.
> (forward_propagate_and_simplify): Call
> locally_poor_mem_replacement.
> (fwprop_init): Free storage.
> * tree.h (copy_ref_info): Expose existing function.
> * tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
> * tree-ssa-reassoc.c (restructure_base_and_offset): New function.
> (restructure_mem_ref): Likewise.
> (reassociate_bb): Look for opportunities to call
> restructure_mem_ref; clean up immediate use lists.
>
> gcc/testsuite:
>
> PR rtl-optimization/46556
> * gcc.target/powerpc/ppc-pr46556-1.c: New testcase.
> * gcc.target/powerpc/ppc-pr46556-2.c: Likewise.
> * gcc.target/powerpc/ppc-pr46556-3.c: Likewise.
> * gcc.target/powerpc/ppc-pr46556-4.c: Likewise.
> * gcc.dg/tree-ssa/pr46556-1.c: Likewise.
> * gcc.dg/tree-ssa/pr46556-2.c: Likewise.
> * gcc.dg/tree-ssa/pr46556-3.c: Likewise.
> Index: gcc/tree-ssa-reassoc.c
> ===================================================================
> --- gcc/tree-ssa-reassoc.c (revision 179317)
> +++ gcc/tree-ssa-reassoc.c (working copy)
> @@ -2361,6 +2361,116 @@ break_up_subtract_bb (basic_block bb)
> break_up_subtract_bb (son);
> }
>
> +/* Given BASE and OFFSET derived from *EXPR occurring in stmt *GSI,
> + determine whether there is a profitable opportunity to restructure
> + address arithmetic within BASE and OFFSET. If so, produce such
> + a restructuring in *MEM_REF and return true; else return false. */
> +
> +static tree
> +restructure_base_and_offset (tree *expr, gimple_stmt_iterator *gsi,
> + tree base, tree offset, HOST_WIDE_INT bitpos)
> +{
> + tree mult_op0, mult_op1, t1, t2, offset_type, mult_expr, add_expr, mem_ref;
> + HOST_WIDE_INT c, c1, c2, c3, c4;
> +
> + /* Look for the following pattern:
> +
> + base = MEM_REF (T1, C1);
> + offset = MULT_EXPR (PLUS_EXPR (T2, C2), C3)
> + bitpos = C4 * BITS_PER_UNIT
> +
> + If found, convert it to:
> +
> + MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
> + C1 + (C2 * C3) + C4)
> + */
> + if (!base
> + || !offset
> + || TREE_CODE (base) != MEM_REF
> + || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST
> + || TREE_CODE (offset) != MULT_EXPR)
> + return NULL_TREE;
> +
> + mult_op0 = TREE_OPERAND (offset, 0);
> + mult_op1 = TREE_OPERAND (offset, 1);
> +
> + if (TREE_CODE (mult_op0) != PLUS_EXPR
> + || TREE_CODE (mult_op1) != INTEGER_CST
> + || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
> + return NULL_TREE;
> +
> + t1 = TREE_OPERAND (base, 0);
> + t2 = TREE_OPERAND (mult_op0, 0);
> + c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
> + c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
> + c3 = TREE_INT_CST_LOW (mult_op1);
Before accessing TREE_INT_CST_LOW you need to make sure the
constants fit into a HWI using host_integerp () (which
conveniently includes the check for INTEGER_CST).
Note that you need to sign-extend the MEM_REF offset,
thus use mem_ref_offset (base).low instead of
TREE_INT_CST_LOW (TREE_OPERAND (base, 1)). Might be worth
to add a testcase with negative offset ;)
> + c4 = bitpos / BITS_PER_UNIT;
> + c = c1 + c2 * c3 + c4;
And you don't know whether this operation overflows. Thus it's
probably easiest to use double_ints instead of HOST_WIDE_INTs
in all of the code.
> +
> + offset_type = TREE_TYPE (TREE_OPERAND (base, 1));
> +
> + mult_expr = fold_build2 (MULT_EXPR, sizetype, t2,
> + build_int_cst (sizetype, c3));
> + mult_expr = force_gimple_operand_gsi (gsi, mult_expr, true, NULL,
> + true, GSI_SAME_STMT);
> + add_expr = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr);
> + add_expr = force_gimple_operand_gsi (gsi, add_expr, true, NULL,
> + true, GSI_SAME_STMT);
> + mem_ref = fold_build2 (MEM_REF, TREE_TYPE (*expr), add_expr,
> + build_int_cst (offset_type, c));
> + return mem_ref;
> +}
> +
> +/* If *EXPR contains a memory reference, determine whether there is an
> + opportunity to restructure the base and offset expressions for
> + improved performance. */
> +
> +static bool
> +restructure_mem_ref (tree *expr, gimple_stmt_iterator *gsi)
> +{
> + enum tree_code code = TREE_CODE (*expr);
> + tree base, offset, mem_ref;
> + HOST_WIDE_INT bitsize, bitpos;
> + enum machine_mode mode;
> + int unsignedp, volatilep;
> +
> + /* Only expressions that reference memory are of interest. Bitfield
> + references should eventually be lowered prior to this point and
> + are not dealt with. */
> + if (! handled_component_p (*expr)
> + || code == BIT_FIELD_REF
> + || (code == COMPONENT_REF && DECL_BIT_FIELD (TREE_OPERAND (*expr, 1))))
> + return false;
I'd say you want to handle references with a non-BLKmode only here,
thus || TYPE_MODE (TREE_TYPE (*expr)) == BLKmode
> + /* Determine whether the expression can be represented with base and
> + offset components. */
> + base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode,
> + &unsignedp, &volatilep, false);
> + if (!base || !offset)
> + return false;
> +
> + /* Look for a restructuring opportunity. */
> + if ((mem_ref = restructure_base_and_offset (expr, gsi, base,
> + offset, bitpos)) == NULL_TREE)
> + return false;
What I'm missing is a check whether the old address computation stmts
will be dead after the transform.
> + /* Found one. Record the replacement in the dump file and complete
> + the replacement. */
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "\nOriginal_expr = ");
> + print_generic_expr (dump_file, *expr, TDF_VOPS|TDF_MEMSYMS);
> + fprintf (dump_file, "\nmem_ref = ");
> + print_generic_expr (dump_file, mem_ref, TDF_VOPS|TDF_MEMSYMS);
> + fprintf (dump_file, "\n\n");
> + }
> +
> + copy_ref_info (mem_ref, *expr);
> + *expr = mem_ref;
> +
> + return true;
> +}
> +
> /* Reassociate expressions in basic block BB and its post-dominator as
> children. */
>
> @@ -2369,14 +2479,30 @@ reassociate_bb (basic_block bb)
> {
> gimple_stmt_iterator gsi;
> basic_block son;
> + bool chgd_mem_ref = false;
> + bool in_loop = loop_depth (bb->loop_father) != 0;
>
> for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi))
> {
> gimple stmt = gsi_stmt (gsi);
>
> - if (is_gimple_assign (stmt)
> - && !stmt_could_throw_p (stmt))
> + /* Look for restructuring opportunities within an expression
> + that references memory. We only do this for blocks not
> + contained in loops, since the ivopts machinery does a
> + good job on loop expressions, and we don't want to interfere
> + with other loop optimizations. */
> + if (!in_loop && gimple_vuse (stmt) && gimple_assign_single_p (stmt))
> {
> + tree *lhs, *rhs;
> + lhs = gimple_assign_lhs_ptr (stmt);
> + chgd_mem_ref = restructure_mem_ref (lhs, &gsi) || chgd_mem_ref;
> + rhs = gimple_assign_rhs1_ptr (stmt);
> + chgd_mem_ref = restructure_mem_ref (rhs, &gsi) || chgd_mem_ref;
It will either be a store or a load, but never both (unless it's an
aggregate copy which I think we should not handle). So ...
if (gimple_vdef (stmt))
... lhs
else if (gimple_vuse (stmt))
... rhs
> + }
> +
> + else if (is_gimple_assign (stmt)
> + && !stmt_could_throw_p (stmt))
> + {
> tree lhs, rhs1, rhs2;
> enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
>
> @@ -2489,6 +2615,12 @@ reassociate_bb (basic_block bb)
> }
> }
> }
> + /* If memory references have been restructured, immediate uses need
> + to be cleaned up. */
> + if (chgd_mem_ref)
> + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> + update_stmt (gsi_stmt (gsi));
ICK. Definitely a no ;)
Why does a update_stmt () after the restructure_mem_ref call not work?
Richard.