This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [PATCH] Fix PR46556 (poor address generation)

From: Richard Guenther <rguenther at suse dot de>
To: "William J. Schmidt" <wschmidt at linux dot vnet dot ibm dot com>
Cc: gcc-patches at gcc dot gnu dot org, bergner at vnet dot ibm dot com, rguenth at gcc dot gnu dot org
Date: Thu, 6 Oct 2011 12:13:08 +0200 (CEST)
Subject: Re: [PATCH] Fix PR46556 (poor address generation)
References: <1317831212.4199.45.camel@oc2474580526.ibm.com>
On Wed, 5 Oct 2011, William J. Schmidt wrote:

> This patch addresses the poor code generation in PR46556 for the
> following code:
> 
> struct x
> {
>   int a[16];
>   int b[16];
>   int c[16];
> };
> 
> extern void foo (int, int, int);
> 
> void
> f (struct x *p, unsigned int n)
> {
>   foo (p->a[n], p->c[n], p->b[n]);
> }
> 
> Prior to the fix for PR32698, gcc calculated the offset for accessing
> the array elements as:  n*4; 64+n*4; 128+n*4.
> 
> Following that fix, the offsets are calculated as:  n*4; (n+16)*4; (n
> +32)*4.  This led to poor code generation on powerpc64 targets, among
> others.
> 
> The poor code generation was observed to not occur in loops, as the
> IVOPTS code does a good job of lowering these expressions to MEM_REFs.
> It was previously suggested that perhaps a general pass to lower memory
> accesses to MEM_REFs in GIMPLE would solve not only this, but other
> similar problems.  I spent some time looking into various approaches to
> this, and reviewing some previous attempts to do similar things.  In the
> end, I've concluded that this is a bad idea in practice because of the
> loss of useful aliasing information.  In particular, early lowering of
> component references causes us to lose the ability to disambiguate
> non-overlapping references in the same structure, and there is no simple
> way to carry the necessary aliasing information along with the
> replacement MEM_REFs to avoid this.  While some performance gains are
> available with GIMPLE lowering of memory accesses, there are also
> offsetting performance losses, and I suspect this would just be a
> continuous source of bug reports into the future.
> 
> Therefore the current patch is a much simpler approach to solve the
> specific problem noted in the PR.  There are two pieces to the patch:
> 
>  * The offending addressing pattern is matched in GIMPLE and transformed
> into a restructured MEM_REF that distributes the multiply, so that (n
> +32)*4 becomes 4*n+128 as before.  This is done during the reassociation
> pass, for reasons described below.  The transformation only occurs in
> non-loop blocks, since IVOPTS does a good job on such things within
> loops.
>  * A tweak is added to the RTL forward-propagator to avoid propagating
> into memory references based on a single base register with no offset,
> under certain circumstances.  This improves sharing of base registers
> for accesses within the same structure and slightly lowers register
> pressure.
> 
> It would be possible to separate these into two patches if that's
> preferred.  I chose to combine them because together they provide the
> ideal code generation that the new test cases test for.
> 
> I initially implemented the pattern matcher during expand, but I found
> that the expanded code for two accesses to the same structure was often
> not being CSEd well.  So I moved it back into the GIMPLE phases prior to
> DOM to take advantage of its CSE.  To avoid an additional complete pass
> over the IL, I chose to piggyback on the reassociation pass.  This
> transformation is not technically a reassociation, but it is related
> enough to not be a complete wart.
> 
> One noob question about this:  It would probably be preferable to have
> this transformation only take place during the second reassociation
> pass, so the ARRAY_REFs are seen by earlier optimization phases.  Is
> there an easy way to detect that it's the second pass without having to
> generate a separate pass entry point?
> 
> One other general question about the pattern-match transformation:  Is
> this an appropriate transformation for all targets, or should it be
> somehow gated on available addressing modes on the target processor?
> 
> Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
> performance degradations on that target for SPEC CPU2000 and CPU2006.
> 
> I'm looking for eventual approval for trunk after any comments are
> resolved.  Thanks!

People have already commented on pieces, so I'm looking only
at the tree-ssa-reassoc.c pieces (did you consider piggy-backing
on IVOPTs instead?  The idea is to expose additional CSE
opportunities, right?  So it's sort-of a strength-reduction
optimization on scalar code (classically strength reduction
in loops transforms for (i) { z = i*x; } to z = 0; for (i) { z += x }).
That might be worth in general, even for non-address cases.
So - if you rename that thing to tree-ssa-strength-reduce.c you
can get away without piggy-backing on anything ;)  If you
structure it to detect a strength reduction opportunity
(thus, you'd need to match two/multiple of the patterns at the same time)
that would be a bonus ... generalizing it a little bit would be
another.

Now some comments on the patch ...

> Bill
> 
> 
> 2011-10-05  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> 
> gcc:
> 	
> 	PR rtl-optimization/46556
> 	* fwprop.c (fwprop_bb_aux_d): New struct.
> 	(MEM_PLUS_REGS): New macro.
> 	(record_mem_plus_reg): New function.
> 	(record_mem_plus_regs): Likewise.
> 	(single_def_use_enter_block): Record mem-plus-reg patterns.
> 	(build_single_def_use_links): Allocate aux storage.
> 	(locally_poor_mem_replacement): New function.
> 	(forward_propagate_and_simplify): Call
> 	locally_poor_mem_replacement.
> 	(fwprop_init): Free storage.
> 	* tree.h (copy_ref_info): Expose existing function.
> 	* tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
> 	* tree-ssa-reassoc.c (restructure_base_and_offset): New function.
> 	(restructure_mem_ref): Likewise.
> 	(reassociate_bb): Look for opportunities to call
> 	restructure_mem_ref; clean up immediate use lists.
> 
> gcc/testsuite:
> 	
> 	PR rtl-optimization/46556
> 	* gcc.target/powerpc/ppc-pr46556-1.c: New testcase.
> 	* gcc.target/powerpc/ppc-pr46556-2.c: Likewise.
> 	* gcc.target/powerpc/ppc-pr46556-3.c: Likewise.
> 	* gcc.target/powerpc/ppc-pr46556-4.c: Likewise.
> 	* gcc.dg/tree-ssa/pr46556-1.c: Likewise.
> 	* gcc.dg/tree-ssa/pr46556-2.c: Likewise.
> 	* gcc.dg/tree-ssa/pr46556-3.c: Likewise.

> Index: gcc/tree-ssa-reassoc.c
> ===================================================================
> --- gcc/tree-ssa-reassoc.c	(revision 179317)
> +++ gcc/tree-ssa-reassoc.c	(working copy)
> @@ -2361,6 +2361,116 @@ break_up_subtract_bb (basic_block bb)
>      break_up_subtract_bb (son);
>  }
>  
> +/* Given BASE and OFFSET derived from *EXPR occurring in stmt *GSI,
> +   determine whether there is a profitable opportunity to restructure
> +   address arithmetic within BASE and OFFSET.  If so, produce such
> +   a restructuring in *MEM_REF and return true; else return false.  */
> +
> +static tree
> +restructure_base_and_offset (tree *expr, gimple_stmt_iterator *gsi,
> +			     tree base, tree offset, HOST_WIDE_INT bitpos)
> +{
> +  tree mult_op0, mult_op1, t1, t2, offset_type, mult_expr, add_expr, mem_ref;
> +  HOST_WIDE_INT c, c1, c2, c3, c4;
> +
> +  /* Look for the following pattern:
> +
> +       base   = MEM_REF (T1, C1);
> +       offset = MULT_EXPR (PLUS_EXPR (T2, C2), C3)
> +       bitpos = C4 * BITS_PER_UNIT
> +
> +     If found, convert it to:
> +
> +       MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
> +                C1 + (C2 * C3) + C4)
> +   */
> +  if (!base
> +      || !offset
> +      || TREE_CODE (base) != MEM_REF
> +      || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST
> +      || TREE_CODE (offset) != MULT_EXPR)
> +    return NULL_TREE;
> +
> +  mult_op0 = TREE_OPERAND (offset, 0);
> +  mult_op1 = TREE_OPERAND (offset, 1);
> +
> +  if (TREE_CODE (mult_op0) != PLUS_EXPR
> +      || TREE_CODE (mult_op1) != INTEGER_CST
> +      || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST)
> +    return NULL_TREE;
> +
> +  t1 = TREE_OPERAND (base, 0);
> +  t2 = TREE_OPERAND (mult_op0, 0);
> +  c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1));
> +  c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1));
> +  c3 = TREE_INT_CST_LOW (mult_op1);

Before accessing TREE_INT_CST_LOW you need to make sure the
constants fit into a HWI using host_integerp () (which
conveniently includes the check for INTEGER_CST).

Note that you need to sign-extend the MEM_REF offset,
thus use mem_ref_offset (base).low instead of
TREE_INT_CST_LOW (TREE_OPERAND (base, 1)).  Might be worth
to add a testcase with negative offset ;)

> +  c4 = bitpos / BITS_PER_UNIT;
> +  c = c1 + c2 * c3 + c4;

And you don't know whether this operation overflows.  Thus it's
probably easiest to use double_ints instead of HOST_WIDE_INTs
in all of the code.

> +
> +  offset_type = TREE_TYPE (TREE_OPERAND (base, 1));
> +  
> +  mult_expr = fold_build2 (MULT_EXPR, sizetype, t2,
> +			   build_int_cst (sizetype, c3));
> +  mult_expr = force_gimple_operand_gsi (gsi, mult_expr, true, NULL,
> +					true, GSI_SAME_STMT);
> +  add_expr = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr);
> +  add_expr = force_gimple_operand_gsi (gsi, add_expr, true, NULL,
> +				       true, GSI_SAME_STMT);
> +  mem_ref = fold_build2 (MEM_REF, TREE_TYPE (*expr), add_expr,
> +			 build_int_cst (offset_type, c));
> +  return mem_ref;
> +}
> +
> +/* If *EXPR contains a memory reference, determine whether there is an
> +   opportunity to restructure the base and offset expressions for
> +   improved performance.  */
> +
> +static bool
> +restructure_mem_ref (tree *expr, gimple_stmt_iterator *gsi)
> +{
> +  enum tree_code code = TREE_CODE (*expr);
> +  tree base, offset, mem_ref;
> +  HOST_WIDE_INT bitsize, bitpos;
> +  enum machine_mode mode;
> +  int unsignedp, volatilep;
> +
> +  /* Only expressions that reference memory are of interest.  Bitfield
> +     references should eventually be lowered prior to this point and
> +     are not dealt with.  */
> +  if (! handled_component_p (*expr)
> +      || code == BIT_FIELD_REF
> +      || (code == COMPONENT_REF && DECL_BIT_FIELD (TREE_OPERAND (*expr, 1))))
> +    return false;

I'd say you want to handle references with a non-BLKmode only here,
thus || TYPE_MODE (TREE_TYPE (*expr)) == BLKmode

> +  /* Determine whether the expression can be represented with base and
> +     offset components.  */
> +  base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode,
> +			      &unsignedp, &volatilep, false);
> +  if (!base || !offset)
> +    return false;
> +
> +  /* Look for a restructuring opportunity.  */
> +  if ((mem_ref = restructure_base_and_offset (expr, gsi, base,
> +					      offset, bitpos)) == NULL_TREE)
> +    return false;

What I'm missing is a check whether the old address computation stmts
will be dead after the transform.

> +  /* Found one.  Record the replacement in the dump file and complete
> +     the replacement.  */
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "\nOriginal_expr = ");
> +      print_generic_expr (dump_file, *expr, TDF_VOPS|TDF_MEMSYMS);
> +      fprintf (dump_file, "\nmem_ref = ");
> +      print_generic_expr (dump_file, mem_ref, TDF_VOPS|TDF_MEMSYMS);
> +      fprintf (dump_file, "\n\n");
> +    }
> +
> +  copy_ref_info (mem_ref, *expr);
> +  *expr = mem_ref;
> +
> +  return true;
> +}
> +
>  /* Reassociate expressions in basic block BB and its post-dominator as
>     children.  */
>  
> @@ -2369,14 +2479,30 @@ reassociate_bb (basic_block bb)
>  {
>    gimple_stmt_iterator gsi;
>    basic_block son;
> +  bool chgd_mem_ref = false;
> +  bool in_loop = loop_depth (bb->loop_father) != 0;
>  
>    for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi))
>      {
>        gimple stmt = gsi_stmt (gsi);
>  
> -      if (is_gimple_assign (stmt)
> -	  && !stmt_could_throw_p (stmt))
> +      /* Look for restructuring opportunities within an expression
> +	 that references memory.  We only do this for blocks not
> +         contained in loops, since the ivopts machinery does a 
> +         good job on loop expressions, and we don't want to interfere
> +	 with other loop optimizations.  */
> +      if (!in_loop && gimple_vuse (stmt) && gimple_assign_single_p (stmt))
>  	{
> +	  tree *lhs, *rhs;
> +	  lhs = gimple_assign_lhs_ptr (stmt);
> +	  chgd_mem_ref = restructure_mem_ref (lhs, &gsi) || chgd_mem_ref;
> +	  rhs = gimple_assign_rhs1_ptr (stmt);
> +	  chgd_mem_ref = restructure_mem_ref (rhs, &gsi) || chgd_mem_ref;

It will either be a store or a load, but never both (unless it's an
aggregate copy which I think we should not handle).  So ...

  if (gimple_vdef (stmt))
    ... lhs
  else if (gimple_vuse (stmt))
    ... rhs

> +	}
> +
> +      else if (is_gimple_assign (stmt)
> +	       && !stmt_could_throw_p (stmt))
> +	{
>  	  tree lhs, rhs1, rhs2;
>  	  enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
>  
> @@ -2489,6 +2615,12 @@ reassociate_bb (basic_block bb)
>  	    }
>  	}
>      }
> +  /* If memory references have been restructured, immediate uses need
> +     to be cleaned up.  */
> +  if (chgd_mem_ref)
> +    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +      update_stmt (gsi_stmt (gsi));

ICK.  Definitely a no ;)

Why does a update_stmt () after the restructure_mem_ref call not work?

Richard.
Follow-Ups:
- Re: [PATCH] Fix PR46556 (poor address generation)
  - From: William J. Schmidt
- Re: [PATCH] Fix PR46556 (poor address generation)
  - From: Jeff Law
References:
- [PATCH] Fix PR46556 (poor address generation)
  - From: William J. Schmidt
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]