[PATCH] Improve { x, x + 3, x + 6, x + 9 } expansion (take 2)

Thu Nov 21 13:29:00 GMT 2013

On Thu, 21 Nov 2013, Jakub Jelinek wrote:

> On Thu, Nov 21, 2013 at 07:43:35AM +1000, Richard Henderson wrote:
> > On 11/20/2013 07:44 PM, Jakub Jelinek wrote:
> > > On Wed, Nov 20, 2013 at 10:31:38AM +0100, Richard Biener wrote:
> > >> Aww ;)  Nice improvement.  Generally when I see this I always wonder
> > >> whether we want to do this kind of stuff pre RTL expansion.
> > >> 1st to not rely on being able to TER, 2nd to finally eventually
> > >> get rid of TER.
> > >>
> > >> These patches are unfortunately a step backward for #2.
> > >>
> > >> As of the patch, do we have a way to query whether the target
> > >> can efficiently broadcast?  If so this IMHO belongs in generic
> > > 
> > > We don't.  Perhaps if we'd add optab for vec_dup<mode> and mentioned
> > > clearly in the documentation that it should be used only if it is reasonably
> > > efficient.  But still, even with optab, it would probably better to do it
> > > in the veclower* passes than in the vectorizer itself.
> > 
> > I think we can assume that broadcast is relatively efficient, whether or not
> > vec_dup is present.  I'd lean to making the transformation generic to start
> > with, so that you don't need extra handling in the i386 backend.
> 
> Ok, here is a generic veclower implementation without looking at any optabs,
> so far only handles PLUS_EXPR, what operation other than MULT_EXPR would
> make sense here?  Though, handling MULT_EXPR also would complicate the code
> slightly (it would need to handle say:
>   _2 = _1(D) + 1;
>   _3 = _2 + 2;
>   _4 = _3 * 2;
>   _5 = _4 * 3;
>   _6 = { _3, _4, _5, _4 };
> where we could start thinking first the operation is PLUS_EXPR, but it
> actually is MULT_EXPR with _3 as base).  Also, for MULT_EXPR, supposedly
> we could handle some values to be constant 0, like in:
>   _2 = _1(D) * 5;
>   _3 = _2 * 2;
>   _4 = _1(D) * 10;
>   _5 = { _3, 0, _4, _2, _1(D), 0, _4, _2 };
> 
> Bootstrap/regtest pending, ok at least for this for the start and can be
> improved later on?

Ok, this should catch most of the vectorizer cases.

Zero could also be handled for PLUS_EXPR, likewise one for MULT_EXPR.
I think for induction it's common to have { base, base + 1, base + 2, ... 
}

more comments below

> 2013-11-21  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* tree-vect-generic.c (optimize_vector_constructor): New function.
> 	(expand_vector_operations_1): Call it.
> 
> 	* gcc.dg/vect/vect-124.c: New test.
> 
> --- gcc/tree-vect-generic.c.jj	2013-11-19 21:56:40.000000000 +0100
> +++ gcc/tree-vect-generic.c	2013-11-21 11:17:55.146118161 +0100
> @@ -988,6 +988,84 @@ expand_vector_operation (gimple_stmt_ite
>  				    gimple_assign_rhs1 (assign),
>  				    gimple_assign_rhs2 (assign), code);
>  }
> +
> +/* Try to optimize
> +   a_5 = { b_7, b_7 + 3, b_7 + 6, b_7 + 9 };
> +   style stmts into:
> +   _9 = { b_7, b_7, b_7, b_7 };
> +   a_5 = _9 + { 0, 3, 6, 9 };
> +   because vector splat operation is usually more efficient
> +   than piecewise initialization of the vector.  */
> +
> +static void
> +optimize_vector_constructor (gimple_stmt_iterator *gsi)
> +{
> +  gimple stmt = gsi_stmt (*gsi);
> +  tree lhs = gimple_assign_lhs (stmt);
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +  tree type = TREE_TYPE (rhs);
> +  unsigned int i, j, nelts = TYPE_VECTOR_SUBPARTS (type);
> +  bool all_same = true;
> +  constructor_elt *elt;
> +  tree *cst;
> +  gimple g;
> +  tree base = NULL_TREE;
> +
> +  if (nelts <= 2 || CONSTRUCTOR_NELTS (rhs) != nelts)
> +    return;
> +  FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (rhs), i, elt)
> +    if (TREE_CODE (elt->value) != SSA_NAME
> +	|| TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> +      return;
> +    else
> +      {
> +	tree this_base = elt->value;
> +	if (this_base != CONSTRUCTOR_ELT (rhs, 0)->value)
> +	  all_same = false;
> +	for (j = 0; j < nelts + 1; j++)
> +	  {
> +	    g = SSA_NAME_DEF_STMT (this_base);
> +	    if (is_gimple_assign (g)
> +		&& gimple_assign_rhs_code (g) == PLUS_EXPR
> +		&& TREE_CODE (gimple_assign_rhs2 (g)) == INTEGER_CST
> +		&& TREE_CODE (gimple_assign_rhs1 (g)) == SSA_NAME
> +		&& !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (gimple_assign_rhs1 (g)))
> +	      this_base = gimple_assign_rhs1 (g);
> +	    else
> +	      break;
> +	  }

why loop here?  Do you want to catch base + 1 + 2?  I think that's
hiding a missed optimization elsewhere for no good reason.

> +	if (i == 0)
> +	  base = this_base;
> +	else if (this_base != base)
> +	  return;
> +      }
> +  if (all_same)
> +    return;
> +  cst = XALLOCAVEC (tree, nelts);
> +  for (i = 0; i < nelts; i++)
> +    {
> +      tree this_base = CONSTRUCTOR_ELT (rhs, i)->value;;
> +      cst[i] = build_zero_cst (TREE_TYPE (base));
> +      while (this_base != base)
> +	{
> +	  g = SSA_NAME_DEF_STMT (this_base);
> +	  cst[i] = fold_binary (PLUS_EXPR, TREE_TYPE (base),
> +				cst[i], gimple_assign_rhs2 (g));
> +	  if (cst[i] == NULL_TREE
> +	      || TREE_CODE (cst[i]) != INTEGER_CST
> +	      || TREE_OVERFLOW (cst[i]))
> +	    return;
> +	  this_base = gimple_assign_rhs1 (g);
> +	}

Likewise.

I'd be more comfortable with leaving this magic out.

Ok with that change.

Thanks,
Richard.

> +    }
> +  for (i = 0; i < nelts; i++)
> +    CONSTRUCTOR_ELT (rhs, i)->value = base;
> +  g = gimple_build_assign (make_ssa_name (type, NULL), rhs);
> +  gsi_insert_before (gsi, g, GSI_SAME_STMT);
> +  g = gimple_build_assign_with_ops (PLUS_EXPR, lhs, gimple_assign_lhs (g),
> +				    build_vector (type, cst));
> +  gsi_replace (gsi, g, false);
> +}
>  
>  /* Return a type for the widest vector mode whose components are of type
>     TYPE, or NULL_TREE if none is found.  */
> @@ -1278,6 +1356,17 @@ expand_vector_operations_1 (gimple_stmt_
>        expand_vector_condition (gsi);
>        return;
>      }
> +
> +  if (code == CONSTRUCTOR
> +      && TREE_CODE (lhs) == SSA_NAME
> +      && VECTOR_MODE_P (TYPE_MODE (TREE_TYPE (lhs)))
> +      && !gimple_clobber_p (stmt)
> +      && optimize)
> +    {
> +      optimize_vector_constructor (gsi);
> +      return;
> +    }
> +
>    if (rhs_class != GIMPLE_UNARY_RHS && rhs_class != GIMPLE_BINARY_RHS)
>      return;
>  
> --- gcc/testsuite/gcc.dg/vect/vect-124.c.jj	2013-11-20 09:27:35.760798649 +0100
> +++ gcc/testsuite/gcc.dg/vect/vect-124.c	2013-11-20 09:29:32.219199794 +0100
> @@ -0,0 +1,28 @@
> +#include "tree-vect.h"
> +
> +#ifndef N
> +#define N 64
> +#endif
> +
> +int a[N];
> +
> +__attribute__((noinline, noclone)) void
> +foo (int x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++, x += 3)
> +    a[i] = x;
> +}
> +
> +int
> +main ()
> +{
> +  int i;
> +  
> +  check_vect ();
> +  foo (6);
> +  for (i = 0; i < N; i++)
> +    if (a[i] != i * 3 + 6)
> +      abort ();
> +  return 0;
> +}
> 
> 
> 	Jakub
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer