[patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful

Thu May 17 09:45:00 GMT 2018

On 16/05/18 09:37, Kyrill Tkachov wrote:
> 
> On 15/05/18 10:58, Richard Biener wrote:
>> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com>
>> wrote:
>>
>>> Hi all,
>>> This is a respin of James's patch from:
>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>> The original patch was approved and committed but was later reverted
>> because of failures on big-endian.
>>> This tweaked version fixes the big-endian failures in
>> aarch64_expand_vector_init by picking the right
>>> element of VALS to move into the low part of the vector register
>> depending on endianness. The rest of the patch
>>> stays the same. I'm looking for approval on the aarch64 parts, as they
>> are the ones that have changed
>>> since the last approved version of the patch.
>>> -----------------------------------------------------------------------
>>> In the testcase in this patch we create an SLP vector with only two
>>> elements. Our current vector initialisation code will first duplicate
>>> the first element to both lanes, then overwrite the top lane with a new
>>> value.
>>> This duplication can be clunky and wasteful.
>>> Better would be to simply use the fact that we will always be
>>> overwriting
>>> the remaining bits, and simply move the first element to the corrcet
>>> place
>>> (implicitly zeroing all other bits).
>>> This reduces the code generation for this case, and can allow more
>>> efficient addressing modes, and other second order benefits for AArch64
>>> code which has been vectorized to V2DI mode.
>>> Note that the change is generic enough to catch the case for any vector
>>> mode, but is expected to be most useful for 2x64-bit vectorization.
>>> Unfortunately, on its own, this would cause failures in
>>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>>> vec_merge and vec_duplicate for their simplifications to apply. To fix
>>> this,
>>> add a special case to the AArch64 code if we are loading from two memory
>>> addresses, and use the load_pair_lanes patterns directly.
>>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
>>> , to
>>> catch:
>>> Â Â Â Â  (vec_merge:OUTER
>>> Â Â Â Â Â Â Â  (vec_duplicate:OUTER x:INNER)
>>> Â Â Â Â Â Â Â  (subreg:OUTER y:INNER 0)
>>> Â Â Â Â Â Â Â  (const_int N))
>>> And simplify it to:
>>> Â Â Â Â  (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>>> This is similar to the existing patterns which are tested in this
>>> function,
>>> without requiring the second operand to also be a vec_duplicate.
>>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>>> aarch64-none-elf.
>>> Note that this requires
>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>> if we don't want to ICE creating broken vector zero extends.
>>> Are the non-AArch64 parts OK?
>> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
>> you handle?Â  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
>> also doesn't handle the swapped operand case.
>>
>> Otherwise the middle-end parts looks ok.
> 
> I don't see any explicit canonicalisation code for it.
> I've updated the simplify-rtx part to handle the swapped operand case.
> Is the attached patch better in this regard? I couldn't think of a clean
> way to avoid
> duplicating some logic (beyond creating a new function away from the
> callsite).
> 
> Thanks,
> Kyrill
> 
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> James
>>> ---
>>> 2018-05-15Â  James GreenhalghÂ  <james.greenhalgh@arm.com>
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Kyrylo TkachovÂ  <kyrylo.tkachov@arm.com>
>>> Â Â Â Â Â Â Â Â Â Â  * config/aarch64/aarch64.c (aarch64_expand_vector_init):
>>> Modify
>>> Â Â Â Â Â Â Â Â Â Â  code generation for cases where splatting a value is not
>>> useful.
>>> Â Â Â Â Â Â Â Â Â Â  * simplify-rtx.c (simplify_ternary_operation): Simplify
>>> Â Â Â Â Â Â Â Â Â Â  vec_merge across a vec_duplicate and a paradoxical subreg
>> forming a vector
>>> Â Â Â Â Â Â Â Â Â Â  mode to a vec_concat.
>>> 2018-05-15Â  James GreenhalghÂ  <james.greenhalgh@arm.com>
>>> Â Â Â Â Â Â Â Â Â Â  * gcc.target/aarch64/vect-slp-dup.c: New.
> 

I'm surprised we don't seem to have a function in the compiler that
performs this check:

+		  && rtx_equal_p (XEXP (x1, 0),
+				  plus_constant (Pmode,
+						 XEXP (x0, 0),
+						 GET_MODE_SIZE (inner_mode))))

Without generating dead RTL (plus_constant will rarely be able to return
a subexpression of the original pattern).  I would have thought this
sort of test was not that uncommon.

However, I don't think that needs to hold up this patch.

OK.

R.
> 
> vec-splat.patch
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	    maxv = matches[i][1];
>  	  }
>  
> -      /* Create a duplicate of the most common element.  */
> -      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> +      /* Create a duplicate of the most common element, unless all elements
> +	 are equally useless to us, in which case just immediately set the
> +	 vector register using the first element.  */
> +
> +      if (maxv == 1)
> +	{
> +	  /* For vectors of two 64-bit elements, we can do even better.  */
> +	  if (n_elts == 2
> +	      && (inner_mode == E_DImode
> +		  || inner_mode == E_DFmode))
> +
> +	    {
> +	      rtx x0 = XVECEXP (vals, 0, 0);
> +	      rtx x1 = XVECEXP (vals, 0, 1);
> +	      /* Combine can pick up this case, but handling it directly
> +		 here leaves clearer RTL.
> +
> +		 This is load_pair_lanes<mode>, and also gives us a clean-up
> +		 for store_pair_lanes<mode>.  */
> +	      if (memory_operand (x0, inner_mode)
> +		  && memory_operand (x1, inner_mode)
> +		  && !STRICT_ALIGNMENT
> +		  && rtx_equal_p (XEXP (x1, 0),
> +				  plus_constant (Pmode,
> +						 XEXP (x0, 0),
> +						 GET_MODE_SIZE (inner_mode))))
> +		{
> +		  rtx t;
> +		  if (inner_mode == DFmode)
> +		    t = gen_load_pair_lanesdf (target, x0, x1);
> +		  else
> +		    t = gen_load_pair_lanesdi (target, x0, x1);
> +		  emit_insn (t);
> +		  return;
> +		}
> +	    }
> +	  /* The subreg-move sequence below will move into lane zero of the
> +	     vector register.  For big-endian we want that position to hold
> +	     the last element of VALS.  */
> +	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> +	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +	}
> +      else
> +	{
> +	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> +	}
>  
>        /* Insert the rest.  */
>        for (int i = 0; i < n_elts; i++)
> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
> index e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905 100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>  		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>  	    }
>  
> +	  /* Replace:
> +
> +	      (vec_merge:outer (vec_duplicate:outer x:inner)
> +			       (subreg:outer y:inner 0)
> +			       (const_int N))
> +
> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
> +	     or (vec_concat:outer y:inner x:inner) if N == 2.
> +
> +	     Implicitly, this means we have a paradoxical subreg, but such
> +	     a check is cheap, so make it anyway.
> +
> +	     Only applies for vectors of two elements.  */
> +	  if (GET_CODE (op0) == VEC_DUPLICATE
> +	      && GET_CODE (op1) == SUBREG
> +	      && GET_MODE (op1) == GET_MODE (op0)
> +	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
> +	      && paradoxical_subreg_p (op1)
> +	      && subreg_lowpart_p (op1)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
> +	      && IN_RANGE (sel, 1, 2))
> +	    {
> +	      rtx newop0 = XEXP (op0, 0);
> +	      rtx newop1 = SUBREG_REG (op1);
> +	      if (sel == 2)
> +		std::swap (newop0, newop1);
> +	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
> +	    }
> +
> +	  /* Same as above but with switched operands:
> +		Replace (vec_merge:outer (subreg:outer x:inner 0)
> +					 (vec_duplicate:outer y:inner)
> +			       (const_int N))
> +
> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
> +	     or (vec_concat:outer y:inner x:inner) if N == 2.  */
> +	  if (GET_CODE (op1) == VEC_DUPLICATE
> +	      && GET_CODE (op0) == SUBREG
> +	      && GET_MODE (op0) == GET_MODE (op1)
> +	      && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0))
> +	      && paradoxical_subreg_p (op0)
> +	      && subreg_lowpart_p (op0)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
> +	      && IN_RANGE (sel, 1, 2))
> +	    {
> +	      rtx newop0 = SUBREG_REG (op0);
> +	      rtx newop1 = XEXP (op1, 0);
> +	      if (sel == 2)
> +		std::swap (newop0, newop1);
> +	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
> +	    }
> +
>  	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
>  				 (const_int n))
>  	     with (vec_concat x y) or (vec_concat y x) depending on value
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +
> +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
> +
> +void bar (double);
> +
> +void
> +foo (double *restrict in, double *restrict in2,
> +     double *restrict out1, double *restrict out2)
> +{
> +  for (int i = 0; i < 1024; i++)
> +    {
> +      out1[i] = in[i] + 2.0 * in[i+128];
> +      out1[i+1] = in[i+1] + 2.0 * in2[i];
> +      bar (in[i]);
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
> +
>