[SVE] PR89007 - Implement generic vector average expansion

Wed Nov 20 11:10:00 GMT 2019

Hi,

Thanks for doing this.  Adding Richard on cc:, since the SVE subject
tag might have put him off.  There's not really anything SVE-specific
here apart from the testcase.

> 2019-11-19  Prathamesh Kulkarni  <prathamesh.kulkarni@linaro.org>
>
> 	PR tree-optimization/89007
> 	* tree-vect-patterns.c (vect_recog_average_pattern): If there is no
> 	target support available, generate code to distribute rshift over plus
> 	and add one depending upon floor or ceil rounding.
>
> testsuite/
> 	* gcc.target/aarch64/sve/pr89007.c: New test.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
> new file mode 100644
> index 00000000000..32095c63c61
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
> @@ -0,0 +1,29 @@
> +/* { dg-do assemble { target aarch64_asm_sve_ok } } */
> +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#define N 1024
> +unsigned char dst[N];
> +unsigned char in1[N];
> +unsigned char in2[N];
> +
> +/*
> +**  foo: 
> +**	...
> +**	lsr	(z[0-9]+\.b), z[0-9]+\.b, #1
> +**	lsr	(z[0-9]+\.b), z[0-9]+\.b, #1
> +**	add	(z[0-9]+\.b), \1, \2
> +**	orr	(z[0-9]+)\.d, z[0-9]+\.d, z[0-9]+\.d
> +**	and	(z[0-9]+\.b), \4\.b, #0x1
> +**	add	z0.b, \3, \5

It'd probably be more future-proof to allow (\1, \2|\2, \1) and
(\3, \5|\5, \3).  Same for the other testcase.

> +**	...
> +*/
> +void
> +foo ()
> +{
> +  for( int x = 0; x < N; x++ )
> +    dst[x] = (in1[x] + in2[x] + 1) >> 1;
> +}
> +
> +/* { dg-final { scan-assembler-not {\tuunpklo\t} } } */
> +/* { dg-final { scan-assembler-not {\tuunpkhi\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
> new file mode 100644
> index 00000000000..cc40f45046b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
> @@ -0,0 +1,29 @@
> +/* { dg-do assemble { target aarch64_asm_sve_ok } } */
> +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#define N 1024
> +unsigned char dst[N];
> +unsigned char in1[N];
> +unsigned char in2[N];
> +
> +/*
> +**  foo: 
> +**	...
> +**	lsr	(z[0-9]+\.b), z[0-9]+\.b, #1
> +**	lsr	(z[0-9]+\.b), z[0-9]+\.b, #1
> +**	add	(z[0-9]+\.b), \1, \2
> +**	and	(z[0-9]+)\.d, z[0-9]+\.d, z[0-9]+\.d
> +**	and	(z[0-9]+\.b), \4\.b, #0x1
> +**	add	z0.b, \3, \5
> +**	...
> +*/
> +void
> +foo ()
> +{
> +  for( int x = 0; x < N; x++ )
> +    dst[x] = (in1[x] + in2[x]) >> 1;
> +}
> +
> +/* { dg-final { scan-assembler-not {\tuunpklo\t} } } */
> +/* { dg-final { scan-assembler-not {\tuunpkhi\t} } } */
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> index 8ebbcd76b64..7025a3b4dc2 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -2019,22 +2019,59 @@ vect_recog_average_pattern (stmt_vec_info last_stmt_info, tree *type_out)

>    /* Check for target support.  */
>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
> -  if (!new_vectype
> -      || !direct_internal_fn_supported_p (ifn, new_vectype,
> -					  OPTIMIZE_FOR_SPEED))
> +
> +  if (!new_vectype)
>      return NULL;
> 
> +  bool ifn_supported
> +    = direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED);
> +
>    /* The IR requires a valid vector type for the cast result, even though
>       it's likely to be discarded.  */
>    *type_out = get_vectype_for_scalar_type (vinfo, type);
>    if (!*type_out)
>      return NULL;
 >
> -  /* Generate the IFN_AVG* call.  */
>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>    tree new_ops[2];
>    vect_convert_inputs (last_stmt_info, 2, new_ops, new_type,
>  		       unprom, new_vectype);
> +
> +  if (!ifn_supported)

I guess this is personal preference, but I'm not sure there's much
benefit to splitting this out of the "if" into a separate variable.

> +    {
> +      /* If there is no target support available, generate code
> +	 to distribute rshift over plus and add one depending
> +	 upon floor or ceil rounding.  */

Maybe "and add a carry"?  It'd be good to write out the expansion in
pseudo-code too.

The transform is only valid on unsigned types.  We still need to
bail out for signed types because the carry would act in the wrong
direction for negative results.  E.g.:

  (-3 + 1) >> 1 == -2
  (-3 >> 1) == -1,  (1 >> 1) == 0,  (-1 & 1) == 1   total == 0

Handling that case could be future work.

> +      tree one_cst = build_one_cst (new_type);
> +
> +      tree tmp1 = vect_recog_temp_ssa_var (new_type, NULL);
> +      gassign *g1 = gimple_build_assign (tmp1, RSHIFT_EXPR, new_ops[0], one_cst);
> +
> +      tree tmp2 = vect_recog_temp_ssa_var (new_type, NULL);
> +      gassign *g2 = gimple_build_assign (tmp2, RSHIFT_EXPR, new_ops[1], one_cst);
> +
> +      tree tmp3 = vect_recog_temp_ssa_var (new_type, NULL);
> +      gassign *g3 = gimple_build_assign (tmp3, PLUS_EXPR, tmp1, tmp2);
> +      
> +      tree tmp4 = vect_recog_temp_ssa_var (new_type, NULL);
> +      tree_code c = (ifn == IFN_AVG_CEIL) ? BIT_IOR_EXPR : BIT_AND_EXPR;
> +      gassign *g4 = gimple_build_assign (tmp4, c, new_ops[0], new_ops[1]);
> + 
> +      tree tmp5 = vect_recog_temp_ssa_var (new_type, NULL);
> +      gassign *g5 = gimple_build_assign (tmp5, BIT_AND_EXPR, tmp4, one_cst);
> +
> +      gassign *g6 = gimple_build_assign (new_var, PLUS_EXPR, tmp3, tmp5);
> +
> +      append_pattern_def_seq (last_stmt_info, g1, new_vectype);
> +      append_pattern_def_seq (last_stmt_info, g2, new_vectype);
> +      append_pattern_def_seq (last_stmt_info, g3, new_vectype);
> +      append_pattern_def_seq (last_stmt_info, g4, new_vectype);
> +      append_pattern_def_seq (last_stmt_info, g5, new_vectype);

This is all again personal preference, but IMO it'd be clearer to give
the temporaries more meaningful names.  E.g.:

tmp1 -> shifted_op0
tmp2 -> shifted_op1
tmp3 -> sum_of_shifted
tmp4 -> unmasked_carry
tmp5 -> carry

But maybe that's less necessary if we have the pseudo-code in a comment.

I think it'd also be better to add statements as we go rather than
save them all till the end.  That way it's easier to add conditional
paths later (e.g. to handle the signed case).

Thanks,
Richard