[PATCH][GCC][aarch64] Generation of adjusted ldp/stp for vector types

Mon Jul 13 16:12:37 GMT 2020

Hi,

Sorry for the slow review.

Przemyslaw Wirkus <Przemyslaw.Wirkus@arm.com> writes:
> Hi,
>
> Introduce simple peephole2 optimization which substitutes a sequence of
> four consecutive load or store (LDR, STR) instructions with two load or
> store pair (LDP, STP) instructions for 2 element supported vector modes
> (V2SI, V2SF, V2DI, and V2DF).
> Generated load / store pair instruction offset is adjusted accordingly.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Example:
> $ cat stp_vec_v2sf.c
> typedef float __attribute__((vector_size(8))) vec;
>
> void
> store_adjusted(vec *out, vec x, vec y)
> {
>   out[400] = x;
>   out[401] = y;
>   out[402] = y;
>   out[403] = x;
> }
>
> Example compiled with:
> $ ./aarch64-none-linux-gnu-gcc -S -O2 stp_vec_v2sf.c -dp
>
> Before the patch:
>
> store_adjusted:
>     str     d0, [x0, 3200]    // 9    [c=4 l=4]  *aarch64_simd_movv2si/2
>     str     d1, [x0, 3208]    // 11   [c=4 l=4]  *aarch64_simd_movv2si/2
>     str     d1, [x0, 3216]    // 13   [c=4 l=4]  *aarch64_simd_movv2si/2
>     str     d0, [x0, 3224]    // 15   [c=4 l=4]  *aarch64_simd_movv2si/2
>     ret       // 26       [c=0 l=4]  *do_return
>
> After the patch:
>
> store_adjusted:
>     add     x1, x0, 3200    // 27   [c=4 l=4]  *adddi3_aarch64/0
>     stp     d0, d1, [x1]    // 28   [c=0 l=4]  vec_store_pairv2siv2si
>     stp     d1, d0, [x1, 16]        // 29   [c=0 l=4]  vec_store_pairv2siv2si
>     ret             // 22   [c=0 l=4]  *do_return
>
>
> OK for master ?
>
> kind regards,
> Przemyslaw

Thanks for doing this, looks good.

My only real comment is that I wonder if we really need the nunits
parameter, or if we should instead change the scalar_mode to a general
machine_mode and just pass the vector mode to that.

Passing nunits means that we don't need a to_constant here:

> @@ -21970,7 +21973,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>    for (int i = 0; i < num_insns; i++)
>      offvals[i] = INTVAL (offset[i]);
>  
> -  msize = GET_MODE_SIZE (mode);
> +  msize = GET_MODE_SIZE (mode) * nunits;
>  
>    /* Check if the offsets can be put in the right order to do a ldp/stp.  */
>    qsort (offvals, num_insns, sizeof (HOST_WIDE_INT),

but I think adding to_constant is fine in this context, since it's
inherently non-SVE code.

That would also avoid having to recalculate the mode:

> @@ -22114,8 +22118,11 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
>    replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
>  						  new_off_3 + msize), true);
>  
> -  if (!aarch64_mem_pair_operand (mem_1, mode)
> -      || !aarch64_mem_pair_operand (mem_3, mode))
> +  /* If nunits > 1 we are adjusting for vector mode. In this case we should
> +     generate mode for vector built from nunits and scalar_mode provided.  */
> +  mem_mode = (nunits == 1) ? mode : mode_for_vector(mode, nunits).else_void();
> +  if (!aarch64_mem_pair_operand (mem_1, mem_mode)
> +      || !aarch64_mem_pair_operand (mem_3, mem_mode))
>      return false;
>  
>    if (code == ZERO_EXTEND)

…here.

One other (very) minor thing is that some of the lines were over the
80 character limit, but removing the nunits parameter might fix that :-)

> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_vec_v2sf.c b/gcc/testsuite/gcc.target/aarch64/ldp_vec_v2sf.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..f46dea1f748a094509ecfa0292a7c54e94164c9a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_vec_v2sf.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +typedef float __attribute__((vector_size(8))) vec;
> +
> +vec
> +load_long(vec *v) {
> +  return v[110] + v[111] + v[112] + v[113];
> +}
> +
> +/* { dg-final { scan-assembler "add\tx\[0-9\]+, x\[0-9\]+, 880" } } */
> +/* { dg-final { scan-assembler "ldp\td\[0-9\]+, d\[0-9\]+, \\\[x\[0-9\]+\\\]" } } */
> +/* { dg-final { scan-assembler "ldp\td\[0-9\]+, d\[0-9\]+, \\\[x\[0-9\]+, 16\\\]" } } */

FWIW, it's possible to avoid many of these backslashes by quoting the
regexp with {…} rather than "…".  E.g.:

/* { dg-final { scan-assembler {ldp\td[0-9]+, d[0-9]+, \[x[0-9]+, 16\]} } } */

The above is fine too though.

Thanks,
Richard