[PATCH] aarch64: Don't include vec_select in SIMD multiply cost
Richard Sandiford
richard.sandiford@arm.com
Thu Jul 22 17:16:38 GMT 2021
Jonathan Wright <Jonathan.Wright@arm.com> writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can take various forms - multiplying full vector registers of values
> or multiplying one vector by a single element of another. Regardless
> of the form used, these instructions have the same cost, and this
> should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match the vec_select used by the lane-referencing forms of the
> instructions already mentioned. This traversal prevents the cost of
> the vec_select from being added into the cost of the multiply -
> meaning that these instructions can now be emitted in the combine
> pass as they are no longer deemed prohibitively expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19 Jonathan Wright <jonathan.wright@arm.com>
>
> * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
> RTL tree to prevents vec_select from being added into Neon
> multiply cost.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
> op0 = XEXP (op0, 0);
> else if (GET_CODE (op1) == VEC_DUPLICATE)
> op1 = XEXP (op1, 0);
> + /* The same argument applies to the VEC_SELECT when using the lane-
> + referencing forms of the MUL/MLA/MLS instructions. Without the
> + traversal here, the combine pass deems these patterns too
> + expensive and subsequently does not emit the lane-referencing
> + forms of the instructions. In addition, canonical form is for the
> + VEC_SELECT to be the second argument of the multiply - thus only
> + op1 is traversed. */
> + if (GET_CODE (op1) == VEC_SELECT
> + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> + op1 = XEXP (op1, 0);
> + else if ((GET_CODE (op1) == ZERO_EXTEND
> + || GET_CODE (op1) == SIGN_EXTEND)
> + && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
> + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> + op1 = XEXP (XEXP (op1, 0), 0);
I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
if block, since the condition is never true otherwise. We can probably
skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
it would be better to add them to the existing VEC_DUPLICATE tests rather
than restrict them to the VEC_SELECT ones.
Also, although this is in Advanced SIMD-specific code, I think it'd be
better to use:
is_a<scalar_mode> (GET_MODE (op1))
instead of:
GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1
Do you have a testcase?
Thanks,
Richard
More information about the Gcc-patches
mailing list