aarch64: Add internal tune flag to minimise VL-based scalar ops
This patch introduces an internal tune flag to break up VL-based scalar ops
into a GP-reg scalar op with the VL read kept separate. This can be preferable on some CPUs.
I went for a tune param rather than extending the rtx costs as our RTX costs tables aren't set up to track
this intricacy.
I've confirmed that on the simple loop:
void vadd (int *dst, int *op1, int *op2, int count)
{
for (int i = 0; i < count; ++i)
dst[i] = op1[i] + op2[i];
}
we now split the incw into a cntw outside the loop and the add inside.
+ cntw x5
...
loop:
- incw x4
+ add x4, x4, x5
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (cse_sve_vl_constants):
Define.
* config/aarch64/aarch64.md (add<mode>3): Force CONST_POLY_INT immediates
into a register when the above is enabled.
* config/aarch64/aarch64.c (neoversev1_tunings):
AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS.
(aarch64_rtx_costs): Use AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS.
gcc/testsuite/
* gcc.target/aarch64/sve/cse_sve_vl_constants_1.c: New test.