]> gcc.gnu.org Git - gcc.git/commit
aarch64: Avoid same input and output Z register for gather loads
authorKyrylo Tkachov <kyrylo.tkachov@arm.com>
Wed, 21 Jun 2023 12:43:26 +0000 (13:43 +0100)
committerKyrylo Tkachov <kyrylo.tkachov@arm.com>
Wed, 21 Jun 2023 12:43:26 +0000 (13:43 +0100)
commitb375c5340b55d6e7ee703717831dd27528962570
tree5d740c7e8f8ac1f4cb763e4c0298b7980a2a8c48
parent4d9d207c668e757adefcfe5368d86bcc008fa4db
aarch64: Avoid same input and output Z register for gather loads

The architecture recommends that load-gather instructions avoid using the same
Z register for the load address and the destination, and the Software Optimization
Guides for Arm cores recommend that as well.
This means that for code like:

svuint64_t
food (svbool_t p, uint64_t *in, svint64_t offsets, svuint64_t a)
{
  return svadd_u64_x (p, a, svld1_gather_offset(p, in, offsets));
}

we'll want to avoid generating the current:
food:
        ld1d    z0.d, p0/z, [x0, z0.d] // Z0 reused as input and output.
        add     z0.d, z1.d, z0.d
        ret

However, we still want to avoid generating extra moves where there were
none before, so the tight aarch64-sve-acle.exp tests for load gathers
should still pass as they are.

This patch implements that recommendation for the load gather patterns by:
* duplicating the alternatives
* marking the output operand as early clobber
* Tying the input Z register operand in the original alternatives to 0
* Penalising the original alternatives with '?'

This results in a large-ish patch in terms of diff lines but the new
compact syntax (thanks Tamar) makes it quite a readable an regular change.

The benchmark numbers on a Neoverse V1 on fprate look okay:
        diff
503.bwaves_r 0.00%
507.cactuBSSN_r 0.00%
508.namd_r 0.00%
510.parest_r 0.55%
511.povray_r 0.22%
519.lbm_r 0.00%
521.wrf_r 0.00%
526.blender_r 0.00%
527.cam4_r 0.56%
538.imagick_r 0.00%
544.nab_r 0.00%
549.fotonik3d_r 0.00%
554.roms_r 0.00%
fprate         0.10%

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (mask_gather_load<mode><v_int_container>):
Add alternatives to prefer to avoid same input and output Z register.
(mask_gather_load<mode><v_int_container>): Likewise.
(*mask_gather_load<mode><v_int_container>_<su>xtw_unpacked): Likewise.
(*mask_gather_load<mode><v_int_container>_sxtw): Likewise.
(*mask_gather_load<mode><v_int_container>_uxtw): Likewise.
(@aarch64_gather_load_<ANY_EXTEND:optab><SVE_4HSI:mode><SVE_4BHI:mode>):
Likewise.
(@aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode><SVE_2BHSI:mode>):
Likewise.
(*aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode>
<SVE_2BHSI:mode>_<ANY_EXTEND2:su>xtw_unpacked): Likewise.
(*aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode>
<SVE_2BHSI:mode>_sxtw): Likewise.
(*aarch64_gather_load_<ANY_EXTEND:optab><SVE_2HSDI:mode>
<SVE_2BHSI:mode>_uxtw): Likewise.
(@aarch64_ldff1_gather<mode>): Likewise.
(@aarch64_ldff1_gather<mode>): Likewise.
(*aarch64_ldff1_gather<mode>_sxtw): Likewise.
(*aarch64_ldff1_gather<mode>_uxtw): Likewise.
(@aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx4_WIDE:mode>
<VNx4_NARROW:mode>): Likewise.
(@aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx2_WIDE:mode>
<VNx2_NARROW:mode>): Likewise.
(*aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx2_WIDE:mode>
<VNx2_NARROW:mode>_sxtw): Likewise.
(*aarch64_ldff1_gather_<ANY_EXTEND:optab><VNx2_WIDE:mode>
<VNx2_NARROW:mode>_uxtw): Likewise.
* config/aarch64/aarch64-sve2.md (@aarch64_gather_ldnt<mode>): Likewise.
(@aarch64_gather_ldnt_<ANY_EXTEND:optab><SVE_FULL_SDI:mode>
<SVE_PARTIAL_I:mode>): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/gather_earlyclobber.c: New test.
* gcc.target/aarch64/sve2/gather_earlyclobber.c: New test.
gcc/config/aarch64/aarch64-sve.md
gcc/config/aarch64/aarch64-sve2.md
gcc/testsuite/gcc.target/aarch64/sve/gather_earlyclobber.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve2/gather_earlyclobber.c [new file with mode: 0644]
This page took 0.092654 seconds and 6 git commands to generate.