This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[PATCH][AArch64] Use Q-reg loads/stores in movmem expansion
- From: Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>
- To: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Cc: "Richard Earnshaw (lists)" <richard dot earnshaw at arm dot com>, James Greenhalgh <james dot greenhalgh at arm dot com>, Marcus Shawcroft <marcus dot shawcroft at arm dot com>
- Date: Fri, 21 Dec 2018 12:30:49 +0000
- Subject: [PATCH][AArch64] Use Q-reg loads/stores in movmem expansion
Hi all,
Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks.
This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.
For the purpose of copying memory, however, we want to prefer Q-registers.
This uses one fewer register, so helping with register pressure.
It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size.
The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch)
rather than a TImode.
With this patch the testcase:
#define N 8
int src[N], dst[N];
void
foo (void)
{
__builtin_memcpy (dst, src, N * sizeof (int));
}
generates:
foo:
adrp x1, src
add x1, x1, :lo12:src
adrp x0, dst
add x0, x0, :lo12:dst
ldp q1, q0, [x1]
stp q1, q0, [x0]
ret
instead of:
foo:
adrp x1, src
add x1, x1, :lo12:src
adrp x0, dst
add x0, x0, :lo12:dst
ldp x2, x3, [x1]
stp x2, x3, [x0]
ldp x2, x3, [x1, 16]
stp x2, x3, [x0, 16]
ret
Bootstrapped and tested on aarch64-none-linux-gnu.
I hope this is a small enough change for GCC 9.
One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs
I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise
but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed)
Ok for trunk?
Thanks,
Kyrill
2018-12-21 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
128-bit moves.
2018-12-21 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* gcc.target/aarch64/movmem-q-reg_1.c: New test.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 88b14179a4cbc5357dfabe21227ff9c8a111804c..a8dcdd4c9e22a7583a197372e500c787c91fe459 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16448,6 +16448,16 @@ aarch64_expand_movmem (rtx *operands)
if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
cur_mode = mode_iter.require ();
+ /* If we want to use 128-bit chunks use a vector mode to prefer the use
+ of Q registers. This is preferable to using load/store-pairs of X
+ registers as we need 1 Q-register vs 2 X-registers.
+ Also, for targets that prefer it, further passes can create
+ LDP/STP of Q-regs to further reduce the code size. */
+ if (TARGET_SIMD
+ && known_eq (GET_MODE_SIZE (cur_mode), GET_MODE_SIZE (TImode)))
+ cur_mode = V4SImode;
+
+
gcc_assert (cur_mode != BLKmode);
mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
diff --git a/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..09afad59712b939e25519f02153b5156ddacbf5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define N 8
+int src[N], dst[N];
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, N * sizeof (int));
+}
+
+/* { dg-final { scan-assembler {ld[rp]\tq[0-9]*} } } */
+/* { dg-final { scan-assembler-not {ld[rp]\tx[0-9]*} } } */
+/* { dg-final { scan-assembler {st[rp]\tq[0-9]*} } } */
+/* { dg-final { scan-assembler-not {st[rp]\tx[0-9]*} } } */
\ No newline at end of file