This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
- From: James Greenhalgh <james dot greenhalgh at arm dot com>
- To: Tamar Christina <Tamar dot Christina at arm dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>
- Date: Tue, 19 Jun 2018 22:23:04 +0100
- Subject: Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
- Nodisclaimer: True
- References: <20180619140924.GA28606@arm.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,
>
> This changes the movmem code in AArch64 that does copy for data between 4 and 7
> bytes to use the smallest possible mode capable of copying the remaining bytes.
>
> This means that if we're copying 5 bytes we would issue an SImode and QImode
> load instead of two SImode loads.
>
> This does smaller memory accesses but also gives the mid-end a chance to realise
> that it can CSE the loads in certain circumstances. e.g. when you have something
> like
>
> return foo;
>
> where foo is a struct. This would be transformed by the mid-end into SSA form as
>
> D.XXXX = foo;
>
> return D.XXXX;
>
> This movmem routine will handle the first copy, but it's usually not needed,
> the mid-end would do SImode and QImode stores into X0 for the 5 bytes example
> but without the first copies being in the same mode, it doesn't know it doesn't
> need the stores at all.
>
> This will generate
>
> fun5:
> sub sp, sp, #16
> adrp x0, .LANCHOR0
> add x1, x0, :lo12:.LANCHOR0
> ldr w0, [x0, #:lo12:.LANCHOR0]
> str w0, [sp, 8]
> ldrb w0, [x1, 4]
> strb w0, [sp, 12]
> add sp, sp, 16
> ret
>
> instead of
>
> fun5:
> sub sp, sp, #16
> adrp x0, .LANCHOR0
> add x1, x0, :lo12:.LANCHOR0
> ldr w0, [x0, #:lo12:.LANCHOR0]
> str w0, [sp, 8]
> ldr w0, [x1, 1]
> str w0, [sp, 9]
> add sp, sp, 16
> ret
>
> for a 5 byte copy.
So... Given that I wrote the code to do the overlapping memmove back in 2014
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
why we want to achieve this transformation. I can see that we come out
about even for a 5 or 6 byte copy, but your patch would also force it for a
7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
probably weren't though through that closely.
I'd like to see a testcase which shows the advantages for CSE you describe
in your introduction to the patch; I'm not sure your testcase below gives
that.
> Regtested on aarch64-none-elf and .. issues.
Do I draw a zero on the line on my screen with a sharpie or what ;-)
> Ok for trunk?
Justification and performance impact would be appreciated. Right now I'm not
convinced this is forward progress.
> gcc/
> 2018-06-19 Tamar Christina <tamar.christina@arm.com>
>
> * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
>
> gcc/testsuite/
> 2018-06-19 Tamar Christina <tamar.christina@arm.com>
>
> * gcc.target/aarch64/struct_cpy.c: New.
>
> --
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
> bool
> aarch64_expand_movmem (rtx *operands)
> {
> - unsigned int n;
> + int n, mode_bits;
> rtx dst = operands[0];
> rtx src = operands[1];
> rtx base;
> + machine_mode cur_mode = BLKmode, next_mode;
> bool speed_p = !optimize_function_for_size_p (cfun);
>
> /* When optimizing for size, give a better estimate of the length of a
> - memcpy call, but use the default otherwise. */
> - unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> + memcpy call, but use the default otherwise. Moves larger than 8 bytes
> + will always require an even number of instructions to do now. And each
> + operation requires both a load+store, so devide the max number by 2. */
> + int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
Why did we go from unsigned to signed int?
Thanks,
James