This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

From: James Greenhalgh <james dot greenhalgh at arm dot com>
To: Tamar Christina <Tamar dot Christina at arm dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>
Date: Tue, 19 Jun 2018 22:23:04 +0100
Subject: Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
Nodisclaimer: True
References: <20180619140924.GA28606@arm.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,
> 
> This changes the movmem code in AArch64 that does copy for data between 4 and 7
> bytes to use the smallest possible mode capable of copying the remaining bytes.
> 
> This means that if we're copying 5 bytes we would issue an SImode and QImode
> load instead of two SImode loads.
> 
> This does smaller memory accesses but also gives the mid-end a chance to realise
> that it can CSE the loads in certain circumstances. e.g. when you have something
> like
> 
> return foo;
> 
> where foo is a struct. This would be transformed by the mid-end into SSA form as
> 
> D.XXXX = foo;
> 
> return D.XXXX;
> 
> This movmem routine will handle the first copy, but it's usually not needed,
> the mid-end would do SImode and QImode stores into X0 for the 5 bytes example
> but without the first copies being in the same mode, it doesn't know it doesn't
> need the stores at all.
> 
> This will generate
> 
> fun5:
> 	sub	sp, sp, #16
> 	adrp	x0, .LANCHOR0
> 	add	x1, x0, :lo12:.LANCHOR0
> 	ldr	w0, [x0, #:lo12:.LANCHOR0]
> 	str	w0, [sp, 8]
> 	ldrb	w0, [x1, 4]
> 	strb	w0, [sp, 12]
> 	add	sp, sp, 16
> 	ret
> 
> instead of
> 
> fun5:
>         sub     sp, sp, #16
>         adrp    x0, .LANCHOR0
>         add     x1, x0, :lo12:.LANCHOR0
>         ldr     w0, [x0, #:lo12:.LANCHOR0]
>         str     w0, [sp, 8]
>         ldr     w0, [x1, 1]
>         str     w0, [sp, 9]
>         add     sp, sp, 16
>         ret
> 
> for a 5 byte copy.

So... Given that I wrote the code to do the overlapping memmove back in 2014
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
why we want to achieve this transformation. I can see that we come out
about even for a 5 or 6 byte copy, but your patch would also force it for a
7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
probably weren't though through that closely.

I'd like to see a testcase which shows the advantages for CSE you describe
in your introduction to the patch; I'm not sure your testcase below gives
that.

> Regtested on aarch64-none-elf and .. issues.

Do I draw a zero on the line on my screen with a sharpie or what ;-)

> Ok for trunk?

Justification and performance impact would be appreciated. Right now I'm not
convinced this is forward progress.

> gcc/
> 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
>  bool
>  aarch64_expand_movmem (rtx *operands)
>  {
> -  unsigned int n;
> +  int n, mode_bits;
>    rtx dst = operands[0];
>    rtx src = operands[1];
>    rtx base;
> +  machine_mode cur_mode = BLKmode, next_mode;
>    bool speed_p = !optimize_function_for_size_p (cfun);
>  
>    /* When optimizing for size, give a better estimate of the length of a
> -     memcpy call, but use the default otherwise.  */
> -  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> +     memcpy call, but use the default otherwise.  Moves larger than 8 bytes
> +     will always require an even number of instructions to do now.  And each
> +     operation requires both a load+store, so devide the max number by 2.  */
> +  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;

Why did we go from unsigned to signed int?

Thanks,
James

Follow-Ups:
- Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
  - From: Tamar Christina

References:
- [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
  - From: Tamar Christina

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]