[AArch64] Implement movmem for the benefit of inline memcpy

Thu Aug 7 14:20:00 GMT 2014

On Tue, Aug 05, 2014 at 08:05:00AM +0100, Andrew Pinski wrote:
> On Fri, Aug 1, 2014 at 2:21 AM,  <pinskia@gmail.com> wrote:
> >> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> >>
> >>
> >> Hi,
> >>
> >> The move_by_pieces infrastructure performs a copy by repeatedly trying
> >> the largest safe copy it can make. So for a 15-byte copy we might see:
> >>
> >> offset   amount  bytes copied
> >> 0        8       0-7
> >> 8        4       8-11
> >> 12       2       12-13
> >> 14       1       14
> >>
> >> However, we can implement a 15-byte copy as so:
> >>
> >> offset   amount  bytes copied
> >> 0        8       0-7
> >> 7        8       7-14
> >>
> >> Which can prove more efficient for both space and speed.
> >>
> >> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and
> >> implement the movmem pattern name to expand small block copy cases. Note, this
> >> optimization does not apply for -mstrict-align targets, which must continue
> >> copying byte-by-byte.
> >
> > Why not change move_by_pieces instead of having a target specific code?
> > This seems like a better option. You can check is unaligned slow target
> > macro to see if you want to do this optimization too.   As I mentioned in
> > the other email make sure you check the volatile ness of the from and to
> > before doing this optimization.

Hi Andrew,

If we are converting these volatile copies to memcpy calls, then there is an
additional bug there. There is nothing in the C standard which imposes that
memcpy/memmov read and write each byte only once and it seems reasonable to
assume that the movmem optab inherits this lack of restrictions.  This gotcha
either needs fixed, or at least documented for the movmem optab.

If I'm going to fix this, I have to write it in the back-end. Failing
the expand will cause a buggy call to memcpy.  Having said that, I'm not
sure I've seen a good definition of the semantics of a volatile struct
copy. It feels to me that an element-by-element copy is more likely to
match user expectations than a chunk-by-chunk copy. It is probably too
late for me to get that by the time I am in memov, so I'll have to push
the fix earlier (probably somewhere generic?).

Do you know of a good write-up/argument/discussion on volatile struct
copy semantics? The testcase you provided is obviously broken, what is less
obvious is what should happen for:

struct __attribute__((packed)) t16{
  long long t8;
  int t4;
  short t2;
  unsigned char t1;
  unsigned char t1a;
};
volatile struct t16 t16;
int f(struct t16 *a)
{
  t16 = *a;
}

We have at least two choices...

> Attached is the patch which does what I mentioned, I also changed
> store_by_pieces to implement a similar optimization there (for memset
> and strcpy).  Also since I used SLOW_UNALIGNED_ACCESS, this is a
> generic optimization.

I'm not sure this helps your situation on AArch64. There are still AArch64
implementations for which we will want to bypass move_by_pieces and provide
a back-end implementation.

We could more reasonably be controlling this with MOVE_BY_PIECES_P, but
this is only a thin wrapper around MOVE_RATIO, so the result for you is
much the same (pending a patch fixing SRA not to read MOVE_RATIO, it should
make no difference whether we disable by MOVE_RATIO or MOVE_BY_PIECES_P).

Have you done much micro/macro-benchmarking to show that this is indeed
a sensible optimization for !SLOW_UNALIGNED_ACCESS targets? The documentation
suggests that SLOW_UNALIGNED_ACCESS should be set if unaligned accesses
are "many times" slower. This is a bit of a blunt hammer - there are likely
targets which will suffer from this optimization, but which don't set
SLOW_UNALIGNED_ACCESS. Maybe you need some finer grained cost function?

Thanks,
James

> I had tested an earlier version on x86_64-linux-gnu and I am in the
> middle of bootstrap/testing on this one.
> 
> Thanks,
> Andrew Pinski
> 
> * expr.c (move_by_pieces):
> Take the min of max_size and len to speed up things
> and to take advatage of the mode in move_by_pieces_1.
> (move_by_pieces_1): Read/write the leftovers using an overlapping
> memory locations to reduce the number of reads/writes.
> (store_by_pieces_1): Take the min of max_size and len to speed up things
> and to take advatage of the mode in store_by_pieces_2.
> (store_by_pieces_2): Write the leftovers using an overlapping
>