[Bug tree-optimization/50417] regression: memcpy with known alignment

Tue Jul 12 08:32:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417

--- Comment #22 from npl at chello dot at ---
(In reply to Richard Biener from comment #21)
> (In reply to Georg-Johann Lay from comment #18)
> > (In reply to rguenther@suse.de from comment #12)
> > > On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:
> > > 
> > > > void test (const int *a, int *b)
> > > > {
> > > >   a[100] = 1;
> > > >   b[200] = 2;
> > > > 
> > > >   std::memcpy ((char *)b, (char *)a, t);
> > > > }
> > > > 
> > > > where a[100] and b[200] both would result in 32 bit accesses, not 4x1 byte or
> > > > something, because the base pointer is assumed to be int aligned.
> > > 
> > > No, because the access is performed as 'int'.
> > > 
> > > >  Why should memcpy be any different?
> > > 
> > > Because the memcpy stmt doesn't constitute a memory access but a function
> > > call.
> > 
> > What about a new command option like -fassume-aligned-xxx that's off per
> > default?
> > 
> > The user could assert that when she is using memcpy (and friends) with a
> > pointer of a specific type, then that also asserts that the data behind the
> > pointer is appropriately aligned and may be accessed accordingly.
> 
> But if you do
> 
> void copy (int *d, int *s)
> {
>   memcpy ((char *)d, (char *)s, 4);
> }
> 
> then you will get aligned accesses because all the middle-end sees is
> 
>   mempcy (d, s, 4);

Same thing happens already if you do this:

int d, s;
mempcy ((char *)&d, (char *)&s, 4);

Its also generally quite hard to force the compiler to do less-aligned
accesses, and I haven`t seen this "solution" anywhere. (Probably because it
doesn`t work on any current compiler)

> so as I said elsewhere the only way to reliably implement deriving alignment
> from pointer types is by the frontends inserting __builtin_assume_aligned
> calls before they possibly stripped any conversions.
> 
> Yes, if we simply say we strictly follow C11 6.3.2.3/7 we can do that
> using a simple flag.  But we won't get the optimization reliably because
> of the above issue for the variant with char * parameters and int * casts.

And you shouldn`t get the optimization in this case. Casting char* to int* is
non-standard, the assume_aligned builtin is a good fit for that non-standard
stuff IMHO.

> You could call that flag -fstrict-alignment (though maybe that would be
> confusing to people familiar with GCC internals STRICT_ALIGNMENT target
> macro).  A simple implementation could be in get_pointer_alignment_1,
> wrapping it with a function that inspects TYPE_ALIGN (TREE_TYPE (TREE_TYPE
> (exp)) and uses that if it trumps the wrapped fn return values.  But I
> expect much undesired fallout from such a change.

The internals are above my head, but in regards to fallout the same thing could
(has) be said for -fstrict-aliasing. Im a victim myself, and am suffering from
a paranoia where I have to replace pointer-accesses with memcpy =)

(In reply to Richard Biener from comment #19)
> (In reply to npl from comment #17)
> > I got interrupted by a colleague at work, part 2 of the ramblings...
> > 
> > Everything you could argue against memcpy beeing replaced by simpler
> > instructions, doesnt change that the same issue persists with the
> > __builtin_memcpy function, which is explicitely saying you want the
> > optimizations.
> > 
> > A pointer to a uint32 can be assumed to be proper aligned, CREATING such a
> > pointer thats not aligned is already undefined behaviour by the standard
> > (the compiler could zero out bits for example). I dont think that what
> > happens afterwards with something that shouldn`t exist in the first place is
> > an argument against optimizing proper code.
> > 
> > Further, I lack a consistent way of dealing with potential aliasing
> > pointers. Using memcpy seems the sanest way, simply because its standards
> > compliant, supported everywhere and your code wont mysteriously break once
> > you use LTO or higher optimization settings.
> > Compilers can reliably detect this and replace memcpy since years (ignoring
> > this issue, which I would consider a bug), so there is no draw back. Its a
> > feature common pretty much everywhere, and a valid recommendation in many
> > discussions related to the topic.
> > 
> > Consider the example below for illustration, FIXEDMEMCPY is how the plain
> > memcpy should work and already does work for archs with unaligned access.
> > (I had planned to post the code for 32bit x86, but the assembly is rather
> > ugly, amd64 would work with "unsigned long" and "unsigned long long").
> > 
> > I already ran in such issues, when different software components define
> > their own fixedwidth types. Its a practical issue where pointing to
> > paragraphs of the standard dont help, unless you provide a proper solution
> > with it. The FIXEDMEMCPY hack is fine for gcc but compilerspecific.
> > 
> > In short:
> > * Optimizing memcpy to simple instructions is a reality and expected, the
> > behaviour (slow code) on arm (and other archs with req. alignment) is a
> > unwelcome oddity
> > * memcpy is one of the few ways to deal with aliasing, and the most
> > standards compliant. (theres unions too, but thats not standards compliant)
> > * I dont see a problem in replacing standard functions (and __builtin_memcpy
> > has the same issue)
> > * I dont see a problem in expecting a correctly aligned pointer, and doing
> > undefined behaviour if the pointer could cause undefined behaviour.
> > 
> > 
> > 
> > typedef unsigned uint32_t;
> > typedef unsigned long uint32_alt;
> > _Static_assert(sizeof(uint32_t) == sizeof(uint32_alt), "you picked a bad
> > architecture or typedefs for this example");
> > 
> > #define FIXEDMEMCPY(a, b, s) __builtin_memcpy(__builtin_assume_aligned(a,
> > __alignof__(*a)), __builtin_assume_aligned(b, __alignof__(*b)), s)
> > unsigned breakme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
> > {
> > 	/* normally in different compilation units, but LTO doesnt care */
> > 	*ptr = 0;
> > 	*ptr2 = a;
> > 	return *ptr;
> > }
> > 
> > unsigned fixme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
> > {
> > 	/* fixes aliasing, but should be as fast as simple accesses */
> > 	uint32_t val = 0;
> > 	FIXEDMEMCPY(ptr, &val, 4);
> > 	FIXEDMEMCPY(ptr2 , &a, 4);
> > 	uint32_t val2;
> > 	FIXEDMEMCPY(&val2, ptr, 4);
> > 	return val2;
> > }
> > 
> > 00000000 <breakme>:
> >    0:	e3a03000 	mov	r3, #0
> >    4:	e5803000 	str	r3, [r0]
> >    8:	e1a00003 	mov	r0, r3 // Oops: retval = 0
> >    c:	e5812000 	str	r2, [r1]
> >   10:	e12fff1e 	bx	lr
> > 
> > 00000014 <fixme>:
> >   14:	e3a03000 	mov	r3, #0
> >   18:	e5803000 	str	r3, [r0]
> >   1c:	e5812000 	str	r2, [r1]
> >   20:	e5900000 	ldr	r0, [r0] // The load thats missing above
> >   24:	e24dd010 	sub	sp, sp, #16 // Time for another 
> >   28:	e28dd010 	add	sp, sp, #16 // Bugreport ?
> >   2c:	e12fff1e 	bx	lr
> 
> It's not done on STRICT_ALIGNMENT platforms because not all of those expand
> misaligned moves correctly (IIRC).  Looking at RTL expansion at least the
> misaligned destination will work correctly.  The question remains is what
> happens for -Os and for example both misaligned source and destination.
> Or on x86 where a simple rep; movb; is possible (plus the register setup
> of course).

Not sure what you mean, x86 has unaligned accesses and shouldn't be affected.
Also, I doubt there are many cases where the function-call (and the register
and stack shuffling) will use less code than the aligned access.

The generated code for unaligned access could be improved in many cases (ARM
atleast), possibly fixed. But thats not generally an argument against improving
the builtins?