Bug 50417 - [12/13/14/15 regression]: memcpy with known alignment
Summary: [12/13/14/15 regression]: memcpy with known alignment
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P2 normal
Target Milestone: 12.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 52415 67676 101920 (view as bug list)
Depends on:
Blocks: 111502
  Show dependency treegraph
 
Reported: 2011-09-15 08:49 UTC by Wouter Vermaelen
Modified: 2024-07-19 12:55 UTC (History)
7 users (show)

See Also:
Host:
Target: sh*-*-* arm*-*-* powerpc64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-07-15 00:00:00


Attachments
patch to remove memcpy inline expansion misalign limit (1.16 KB, text/plain)
2016-07-12 07:25 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Wouter Vermaelen 2011-09-15 08:49:31 UTC
Consider these functions:

void copy1(char* d, const char* s) {
	memcpy(d, s, 256);
}
void copy2(short* d, const short* s) {
	memcpy(d, s, 256);
}
void copy3(int* d, const int* s) {
	memcpy(d, s, 256);
}
void copy4(long* d, const long* s) {
	memcpy(d, s, 256);
}

g++-4.5.2 is able to generate better code for the later functions. But when I test with a recent snapshot (SVN revision 178875 on linux x86_64) it generates the same code for all versions (same as copy1()).
Comment 1 Richard Biener 2011-09-25 09:58:43 UTC
The middle-end does not know anything about the parameters alignment.
Comment 2 Richard Biener 2012-06-06 11:00:31 UTC
*** Bug 52415 has been marked as a duplicate of this bug. ***
Comment 3 Oleg Endo 2013-07-14 10:29:44 UTC
I ran into this issue when trying out something on SH, too.
(4.9 trunk rev 200571)
Example:

void test (const int* a, int* b)
{
  std::memcpy (b, a, 4);
}

Here I'd have expected this to result in a int load + int store, like...

        mov.l   @r4,r1
        rts
        mov.l   r1,@r5

However, this does not happen.  The SH's movmemsi pattern gets a 'common alignment' of '1'.  SH is a strict alignment target and thus the expander will fall back to memcpy library call.
If I remember correctly, on strict alignment targets a pointer to some type is by default assumed to satisfy the minimum alignment of that type.
For example

int test (const int* a)
{
  return a[0];
}

resuls in (as expected):

        rts
        mov.l   @r4,r0

In the memcpy case it should be safe to assume that the alignment of int* is 32 bit aligned.
As it has been mentioned in PR 52415, I've also tried out __builtin_assume_aligned and the problem disappears.  In this case the memcpy is eliminated early on and the movmemsi pattern is not used at all.
Comment 4 Georg-Johann Lay 2013-10-09 09:25:23 UTC
(In reply to Richard Biener from comment #1)
> The middle-end does not know anything about the parameters alignment.

But the following code will use SImode load / store, even on strict-alignment backends

void test (const int* a, int* b)
{
  *b = *a;
}

If SImode operations are in order here, why not in memcpy et al. provided it is feeded with int*?

In the case where an unaligned pointer is used in *b = *a and this causes, e.g.  a trap, this is in order.  Similar should apply for memcpy on strict-alignment machines if it gets an unaligned int* for example and the size is a multiple of the alignment requirement.

BTW: AVR is an 8-bit machine that copies in chunks of 1 byte, this applies also for 4.5 and older.  avr-gcc is /not/ a strict alignment backend.  Thul removed from the targets.
Comment 5 Richard Biener 2014-04-02 08:39:36 UTC
*** Bug 60737 has been marked as a duplicate of this bug. ***
Comment 6 Richard Biener 2015-09-22 08:04:44 UTC
*** Bug 67676 has been marked as a duplicate of this bug. ***
Comment 7 nolange79 2016-07-08 10:17:17 UTC
This seems to affect even the most trivial cases. I ran the following code witn arm gcc 4.8.4, 4.9.2 and 5.3.0:

#include <stdint.h>
#include <string.h>

uint32_t foo_noalign(const uint32_t *s) {
	uint32_t v;
	memcpy(&v, s, sizeof(v));
	return v;
}

uint32_t foo(const uint32_t *s) {
	uint32_t v;
	memcpy(&v, __builtin_assume_aligned(s, 4), sizeof(v));
	return v;
}

Which generates the following code:

00000000 <foo_noalign>:
   0:	e92d4007 	push	{r0, r1, r2, lr}
   4:	e3a02004 	mov	r2, #4
   8:	e1a01000 	mov	r1, r0
   c:	e08d0002 	add	r0, sp, r2
  10:	ebfffffe 	bl	0 <memcpy>
  14:	e59d0004 	ldr	r0, [sp, #4]
  18:	e28dd00c 	add	sp, sp, #12
  1c:	e49de004 	pop	{lr}		; (ldr lr, [sp], #4)
  20:	e12fff1e 	bx	lr

00000024 <foo>:
  24:	e5900000 	ldr	r0, [r0]
  28:	e12fff1e 	bx	lr

Thats really, really bad. clang has no problems generating the optimal code.
Comment 8 rguenther@suse.de 2016-07-08 11:21:22 UTC
On Fri, 8 Jul 2016, npl at chello dot at wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
> 
> npl at chello dot at changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |npl at chello dot at
> 
> --- Comment #7 from npl at chello dot at ---
> This seems to affect even the most trivial cases. I ran the following code witn
> arm gcc 4.8.4, 4.9.2 and 5.3.0:
> 
> #include <stdint.h>
> #include <string.h>
> 
> uint32_t foo_noalign(const uint32_t *s) {
>         uint32_t v;
>         memcpy(&v, s, sizeof(v));
>         return v;
> }
> 
> uint32_t foo(const uint32_t *s) {
>         uint32_t v;
>         memcpy(&v, __builtin_assume_aligned(s, 4), sizeof(v));
>         return v;
> }
> 
> Which generates the following code:
> 
> 00000000 <foo_noalign>:
>    0:   e92d4007        push    {r0, r1, r2, lr}
>    4:   e3a02004        mov     r2, #4
>    8:   e1a01000        mov     r1, r0
>    c:   e08d0002        add     r0, sp, r2
>   10:   ebfffffe        bl      0 <memcpy>
>   14:   e59d0004        ldr     r0, [sp, #4]
>   18:   e28dd00c        add     sp, sp, #12
>   1c:   e49de004        pop     {lr}            ; (ldr lr, [sp], #4)
>   20:   e12fff1e        bx      lr
> 
> 00000024 <foo>:
>   24:   e5900000        ldr     r0, [r0]
>   28:   e12fff1e        bx      lr
> 
> Thats really, really bad. clang has no problems generating the optimal code.

But there is no good reasoning that can be applied that it is a valid
transform.  What does clang do when s is void * and you cast that to
uint32_t *?  Note that memcpy prototype takes a void * argument.
Comment 9 Oleg Endo 2016-07-08 11:50:09 UTC
(In reply to rguenther@suse.de from comment #8)
> 
> But there is no good reasoning that can be applied that it is a valid
> transform.  What does clang do when s is void * and you cast that to
> uint32_t *?  Note that memcpy prototype takes a void * argument.

Does it matter whether there's a void* in between or not?

If we do a *(int32_T*)(void*)(int32_t*)p on a strict alignment target, it will result in a 32 bit mem access.

GCC knows memcpy and its semantics well enough so propagating the target's alignment rules into memcpy is most likely going to be a safe and sane thing to do.  So at least cases like 

void test (const int* a, int* b)
{
  std::memcpy (b, a, 4);
}

should work.  Shouldn't it?
Comment 10 rguenther@suse.de 2016-07-08 12:07:11 UTC
On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
> 
> --- Comment #9 from Oleg Endo <olegendo at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #8)
> > 
> > But there is no good reasoning that can be applied that it is a valid
> > transform.  What does clang do when s is void * and you cast that to
> > uint32_t *?  Note that memcpy prototype takes a void * argument.
> 
> Does it matter whether there's a void* in between or not?
> 
> If we do a *(int32_T*)(void*)(int32_t*)p on a strict alignment target, it will
> result in a 32 bit mem access.

No, changing the pointer type doesn't change the access - how the access
is implemented depends on memcpy.

> GCC knows memcpy and its semantics well enough so propagating the target's
> alignment rules into memcpy is most likely going to be a safe and sane thing to
> do.  So at least cases like 
> 
> void test (const int* a, int* b)
> {
>   std::memcpy (b, a, 4);
> }
> 
> should work.  Shouldn't it?

What makes the parameter type special?  Would

void test (const int *a, int *b)
{
  std::memcpy ((char *)b, (char *)a, t);
}

be invalid to optimize to an aligned copy in your eyes?

Yes, I know that C standard rule about pointer type casting, 6.3.2.3/7
in C11, but I guess that if we'd start to derive alignment from
this undefinedness we are going to break a lot of code.  To GCC
what matters is actual accesses, not just pointer conversions
which are all stripped anyway which makes the following testcase
not optimizable (easily)

void test (const char *a, char *b)                                                
{                                                                               
  std::memcpy ((int *)b, (int *)a, 4);                                        
}
Comment 11 Oleg Endo 2016-07-08 12:26:09 UTC
(In reply to rguenther@suse.de from comment #10)
> 
> What makes the parameter type special?  Would
> 
> void test (const int *a, int *b)
> {
>   std::memcpy ((char *)b, (char *)a, t);
> }
> 
> be invalid to optimize to an aligned copy in your eyes?

I think it's perfectly valid to optimize this case.  It could as well be something like:

void test (const int *a, int *b)
{
  a[100] = 1;
  b[200] = 2;

  std::memcpy ((char *)b, (char *)a, t);
}

where a[100] and b[200] both would result in 32 bit accesses, not 4x1 byte or something, because the base pointer is assumed to be int aligned.  Why should memcpy be any different?
Comment 12 rguenther@suse.de 2016-07-08 12:30:34 UTC
On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
> 
> --- Comment #11 from Oleg Endo <olegendo at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #10)
> > 
> > What makes the parameter type special?  Would
> > 
> > void test (const int *a, int *b)
> > {
> >   std::memcpy ((char *)b, (char *)a, t);
> > }
> > 
> > be invalid to optimize to an aligned copy in your eyes?
> 
> I think it's perfectly valid to optimize this case.  It could as well be
> something like:
> 
> void test (const int *a, int *b)
> {
>   a[100] = 1;
>   b[200] = 2;
> 
>   std::memcpy ((char *)b, (char *)a, t);
> }
> 
> where a[100] and b[200] both would result in 32 bit accesses, not 4x1 byte or
> something, because the base pointer is assumed to be int aligned.

No, because the access is performed as 'int'.

>  Why should memcpy be any different?

Because the memcpy stmt doesn't constitute a memory access but a function call.
Comment 13 Oleg Endo 2016-07-08 12:42:35 UTC
(In reply to rguenther@suse.de from comment #12)
> 
> No, because the access is performed as 'int'.

> 
> >  Why should memcpy be any different?
> 
> Because the memcpy stmt doesn't constitute a memory access but a function
> call.

Yeah, that might be how things are handled internally.  But that doesn't mean the optimization in question can't / shouldn't be done.  GCC does even more "evil" things with function calls such as replacing printf with puts, propagating arguments into functions, inlining...  So ... uhm ... I'm afraid I don't get the point.
Comment 14 rguenther@suse.de 2016-07-08 12:50:27 UTC
On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
> 
> --- Comment #13 from Oleg Endo <olegendo at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #12)
> > 
> > No, because the access is performed as 'int'.
> 
> > 
> > >  Why should memcpy be any different?
> > 
> > Because the memcpy stmt doesn't constitute a memory access but a function
> > call.
> 
> Yeah, that might be how things are handled internally.  But that doesn't mean
> the optimization in question can't / shouldn't be done.  GCC does even more
> "evil" things with function calls such as replacing printf with puts,
> propagating arguments into functions, inlining...  So ... uhm ... I'm afraid I
> don't get the point.

The point is the only reason you can do the optimization (on all targets)
is that C standard rule about pointer types (casting).

GCCs implementation details will make the application of the optimization
quite unreliable as well.

I've repeatedly said that I think relying on that part of the C standard
with respect to undefinedness would be a bad thing from a QOI perspective.
But I've changed my mind in the past - does -fsanitize=alignment
instrument pointer types/casts?
Comment 15 Richard Biener 2016-07-08 12:56:27 UTC
Doesn't sanitize anything for

int main()                                                                      
{                                                                               
  int i, j;                                                                     
  memcpy (&i, (int *)((char *)&i + 1), 4);                                               
  return 0;                                                                     
}

but note that the cast to (int *) is already removed during early folding
and you end up with GENERIC

  memcpy ((void *) &i, (const void *) (&i + 1), 4);
Comment 16 nolange79 2016-07-08 14:38:15 UTC
rguenther: Funny enough, I am using memcpy because thats the only standard conform way to move data around that might be aliased. And I use this often to parse binary streams from a network.

I think you are missing some key points, first that gcc readily replaces functions it knows - which can be disabled at will. On x86, for all those samples the call WILL BE replaced. (And in C++, RVO is explicitly allowed for example)
So its pointless to discuss whether or not gcc should provide builtins (atleast in the context of this bug). Instead talk about the replacing feature that seems to work inconsistent / suboptimal.


One more sample to demonstrate this (all compiled with O2):

#include <string.h>
int foo_ptr(const int *s)
{
	/* nope */
	int v;
	memcpy(&v, s, sizeof(v));
	return v;
}

int foo_ref(const int &s)
{
	/* nope */
	int v;
	memcpy(&v, &s, sizeof(v));
	return v;
}

int foo_refloc(const int &s)
{
	/* This actually works */
	int b = s;
	int v;
	memcpy(&v, &b, sizeof(v));
	return v;
}

int foo_badalias(const char *s)
{
	/* Nope, but then THIS shouldnt be be replaced by a aligned load */
	int v;
	memcpy(&v, (int *)s, sizeof(v));
	return v;
}

int foo_badalias2(const char *s)
{
	/* That works, but is undefined behaviour (casting from a pointer with less alignment)! */
	int v = *(int *)s;
	return v;
}
Comment 17 nolange79 2016-07-08 21:11:24 UTC
I got interrupted by a colleague at work, part 2 of the ramblings...

Everything you could argue against memcpy beeing replaced by simpler instructions, doesnt change that the same issue persists with the __builtin_memcpy function, which is explicitely saying you want the optimizations.

A pointer to a uint32 can be assumed to be proper aligned, CREATING such a pointer thats not aligned is already undefined behaviour by the standard (the compiler could zero out bits for example). I dont think that what happens afterwards with something that shouldn`t exist in the first place is an argument against optimizing proper code.

Further, I lack a consistent way of dealing with potential aliasing pointers. Using memcpy seems the sanest way, simply because its standards compliant, supported everywhere and your code wont mysteriously break once you use LTO or higher optimization settings.
Compilers can reliably detect this and replace memcpy since years (ignoring this issue, which I would consider a bug), so there is no draw back. Its a feature common pretty much everywhere, and a valid recommendation in many discussions related to the topic.

Consider the example below for illustration, FIXEDMEMCPY is how the plain memcpy should work and already does work for archs with unaligned access.
(I had planned to post the code for 32bit x86, but the assembly is rather ugly, amd64 would work with "unsigned long" and "unsigned long long").

I already ran in such issues, when different software components define their own fixedwidth types. Its a practical issue where pointing to paragraphs of the standard dont help, unless you provide a proper solution with it. The FIXEDMEMCPY hack is fine for gcc but compilerspecific.

In short:
* Optimizing memcpy to simple instructions is a reality and expected, the behaviour (slow code) on arm (and other archs with req. alignment) is a unwelcome oddity
* memcpy is one of the few ways to deal with aliasing, and the most standards compliant. (theres unions too, but thats not standards compliant)
* I dont see a problem in replacing standard functions (and __builtin_memcpy has the same issue)
* I dont see a problem in expecting a correctly aligned pointer, and doing undefined behaviour if the pointer could cause undefined behaviour.



typedef unsigned uint32_t;
typedef unsigned long uint32_alt;
_Static_assert(sizeof(uint32_t) == sizeof(uint32_alt), "you picked a bad architecture or typedefs for this example");

#define FIXEDMEMCPY(a, b, s) __builtin_memcpy(__builtin_assume_aligned(a, __alignof__(*a)), __builtin_assume_aligned(b, __alignof__(*b)), s)
unsigned breakme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
{
	/* normally in different compilation units, but LTO doesnt care */
	*ptr = 0;
	*ptr2 = a;
	return *ptr;
}

unsigned fixme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
{
	/* fixes aliasing, but should be as fast as simple accesses */
	uint32_t val = 0;
	FIXEDMEMCPY(ptr, &val, 4);
	FIXEDMEMCPY(ptr2 , &a, 4);
	uint32_t val2;
	FIXEDMEMCPY(&val2, ptr, 4);
	return val2;
}

00000000 <breakme>:
   0:	e3a03000 	mov	r3, #0
   4:	e5803000 	str	r3, [r0]
   8:	e1a00003 	mov	r0, r3 // Oops: retval = 0
   c:	e5812000 	str	r2, [r1]
  10:	e12fff1e 	bx	lr

00000014 <fixme>:
  14:	e3a03000 	mov	r3, #0
  18:	e5803000 	str	r3, [r0]
  1c:	e5812000 	str	r2, [r1]
  20:	e5900000 	ldr	r0, [r0] // The load thats missing above
  24:	e24dd010 	sub	sp, sp, #16 // Time for another 
  28:	e28dd010 	add	sp, sp, #16 // Bugreport ?
  2c:	e12fff1e 	bx	lr
Comment 18 Georg-Johann Lay 2016-07-09 12:57:57 UTC
(In reply to rguenther@suse.de from comment #12)
> On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:
> 
> > void test (const int *a, int *b)
> > {
> >   a[100] = 1;
> >   b[200] = 2;
> > 
> >   std::memcpy ((char *)b, (char *)a, t);
> > }
> > 
> > where a[100] and b[200] both would result in 32 bit accesses, not 4x1 byte or
> > something, because the base pointer is assumed to be int aligned.
> 
> No, because the access is performed as 'int'.
> 
> >  Why should memcpy be any different?
> 
> Because the memcpy stmt doesn't constitute a memory access but a function
> call.

What about a new command option like -fassume-aligned-xxx that's off per default?

The user could assert that when she is using memcpy (and friends) with a pointer of a specific type, then that also asserts that the data behind the pointer is appropriately aligned and may be accessed accordingly.
Comment 19 Richard Biener 2016-07-12 07:21:02 UTC
(In reply to npl from comment #17)
> I got interrupted by a colleague at work, part 2 of the ramblings...
> 
> Everything you could argue against memcpy beeing replaced by simpler
> instructions, doesnt change that the same issue persists with the
> __builtin_memcpy function, which is explicitely saying you want the
> optimizations.
> 
> A pointer to a uint32 can be assumed to be proper aligned, CREATING such a
> pointer thats not aligned is already undefined behaviour by the standard
> (the compiler could zero out bits for example). I dont think that what
> happens afterwards with something that shouldn`t exist in the first place is
> an argument against optimizing proper code.
> 
> Further, I lack a consistent way of dealing with potential aliasing
> pointers. Using memcpy seems the sanest way, simply because its standards
> compliant, supported everywhere and your code wont mysteriously break once
> you use LTO or higher optimization settings.
> Compilers can reliably detect this and replace memcpy since years (ignoring
> this issue, which I would consider a bug), so there is no draw back. Its a
> feature common pretty much everywhere, and a valid recommendation in many
> discussions related to the topic.
> 
> Consider the example below for illustration, FIXEDMEMCPY is how the plain
> memcpy should work and already does work for archs with unaligned access.
> (I had planned to post the code for 32bit x86, but the assembly is rather
> ugly, amd64 would work with "unsigned long" and "unsigned long long").
> 
> I already ran in such issues, when different software components define
> their own fixedwidth types. Its a practical issue where pointing to
> paragraphs of the standard dont help, unless you provide a proper solution
> with it. The FIXEDMEMCPY hack is fine for gcc but compilerspecific.
> 
> In short:
> * Optimizing memcpy to simple instructions is a reality and expected, the
> behaviour (slow code) on arm (and other archs with req. alignment) is a
> unwelcome oddity
> * memcpy is one of the few ways to deal with aliasing, and the most
> standards compliant. (theres unions too, but thats not standards compliant)
> * I dont see a problem in replacing standard functions (and __builtin_memcpy
> has the same issue)
> * I dont see a problem in expecting a correctly aligned pointer, and doing
> undefined behaviour if the pointer could cause undefined behaviour.
> 
> 
> 
> typedef unsigned uint32_t;
> typedef unsigned long uint32_alt;
> _Static_assert(sizeof(uint32_t) == sizeof(uint32_alt), "you picked a bad
> architecture or typedefs for this example");
> 
> #define FIXEDMEMCPY(a, b, s) __builtin_memcpy(__builtin_assume_aligned(a,
> __alignof__(*a)), __builtin_assume_aligned(b, __alignof__(*b)), s)
> unsigned breakme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
> {
> 	/* normally in different compilation units, but LTO doesnt care */
> 	*ptr = 0;
> 	*ptr2 = a;
> 	return *ptr;
> }
> 
> unsigned fixme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
> {
> 	/* fixes aliasing, but should be as fast as simple accesses */
> 	uint32_t val = 0;
> 	FIXEDMEMCPY(ptr, &val, 4);
> 	FIXEDMEMCPY(ptr2 , &a, 4);
> 	uint32_t val2;
> 	FIXEDMEMCPY(&val2, ptr, 4);
> 	return val2;
> }
> 
> 00000000 <breakme>:
>    0:	e3a03000 	mov	r3, #0
>    4:	e5803000 	str	r3, [r0]
>    8:	e1a00003 	mov	r0, r3 // Oops: retval = 0
>    c:	e5812000 	str	r2, [r1]
>   10:	e12fff1e 	bx	lr
> 
> 00000014 <fixme>:
>   14:	e3a03000 	mov	r3, #0
>   18:	e5803000 	str	r3, [r0]
>   1c:	e5812000 	str	r2, [r1]
>   20:	e5900000 	ldr	r0, [r0] // The load thats missing above
>   24:	e24dd010 	sub	sp, sp, #16 // Time for another 
>   28:	e28dd010 	add	sp, sp, #16 // Bugreport ?
>   2c:	e12fff1e 	bx	lr

It's not done on STRICT_ALIGNMENT platforms because not all of those expand
misaligned moves correctly (IIRC).  Looking at RTL expansion at least the
misaligned destination will work correctly.  The question remains is what
happens for -Os and for example both misaligned source and destination.
Or on x86 where a simple rep; movb; is possible (plus the register setup
of course).
Comment 20 Richard Biener 2016-07-12 07:25:42 UTC
Created attachment 38879 [details]
patch to remove memcpy inline expansion misalign limit

This patch lifts the restrictions on memcpy to assignment folding WRT strict alignment platforms.  As said for -Os this might not be optimal (but it's
usually hard to tell given the cases we want the early folding there are
followup optimization opportunities like promoting either source or destination
to a register).
Comment 21 Richard Biener 2016-07-12 07:33:47 UTC
(In reply to Georg-Johann Lay from comment #18)
> (In reply to rguenther@suse.de from comment #12)
> > On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:
> > 
> > > void test (const int *a, int *b)
> > > {
> > >   a[100] = 1;
> > >   b[200] = 2;
> > > 
> > >   std::memcpy ((char *)b, (char *)a, t);
> > > }
> > > 
> > > where a[100] and b[200] both would result in 32 bit accesses, not 4x1 byte or
> > > something, because the base pointer is assumed to be int aligned.
> > 
> > No, because the access is performed as 'int'.
> > 
> > >  Why should memcpy be any different?
> > 
> > Because the memcpy stmt doesn't constitute a memory access but a function
> > call.
> 
> What about a new command option like -fassume-aligned-xxx that's off per
> default?
> 
> The user could assert that when she is using memcpy (and friends) with a
> pointer of a specific type, then that also asserts that the data behind the
> pointer is appropriately aligned and may be accessed accordingly.

But if you do

void copy (int *d, int *s)
{
  memcpy ((char *)d, (char *)s, 4);
}

then you will get aligned accesses because all the middle-end sees is

  mempcy (d, s, 4);

so as I said elsewhere the only way to reliably implement deriving alignment
from pointer types is by the frontends inserting __builtin_assume_aligned
calls before they possibly stripped any conversions.

Yes, if we simply say we strictly follow C11 6.3.2.3/7 we can do that
using a simple flag.  But we won't get the optimization reliably because
of the above issue for the variant with char * parameters and int * casts.
You could call that flag -fstrict-alignment (though maybe that would be
confusing to people familiar with GCC internals STRICT_ALIGNMENT target
macro).  A simple implementation could be in get_pointer_alignment_1,
wrapping it with a function that inspects TYPE_ALIGN (TREE_TYPE (TREE_TYPE (exp)) and uses that if it trumps the wrapped fn return values.  But I
expect much undesired fallout from such a change.

Having a flag restricting the optimization behavior to memcpy-and-friends
is not going to work (without the frontends inserting extra IL when all
information is still present).
Comment 22 nolange79 2016-07-12 08:32:05 UTC
(In reply to Richard Biener from comment #21)
> (In reply to Georg-Johann Lay from comment #18)
> > (In reply to rguenther@suse.de from comment #12)
> > > On Fri, 8 Jul 2016, olegendo at gcc dot gnu.org wrote:
> > > 
> > > > void test (const int *a, int *b)
> > > > {
> > > >   a[100] = 1;
> > > >   b[200] = 2;
> > > > 
> > > >   std::memcpy ((char *)b, (char *)a, t);
> > > > }
> > > > 
> > > > where a[100] and b[200] both would result in 32 bit accesses, not 4x1 byte or
> > > > something, because the base pointer is assumed to be int aligned.
> > > 
> > > No, because the access is performed as 'int'.
> > > 
> > > >  Why should memcpy be any different?
> > > 
> > > Because the memcpy stmt doesn't constitute a memory access but a function
> > > call.
> > 
> > What about a new command option like -fassume-aligned-xxx that's off per
> > default?
> > 
> > The user could assert that when she is using memcpy (and friends) with a
> > pointer of a specific type, then that also asserts that the data behind the
> > pointer is appropriately aligned and may be accessed accordingly.
> 
> But if you do
> 
> void copy (int *d, int *s)
> {
>   memcpy ((char *)d, (char *)s, 4);
> }
> 
> then you will get aligned accesses because all the middle-end sees is
> 
>   mempcy (d, s, 4);

Same thing happens already if you do this:

int d, s;
mempcy ((char *)&d, (char *)&s, 4);

Its also generally quite hard to force the compiler to do less-aligned accesses, and I haven`t seen this "solution" anywhere. (Probably because it doesn`t work on any current compiler)

> so as I said elsewhere the only way to reliably implement deriving alignment
> from pointer types is by the frontends inserting __builtin_assume_aligned
> calls before they possibly stripped any conversions.
> 
> Yes, if we simply say we strictly follow C11 6.3.2.3/7 we can do that
> using a simple flag.  But we won't get the optimization reliably because
> of the above issue for the variant with char * parameters and int * casts.

And you shouldn`t get the optimization in this case. Casting char* to int* is non-standard, the assume_aligned builtin is a good fit for that non-standard stuff IMHO.

> You could call that flag -fstrict-alignment (though maybe that would be
> confusing to people familiar with GCC internals STRICT_ALIGNMENT target
> macro).  A simple implementation could be in get_pointer_alignment_1,
> wrapping it with a function that inspects TYPE_ALIGN (TREE_TYPE (TREE_TYPE
> (exp)) and uses that if it trumps the wrapped fn return values.  But I
> expect much undesired fallout from such a change.

The internals are above my head, but in regards to fallout the same thing could (has) be said for -fstrict-aliasing. Im a victim myself, and am suffering from a paranoia where I have to replace pointer-accesses with memcpy =)

(In reply to Richard Biener from comment #19)
> (In reply to npl from comment #17)
> > I got interrupted by a colleague at work, part 2 of the ramblings...
> > 
> > Everything you could argue against memcpy beeing replaced by simpler
> > instructions, doesnt change that the same issue persists with the
> > __builtin_memcpy function, which is explicitely saying you want the
> > optimizations.
> > 
> > A pointer to a uint32 can be assumed to be proper aligned, CREATING such a
> > pointer thats not aligned is already undefined behaviour by the standard
> > (the compiler could zero out bits for example). I dont think that what
> > happens afterwards with something that shouldn`t exist in the first place is
> > an argument against optimizing proper code.
> > 
> > Further, I lack a consistent way of dealing with potential aliasing
> > pointers. Using memcpy seems the sanest way, simply because its standards
> > compliant, supported everywhere and your code wont mysteriously break once
> > you use LTO or higher optimization settings.
> > Compilers can reliably detect this and replace memcpy since years (ignoring
> > this issue, which I would consider a bug), so there is no draw back. Its a
> > feature common pretty much everywhere, and a valid recommendation in many
> > discussions related to the topic.
> > 
> > Consider the example below for illustration, FIXEDMEMCPY is how the plain
> > memcpy should work and already does work for archs with unaligned access.
> > (I had planned to post the code for 32bit x86, but the assembly is rather
> > ugly, amd64 would work with "unsigned long" and "unsigned long long").
> > 
> > I already ran in such issues, when different software components define
> > their own fixedwidth types. Its a practical issue where pointing to
> > paragraphs of the standard dont help, unless you provide a proper solution
> > with it. The FIXEDMEMCPY hack is fine for gcc but compilerspecific.
> > 
> > In short:
> > * Optimizing memcpy to simple instructions is a reality and expected, the
> > behaviour (slow code) on arm (and other archs with req. alignment) is a
> > unwelcome oddity
> > * memcpy is one of the few ways to deal with aliasing, and the most
> > standards compliant. (theres unions too, but thats not standards compliant)
> > * I dont see a problem in replacing standard functions (and __builtin_memcpy
> > has the same issue)
> > * I dont see a problem in expecting a correctly aligned pointer, and doing
> > undefined behaviour if the pointer could cause undefined behaviour.
> > 
> > 
> > 
> > typedef unsigned uint32_t;
> > typedef unsigned long uint32_alt;
> > _Static_assert(sizeof(uint32_t) == sizeof(uint32_alt), "you picked a bad
> > architecture or typedefs for this example");
> > 
> > #define FIXEDMEMCPY(a, b, s) __builtin_memcpy(__builtin_assume_aligned(a,
> > __alignof__(*a)), __builtin_assume_aligned(b, __alignof__(*b)), s)
> > unsigned breakme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
> > {
> > 	/* normally in different compilation units, but LTO doesnt care */
> > 	*ptr = 0;
> > 	*ptr2 = a;
> > 	return *ptr;
> > }
> > 
> > unsigned fixme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a)
> > {
> > 	/* fixes aliasing, but should be as fast as simple accesses */
> > 	uint32_t val = 0;
> > 	FIXEDMEMCPY(ptr, &val, 4);
> > 	FIXEDMEMCPY(ptr2 , &a, 4);
> > 	uint32_t val2;
> > 	FIXEDMEMCPY(&val2, ptr, 4);
> > 	return val2;
> > }
> > 
> > 00000000 <breakme>:
> >    0:	e3a03000 	mov	r3, #0
> >    4:	e5803000 	str	r3, [r0]
> >    8:	e1a00003 	mov	r0, r3 // Oops: retval = 0
> >    c:	e5812000 	str	r2, [r1]
> >   10:	e12fff1e 	bx	lr
> > 
> > 00000014 <fixme>:
> >   14:	e3a03000 	mov	r3, #0
> >   18:	e5803000 	str	r3, [r0]
> >   1c:	e5812000 	str	r2, [r1]
> >   20:	e5900000 	ldr	r0, [r0] // The load thats missing above
> >   24:	e24dd010 	sub	sp, sp, #16 // Time for another 
> >   28:	e28dd010 	add	sp, sp, #16 // Bugreport ?
> >   2c:	e12fff1e 	bx	lr
> 
> It's not done on STRICT_ALIGNMENT platforms because not all of those expand
> misaligned moves correctly (IIRC).  Looking at RTL expansion at least the
> misaligned destination will work correctly.  The question remains is what
> happens for -Os and for example both misaligned source and destination.
> Or on x86 where a simple rep; movb; is possible (plus the register setup
> of course).

Not sure what you mean, x86 has unaligned accesses and shouldn't be affected. Also, I doubt there are many cases where the function-call (and the register and stack shuffling) will use less code than the aligned access.

The generated code for unaligned access could be improved in many cases (ARM atleast), possibly fixed. But thats not generally an argument against improving the builtins?
Comment 23 rguenther@suse.de 2016-07-12 09:11:34 UTC
On Tue, 12 Jul 2016, npl at chello dot at wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
> 
> --- Comment #22 from npl at chello dot at ---
> > > 00000014 <fixme>:
> > >   14:	e3a03000 	mov	r3, #0
> > >   18:	e5803000 	str	r3, [r0]
> > >   1c:	e5812000 	str	r2, [r1]
> > >   20:	e5900000 	ldr	r0, [r0] // The load thats missing above
> > >   24:	e24dd010 	sub	sp, sp, #16 // Time for another 
> > >   28:	e28dd010 	add	sp, sp, #16 // Bugreport ?
> > >   2c:	e12fff1e 	bx	lr
> > 
> > It's not done on STRICT_ALIGNMENT platforms because not all of those expand
> > misaligned moves correctly (IIRC).  Looking at RTL expansion at least the
> > misaligned destination will work correctly.  The question remains is what
> > happens for -Os and for example both misaligned source and destination.
> > Or on x86 where a simple rep; movb; is possible (plus the register setup
> > of course).
> 
> Not sure what you mean, x86 has unaligned accesses and shouldn't be affected.
> Also, I doubt there are many cases where the function-call (and the register
> and stack shuffling) will use less code than the aligned access.
> 
> The generated code for unaligned access could be improved in many cases (ARM
> atleast), possibly fixed. But thats not generally an argument against improving
> the builtins?

Sure.  It is just that I am usually (overly?) cautionous when introducing
unaligned accesses because historically they were handled wrong and
nowadays they might be handled very inefficient on strict-align targets.

I am going to propose the patch and add a testcase with most misalign
cases so we can see runtime fallout on the more weird targets.
Comment 24 Oleg Endo 2016-07-12 11:54:32 UTC
(In reply to npl from comment #22)
> 
> Its also generally quite hard to force the compiler to do less-aligned
> accesses, and I haven`t seen this "solution" anywhere. (Probably because it
> doesn`t work on any current compiler)

Hm, have you tried using the address of a member in a packed (1) struct?
Comment 25 nolange79 2016-07-12 12:42:41 UTC
Yes, that works fine. I just meant to say it needs more work than casting to a type with less alignment, and unless explicitly marked with this pragma you can expect a compiler to access like the (deduced) type was properly aligned.
Comment 26 rguenther@suse.de 2016-07-12 13:01:23 UTC
On Tue, 12 Jul 2016, npl at chello dot at wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
> 
> --- Comment #25 from npl at chello dot at ---
> Yes, that works fine. I just meant to say it needs more work than casting to a
> type with less alignment, and unless explicitly marked with this pragma you can
> expect a compiler to access like the (deduced) type was properly aligned.

sth like

typedef int acc __attribute__((may_alias, aligned(1)));

and an actual access like

  *(acc *)p;

works as well (type punning allowed via may_alias, alignment adjusted
to 1 byte).

Of course that's not portable.

Portable is to use memcpy which irrespective of the type of the
pointer has to assume an unaligned access.
Comment 27 Richard Biener 2016-08-03 10:53:53 UTC
GCC 4.9 branch is being closed
Comment 28 Jakub Jelinek 2017-10-10 13:27:45 UTC
GCC 5 branch is being closed
Comment 29 Jakub Jelinek 2018-10-26 10:12:33 UTC
GCC 6 branch is being closed
Comment 30 Richard Biener 2019-11-14 07:52:18 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 31 Jakub Jelinek 2020-03-04 09:33:52 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 32 Jakub Jelinek 2021-05-14 09:46:30 UTC
GCC 8 branch is being closed.
Comment 33 Richard Biener 2021-06-01 08:05:29 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 34 Andrew Pinski 2021-08-15 09:13:13 UTC
*** Bug 101920 has been marked as a duplicate of this bug. ***
Comment 35 Richard Biener 2022-05-27 09:34:28 UTC
GCC 9 branch is being closed
Comment 36 Jakub Jelinek 2022-06-28 10:30:08 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 37 Richard Biener 2023-07-07 10:29:36 UTC
GCC 10 branch is being closed.
Comment 38 Richard Biener 2024-07-19 12:55:30 UTC
GCC 11 branch is being closed.