Bug 67701 - Unnecessary/bad instructions for u32-casted access to external symbol (assumes misaligned, superfluous load)
Summary: Unnecessary/bad instructions for u32-casted access to external symbol (assume...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 5.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-09-23 23:56 UTC by Julius Werner
Modified: 2016-04-15 08:52 UTC (History)
2 users (show)

See Also:
Host:
Target: armv7a
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-09-24 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Julius Werner 2015-09-23 23:56:53 UTC
Consider the following sample program:

static inline void write32(void *addr, unsigned int value) {
        *(volatile unsigned int *)addr = value;
}

extern unsigned char testvalue[];
void main() {
        write32(&testvalue, 4);
}

GCC 5.2 (with -O2 -fno-stack-protector -ffreestanding -fomit-frame-pointer) generates:

main:
        mov     r1, #4
        mov     r2, #0
        ldr     r3, .L2
        ldrb    r0, [r3]        @ zero_extendqisi2
        strb    r1, [r3]
        ldrb    r1, [r3, #1]    @ zero_extendqisi2
        strb    r2, [r3, #1]
        ldrb    r1, [r3, #2]    @ zero_extendqisi2
        strb    r2, [r3, #2]
        ldrb    r1, [r3, #3]    @ zero_extendqisi2
        strb    r2, [r3, #3]
        bx      lr

Whereas GCC 4.9 (same arguments) generates:

main:
        movw    r3, #:lower16:testvalue
        movs    r2, #4
        movt    r3, #:upper16:testvalue
        ldr     r1, [r3]
        str     r2, [r3]
        bx      lr

I think there are two problems with this:

1. Both GCC 5.2 and 4.9 generate superfluous load instructions to |addr|, even though the pointer is never dereferenced as an rvalue (only as the left-hand side of an assignment). This is a) unnecessary and b) dangerous since the pointer is declared volatile (meaning it could be an MMIO register with read side-effects).

2. GCC 5.2 seems to somehow assume it must treat the object as unaligned and generates byte-wise accesses. I don't understand why it would do that, since nothing here is declared __attribute__((packed)) and the explicit cast to a 4-byte-aligned type should remove any ambiguity for the compiler about which alignment it can assume. It is important for systems code to have a (portable) way to dereference an address with an explicit alignment regardless of where it came from (e.g. for MMIO registers that only support 32-bit wide accesses).

Both of these issues can be solved by using __builtin_assume_aligned(addr, sizeof(unsigned long)), but I still think the compiler shouldn't have done this in the first place (and having a portable solution to this problem is preferable).

Context: http://review.coreboot.org/#/c/11698
Comment 1 Julius Werner 2015-09-24 00:03:07 UTC
edit: target architecture in my example is armv7a, btw (although I doubt that this is architecture-specific)
Comment 2 Andrew Pinski 2015-09-24 00:27:56 UTC
(In reply to Julius Werner from comment #1)
> edit: target architecture in my example is armv7a, btw (although I doubt
> that this is architecture-specific)

I suspect this is an armv7 issue only where there is missing support for misaligned load/stores.

Also testvalue is aligned to 1 byte by default so what GCC is doing is correct as there is no way to do a volatile unaligned loads really.
Comment 3 Julius Werner 2015-09-24 01:37:14 UTC
> I suspect this is an armv7 issue only where there is missing support for misaligned load/stores.

Did a quick test on aarch64 and you're right, neither of the two issues appears there.

> Also testvalue is aligned to 1 byte by default so what GCC is doing is correct as there is no way to do a volatile unaligned loads really.

I thought alignment was always tied to a type (at least by default, unless overridden by __attribute__((align)) or the like)? It's true that the expected alignment for testvalue is 1, but I'm not dereferencing testvalue... I'm dereferencing *(volatile unsigned long *)&testvalue. Shouldn't the alignment of that always be 4? (FWIW, _Alignof(*(volatile unsigned long *)&testvalue) evaluates to 4 in both compiler versions.)

Another argument: if write32() was defined as a non-inline function in a separate execution unit, this problem would not appear. Should inlining a function really change behavior in this way and produce code that is clearly less efficient (to try to work around a "problem" that the -O0 version of the same code would have never found)?
Comment 4 Richard Biener 2015-09-24 07:48:25 UTC
As said in another bug GCC tries to be conservative (QOI) and fixes up programmer mistakes in case it can see that the alignment of some memory is clearly _not_
according to the language requirement (as here, dereferencing a int *).

In this case it sees the global 'unsigned char[]' and concludes its alignment
does not have to be larger than to 1 byte.

We could change that behavior with a flag if people like (-fstrict-alignment?).
Comment 5 Eric Botcazou 2015-09-24 13:53:01 UTC
I wonder what the motivation to make this change was as, historically, GCC has never tried to rescue the programmer in this clearly invalid case.  Some obscure situation with SSE on x86?  Do other compilers do the same, e.g. on ARM?
Comment 6 rguenther@suse.de 2015-09-24 14:09:13 UTC
On Thu, 24 Sep 2015, ebotcazou at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701
> 
> Eric Botcazou <ebotcazou at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |NEW
>    Last reconfirmed|                            |2015-09-24
>                  CC|                            |ebotcazou at gcc dot gnu.org
>      Ever confirmed|0                           |1
> 
> --- Comment #5 from Eric Botcazou <ebotcazou at gcc dot gnu.org> ---
> I wonder what the motivation to make this change was as, historically, GCC has
> never tried to rescue the programmer in this clearly invalid case.  Some
> obscure situation with SSE on x86?  Do other compilers do the same, e.g. on
> ARM?

Yes, AFAIK this was some obscure situation with SSE on x86.  IIRC
code doing unaligned scalar accesses (which is ok on x86) but then
vectorized using peeling for alignment (which cannot succeed if the
element is not naturally aligned) and segfaulting for the emitted
aligned move instructions.

Of course the pessimistic assumption we make only applies when
the compiler sees the underlying decl, thus only a fraction of
all possible cases are handled conservative that way.

Maybe these days the legacy has been cleaned up enough so we can
remove that conservative handling again...  I think it also causes
us to handle

char c[4];

int main()
{
  if (!((unsigned long)c & 3))
    return *(int *)c;
  return c[0];
}

too conservatively as we expand

  _5 = MEM[(int *)&c];

and thus lost the flow-sensitive info.

Implementation-wise you'd have to adjust get_object_alignment_2 to
get at a MEM_REF base (get_inner_reference will look through MEM_REFs
with &decl operand 0).  Eventually by adjusting get_inner_reference
itself to avoid doing the work twice.
Comment 7 Eric Botcazou 2015-09-24 15:13:58 UTC
> Yes, AFAIK this was some obscure situation with SSE on x86.  IIRC
> code doing unaligned scalar accesses (which is ok on x86) but then
> vectorized using peeling for alignment (which cannot succeed if the
> element is not naturally aligned) and segfaulting for the emitted
> aligned move instructions.

I see, thanks for the insight.

> Maybe these days the legacy has been cleaned up enough so we can
> remove that conservative handling again...  I think it also causes
> us to handle
> 
> char c[4];
> 
> int main()
> {
>   if (!((unsigned long)c & 3))
>     return *(int *)c;
>   return c[0];
> }
> 
> too conservatively as we expand
> 
>   _5 = MEM[(int *)&c];
> 
> and thus lost the flow-sensitive info.

The problem is that, in order to fix a legitimate issue on x86, the change pessimizes the code for strict-alignment platforms, where the said issue doesn't exist since there are unaligned accesses in the source code.  And of course only for them, since x86 has unaligned load/stores.  So, in the end, this is a net loss for strict-alignment platforms.
Comment 8 rguenther@suse.de 2015-09-25 07:38:29 UTC
On Thu, 24 Sep 2015, ebotcazou at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67701
> 
> --- Comment #7 from Eric Botcazou <ebotcazou at gcc dot gnu.org> ---
> > Yes, AFAIK this was some obscure situation with SSE on x86.  IIRC
> > code doing unaligned scalar accesses (which is ok on x86) but then
> > vectorized using peeling for alignment (which cannot succeed if the
> > element is not naturally aligned) and segfaulting for the emitted
> > aligned move instructions.
> 
> I see, thanks for the insight.
> 
> > Maybe these days the legacy has been cleaned up enough so we can
> > remove that conservative handling again...  I think it also causes
> > us to handle
> > 
> > char c[4];
> > 
> > int main()
> > {
> >   if (!((unsigned long)c & 3))
> >     return *(int *)c;
> >   return c[0];
> > }
> > 
> > too conservatively as we expand
> > 
> >   _5 = MEM[(int *)&c];
> > 
> > and thus lost the flow-sensitive info.
> 
> The problem is that, in order to fix a legitimate issue on x86, the change
> pessimizes the code for strict-alignment platforms, where the said issue
> doesn't exist since there are unaligned accesses in the source code.  And of
> course only for them, since x86 has unaligned load/stores.  So, in the end,
> this is a net loss for strict-alignment platforms.

Agreed.  Looking at how to fix this in get_object_alignment_2 I wonder
if it makes sense to unify this function with get_inner_reference.  The
other choice would be to add a flag to get_inner_reference to stop
at MEM_REF/TARGET_MEM_REFs.
Comment 9 Richard Biener 2016-04-15 08:46:36 UTC
Note the issue should be partly mitigated by the fix for PR70424.  The issue
in comment#6 is still there as we still trust the alignment of an underlying
decl more than the alignment info present on the MEM_REF.
Comment 10 Richard Biener 2016-04-15 08:52:39 UTC
The reason for the legacy on x86 targets is indeed not present on strict-alignment platforms if it is the case that on strict-alignment platforms
unaligned accesses trap (and are not merely slow).

Index: gcc/builtins.c
===================================================================
--- gcc/builtins.c      (revision 235005)
+++ gcc/builtins.c      (working copy)
@@ -339,7 +339,8 @@ get_object_alignment_2 (tree exp, unsign
         Do so only if get_pointer_alignment_1 did not reveal absolute
         alignment knowledge and if using that alignment would
         improve the situation.  */
-      if (!addr_p && !known_alignment
+      if (!addr_p
+         && (!known_alignment || STRICT_ALIGNMENT)
          && TYPE_ALIGN (TREE_TYPE (exp)) > align)
        align = TYPE_ALIGN (TREE_TYPE (exp));
       else