Bug 42907 - -fstrict-aliasing breaks valid code
Summary: -fstrict-aliasing breaks valid code
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: c (show other bugs)
Version: 4.4.1
: P3 critical
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-30 21:10 UTC by fejj
Modified: 2010-02-03 02:16 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
list.c (301 bytes, text/plain)
2010-01-30 21:11 UTC, fejj
Details
gdb-log.txt (786 bytes, text/plain)
2010-01-30 21:13 UTC, fejj
Details

Note You need to log in before you can comment on or make changes to this bug.
Description fejj 2010-01-30 21:10:08 UTC
When compiling the attached C code with gcc 4.4.1 [gcc-4_4-branch revision 150839] w/ -O2 -fstrict-aliasing, gcc outputs broken machine code.

From what I can gather by stepping thru the generated program with gdb, gcc has dropped a number of statements that are required for the code to properly function. For example, it has dropped:

- the initialization of list
- the initialization of tail
- the updating of the tail pointer after each iteration thru the loop (if you pretend like gcc didn't unroll the loop)
- the comparison of list to NULL after the loop

I'll also attach my gdb session.
Comment 1 fejj 2010-01-30 21:11:50 UTC
Created attachment 19760 [details]
list.c

this is a small test-case program to illustrate the bug
Comment 2 fejj 2010-01-30 21:13:54 UTC
Created attachment 19761 [details]
gdb-log.txt

copy & paste of my gdb session
Comment 3 Steven Bosscher 2010-01-30 21:17:41 UTC
You get what you deserve:

$ gcc-4.5.0 -S -O2 -fstrict-aliasing -Wstrict-aliasing=2 q.c 
q.c: In function 'main':
q.c:15:2: warning: dereferencing type-punned pointer might break strict-aliasing rules

That is this line:

        tail = (Node *) &list;
Comment 4 Andrew Pinski 2010-01-30 21:19:54 UTC
More to the point, it is even worse than violating aliasing rules here as tail contains a pointer which can contain only another pointer for the size.
Comment 5 fejj 2010-01-30 21:23:29 UTC
the way I compiled it for that gdb session is as follows:

gcc -Wall -g -O2 -o list list.c

If I use -fno-strict-aliasing or use -O0 (or -O1), it works just fine.
Comment 6 fejj 2010-01-30 21:25:26 UTC
This code has been working since at least gcc 2.7 days and works fine with other compiles (Sun's, Microsoft's, etc).

Seems like this is a very real bug to me.
Comment 7 Richard Biener 2010-01-30 21:28:04 UTC
You violate GCC strict-aliasing rules.
Comment 8 Andrew Pinski 2010-01-30 21:28:58 UTC
Well the code is totally undefined because you are access a Node* via the struct Node.
Comment 9 fejj 2010-01-30 21:33:19 UTC
gcc should do what I've asked it to do.

if I change the cast to:

tail = (Node *) ((char *) &list);

it works, even with -fstrict-aliasing.

In other words, if I trick gcc into not being "smart", it works. gcc shouldn't try to outsmart the programmer - that is just wrong.
Comment 10 Richard Biener 2010-01-30 21:39:29 UTC
GCC is unfortunately not a DWIM compiler but a C compiler.

And obviously tail = (Node *) ((char *) &list); doesn't make it work.
Comment 11 Steven Bosscher 2010-01-30 21:41:57 UTC
It probably does, actually, because next is the first field.

But anyway, still INVALID.
Comment 12 fejj 2010-01-30 21:44:18 UTC
I'm not asking for gcc to be a DWIM compiler, I'm asking for it to do EXACTLY what I've told it to do.
Comment 13 Steven Bosscher 2010-01-30 21:45:54 UTC
fejj, re-open this again and I guess we'll have to ban you.
Comment 14 Richard Biener 2010-01-30 22:00:08 UTC
(In reply to comment #12)
> I'm not asking for gcc to be a DWIM compiler, I'm asking for it to do EXACTLY
> what I've told it to do.

You told it to use C type-based aliasing rules to optimize.  And exactly this
is what it is doing.

Comment 15 fejj 2010-01-30 22:11:10 UTC
Since the code works just fine with -O0, -O1 and/or -fno-strict-aliasing (when using -O2), then it seems to me that gcc is fully capable of doing what I've asked it to do (therefor DWIM doesn't even enter into the equation, here).

The fact that it doesn't as soon as -O2 optimizations (or above) are turned on is what is broken.

I understand that the point of the strict-aliasing functionality in gcc is trying to do some fancy optimizations that it couldn't do normally and I respect that - if I was explicitly turning on -fstrict-aliasing, I'd be inclined to agree with you.

However, this is not the case. gcc is auto-enabling strict-aliasing with -O2 and:

1. it never used to do so
2. it does so without warning (why should I have to explicitly ask it to warn me about something that it is obviously breaking?)

Here's a solution:

if -fstrict-aliasing is not EXPLICITLY requested on the command line (e.g. it is auto-enabled by means of -O2), then gcc should, when it encounters code like in my test case, it should compile that code as if -fno-strict-aliasing had been turned on (e.g. the old default behavior).

It shouldn't be difficult for you guys to implement because you obviously already have the ability to detect that type of code (e.g. -Wstrict-aliasing=2).

So in order to fix this, all you need is a tri-state flag for strict-aliasing optimizations. It seems like it could be done trivially and it would make me happy (and a lot of other people happy, from what I can tell from googling).

Please?
Comment 16 fejj 2010-01-30 22:13:10 UTC
> You told it to use C type-based aliasing rules to optimize.  And exactly this
> is what it is doing.

Richard, no, I didn't. All I asked was -O2 and -O2 never used to enable strict-aliasing (in fact, I never even asked for -O2 - that's just what distros typically build with).
Comment 17 Steven Bosscher 2010-01-30 22:15:13 UTC
GCC has had -fstrict-aliasing enabled since 1998.
Comment 18 fejj 2010-01-30 22:17:28 UTC
Steven: ok, well, it has never mis-compiled that code before 4.4 and so I never noticed it I guess.

But now it is mis-compiling the code, so it has come to my attention with sirens blaring.
Comment 19 Steven Bosscher 2010-01-30 22:20:48 UTC
FWIW the tri-state solution is problematic because there are cases where type punning doesn't lead to this kind of optimizations. For example if the casted pointer is casted back to the correct type, or if there is never a dereference. To disable strict aliasing in such cases leads to missed optimizations. That would make you happy, but it would also make another crowd unhappy again.

Whatever GCC does, it's going to make people unhappy somewhere. This seems to be the fate of compilers everywhere... 

This is a closed discussion. *Please* stop re-opening this bug. You will *not* get what you want. It's nothing personal, it's just that this has been discussed many, many times before and a decision has been made. (And I just read the Troll article on your blog... ;-)
Comment 20 fejj 2010-01-30 22:22:32 UTC
Just so we are all on the same page (in case I haven't been clear)

This bug ONLY manifests when using -O2+ if I don't also use -fno-strict-aliasing.

Meaning: if I do gcc -O0 -fstrict-aliasing, things still work.
Comment 21 fejj 2010-01-30 22:25:55 UTC
Steven:

> FWIW the tri-state solution is problematic because...

Fair enough. This is why you're the gcc developer and not me ;-)

However, correct output should be more important than optimized output.

Also, I fail to see how my blog is considered trolling. gcc *is* breaking code without warning.
Comment 22 Richard Biener 2010-01-30 22:41:15 UTC
(In reply to comment #21)
> Steven:
> 
> > FWIW the tri-state solution is problematic because...
> 
> Fair enough. This is why you're the gcc developer and not me ;-)
> 
> However, correct output should be more important than optimized output.

It is.

> Also, I fail to see how my blog is considered trolling. gcc *is* breaking code
> without warning.

Well, your code invokes undefined behavior according to GCCs interpretation
of the C standard.  So it is not "breaking" it.

Also we can't really do better in the sense of making use of the invariants
the C standards guarantee and at the same time not "break" code like yours.
If it would be trivial and obvious to do so we would do it - it's not like
we are breaking your code just because we can.  It's because the compiler
does not see what you are intending to do.  This giving leeway to code
invoking undefined behavior might improve or regress from release to release.
The most important thing is to generate correct code from well-defined
input (and there are still bugs with that, so ...)
Comment 23 Andrew Pinski 2010-01-30 22:51:14 UTC
GCC and aliasing is one area which talked about a lot.  The problem is what the standard allows to do and what people think the compiler should do (not based on anything except what they are taught) are two different things.  I don't remember any time during my schooling where my teacher even mentioned aliasing rules; though I knew they existed.  So maybe my point is that aliasing is not taught well in any C/C++ classes and we should look into ways of fixing that issue rather than saying GCC should change.  Yes warnings can help in teaching but not always.
Comment 24 fejj 2010-01-30 23:12:05 UTC
Richard: it seems weird to me that gcc's interpretation of the standard differs depending on optimization level. Shouldn't it be consistent?
Comment 25 Steven Bosscher 2010-01-30 23:19:12 UTC
It should be consistent for programs that do not trigger undefined behavior. The behavior of your program is undefined according to the C standard and according to GCC.

The alias discussion is just one of th many where different crowds are asking for different semantics for undefined behavior. But the point of "undefined" is that, well, the behavior is undefined. So, although it is counterintuitive (and in fact, quite unfriendly of the compiler) it is not incorrect to have different program behavior at different optimization levels for a program with undefined behavior.
Comment 26 Andrew Pinski 2010-01-30 23:20:43 UTC
Note another area which is undefined and talked a lot about with many different issues is signed integer overflow.  It has the same issue as aliasing except it is easier to explain in most cases.
Comment 27 fejj 2010-01-30 23:34:38 UTC
Steven: okay, thanks for the explanation.

My concern is that if I hit a similar case down the road and someone builds with different optimization flags than I do, that I'd never figure out the problem.

It was pure luck that I tracked it down to strict aliasing in this case because when I found that -O0 and -O2 produced different code, my first instinct was to google "gcc 4.4 broken" (or some such) and it turned up pages and pages of links to people with similar (although not exactly identical) problems on various mailing-lists and decided to try -fno-strict-aliasing and it worked.
Comment 28 Michael Matz 2010-01-31 20:34:18 UTC
fejj: To see how your initial code is broken, I've transformed it a bit to
show you how it already is "miscompiled" with earlier compilers.  I've manually unrolled the loop, and hoisted the two malloc calls to the front.
You hopefully agree that the function then is completely equivalent
to your initial testcase (assuming validity to start with).  Like so:

% cat broken.c
#include <stdio.h>
#include <stdlib.h>

typedef struct _node {
        struct _node *next;
        int value;
} Node;

int main (int argc, char **argv)
{
        Node *list, *node, *tail;
        Node *n1, *n2;
        int i;

        n1 = malloc (sizeof (Node));
        n2 = malloc (sizeof (Node));

        list = NULL;
        tail = (Node *) &list;

        node = n1;
        node->next = NULL;
        node->value = 0;
        tail->next = node;
        tail = node;

        node = n2;
        node->next = NULL;
        node->value = 2;
        tail->next = node;
        tail = node;

        if (list == NULL)
                printf ("oops, list is null???\n");
        return 0;
}

Compile this with e.g. gcc-4.3 -O2 and see it broken too.  From your blog
entry I gather that you think this has something to do with the recent gcc@
discussion about interpreting the C standard regarding accesses through unions
or similar.  This is not the case.  The code as written simply is completely
unspecified in _any_ interpretation of the standard.  To see this, add this
statement into the list:

    tail->value = tail->value;

This statement must be harmless iff tail indeed points to enough memory to
have a value member.  But in the first iteration tail points to list, a Node*,
hence it holds only enough memory for one pointer.  I.e. would you change
the order of the struct members ('value' first, 'next' second) already the
access to ->next in your original testcase would overwrite non-existing
memory (and depending on circumstance for instance valgrind would complain). 
The list variables simply is only one pointer long, not two.

I hope you see now that we don't even need to invoke aliasing arguments to
show that the testcase was invalid: In the first iteration tail points to
something that doesn't even have enough space to contain a whole Node.  We
don't even need to argue about the type of it.
Comment 29 Eamon Nerbonne 2010-02-01 10:16:42 UTC
What's particularly unfortunate about this instance is the fact that gcc fails to warn you about the erroneous code, despite the obvious signs and despite -Wall.

Line 15 is obviously potentially problematic, but it doesn't show up as such even when -Wall is used.  This means that any improvement in the optimizer which causes it break similarly invalid code may be terribly hard to debug - the code violates the spec, but there's just no hint that it's wrong (and worse, it works fine when compiled with less optmization, such as for debugging).

And additional factor here is that it's statically determinable that this code will violate aliasing rules; line 15 creates an alias, and the next usage of the variable is a write.  I realize that perhaps such read/write dependency analysis isn't available at the time that warnings are normally output (I have no idea about gcc innards), but that means that gcc is being presented with code that raises a red flag (line 15), knowably violates the spec (line 22), and yet doesn't even issue a warning even when compiled with -Wall.  That's bound to make a gcc-users life hard, particularly as there's a bunch of old c code lying around that's probably not entirely up-to-spec.

So, the OP's code is clearly at fault, but that doesn't necessarily mean gcc couldn't do more to prevent this kind of error.
Comment 30 Michael Matz 2010-02-01 12:24:28 UTC
See comment #3 for how (newer) GCC does warn at line 15.  Not with the default
options, though, but this has good reasons (too many false positives).

You are right that we should be able to improve our diagnostics in this case,
as, as you say, line 15 and 22 together create a definitely fishy situation.
Comment 31 fejj 2010-02-01 13:08:33 UTC
Michael:

I know that tail->value = <anything> is clearly going to evil things (if done before assigning one of the nodes to tail) because only the first sizeof(void*) bytes are valid memory addresses. However, if you read the code I wrote - OBVIOUSLY it only ever assigns to those addresses. Maybe it's not obvious to gcc (fair enough), but it should be obvious human programmers (or at least be able to figure it out).

Pretend, instead, that I had:

typedef struct _Node {
    struct _Node *next;
} Node;

Now it should be even more obvious that it should work.
Comment 32 Eamon Nerbonne 2010-02-01 13:24:08 UTC
I realize that you *can* enable a specific warning that might solve this; but that's a pretty unsatisfactory state of affairs.

The point is that if you've old (or external) C code *anywhere* in your app which breaks due to this behavior, there's no hint of what's wrong anywhere; enabling "all" warnings doesn't actually enable the warning you need, and you're stuck.

The problematic bit (to my untrained ears anyhow) isn't that the code breaks, it's that it breaks without warning and didn't previously (despite strict aliasing) and doesn't necessarily break in other compilers (just tried in msvc 10.0, for instance) - so the notion that the diagnostics could be improved should (in principle, anyhow) cover it.
Comment 33 Richard Biener 2010-02-01 17:04:17 UTC
(In reply to comment #32)
> I realize that you *can* enable a specific warning that might solve this; but
> that's a pretty unsatisfactory state of affairs.
> 
> The point is that if you've old (or external) C code *anywhere* in your app
> which breaks due to this behavior, there's no hint of what's wrong anywhere;
> enabling "all" warnings doesn't actually enable the warning you need, and
> you're stuck.
> 
> The problematic bit (to my untrained ears anyhow) isn't that the code breaks,
> it's that it breaks without warning and didn't previously (despite strict
> aliasing) and doesn't necessarily break in other compilers (just tried in msvc
> 10.0, for instance) - so the notion that the diagnostics could be improved
> should (in principle, anyhow) cover it.

Patches welcome.
Comment 34 fejj 2010-02-03 02:16:22 UTC
just an FYI, but if you take my original list.c attachment and uncomment the second loop (the one that prints the values of each node after the creation loop), even if you use `gcc -Wall -g -O2 -o list list.c`, the list will be non-NULL. Comment that loop, and it goes back to being NULL.