When compiling the attached C code with gcc 4.4.1 [gcc-4_4-branch revision 150839] w/ -O2 -fstrict-aliasing, gcc outputs broken machine code. From what I can gather by stepping thru the generated program with gdb, gcc has dropped a number of statements that are required for the code to properly function. For example, it has dropped: - the initialization of list - the initialization of tail - the updating of the tail pointer after each iteration thru the loop (if you pretend like gcc didn't unroll the loop) - the comparison of list to NULL after the loop I'll also attach my gdb session.
Created attachment 19760 [details] list.c this is a small test-case program to illustrate the bug
Created attachment 19761 [details] gdb-log.txt copy & paste of my gdb session
You get what you deserve: $ gcc-4.5.0 -S -O2 -fstrict-aliasing -Wstrict-aliasing=2 q.c q.c: In function 'main': q.c:15:2: warning: dereferencing type-punned pointer might break strict-aliasing rules That is this line: tail = (Node *) &list;
More to the point, it is even worse than violating aliasing rules here as tail contains a pointer which can contain only another pointer for the size.
the way I compiled it for that gdb session is as follows: gcc -Wall -g -O2 -o list list.c If I use -fno-strict-aliasing or use -O0 (or -O1), it works just fine.
This code has been working since at least gcc 2.7 days and works fine with other compiles (Sun's, Microsoft's, etc). Seems like this is a very real bug to me.
You violate GCC strict-aliasing rules.
Well the code is totally undefined because you are access a Node* via the struct Node.
gcc should do what I've asked it to do. if I change the cast to: tail = (Node *) ((char *) &list); it works, even with -fstrict-aliasing. In other words, if I trick gcc into not being "smart", it works. gcc shouldn't try to outsmart the programmer - that is just wrong.
GCC is unfortunately not a DWIM compiler but a C compiler. And obviously tail = (Node *) ((char *) &list); doesn't make it work.
It probably does, actually, because next is the first field. But anyway, still INVALID.
I'm not asking for gcc to be a DWIM compiler, I'm asking for it to do EXACTLY what I've told it to do.
fejj, re-open this again and I guess we'll have to ban you.
(In reply to comment #12) > I'm not asking for gcc to be a DWIM compiler, I'm asking for it to do EXACTLY > what I've told it to do. You told it to use C type-based aliasing rules to optimize. And exactly this is what it is doing.
Since the code works just fine with -O0, -O1 and/or -fno-strict-aliasing (when using -O2), then it seems to me that gcc is fully capable of doing what I've asked it to do (therefor DWIM doesn't even enter into the equation, here). The fact that it doesn't as soon as -O2 optimizations (or above) are turned on is what is broken. I understand that the point of the strict-aliasing functionality in gcc is trying to do some fancy optimizations that it couldn't do normally and I respect that - if I was explicitly turning on -fstrict-aliasing, I'd be inclined to agree with you. However, this is not the case. gcc is auto-enabling strict-aliasing with -O2 and: 1. it never used to do so 2. it does so without warning (why should I have to explicitly ask it to warn me about something that it is obviously breaking?) Here's a solution: if -fstrict-aliasing is not EXPLICITLY requested on the command line (e.g. it is auto-enabled by means of -O2), then gcc should, when it encounters code like in my test case, it should compile that code as if -fno-strict-aliasing had been turned on (e.g. the old default behavior). It shouldn't be difficult for you guys to implement because you obviously already have the ability to detect that type of code (e.g. -Wstrict-aliasing=2). So in order to fix this, all you need is a tri-state flag for strict-aliasing optimizations. It seems like it could be done trivially and it would make me happy (and a lot of other people happy, from what I can tell from googling). Please?
> You told it to use C type-based aliasing rules to optimize. And exactly this > is what it is doing. Richard, no, I didn't. All I asked was -O2 and -O2 never used to enable strict-aliasing (in fact, I never even asked for -O2 - that's just what distros typically build with).
GCC has had -fstrict-aliasing enabled since 1998.
Steven: ok, well, it has never mis-compiled that code before 4.4 and so I never noticed it I guess. But now it is mis-compiling the code, so it has come to my attention with sirens blaring.
FWIW the tri-state solution is problematic because there are cases where type punning doesn't lead to this kind of optimizations. For example if the casted pointer is casted back to the correct type, or if there is never a dereference. To disable strict aliasing in such cases leads to missed optimizations. That would make you happy, but it would also make another crowd unhappy again. Whatever GCC does, it's going to make people unhappy somewhere. This seems to be the fate of compilers everywhere... This is a closed discussion. *Please* stop re-opening this bug. You will *not* get what you want. It's nothing personal, it's just that this has been discussed many, many times before and a decision has been made. (And I just read the Troll article on your blog... ;-)
Just so we are all on the same page (in case I haven't been clear) This bug ONLY manifests when using -O2+ if I don't also use -fno-strict-aliasing. Meaning: if I do gcc -O0 -fstrict-aliasing, things still work.
Steven: > FWIW the tri-state solution is problematic because... Fair enough. This is why you're the gcc developer and not me ;-) However, correct output should be more important than optimized output. Also, I fail to see how my blog is considered trolling. gcc *is* breaking code without warning.
(In reply to comment #21) > Steven: > > > FWIW the tri-state solution is problematic because... > > Fair enough. This is why you're the gcc developer and not me ;-) > > However, correct output should be more important than optimized output. It is. > Also, I fail to see how my blog is considered trolling. gcc *is* breaking code > without warning. Well, your code invokes undefined behavior according to GCCs interpretation of the C standard. So it is not "breaking" it. Also we can't really do better in the sense of making use of the invariants the C standards guarantee and at the same time not "break" code like yours. If it would be trivial and obvious to do so we would do it - it's not like we are breaking your code just because we can. It's because the compiler does not see what you are intending to do. This giving leeway to code invoking undefined behavior might improve or regress from release to release. The most important thing is to generate correct code from well-defined input (and there are still bugs with that, so ...)
GCC and aliasing is one area which talked about a lot. The problem is what the standard allows to do and what people think the compiler should do (not based on anything except what they are taught) are two different things. I don't remember any time during my schooling where my teacher even mentioned aliasing rules; though I knew they existed. So maybe my point is that aliasing is not taught well in any C/C++ classes and we should look into ways of fixing that issue rather than saying GCC should change. Yes warnings can help in teaching but not always.
Richard: it seems weird to me that gcc's interpretation of the standard differs depending on optimization level. Shouldn't it be consistent?
It should be consistent for programs that do not trigger undefined behavior. The behavior of your program is undefined according to the C standard and according to GCC. The alias discussion is just one of th many where different crowds are asking for different semantics for undefined behavior. But the point of "undefined" is that, well, the behavior is undefined. So, although it is counterintuitive (and in fact, quite unfriendly of the compiler) it is not incorrect to have different program behavior at different optimization levels for a program with undefined behavior.
Note another area which is undefined and talked a lot about with many different issues is signed integer overflow. It has the same issue as aliasing except it is easier to explain in most cases.
Steven: okay, thanks for the explanation. My concern is that if I hit a similar case down the road and someone builds with different optimization flags than I do, that I'd never figure out the problem. It was pure luck that I tracked it down to strict aliasing in this case because when I found that -O0 and -O2 produced different code, my first instinct was to google "gcc 4.4 broken" (or some such) and it turned up pages and pages of links to people with similar (although not exactly identical) problems on various mailing-lists and decided to try -fno-strict-aliasing and it worked.
fejj: To see how your initial code is broken, I've transformed it a bit to show you how it already is "miscompiled" with earlier compilers. I've manually unrolled the loop, and hoisted the two malloc calls to the front. You hopefully agree that the function then is completely equivalent to your initial testcase (assuming validity to start with). Like so: % cat broken.c #include <stdio.h> #include <stdlib.h> typedef struct _node { struct _node *next; int value; } Node; int main (int argc, char **argv) { Node *list, *node, *tail; Node *n1, *n2; int i; n1 = malloc (sizeof (Node)); n2 = malloc (sizeof (Node)); list = NULL; tail = (Node *) &list; node = n1; node->next = NULL; node->value = 0; tail->next = node; tail = node; node = n2; node->next = NULL; node->value = 2; tail->next = node; tail = node; if (list == NULL) printf ("oops, list is null???\n"); return 0; } Compile this with e.g. gcc-4.3 -O2 and see it broken too. From your blog entry I gather that you think this has something to do with the recent gcc@ discussion about interpreting the C standard regarding accesses through unions or similar. This is not the case. The code as written simply is completely unspecified in _any_ interpretation of the standard. To see this, add this statement into the list: tail->value = tail->value; This statement must be harmless iff tail indeed points to enough memory to have a value member. But in the first iteration tail points to list, a Node*, hence it holds only enough memory for one pointer. I.e. would you change the order of the struct members ('value' first, 'next' second) already the access to ->next in your original testcase would overwrite non-existing memory (and depending on circumstance for instance valgrind would complain). The list variables simply is only one pointer long, not two. I hope you see now that we don't even need to invoke aliasing arguments to show that the testcase was invalid: In the first iteration tail points to something that doesn't even have enough space to contain a whole Node. We don't even need to argue about the type of it.
What's particularly unfortunate about this instance is the fact that gcc fails to warn you about the erroneous code, despite the obvious signs and despite -Wall. Line 15 is obviously potentially problematic, but it doesn't show up as such even when -Wall is used. This means that any improvement in the optimizer which causes it break similarly invalid code may be terribly hard to debug - the code violates the spec, but there's just no hint that it's wrong (and worse, it works fine when compiled with less optmization, such as for debugging). And additional factor here is that it's statically determinable that this code will violate aliasing rules; line 15 creates an alias, and the next usage of the variable is a write. I realize that perhaps such read/write dependency analysis isn't available at the time that warnings are normally output (I have no idea about gcc innards), but that means that gcc is being presented with code that raises a red flag (line 15), knowably violates the spec (line 22), and yet doesn't even issue a warning even when compiled with -Wall. That's bound to make a gcc-users life hard, particularly as there's a bunch of old c code lying around that's probably not entirely up-to-spec. So, the OP's code is clearly at fault, but that doesn't necessarily mean gcc couldn't do more to prevent this kind of error.
See comment #3 for how (newer) GCC does warn at line 15. Not with the default options, though, but this has good reasons (too many false positives). You are right that we should be able to improve our diagnostics in this case, as, as you say, line 15 and 22 together create a definitely fishy situation.
Michael: I know that tail->value = <anything> is clearly going to evil things (if done before assigning one of the nodes to tail) because only the first sizeof(void*) bytes are valid memory addresses. However, if you read the code I wrote - OBVIOUSLY it only ever assigns to those addresses. Maybe it's not obvious to gcc (fair enough), but it should be obvious human programmers (or at least be able to figure it out). Pretend, instead, that I had: typedef struct _Node { struct _Node *next; } Node; Now it should be even more obvious that it should work.
I realize that you *can* enable a specific warning that might solve this; but that's a pretty unsatisfactory state of affairs. The point is that if you've old (or external) C code *anywhere* in your app which breaks due to this behavior, there's no hint of what's wrong anywhere; enabling "all" warnings doesn't actually enable the warning you need, and you're stuck. The problematic bit (to my untrained ears anyhow) isn't that the code breaks, it's that it breaks without warning and didn't previously (despite strict aliasing) and doesn't necessarily break in other compilers (just tried in msvc 10.0, for instance) - so the notion that the diagnostics could be improved should (in principle, anyhow) cover it.
(In reply to comment #32) > I realize that you *can* enable a specific warning that might solve this; but > that's a pretty unsatisfactory state of affairs. > > The point is that if you've old (or external) C code *anywhere* in your app > which breaks due to this behavior, there's no hint of what's wrong anywhere; > enabling "all" warnings doesn't actually enable the warning you need, and > you're stuck. > > The problematic bit (to my untrained ears anyhow) isn't that the code breaks, > it's that it breaks without warning and didn't previously (despite strict > aliasing) and doesn't necessarily break in other compilers (just tried in msvc > 10.0, for instance) - so the notion that the diagnostics could be improved > should (in principle, anyhow) cover it. Patches welcome.
just an FYI, but if you take my original list.c attachment and uncomment the second loop (the one that prints the values of each node after the creation loop), even if you use `gcc -Wall -g -O2 -o list list.c`, the list will be non-NULL. Comment that loop, and it goes back to being NULL.