Code: struct obj { int __pad; int i; }; /* aborts when called with NULL */ int assert_not_null(void *); void bug(struct obj **root, struct obj *dso) { while (1) { struct obj *this = *root; if (dso == (void *)0) // should return here return; if (dso == this) return; // shouldn't reach here assert_not_null(dso); if (!&dso->i) break; } } // call like this: bug(&obj, NULL); Result: * -O1: ok * -O1 -funswitch-loops: ok * -O1 -fno-strict-overflow: ok * -O1 -funswitch-loops -fno-strict-overflow: abort
code is reduced from perf, source file util/dsos.c
regression from 12.3 -> 13.2
Looks like the unswitch is happening when it should not be ...
The incorrect unswitch has been happening since at least GCC 5 ...
Confirmed at least for the bad unswitch which causes the other wrong code to happen.
trunk doesn't unswitch for me (needs bisection). Let me check what happens on the branch.
The IL after unswitching looks OK, but we assume that when &dso->i is NULL then dso == NULL and when &dso->i is not NULL then dso also isn't. I think this is a ranger bug that has been fixed on trunk but eventually not yet backported, thus we have a duplicate somewhere. Bisection will tell.
Note on trunk we have jump-threaded this to move dso == (void *)0 || dso == this out of the loop so there's nothing to unswitch and the bad circumstances likely do not trigger. I still remember the ranger bug though.
So, to add some details to people said above, and to make sure I understood this correctly. What gcc did here is: 1. move the `if (!&dso->i)` branch out of the loop, duplicate the loop for the then/else branches. 2. `&dso->i` cannot be 0, so the else branch is eliminated. 3. in the then branch, this condition confused the compiler. it thought since `&dso->i` is not 0, `dso` is not 0 either, causing the bug. I diffed `-fdump` outputs and it does match what I expected to see. (minus is the correct one, plus the incorrect) @@ -2686,7 +2686,11 @@ ;; 4 succs { 5 6 } ;; 6 succs { 3 } ;; 5 succs { 1 } -Removing basic block 6 +Folding predicate _1 == 0B to 0 +Exported global range table: +============================ +_1 : [irange] int * [1, +INF] +Merging blocks 4 and 6 * * * this made me think whether loop unswitching is actually relevant here. since 12.3 also unswitches this loop, but doesn't miscompile. so I manually unswitched the loop: void bug(struct obj **root, struct obj *dso) { if (!&dso->i) { while (1) { struct obj *this = *root; if (dso == (void *)0) // should return here return; if (dso == this) return; // shouldn't reach here assert_not_null(dso); break; } } else { while (1) { struct obj *this = *root; if (dso == (void *)0) // should return here return; if (dso == this) return; // shouldn't reach here assert_not_null(dso); } } } and now the miscompile happens without -funswitch-loops. I wonder if this happens on trunk as well. looks this this is a -fno-strict-overflow related ranger bug.
the manually unswitched version can probably be reduced further.
reduced it a bit: void bug(struct obj **root, struct obj *dso) { if (&dso->i) { while (1) { struct obj *this = *root; if (dso == (void *)0) // should return here return; if (dso == this) return; // shouldn't reach here assert_not_null(dso); } } }
I think this is the MRE: void bug(struct obj *dso) { if (&dso->i) { if (dso == (void *)0) return; assert_not_null(dso); } }
(In reply to Yuxuan Shui from comment #12) > I think this is the MRE: > > > void bug(struct obj *dso) { > if (&dso->i) { > if (dso == (void *)0) > return; > > assert_not_null(dso); > } > } Except that is undefined ... Manually unswitching introduces the undefined behavior in the code. So even though the code was unswitched before GCC 13 incorrectly, GCC didn't take into that account before hand. I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a can't be a nullptr. Which is due to undefinedness there of adding 1 to an null ptr ... (for C that is). Basically the unswitch is the issue ... Or we maybe we should turn `if (a+1)` into just `if (a)` ... Likewise for `if (&a->i)` into `if (a)`
I don't see a complete testcase that I could bisect.
(In reply to Marek Polacek from comment #14) > I don't see a complete testcase that I could bisect. Please use the code sample in the original comment. since there are questions that the manually unswitched version is undefined. link that code with this: #include<stdlib.h> struct obj { int __pad; int i; }; /* aborts when called with NULL */ int assert_not_null(void *n) { if (!n) abort(); } void bug(struct obj **root, struct obj *dso); int main() { struct obj x = {}; struct obj *y = &x; bug(y, NULL); }
(In reply to Andrew Pinski from comment #13) > (In reply to Yuxuan Shui from comment #12) >> ... > > Except that is undefined ... > Manually unswitching introduces the undefined behavior in the code. > So even though the code was unswitched before GCC 13 incorrectly, GCC didn't > take into that account before hand. > > I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a > can't be a nullptr. Which is due to undefinedness there of adding 1 to an > null ptr ... (for C that is). > > Basically the unswitch is the issue ... Or we maybe we should turn `if > (a+1)` into just `if (a)` ... Likewise for `if (&a->i)` into `if (a)` I see. but if it's undefined, why was the `if (dso)` only removed when -fno-strict-overflow is enabled? and it still happens with `-fno-delete-null-pointer-checks`
(In reply to Yuxuan Shui from comment #16) > I see. but if it's undefined, why was the `if (dso)` only removed when > -fno-strict-overflow is enabled? and it still happens with > `-fno-delete-null-pointer-checks` So -fno-strict-overflow does -fno-wrapv-pointer so we can assume pointer arithmetic wraps now and then `a+1` could in theory wrap to nullptr. So many different hooks/options. It is better to use -fwrapv if you only want signed integer overflow being defined (as wrapping) rather than pointer arithmetic overflow being defined too.
(In reply to Andrew Pinski from comment #17) > (In reply to Yuxuan Shui from comment #16) > > ... > > So -fno-strict-overflow does -fno-wrapv-pointer so we can assume pointer > arithmetic wraps now and then `a+1` could in theory wrap to nullptr. hmm, but why does that make the null check that previously was not removable, removable? also another observation, if I change `struct obj *dso` to `int *dso`, and `&dso->i` to `&dso[1]`, then the null check will be preserved. despite this code still being undefined?
On Tue, 23 Jan 2024, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113551 > > --- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > (In reply to Yuxuan Shui from comment #12) > > I think this is the MRE: > > > > > > void bug(struct obj *dso) { > > if (&dso->i) { > > if (dso == (void *)0) > > return; > > > > assert_not_null(dso); > > } > > } > > Except that is undefined ... > Manually unswitching introduces the undefined behavior in the code. > So even though the code was unswitched before GCC 13 incorrectly, GCC didn't > take into that account before hand. > > I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a > can't be a nullptr. Which is due to undefinedness there of adding 1 to an null > ptr ... (for C that is). > > Basically the unswitch is the issue ... Or we maybe we should turn `if (a+1)` > into just `if (a)` ... Likewise for `if (&a->i)` into `if (a)` It's not undefined on GIMPLE. As said I believe this was fixed for GCC 14 but appearantly not backported. bisection should tell
For gcc-14 this test case was fixed by dc48d1d1d4458773f89f21b2f019f66ddf88f2e5 is the first new commit commit dc48d1d1d4458773f89f21b2f019f66ddf88f2e5 Author: Andrew MacLeod <amacleod@redhat.com> Date: Thu Aug 17 11:13:14 2023 -0400 Fix range-ops operator_addr. Lack of symbolic information prevents op1_range from beig able to draw the same conclusions as fold_range can. PR tree-optimization/111009 gcc/ * range-op.cc (operator_addr_expr::op1_range): Be more restrictive. That commit was backported to gcc-13 in 421311a31a12b96143eb901fde0e020771fe71d4, and that fixed this test case there too. Not sure if this is the proper fix or just makes the issue latent.
yes, it looks suspiciously close to what I'd expect.