Bug 113551 - [13 Regression] Miscompilation with -O1 -funswitch-loops -fno-strict-overflow
Summary: [13 Regression] Miscompilation with -O1 -funswitch-loops -fno-strict-overflow
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.2.0
: P2 normal
Target Milestone: 13.3
Assignee: Not yet assigned to anyone
URL:
Keywords: needs-bisection, wrong-code
Depends on:
Blocks:
 
Reported: 2024-01-23 02:25 UTC by Yuxuan Shui
Modified: 2024-05-14 14:31 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work: 13.2.1, 14.0
Known to fail: 13.2.0
Last reconfirmed: 2024-01-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yuxuan Shui 2024-01-23 02:25:17 UTC
Code:

struct obj {
	int __pad;
	int i;
};

/* aborts when called with NULL */
int assert_not_null(void *); 

void bug(struct obj **root, struct obj *dso) {
	while (1) {
		struct obj *this = *root;

		if (dso == (void *)0)
			// should return here
			return;

		if (dso == this)
			return;

		// shouldn't reach here
		assert_not_null(dso);

		if (!&dso->i)
			break;
	}
}

// call like this: bug(&obj, NULL);

Result:

* -O1: ok
* -O1 -funswitch-loops: ok
* -O1 -fno-strict-overflow: ok
* -O1 -funswitch-loops -fno-strict-overflow: abort
Comment 1 Yuxuan Shui 2024-01-23 02:28:32 UTC
code is reduced from perf, source file util/dsos.c
Comment 2 Yuxuan Shui 2024-01-23 02:31:04 UTC
regression from 12.3 -> 13.2
Comment 3 Andrew Pinski 2024-01-23 02:31:22 UTC
Looks like the unswitch is happening when it should not be ...
Comment 4 Andrew Pinski 2024-01-23 02:32:17 UTC
The incorrect unswitch has been happening since at least GCC 5 ...
Comment 5 Andrew Pinski 2024-01-23 04:29:13 UTC
Confirmed at least for the bad unswitch which causes the other wrong code to happen.
Comment 6 Richard Biener 2024-01-23 07:54:19 UTC
trunk doesn't unswitch for me (needs bisection).  Let me check what happens on the branch.
Comment 7 Richard Biener 2024-01-23 08:03:49 UTC
The IL after unswitching looks OK, but we assume that when &dso->i is NULL
then dso == NULL and when &dso->i is not NULL then dso also isn't.

I think this is a ranger bug that has been fixed on trunk
but eventually not yet backported, thus we have a duplicate somewhere.

Bisection will tell.
Comment 8 Richard Biener 2024-01-23 08:06:16 UTC
Note on trunk we have jump-threaded this to move dso == (void *)0 || dso == this out of the loop so there's nothing to unswitch and the bad circumstances likely do not trigger.

I still remember the ranger bug though.
Comment 9 Yuxuan Shui 2024-01-23 23:23:38 UTC
So, to add some details to people said above, and to make sure I understood this correctly.

What gcc did here is:

1. move the `if (!&dso->i)` branch out of the loop, duplicate the loop for the then/else branches.
2. `&dso->i` cannot be 0, so the else branch is eliminated.
3. in the then branch, this condition confused the compiler. it thought since `&dso->i` is not 0, `dso` is not 0 either, causing the bug.

I diffed `-fdump` outputs and it does match what I expected to see.

(minus is the correct one, plus the incorrect)

@@ -2686,7 +2686,11 @@
 ;; 4 succs { 5 6 }
 ;; 6 succs { 3 }
 ;; 5 succs { 1 }
-Removing basic block 6
+Folding predicate _1 == 0B to 0
+Exported global range table:
+============================
+_1  : [irange] int * [1, +INF]
+Merging blocks 4 and 6

* * *

this made me think whether loop unswitching is actually relevant here. since 12.3 also unswitches this loop, but doesn't miscompile. so I manually unswitched the loop:


void bug(struct obj **root, struct obj *dso) {
	if (!&dso->i) {
		while (1) {
			struct obj *this = *root;

			if (dso == (void *)0)
				// should return here
				return;

			if (dso == this)
				return;

			// shouldn't reach here
			assert_not_null(dso);
			break;
		}
	} else {
		while (1) {
			struct obj *this = *root;

			if (dso == (void *)0)
				// should return here
				return;

			if (dso == this)
				return;

			// shouldn't reach here
			assert_not_null(dso);
		}
	}
}

and now the miscompile happens without -funswitch-loops. I wonder if this happens on trunk as well.

looks this this is a -fno-strict-overflow related ranger bug.
Comment 10 Yuxuan Shui 2024-01-23 23:24:14 UTC
the manually unswitched version can probably be reduced further.
Comment 11 Yuxuan Shui 2024-01-23 23:28:41 UTC
reduced it a bit:


void bug(struct obj **root, struct obj *dso) {
	if (&dso->i) {
		while (1) {
			struct obj *this = *root;

			if (dso == (void *)0)
				// should return here
				return;

			if (dso == this)
				return;

			// shouldn't reach here
			assert_not_null(dso);
		}
	}
}
Comment 12 Yuxuan Shui 2024-01-23 23:34:32 UTC
I think this is the MRE:


void bug(struct obj *dso) {
	if (&dso->i) {
		if (dso == (void *)0)
			return;

		assert_not_null(dso);
	}
}
Comment 13 Andrew Pinski 2024-01-23 23:40:37 UTC
(In reply to Yuxuan Shui from comment #12)
> I think this is the MRE:
> 
> 
> void bug(struct obj *dso) {
> 	if (&dso->i) {
> 		if (dso == (void *)0)
> 			return;
> 
> 		assert_not_null(dso);
> 	}
> }

Except that is undefined ...
Manually unswitching introduces the undefined behavior in the code.
So even though the code was unswitched before GCC 13 incorrectly, GCC didn't take into that account before hand.

I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a can't be a nullptr. Which is due to undefinedness there of adding 1 to an null ptr ... (for C that is).

Basically the unswitch is the issue ... Or we maybe we should turn `if (a+1)` into just `if (a)` ...  Likewise for `if (&a->i)` into `if (a)`
Comment 14 Marek Polacek 2024-01-23 23:42:26 UTC
I don't see a complete testcase that I could bisect.
Comment 15 Yuxuan Shui 2024-01-24 00:19:08 UTC
(In reply to Marek Polacek from comment #14)
> I don't see a complete testcase that I could bisect.

Please use the code sample in the original comment. since there are questions that the manually unswitched version is undefined.

link that code with this:

#include<stdlib.h>
struct obj {
	int __pad;
	int i;
};
/* aborts when called with NULL */
int assert_not_null(void *n) {
        if (!n)
                abort();
}
void bug(struct obj **root, struct obj *dso);

int main() {
        struct obj x = {};
        struct obj *y = &x;
        bug(y, NULL);
}
Comment 16 Yuxuan Shui 2024-01-24 00:21:39 UTC
(In reply to Andrew Pinski from comment #13)
> (In reply to Yuxuan Shui from comment #12)
>> ...
> 
> Except that is undefined ...
> Manually unswitching introduces the undefined behavior in the code.
> So even though the code was unswitched before GCC 13 incorrectly, GCC didn't
> take into that account before hand.
> 
> I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a
> can't be a nullptr. Which is due to undefinedness there of adding 1 to an
> null ptr ... (for C that is).
> 
> Basically the unswitch is the issue ... Or we maybe we should turn `if
> (a+1)` into just `if (a)` ...  Likewise for `if (&a->i)` into `if (a)`

I see. but if it's undefined, why was the `if (dso)` only removed when -fno-strict-overflow is enabled? and it still happens with `-fno-delete-null-pointer-checks`
Comment 17 Andrew Pinski 2024-01-24 00:27:26 UTC
(In reply to Yuxuan Shui from comment #16)
> I see. but if it's undefined, why was the `if (dso)` only removed when
> -fno-strict-overflow is enabled? and it still happens with
> `-fno-delete-null-pointer-checks`

So -fno-strict-overflow does -fno-wrapv-pointer so we can assume pointer arithmetic wraps now and then `a+1` could in theory wrap to nullptr.
So many different hooks/options. It is better to use -fwrapv if you only want signed integer overflow being defined (as wrapping) rather than pointer arithmetic overflow being defined too.
Comment 18 Yuxuan Shui 2024-01-24 00:39:47 UTC
(In reply to Andrew Pinski from comment #17)
> (In reply to Yuxuan Shui from comment #16)
> > ...
> 
> So -fno-strict-overflow does -fno-wrapv-pointer so we can assume pointer
> arithmetic wraps now and then `a+1` could in theory wrap to nullptr.

hmm, but why does that make the null check that previously was not removable, removable?

also another observation, if I change `struct obj *dso` to `int *dso`, and `&dso->i` to `&dso[1]`, then the null check will be preserved. despite this code still being undefined?
Comment 19 rguenther@suse.de 2024-01-24 08:17:26 UTC
On Tue, 23 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113551
> 
> --- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> (In reply to Yuxuan Shui from comment #12)
> > I think this is the MRE:
> > 
> > 
> > void bug(struct obj *dso) {
> > 	if (&dso->i) {
> > 		if (dso == (void *)0)
> > 			return;
> > 
> > 		assert_not_null(dso);
> > 	}
> > }
> 
> Except that is undefined ...
> Manually unswitching introduces the undefined behavior in the code.
> So even though the code was unswitched before GCC 13 incorrectly, GCC didn't
> take into that account before hand.
> 
> I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a
> can't be a nullptr. Which is due to undefinedness there of adding 1 to an null
> ptr ... (for C that is).
> 
> Basically the unswitch is the issue ... Or we maybe we should turn `if (a+1)`
> into just `if (a)` ...  Likewise for `if (&a->i)` into `if (a)`

It's not undefined on GIMPLE.  As said I believe this was fixed for GCC 14
but appearantly not backported.  bisection should tell
Comment 20 Mikael Pettersson 2024-05-14 13:56:05 UTC
For gcc-14 this test case was fixed by 

dc48d1d1d4458773f89f21b2f019f66ddf88f2e5 is the first new commit
commit dc48d1d1d4458773f89f21b2f019f66ddf88f2e5
Author: Andrew MacLeod <amacleod@redhat.com>
Date:   Thu Aug 17 11:13:14 2023 -0400

    Fix range-ops operator_addr.
    
    Lack of symbolic information prevents op1_range from beig able to draw
    the same conclusions as fold_range can.
    
            PR tree-optimization/111009
            gcc/
            * range-op.cc (operator_addr_expr::op1_range): Be more restrictive.

That commit was backported to gcc-13 in 421311a31a12b96143eb901fde0e020771fe71d4, and that fixed this test case there too.

Not sure if this is the proper fix or just makes the issue latent.
Comment 21 Richard Biener 2024-05-14 14:31:51 UTC
yes, it looks suspiciously close to what I'd expect.