Bug 94734 - [10 Regression] Program crashes when compiled with -O2 since r10-1892-gb9ef6a2e04bfd013
Summary: [10 Regression] Program crashes when compiled with -O2 since r10-1892-gb9ef6a...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.0
: P1 normal
Target Milestone: 10.0
Assignee: Jakub Jelinek
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2020-04-23 17:17 UTC by John Högberg
Modified: 2020-04-24 22:24 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work: 9.3.0
Known to fail: 10.0
Last reconfirmed: 2020-04-23 00:00:00


Attachments
Test case program (710 bytes, text/x-csrc)
2020-04-23 17:17 UTC, John Högberg
Details
Improved testcase (681 bytes, text/x-csrc)
2020-04-23 18:12 UTC, John Högberg
Details
gcc10-pr94734.patch (1.21 KB, patch)
2020-04-24 09:29 UTC, Jakub Jelinek
Details | Diff
gcc10-pr94734.patch (1.43 KB, patch)
2020-04-24 10:37 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John Högberg 2020-04-23 17:17:42 UTC
Created attachment 48361 [details]
Test case program

The attached program crashes when it has been compiled with GCC 10 and -O2. With -O3 or -Os it returns an incorrect value instead.

There are no warnings with `-Wall -Wextra` or `-fanalyzer`, and strangely enough `-fsanitize=undefined` hides the crash without reporting any errors.

-fno-aggressive-loop-optimizations also makes the crash go away which made me suspect that the code is wrong somehow, but `frama-c` insists that it's correct and should always return 0. If there's a bug in there, I'm not seeing it. :(

Steps to reproduce:

----
$ gcc -O2 test.c -o bug
$ ./bug
Segmentation fault (core dumped)
----

GCC version:

----
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/10.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.0.1 20200423 (experimental) (GCC) 
----

It appears to work fine with GCC 7, I tested it with:

----
$ gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
----
Comment 1 John Högberg 2020-04-23 18:12:25 UTC
Created attachment 48363 [details]
Improved testcase
Comment 2 Martin Liška 2020-04-23 19:19:37 UTC
Confirmed, ASAN is also fine. Started with r10-1892-gb9ef6a2e04bfd013.
Comment 3 Jeffrey A. Law 2020-04-23 21:24:27 UTC
Looks like we're running past the end of the array with the stores in the loop.
Comment 4 Jakub Jelinek 2020-04-23 23:24:20 UTC
+  int cstore_31;
+  int cstore_32;
 
   <bb 2> [local count: 114863530]:
   goto <bb 7>; [100.00%]
 
   <bb 3> [local count: 1014686026]:
   _1 = (long unsigned int) sum_a_7;
   _2 = _1 * 8;
   _3 = input_21(D) + _2;
   _4 = *_3;
   if (_4 == 0B)
     goto <bb 15>; [5.50%]
   else
     goto <bb 4>; [94.50%]
 
   <bb 4> [local count: 958878296]:
   if (sum_a_7 <= 1)
-    goto <bb 5>; [28.10%]
+    goto <bb 6>; [28.10%]
   else
-    goto <bb 6>; [71.90%]
+    goto <bb 5>; [71.90%]
 
-  <bb 5> [local count: 269444804]:
-  arr[sum_a_7] = 1;
+  <bb 5> [local count: 689433492]:
+  cstore_32 = MEM <int[2]> [(void *)&arr][sum_a_7];
 
   <bb 6> [local count: 958878296]:
+  # cstore_31 = PHI <1(4), cstore_32(5)>
+  MEM <int[2]> [(void *)&arr][sum_a_7] = cstore_31;
   sum_a_23 = sum_a_7 + 1;

done by cselim looks just plain wrong, there is no dominating load from that memory, so even when the variable is an automatic variable, there is no guarantee it won't be out of bounds and thus crash already on the load, or just modify random unrelated memory.
Comment 5 Jakub Jelinek 2020-04-23 23:25:48 UTC
All the PR89430 testcases have such a load.
Comment 6 Jeffrey A. Law 2020-04-24 00:48:18 UTC
THe whole point of that change is to not require a dominating load if the object comes from the stack.

We conditionally load from the same location, then have a PHI which selects that loaded value or "1" and we store the result.  That's precisely what is supposed to be happening.

The loop in question looks like:

;;   basic block 3, loop depth 1
;;    pred:       2
;;                6
  # ivtmp.13_9 = PHI <0(2), ivtmp.13_7(6)>
  _4 = MEM[base: in_21(D), index: ivtmp.13_9, step: 8, offset: 0B];
  if (_4 == 0B)
    goto <bb 13>; [5.50%]
  else
    goto <bb 4>; [94.50%]
;;    succ:       13
;;                4

;;   basic block 4, loop depth 1
;;    pred:       3
  if (ivtmp.13_9 != 2)
    goto <bb 6>; [28.10%]
  else
    goto <bb 5>; [71.90%]
;;    succ:       6
;;                5

;;   basic block 5, loop depth 1
;;    pred:       4
  cstore_32 = MEM[symbol: arr, index: ivtmp.13_9, step: 4, offset: 0B];
;;    succ:       6

;;   basic block 6, loop depth 1
;;    pred:       5
;;                4
  # cstore_31 = PHI <cstore_32(5), 1(4)>
  MEM[symbol: arr, index: ivtmp.13_9, step: 4, offset: 0B] = cstore_31;
  _40 = (unsigned int) ivtmp.13_9;
  _38 = _40 + 1;
  _37 = (int) _38;
  ivtmp.13_7 = ivtmp.13_9 + 1;
  sum_a_27 = (int) ivtmp.13_7;
  if (n_16(D) > sum_a_27)
    goto <bb 3>; [94.50%]
  else
    goto <bb 7>; [5.50%]
;;    succ:       3
;;                7

You can see the load from the same stock slot in bb5, the selecting PHI in bb6 and the store in bb6.

THe test in bb4 looks weird and is the source of the problem I believe.
Comment 7 Jakub Jelinek 2020-04-24 05:20:37 UTC
(In reply to Jeffrey A. Law from comment #6)
> THe whole point of that change is to not require a dominating load if the
> object comes from the stack.

Yeah, but I find that flawed.  One can do it after performing get_ref_base_and_extent and verifying the offset is const and the whole access must be within the object, or one could do it with VR verification again verifying the access is fully within the object, but otherwise it can't be done and one needs to find dominating load, just the noload needs to handle also ARRAY_REFs etc., not only MEM_REFs.
Comment 8 Richard Biener 2020-04-24 07:52:20 UTC
In particular tree_could_trap_p woudl consider the load possibly trapping
due to the variable indexing but the patch seems to override that which
I agree is bogus.  I think we need to revert it and re-implement for GCC 11.
All testcases contain a dominating load in the condition and what we _can_
do is treat loads of stack variables the same as stores as far as
non-trappingness is concerned.
Comment 9 Richard Biener 2020-04-24 07:55:20 UTC
(In reply to Richard Biener from comment #8)
> In particular tree_could_trap_p woudl consider the load possibly trapping
> due to the variable indexing but the patch seems to override that which
> I agree is bogus.  I think we need to revert it and re-implement for GCC 11.
> All testcases contain a dominating load in the condition and what we _can_
> do is treat loads of stack variables the same as stores as far as
> non-trappingness is concerned.

Actually the non-trapping machinery only considers plain SSA name
dereferences so there's much more to extend.  Definitely nothing
for GCC 10.
Comment 10 Jakub Jelinek 2020-04-24 08:34:01 UTC
Yeah.
add_or_mark_expr could be extended to handle more complex addresses (perhaps get_inner_reference and hash on the decl + offset expression and taking into account the bitpos/bitsize then?

Further testcase.  Both foo and bar are miscompiled, baz is fine because arr[7] is known to be always within the object bounds.

__attribute__((noipa)) int
foo (int n)
{
  int arr[16], s = 0;
  for (int i = 0; i < n; i++)
    {
      if (i < 16)
	arr[i] = i;
    }
  for (int i = 0; i < 16; i++)
    s += arr[i];
  return s;
}

__attribute__((noipa)) int
bar (int n, int x, unsigned long y, unsigned long z)
{
  int arr[16], s = 0;
  arr[4] = 42;
  for (int i = 0; i < n; i++)
    {
      if (x == (i & 0x25))
	arr[y] = i;
    }
  return arr[z];
}

__attribute__((noipa)) int
baz (int n, int x, unsigned long z)
{
  int arr[16], s = 0;
  arr[12] = 42;
  for (int i = 0; i < n; i++)
    {
      if (x == (i & 0x25))
	arr[7] = i;
    }
  return arr[z];
}

int
main ()
{
  if (foo (10374) != 15 * 16 / 2)
    __builtin_abort ();
  if (bar (25, 0x25, (unsigned long) 0xdeadbeefbeefdeadULL, 4) != 42)
    __builtin_abort ();
  if (bar (25, 4, 17, 17) != 22)
    __builtin_abort ();
  if (baz (25, 0x25, 12) != 42)
    __builtin_abort ();
  if (baz (25, 4, 7) != 22)
    __builtin_abort ();
  if (baz (25, 4, 12) != 42)
    __builtin_abort ();
  return 0;
}
Comment 11 Jakub Jelinek 2020-04-24 08:59:37 UTC
If we don't want to revert the change completely, could we perhaps do:
--- gcc/tree-ssa-phiopt.c.jj	2020-03-19 10:23:50.542872359 +0100
+++ gcc/tree-ssa-phiopt.c	2020-04-24 10:54:10.341716841 +0200
@@ -2237,10 +2237,26 @@ cond_store_replacement (basic_block midd
      whose value is not available readily, which we want to avoid.  */
   if (!nontrap->contains (lhs))
     {
-      /* If LHS is a local variable without address-taken, we could
-	 always safely move down the store.  */
-      tree base = get_base_address (lhs);
-      if (!auto_var_p (base) || TREE_ADDRESSABLE (base))
+      /* If LHS is an access to a local variable without address-taken
+	 or its part and the access is provably within the bounds of the
+	 local variable, we could always safely move down the store.  */
+      HOST_WIDE_INT offset, size, decl_size;
+      bool reverse;
+      tree base = get_ref_base_and_extent_hwi (lhs, &offset, &size,
+					       &reverse);
+      if (base == NULL_TREE || !auto_var_p (base) || TREE_ADDRESSABLE (base))
+	return false;
+      if (!DECL_SIZE (base)
+	  || !tree_fits_shwi_p (DECL_SIZE (base)))
+	return false;
+      decl_size = tree_to_shwi (DECL_SIZE (base));
+      if (offset < 0
+	  || size < 0
+	  || decl_size < 0
+	  || offset >= decl_size
+	  || size > decl_size
+	  || ((unsigned HOST_WIDE_INT) offset + size
+	      > (unsigned HOST_WIDE_INT) decl_size))
 	return false;
     }
 
+ xfail the tests from the PR89430 because they all need the nontrap ARRAY_REF etc. handling?
With that, in the above testcase baz is still cselim optimized , but foo and bar are not.  bar is still miscompiled by some other optimization though (and GCC 9 didn't do that), so we have some other regression.
Comment 12 rguenther@suse.de 2020-04-24 09:03:55 UTC
On Fri, 24 Apr 2020, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94734
> 
> --- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> If we don't want to revert the change completely, could we perhaps do:
> --- gcc/tree-ssa-phiopt.c.jj    2020-03-19 10:23:50.542872359 +0100
> +++ gcc/tree-ssa-phiopt.c       2020-04-24 10:54:10.341716841 +0200
> @@ -2237,10 +2237,26 @@ cond_store_replacement (basic_block midd
>       whose value is not available readily, which we want to avoid.  */
>    if (!nontrap->contains (lhs))
>      {
> -      /* If LHS is a local variable without address-taken, we could
> -        always safely move down the store.  */
> -      tree base = get_base_address (lhs);
> -      if (!auto_var_p (base) || TREE_ADDRESSABLE (base))
> +      /* If LHS is an access to a local variable without address-taken
> +        or its part and the access is provably within the bounds of the
> +        local variable, we could always safely move down the store.  */
> +      HOST_WIDE_INT offset, size, decl_size;
> +      bool reverse;
> +      tree base = get_ref_base_and_extent_hwi (lhs, &offset, &size,
> +                                              &reverse);
> +      if (base == NULL_TREE || !auto_var_p (base) || TREE_ADDRESSABLE (base))
> +       return false;
> +      if (!DECL_SIZE (base)
> +         || !tree_fits_shwi_p (DECL_SIZE (base)))
> +       return false;
> +      decl_size = tree_to_shwi (DECL_SIZE (base));
> +      if (offset < 0
> +         || size < 0
> +         || decl_size < 0
> +         || offset >= decl_size
> +         || size > decl_size
> +         || ((unsigned HOST_WIDE_INT) offset + size
> +             > (unsigned HOST_WIDE_INT) decl_size))
>         return false;
>      }

That should be roughly equivalent to

  if (!nontrap->contains (lhs)
      || !tree_could_trap_p (lhs))

which for ARRAY_REFs checks in_array_bounds_p.
Comment 13 Jakub Jelinek 2020-04-24 09:09:46 UTC
Even better.
Comment 14 Jakub Jelinek 2020-04-24 09:19:47 UTC
(In reply to Jakub Jelinek from comment #10)
> bar is still miscompiled by some other optimization though
> (and GCC 9 didn't do that), so we have some other regression.

Sorry for the false alarm, the testcase was buggy, 17 is out of bounds.
Let's sed -i -e 's/17, 17/15, 15/' in the testcase.
Comment 15 Richard Biener 2020-04-24 09:22:00 UTC
(In reply to Jakub Jelinek from comment #13)
> Even better.

Note none of the committed testcases would be handled with this.  There's
also the issue of store data races (not sure if the notrap handling is
correct there, it looks like not) - for auto variables that would not
be an issue (but would need to be checked extra).  A dominating store
to the same location isn't enough to prove validity either since
that might be inside a different protected region.
Comment 16 Jakub Jelinek 2020-04-24 09:29:57 UTC
Created attachment 48366 [details]
gcc10-pr94734.patch

So like this?  The store data races thing can be covered by the non-addressable auto var checks.
Comment 17 Jakub Jelinek 2020-04-24 10:37:03 UTC
Created attachment 48367 [details]
gcc10-pr94734.patch

Updated patch.
Comment 18 CVS Commits 2020-04-24 22:10:46 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:cf39dccf9284d2fd9f9aa7050760adea110c8d88

commit r10-7952-gcf39dccf9284d2fd9f9aa7050760adea110c8d88
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Sat Apr 25 00:10:01 2020 +0200

    cselim: Don't assume it is safe to cstore replace a store to a local variable with unknown offset [PR94734]
    
    As the new testcase shows, it is not safe to assume we can optimize
    a conditional store into an automatic non-addressable var, we can do it
    only if we can prove that the unconditional load or store actually will
    not be outside of the boundaries of the variable.
    If the offset and size are constant, we can, but this is already all
    checked in !tree_could_trap_p, otherwise we really need to check for
    a dominating unconditional store, or for the specific case of automatic
    non-addressable variables, it is enough if there is a dominating load
    (that is what those 4 testcases have).  tree-ssa-phiopt.c has some
    infrastructure for this already, see the add_or_mark_expr method etc.,
    but right now it handles only MEM_REFs with SSA_NAME first operand
    and some integral offset.  So, I think it can be for GCC11 extended
    to handle other memory references, possibly up to just doing
    get_inner_reference and hasing based on the base, offset expressions
    and bit_offset and bit_size, and have also a special case that for
    !TREE_ADDRESSABLE automatic variables it could ignore whether something
    is a load or store because the local stack should be always writable.
    But it feels way too dangerous to do this this late for GCC10, so this
    patch just restricts the optimization to the safe case (where lhs doesn't
    trap), and on Richi's request also ignores TREE_ADDRESSABLE bit if
    flag_store_data_races, because my understanding the reason for
    TREE_ADDRESSABLE check is that we want to avoid introducing
    store data races (if address of an automatic var escapes, some other thread
    could be accessing it concurrently).
    
    2020-04-25  Jakub Jelinek  <jakub@redhat.com>
                Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/94734
            PR tree-optimization/89430
            * tree-ssa-phiopt.c: Include tree-eh.h.
            (cond_store_replacement): Return false if an automatic variable
            access could trap.  If -fstore-data-races, don't return false
            just because an automatic variable is addressable.
    
            * gcc.dg/tree-ssa/pr89430-1.c: Add xfail.
            * gcc.dg/tree-ssa/pr89430-2.c: Add xfail.
            * gcc.dg/tree-ssa/pr89430-5.c: Add xfail.
            * gcc.dg/tree-ssa/pr89430-6.c: Add xfail.
            * gcc.c-torture/execute/pr94734.c: New test.
Comment 19 Jakub Jelinek 2020-04-24 22:24:15 UTC
Fixed.