Bug 98302

Summary: [9 Regression] Wrong code on aarch64
Product: gcc Reporter: Martin Liška <marxin>
Component: targetAssignee: Richard Sandiford <rsandifo>
Status: RESOLVED FIXED    
Severity: normal CC: acoplan, rsandifo
Priority: P2 Keywords: wrong-code
Version: 11.0   
Target Milestone: 9.4   
Host: aarch64-linux-gnu Target: aarch64-linux-gnu
Build: Known to work: 11.0
Known to fail: Last reconfirmed: 2020-12-16 00:00:00
Bug Depends on:    
Bug Blocks: 103035    
Attachments: test-case 1
test case 2
Packed yarpgen test-case

Description Martin Liška 2020-12-15 21:15:12 UTC
Started with a costing model change, and I can reproduce it only on aarch64.
The test-case is hard to reduce and comes from yarpgen:

$ g++ -O3 func.ii -c
$ g++ driver.ii -c
$ g++ *.o

$ ./a.out
a.out: driver.cpp:1143: void checksum(): Assertion `var_135 == (unsigned char)0' failed.
Aborted (core dumped)

It's caused by vect_loop, and g++ -O3 func.ii -c -fdbg-cnt=vect_loop:1 is first value that causes that.
Comment 1 Martin Liška 2020-12-15 21:15:25 UTC
Created attachment 49771 [details]
test-case 1
Comment 2 Martin Liška 2020-12-15 21:15:35 UTC
Created attachment 49772 [details]
test case 2
Comment 3 Alex Coplan 2020-12-16 11:59:52 UTC
Hi Martin, can you post the yarpgen seed (or the original cpp files) and I will have a go at reproducing/reducing?
Comment 4 Martin Liška 2020-12-16 12:05:13 UTC
(In reply to Alex Coplan from comment #3)
> Hi Martin, can you post the yarpgen seed (or the original cpp files) and I
> will have a go at reproducing/reducing?

Sure, it's 202213617.
Thank you for help.
Comment 5 Alex Coplan 2020-12-16 12:16:44 UTC
Can't repro with that seed (at least on aarch64-elf-gcc). I expect we're seeing different source files.

I see:
$ md5sum src/*
72fdf911a2c5f9cc21df5af3ffb4726e  src/driver.cpp
b8fdebf50f579fa5d7c93de5d42ae217  src/func.cpp
ce2afb8e50893400329f5fd1d3015caf  src/init.h

$ yarpgen --version
yarpgen version 2.0 (build 9cb35d3 on 2020:10:30)

I guess we need (at least) matching (yarpgen commit, seed) to end up with the same source files? Might be easiest to just post the cpp files if possible.
Comment 6 Martin Liška 2020-12-16 12:20:33 UTC
Created attachment 49777 [details]
Packed yarpgen test-case
Comment 7 Alex Coplan 2020-12-16 12:23:41 UTC
Thanks, I can reproduce it now.
Comment 8 Martin Liška 2020-12-16 12:57:26 UTC
(In reply to Alex Coplan from comment #7)
> Thanks, I can reproduce it now.

Great. I tried to reduce that with:
gcc10: -Werror -fsanitize=address,undefined -fno-sanitize-recover=all - OK
gcc11: -O2 - OK
gcc11: -O3 - Assert happens
Comment 9 Martin Liška 2020-12-16 12:58:56 UTC
Btw. how powerful machine do you use for reduction?
What's a wall time of an interestingness test you're going to use?
Comment 10 Alex Coplan 2020-12-16 17:08:21 UTC
Reduced to:

int c = 1705;
char a;
long f = 50887638;
unsigned long long *h(unsigned long long *k, unsigned long long *l) {
  return *k ? k : l;
}
void aa() {}
int main() {
  long d = f;
  for (char g = 0; g < (char)c - 10; g += 2) {
    unsigned long long i = d, j = 4;
    a = *h(&i, &j) << ((d ? 169392992 : 0) - 169392955LL);
  }
  if (a)
    __builtin_abort();
}

which is miscompiled at -O2 -ftree-vectorize or -O3.
Comment 11 Martin Liška 2020-12-16 19:04:41 UTC
> which is miscompiled at -O2 -ftree-vectorize or -O3.

What a great reduction, can you please share knowledge how did you achieve that?!
Comment 12 Richard Sandiford 2020-12-30 18:23:33 UTC
This is caused by the overwidening pattern recognisers.
They correctly realise that in:

  unsigned long long x = ...;
  char y = x << 37;

only the low 8 bits of x << 37 are needed.  But they then
take it too far and try to do an 8-bit shift by 37, which is
undefined in gimple.

The optimal fix would be to get the vectoriser to replace the
shift result with zero instead, but that's a bit awkward to do
and should really happen before vectorisation.  The simplest fix
is to make sure that we don't narrow further than the shift amount
allows.

Testing a patch.
Comment 13 GCC Commits 2020-12-31 16:52:00 UTC
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:58a12b0eadac62e691fcf7325ab2bc2c93d46b61

commit r11-6381-g58a12b0eadac62e691fcf7325ab2bc2c93d46b61
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Thu Dec 31 16:51:34 2020 +0000

    vect: Avoid generating out-of-range shifts [PR98302]
    
    In this testcase we end up with:
    
      unsigned long long x = ...;
      char y = (char) (x << 37);
    
    The overwidening pattern realised that only the low 8 bits
    of x << 37 are needed, but then tried to turn that into:
    
      unsigned long long x = ...;
      char y = (char) x << 37;
    
    which gives an out-of-range shift.  In this case y can simply
    be replaced by zero, but as the comment in the patch says,
    it's kind-of awkward to do that in the middle of vectorisation.
    
    Most of the overwidening stuff is about keeping operations
    as narrow as possible, which is important for vectorisation
    but could be counter-productive for scalars (especially on
    RISC targets).  In contrast, optimising y to zero in the above
    feels like an independent optimisation that would benefit scalar
    code and that should happen before vectorisation.
    
    gcc/
            PR tree-optimization/98302
            * tree-vect-patterns.c (vect_determine_precisions_from_users): Make
            sure that the precision remains greater than the shift count.
    
    gcc/testsuite/
            PR tree-optimization/98302
            * gcc.dg/vect/pr98302.c: New test.
Comment 14 Richard Sandiford 2020-12-31 16:57:30 UTC
FIxed on trunk so far.  The code goes back to GCC 9 and the
patch seems safe, so keeping open for now to see whether
we should backport.
Comment 15 Alex Coplan 2021-01-04 12:04:58 UTC
(In reply to Martin Liška from comment #11)
> > which is miscompiled at -O2 -ftree-vectorize or -O3.
> 
> What a great reduction, can you please share knowledge how did you achieve
> that?!

Hi Martin,

Sorry for the late reply.

Generally, when reducing a yarpgen testcase, I start by checking whether
we can hit the bug with just one large C++ file, i.e. with:

$ cat init.h func.cpp driver.cpp > combined.cc

and then deleting the #include of "init.h" from combined.cc. For most
issues, you should be able to reproduce the bug with the combined file.

For a wrong code bug, I usually proceed with two separate reductions:

1. Reduce the C++ file without preprocessing.
2. Attempt to manually convert the reduced C++ testcase to a C testcase.
3. Preprocess the C testcase and then reduce the preprocessed file.

For (2) this usually involves superficial syntax changes, using C
headers instead of their C++ equivalents, and replacing calls to
std::{min,max} with a C equivalent (using pointers instead of
references).

I use a relatively beefy x86 machine for reduction, using qemu to run
target code (aarch64, in this case). My reduction script broadly does
the following:

Sanitizer checks:

 * I run three separate compile+executes (2x gcc, 1x clang) on the x86
   host with sanitizer options enabled (-fsanitize=address
   -fsanitize=undefined -fno-sanitize-recover=undefined). I use the host
   GCC (7.5.0 on Ubuntu 18.04) and clang 10 for this. I run clang at -O0
   and GCC at both -O0 and -O1 -fno-common (the latter appears to be
   necessary to catch some global buffer overflows).

 * We fail the interestingness test if any of these executions fail. If
   possible, I also save the results of these executions for comparison
   with the target executions.

Target comparisons:

 * Then, using the target (aarch64) GCC, I compile+execute at various
   optimisation levels. In this case I used -O{0,1,2,s,g,3}, checking
   that the output (exit code and stdout/stderr) differs for the
   optimisation level exhibiting the bug (-O3 in this case) but is the
   same for all other cases.

Finally, I run the reduced testcase through
https://cerberus.cl.cam.ac.uk/ to check for any undefined behaviour that
the sanitizers didn't manage to catch. More recently I've also been
experimenting with Frama-C which works on some testcases that are too
large for Cerberus to handle.

Currently the reduction script is a simple bash script that runs
everything serially. I've been experimenting with some python scripts
that can run things in parallel, I might also explore generating
make/ninja to parallelise the interestingness test at some point.

I'm sure that this process could be improved, but this is what I've been
using so far, and it seems to work. I hope this is of some use, let me
know if you have any more specific questions or thoughts.
Comment 16 Martin Liška 2021-01-05 09:43:01 UTC
(In reply to Alex Coplan from comment #15)
> (In reply to Martin Liška from comment #11)
> > > which is miscompiled at -O2 -ftree-vectorize or -O3.
> > 
> > What a great reduction, can you please share knowledge how did you achieve
> > that?!
> 
> Hi Martin,
> 
> Sorry for the late reply.

That's fine!

> 
> Generally, when reducing a yarpgen testcase, I start by checking whether
> we can hit the bug with just one large C++ file, i.e. with:
> 
> $ cat init.h func.cpp driver.cpp > combined.cc

Nice trick, I've just used that in PR98513 and it works. 

> 
> and then deleting the #include of "init.h" from combined.cc. For most
> issues, you should be able to reproduce the bug with the combined file.
> 
> For a wrong code bug, I usually proceed with two separate reductions:
> 
> 1. Reduce the C++ file without preprocessing.
> 2. Attempt to manually convert the reduced C++ testcase to a C testcase.
> 3. Preprocess the C testcase and then reduce the preprocessed file.
> 
> For (2) this usually involves superficial syntax changes, using C
> headers instead of their C++ equivalents, and replacing calls to
> std::{min,max} with a C equivalent (using pointers instead of
> references).

I can confirm that one can easily replace std::min, max with the implementation:

template<class T> 
const T& min(const T& a, const T& b)
{
    return (b < a) ? b : a;
}

that removes the rependency.

> 
> I use a relatively beefy x86 machine for reduction, using qemu to run
> target code (aarch64, in this case). My reduction script broadly does
> the following:
> 
> Sanitizer checks:
> 
>  * I run three separate compile+executes (2x gcc, 1x clang) on the x86
>    host with sanitizer options enabled (-fsanitize=address
>    -fsanitize=undefined -fno-sanitize-recover=undefined). I use the host
>    GCC (7.5.0 on Ubuntu 18.04) and clang 10 for this. I run clang at -O0
>    and GCC at both -O0 and -O1 -fno-common (the latter appears to be
>    necessary to catch some global buffer overflows).

Heh :) In the mentioned PR, I first did only reduction with sanitizers, -O3 was
miscompiled, but later than I noticed that -O0 was crashing as well. Using clang
is also a nice trick. Moreover, I also use -Werror -Wall -Wextra and 'timeout X'
helps as well. It's quite common that loops iterate for ever.

> 
>  * We fail the interestingness test if any of these executions fail. If
>    possible, I also save the results of these executions for comparison
>    with the target executions.
> 
> Target comparisons:
> 
>  * Then, using the target (aarch64) GCC, I compile+execute at various
>    optimisation levels. In this case I used -O{0,1,2,s,g,3}, checking
>    that the output (exit code and stdout/stderr) differs for the
>    optimisation level exhibiting the bug (-O3 in this case) but is the
>    same for all other cases.
> 
> Finally, I run the reduced testcase through
> https://cerberus.cl.cam.ac.uk/ to check for any undefined behaviour that
> the sanitizers didn't manage to catch. More recently I've also been
> experimenting with Frama-C which works on some testcases that are too
> large for Cerberus to handle.

I see!

> 
> Currently the reduction script is a simple bash script that runs
> everything serially. I've been experimenting with some python scripts
> that can run things in parallel, I might also explore generating
> make/ninja to parallelise the interestingness test at some point.

I do it also serially, in case of the mention PR, I did:

cvise -c 'g++-10 combined.cc && timeout 2 ./a.out && g++-10 -fsanitize=address,undefined -fno-sanitize-recover=all combined.cc && timeout 2 ./a.out && g++ combined.cc -O3 && timeout 2 valgrind ./a.out 2>&1 | grep "Invalid "' combined.cc

> 
> I'm sure that this process could be improved, but this is what I've been
> using so far, and it seems to work. I hope this is of some use, let me
> know if you have any more specific questions or thoughts.

I may experiment next time with a paralell interestingness script. Right now, I've been running
yarpgen for few hours a day on a AMD Epyc machine.

Anyway, thank you for sharing the knowledge. It's useful!
Comment 17 Richard Sandiford 2021-01-12 10:03:50 UTC
Fixed for GCC 10 by r10-9257.
Comment 18 GCC Commits 2021-04-25 13:51:34 UTC
The releases/gcc-9 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:3fa4752e29a5b44219837ebad5bb09ec98af156e

commit r9-9462-g3fa4752e29a5b44219837ebad5bb09ec98af156e
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Sun Apr 25 14:51:14 2021 +0100

    vect: Avoid generating out-of-range shifts [PR98302]
    
    In this testcase we end up with:
    
      unsigned long long x = ...;
      char y = (char) (x << 37);
    
    The overwidening pattern realised that only the low 8 bits
    of x << 37 are needed, but then tried to turn that into:
    
      unsigned long long x = ...;
      char y = (char) x << 37;
    
    which gives an out-of-range shift.  In this case y can simply
    be replaced by zero, but as the comment in the patch says,
    it's kind-of awkward to do that in the middle of vectorisation.
    
    Most of the overwidening stuff is about keeping operations
    as narrow as possible, which is important for vectorisation
    but could be counter-productive for scalars (especially on
    RISC targets).  In contrast, optimising y to zero in the above
    feels like an independent optimisation that would benefit scalar
    code and that should happen before vectorisation.
    
    gcc/
            PR tree-optimization/98302
            * tree-vect-patterns.c (vect_determine_precisions_from_users): Make
            sure that the precision remains greater than the shift count.
    
    gcc/testsuite/
            PR tree-optimization/98302
            * gcc.dg/vect/pr98302.c: New test.
    
    (cherry picked from commit 58a12b0eadac62e691fcf7325ab2bc2c93d46b61)
Comment 19 Richard Sandiford 2021-04-25 13:53:33 UTC
Fixed.