Started with a costing model change, and I can reproduce it only on aarch64. The test-case is hard to reduce and comes from yarpgen: $ g++ -O3 func.ii -c $ g++ driver.ii -c $ g++ *.o $ ./a.out a.out: driver.cpp:1143: void checksum(): Assertion `var_135 == (unsigned char)0' failed. Aborted (core dumped) It's caused by vect_loop, and g++ -O3 func.ii -c -fdbg-cnt=vect_loop:1 is first value that causes that.
Created attachment 49771 [details] test-case 1
Created attachment 49772 [details] test case 2
Hi Martin, can you post the yarpgen seed (or the original cpp files) and I will have a go at reproducing/reducing?
(In reply to Alex Coplan from comment #3) > Hi Martin, can you post the yarpgen seed (or the original cpp files) and I > will have a go at reproducing/reducing? Sure, it's 202213617. Thank you for help.
Can't repro with that seed (at least on aarch64-elf-gcc). I expect we're seeing different source files. I see: $ md5sum src/* 72fdf911a2c5f9cc21df5af3ffb4726e src/driver.cpp b8fdebf50f579fa5d7c93de5d42ae217 src/func.cpp ce2afb8e50893400329f5fd1d3015caf src/init.h $ yarpgen --version yarpgen version 2.0 (build 9cb35d3 on 2020:10:30) I guess we need (at least) matching (yarpgen commit, seed) to end up with the same source files? Might be easiest to just post the cpp files if possible.
Created attachment 49777 [details] Packed yarpgen test-case
Thanks, I can reproduce it now.
(In reply to Alex Coplan from comment #7) > Thanks, I can reproduce it now. Great. I tried to reduce that with: gcc10: -Werror -fsanitize=address,undefined -fno-sanitize-recover=all - OK gcc11: -O2 - OK gcc11: -O3 - Assert happens
Btw. how powerful machine do you use for reduction? What's a wall time of an interestingness test you're going to use?
Reduced to: int c = 1705; char a; long f = 50887638; unsigned long long *h(unsigned long long *k, unsigned long long *l) { return *k ? k : l; } void aa() {} int main() { long d = f; for (char g = 0; g < (char)c - 10; g += 2) { unsigned long long i = d, j = 4; a = *h(&i, &j) << ((d ? 169392992 : 0) - 169392955LL); } if (a) __builtin_abort(); } which is miscompiled at -O2 -ftree-vectorize or -O3.
> which is miscompiled at -O2 -ftree-vectorize or -O3. What a great reduction, can you please share knowledge how did you achieve that?!
This is caused by the overwidening pattern recognisers. They correctly realise that in: unsigned long long x = ...; char y = x << 37; only the low 8 bits of x << 37 are needed. But they then take it too far and try to do an 8-bit shift by 37, which is undefined in gimple. The optimal fix would be to get the vectoriser to replace the shift result with zero instead, but that's a bit awkward to do and should really happen before vectorisation. The simplest fix is to make sure that we don't narrow further than the shift amount allows. Testing a patch.
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:58a12b0eadac62e691fcf7325ab2bc2c93d46b61 commit r11-6381-g58a12b0eadac62e691fcf7325ab2bc2c93d46b61 Author: Richard Sandiford <richard.sandiford@arm.com> Date: Thu Dec 31 16:51:34 2020 +0000 vect: Avoid generating out-of-range shifts [PR98302] In this testcase we end up with: unsigned long long x = ...; char y = (char) (x << 37); The overwidening pattern realised that only the low 8 bits of x << 37 are needed, but then tried to turn that into: unsigned long long x = ...; char y = (char) x << 37; which gives an out-of-range shift. In this case y can simply be replaced by zero, but as the comment in the patch says, it's kind-of awkward to do that in the middle of vectorisation. Most of the overwidening stuff is about keeping operations as narrow as possible, which is important for vectorisation but could be counter-productive for scalars (especially on RISC targets). In contrast, optimising y to zero in the above feels like an independent optimisation that would benefit scalar code and that should happen before vectorisation. gcc/ PR tree-optimization/98302 * tree-vect-patterns.c (vect_determine_precisions_from_users): Make sure that the precision remains greater than the shift count. gcc/testsuite/ PR tree-optimization/98302 * gcc.dg/vect/pr98302.c: New test.
FIxed on trunk so far. The code goes back to GCC 9 and the patch seems safe, so keeping open for now to see whether we should backport.
(In reply to Martin Liška from comment #11) > > which is miscompiled at -O2 -ftree-vectorize or -O3. > > What a great reduction, can you please share knowledge how did you achieve > that?! Hi Martin, Sorry for the late reply. Generally, when reducing a yarpgen testcase, I start by checking whether we can hit the bug with just one large C++ file, i.e. with: $ cat init.h func.cpp driver.cpp > combined.cc and then deleting the #include of "init.h" from combined.cc. For most issues, you should be able to reproduce the bug with the combined file. For a wrong code bug, I usually proceed with two separate reductions: 1. Reduce the C++ file without preprocessing. 2. Attempt to manually convert the reduced C++ testcase to a C testcase. 3. Preprocess the C testcase and then reduce the preprocessed file. For (2) this usually involves superficial syntax changes, using C headers instead of their C++ equivalents, and replacing calls to std::{min,max} with a C equivalent (using pointers instead of references). I use a relatively beefy x86 machine for reduction, using qemu to run target code (aarch64, in this case). My reduction script broadly does the following: Sanitizer checks: * I run three separate compile+executes (2x gcc, 1x clang) on the x86 host with sanitizer options enabled (-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=undefined). I use the host GCC (7.5.0 on Ubuntu 18.04) and clang 10 for this. I run clang at -O0 and GCC at both -O0 and -O1 -fno-common (the latter appears to be necessary to catch some global buffer overflows). * We fail the interestingness test if any of these executions fail. If possible, I also save the results of these executions for comparison with the target executions. Target comparisons: * Then, using the target (aarch64) GCC, I compile+execute at various optimisation levels. In this case I used -O{0,1,2,s,g,3}, checking that the output (exit code and stdout/stderr) differs for the optimisation level exhibiting the bug (-O3 in this case) but is the same for all other cases. Finally, I run the reduced testcase through https://cerberus.cl.cam.ac.uk/ to check for any undefined behaviour that the sanitizers didn't manage to catch. More recently I've also been experimenting with Frama-C which works on some testcases that are too large for Cerberus to handle. Currently the reduction script is a simple bash script that runs everything serially. I've been experimenting with some python scripts that can run things in parallel, I might also explore generating make/ninja to parallelise the interestingness test at some point. I'm sure that this process could be improved, but this is what I've been using so far, and it seems to work. I hope this is of some use, let me know if you have any more specific questions or thoughts.
(In reply to Alex Coplan from comment #15) > (In reply to Martin Liška from comment #11) > > > which is miscompiled at -O2 -ftree-vectorize or -O3. > > > > What a great reduction, can you please share knowledge how did you achieve > > that?! > > Hi Martin, > > Sorry for the late reply. That's fine! > > Generally, when reducing a yarpgen testcase, I start by checking whether > we can hit the bug with just one large C++ file, i.e. with: > > $ cat init.h func.cpp driver.cpp > combined.cc Nice trick, I've just used that in PR98513 and it works. > > and then deleting the #include of "init.h" from combined.cc. For most > issues, you should be able to reproduce the bug with the combined file. > > For a wrong code bug, I usually proceed with two separate reductions: > > 1. Reduce the C++ file without preprocessing. > 2. Attempt to manually convert the reduced C++ testcase to a C testcase. > 3. Preprocess the C testcase and then reduce the preprocessed file. > > For (2) this usually involves superficial syntax changes, using C > headers instead of their C++ equivalents, and replacing calls to > std::{min,max} with a C equivalent (using pointers instead of > references). I can confirm that one can easily replace std::min, max with the implementation: template<class T> const T& min(const T& a, const T& b) { return (b < a) ? b : a; } that removes the rependency. > > I use a relatively beefy x86 machine for reduction, using qemu to run > target code (aarch64, in this case). My reduction script broadly does > the following: > > Sanitizer checks: > > * I run three separate compile+executes (2x gcc, 1x clang) on the x86 > host with sanitizer options enabled (-fsanitize=address > -fsanitize=undefined -fno-sanitize-recover=undefined). I use the host > GCC (7.5.0 on Ubuntu 18.04) and clang 10 for this. I run clang at -O0 > and GCC at both -O0 and -O1 -fno-common (the latter appears to be > necessary to catch some global buffer overflows). Heh :) In the mentioned PR, I first did only reduction with sanitizers, -O3 was miscompiled, but later than I noticed that -O0 was crashing as well. Using clang is also a nice trick. Moreover, I also use -Werror -Wall -Wextra and 'timeout X' helps as well. It's quite common that loops iterate for ever. > > * We fail the interestingness test if any of these executions fail. If > possible, I also save the results of these executions for comparison > with the target executions. > > Target comparisons: > > * Then, using the target (aarch64) GCC, I compile+execute at various > optimisation levels. In this case I used -O{0,1,2,s,g,3}, checking > that the output (exit code and stdout/stderr) differs for the > optimisation level exhibiting the bug (-O3 in this case) but is the > same for all other cases. > > Finally, I run the reduced testcase through > https://cerberus.cl.cam.ac.uk/ to check for any undefined behaviour that > the sanitizers didn't manage to catch. More recently I've also been > experimenting with Frama-C which works on some testcases that are too > large for Cerberus to handle. I see! > > Currently the reduction script is a simple bash script that runs > everything serially. I've been experimenting with some python scripts > that can run things in parallel, I might also explore generating > make/ninja to parallelise the interestingness test at some point. I do it also serially, in case of the mention PR, I did: cvise -c 'g++-10 combined.cc && timeout 2 ./a.out && g++-10 -fsanitize=address,undefined -fno-sanitize-recover=all combined.cc && timeout 2 ./a.out && g++ combined.cc -O3 && timeout 2 valgrind ./a.out 2>&1 | grep "Invalid "' combined.cc > > I'm sure that this process could be improved, but this is what I've been > using so far, and it seems to work. I hope this is of some use, let me > know if you have any more specific questions or thoughts. I may experiment next time with a paralell interestingness script. Right now, I've been running yarpgen for few hours a day on a AMD Epyc machine. Anyway, thank you for sharing the knowledge. It's useful!
Fixed for GCC 10 by r10-9257.
The releases/gcc-9 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:3fa4752e29a5b44219837ebad5bb09ec98af156e commit r9-9462-g3fa4752e29a5b44219837ebad5bb09ec98af156e Author: Richard Sandiford <richard.sandiford@arm.com> Date: Sun Apr 25 14:51:14 2021 +0100 vect: Avoid generating out-of-range shifts [PR98302] In this testcase we end up with: unsigned long long x = ...; char y = (char) (x << 37); The overwidening pattern realised that only the low 8 bits of x << 37 are needed, but then tried to turn that into: unsigned long long x = ...; char y = (char) x << 37; which gives an out-of-range shift. In this case y can simply be replaced by zero, but as the comment in the patch says, it's kind-of awkward to do that in the middle of vectorisation. Most of the overwidening stuff is about keeping operations as narrow as possible, which is important for vectorisation but could be counter-productive for scalars (especially on RISC targets). In contrast, optimising y to zero in the above feels like an independent optimisation that would benefit scalar code and that should happen before vectorisation. gcc/ PR tree-optimization/98302 * tree-vect-patterns.c (vect_determine_precisions_from_users): Make sure that the precision remains greater than the shift count. gcc/testsuite/ PR tree-optimization/98302 * gcc.dg/vect/pr98302.c: New test. (cherry picked from commit 58a12b0eadac62e691fcf7325ab2bc2c93d46b61)
Fixed.