Created attachment 51774 [details] main_loop.c Originally found the problem as a test failure on GNU ed-1.17 test suite. The test fails on gcc-12.0.0 20211107 snapshot and works on gcc-11.2.0 release. I shrunk test down to a single file, but was not able to figure out where the problem pops up: # gcc-11, good: $ ../result-1/bin/gcc -O2 main_loop.c -o a && ./a element 1 element 2 element 3 # gcc-12, bad: $ ../result-2/bin/gcc -O2 main_loop.c -o a && ./a element 1 element 2 $ LANG=C ../result-1/bin/gcc -v Using built-in specs. COLLECT_GCC=/nix/store/gsj6c88q5hf7fmaaym7023q92jvx1i20-gcc-11.2.0/bin/gcc COLLECT_LTO_WRAPPER=/nix/store/gsj6c88q5hf7fmaaym7023q92jvx1i20-gcc-11.2.0/libexec/gcc/x86_64-unknown-linux-gnu/11.2.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: Thread model: posix Supported LTO compression algorithms: zlib gcc version 11.2.0 (GCC) $ LANG=C ../result-2/bin/gcc -v Using built-in specs. COLLECT_GCC=/nix/store/1g16njhrdlhq61qh30c063jdmmz3hv4d-gcc-12.0.0/bin/gcc COLLECT_LTO_WRAPPER=/nix/store/1g16njhrdlhq61qh30c063jdmmz3hv4d-gcc-12.0.0/libexec/gcc/x86_64-unknown-linux-gnu/12.0.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: Thread model: posix Supported LTO compression algorithms: zlib gcc version 12.0.0 20211107 (experimental) (GCC)
Still borken as of r12-5142. Confirmed.
This seems jump threading related but I can't tell how. -fno-tree-vrp is enough to workaround the bug but I don't have anything more than that.
Created attachment 51776 [details] main_loop_simpler.c Attaching seemingly simpler example. I applied most inlines and simplified loops as much as I could. It looks like VRP does something funny with: while (n > 0) { while (n-- > 0) { Looking at 039t.mergephi1 with -fno-tree-vrp: --- no-vrp/ed-main_loop.c.039t.mergephi1 2021-11-12 09:07:55.120821967 +0000 +++ yes-vrp/ed-main_loop.c.039t.mergephi1 2021-11-12 09:07:55.186822973 +0000 @@ -93,7 +93,7 @@ # np_6 = PHI <np_7(10), np_40(6)> # n_8 = PHI <n_9(10), n_23(6)> n_23 = n_8 + -1; - if (n_8 > 0) + if (n_8 != 0) goto <bb 4>; [INV] else goto <bb 8>; [INV] I think our loop is executed with n == -1 on real data and vrp provided incorrect range.
(In reply to Sergei Trofimovich from comment #3) > Created attachment 51776 [details] > main_loop_simpler.c > > Attaching seemingly simpler example. I applied most inlines and simplified > loops as much as I could. It looks like VRP does something funny with: > > while (n > 0) > { > while (n-- > 0) > { > > Looking at 039t.mergephi1 > > with -fno-tree-vrp: > > --- no-vrp/ed-main_loop.c.039t.mergephi1 2021-11-12 09:07:55.120821967 > +0000 > +++ yes-vrp/ed-main_loop.c.039t.mergephi1 2021-11-12 09:07:55.186822973 > +0000 > @@ -93,7 +93,7 @@ > # np_6 = PHI <np_7(10), np_40(6)> > # n_8 = PHI <n_9(10), n_23(6)> > n_23 = n_8 + -1; > - if (n_8 > 0) > + if (n_8 != 0) > goto <bb 4>; [INV] > else > goto <bb 8>; [INV] > > I think our loop is executed with n == -1 on real data and vrp provided > incorrect range. But how would the inner loop be reachable with n == -1?
Started with r12-3876-g4a960d548b7d7d94.
Not looking at this yet, but disabling jump threading from all passes (dom included) makes the problem go away: $ ./xgcc -B./ a.c -w -O2 -fno-thread-jumps && ./a.out element 1 element 2 element 3 ...so *maybe* threading related. The minimum number of threads that reproduces this is: ./xgcc -B./ a.c -w -O2 -fdbg-cnt=registered_jump_thread:1-1:3-3:5-5 && ./a.out ***dbgcnt: lower limit 1 reached for registered_jump_thread.*** ***dbgcnt: upper limit 1 reached for registered_jump_thread.*** ***dbgcnt: lower limit 3 reached for registered_jump_thread.*** ***dbgcnt: upper limit 3 reached for registered_jump_thread.*** ***dbgcnt: lower limit 5 reached for registered_jump_thread.*** ***dbgcnt: upper limit 5 reached for registered_jump_thread.*** element 1 element 2 $ grep 'Register.*jump' a.c.* a.c.111t.threadfull1: [1] Registering jump thread: (6, 7) incoming edge; (7, 9) nocopy; a.c.126t.thread1: [3] Registering jump thread: (2, 4) incoming edge; (4, 12) nocopy; a.c.191t.thread2: [5] Registering jump thread: (13, 11) incoming edge; (11, 10) nocopy;
(In reply to Aldy Hernandez from comment #6) > Not looking at this yet, but disabling jump threading from all passes (dom > included) makes the problem go away: > > $ ./xgcc -B./ a.c -w -O2 -fno-thread-jumps && ./a.out > element 1 > element 2 > element 3 a.c is main_loop_simpler.c
The problem here is this thread: [5] Registering jump thread: (13, 11) incoming edge; (11, 10) nocopy; BB11 has this: =========== BB 11 ============ Imports: n_42 Exports: n_42 Relational : (n_45 < n_42) <bb 11> [local count: 118111614]: # np_43 = PHI <np_19(13), &buffer_head(4)> # n_42 = PHI <m_31(13), addr_14(D)(4)> # m_31 = PHI <0(13), m_16(4)> n_45 = n_42 + -1; if (n_42 != 0) goto <bb 12>; [89.00%] else goto <bb 10>; [11.00%] Because of the ordering of the import bitmap, we solve m_31 first to 0. Then, when we solve n_42, we think we can use the m_31 in the cache, but the ordering is wrong. PHIs must always be done first. We went through a similar exercise to get relationals right and somehow missed this.
Created attachment 51780 [details] patch in testing
(In reply to Aldy Hernandez from comment #9) > Created attachment 51780 [details] > patch in testing The patch fixed real ed-1.17 test suite as well. Thank you!
The master branch has been updated by Aldy Hernandez <aldyh@gcc.gnu.org>: https://gcc.gnu.org/g:264f061997c0a5349cdce6d73f0dc167ac7fc8f4 commit r12-5216-g264f061997c0a5349cdce6d73f0dc167ac7fc8f4 Author: Aldy Hernandez <aldyh@redhat.com> Date: Fri Nov 12 16:08:01 2021 +0100 path solver: Solve PHI imports first for ranges. PHIs must be resolved first while solving ranges in a block, regardless of where they appear in the import bitmap. We went through a similar exercise for the relational code, but missed these. Tested on x86-64 & ppc64le Linux. gcc/ChangeLog: PR tree-optimization/103202 * gimple-range-path.cc (path_range_query::compute_ranges_in_block): Solve PHI imports first.
fixed