GCC-trunk as of 31924665c86d47af6b1f22a74f594f2e1dc0ed2d is taking a long long time, probably a hang since I cancelled it after 12 mins ) compiling this file https://uclibc.org/~kraj/channels_combine.i aarch64-yoe-linux/aarch64-yoe-linux-g++ channels_combine.i -O2 ( hangs ) It compiled ok with -O0,-Os,-Og,-Oz but not with -O1,-O2,-O3
Created attachment 54411 [details] unreduced Testcase
rtl_dce seems stuck and keeps on adding to the worklist for: ;; Function carotene_o4t::combine2 (_ZN12carotene_o4t8combine2ERKNS_6Size2DEPKllS4_lPll, funcdef_no=5226, decl_uid=40999, cgraph_uid=4575, symbol_order=4584)
there's another endless DCE bug somewhere.
Reduced testcase (-O2): #pragma GCC aarch64 "arm_neon.h" typedef __Int64x1_t int64x1_t; void foo (int64x1x4_t); void bar (int64x1_t a) { for (;;) { int64x1x4_t b; b.val[3] = a; foo (b); } }
The peephole2 dump keeps repeating Finished finding needed instructions: processing block 3 lr out = 31 [sp] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 40 [v8] Adding insn 12 to worklist Adding insn 36 to worklist Adding insn 35 to worklist Adding insn 34 to worklist Adding insn 33 to worklist Adding insn 8 to worklist processing block 2 lr out = 31 [sp] 34 [v2] 35 [v3] 36 [v4] 40 [v8] Adding insn 2 to worklist Adding insn 38 to worklist df_worklist_dataflow_doublequeue: n_basic_blocks 4 n_edges 3 count 4 ( 1) forever. bb 3 is: (code_label 13 4 7 3 2 (nil) [1 uses]) (note 7 13 9 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 9 7 8 3 (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24)) "pr108681.C":10:14 -1 (expr_list:REG_UNUSED (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) (nil))) (insn:TI 8 9 33 3 (set (reg:DI 37 v5 [ b+24 ]) (reg:DI 40 v8 [orig:93 a ] [93])) "pr108681.C":10:14 65 {*movdi_aarch64} (nil)) (insn 33 8 34 3 (set (reg:DI 32 v0) (reg:DI 34 v2)) "pr108681.C":11:9 65 {*movdi_aarch64} (expr_list:REG_DEAD (reg:DI 34 v2) (nil))) (insn:TI 34 33 35 3 (set (reg:DI 33 v1) (reg:DI 35 v3)) "pr108681.C":11:9 65 {*movdi_aarch64} (expr_list:REG_DEAD (reg:DI 35 v3) (nil))) (insn 35 34 36 3 (set (reg:DI 34 v2) (reg:DI 36 v4)) "pr108681.C":11:9 65 {*movdi_aarch64} (expr_list:REG_DEAD (reg:DI 36 v4) (nil))) (insn:TI 36 35 11 3 (set (reg:DI 35 v3) (reg:DI 37 v5)) "pr108681.C":11:9 65 {*movdi_aarch64} (expr_list:REG_DEAD (reg:DI 37 v5) (nil))) (call_insn 11 36 12 3 (parallel [ (call (mem:DI (symbol_ref:DI ("_Z3foo11int64x1x4_t") [flags 0x41] <function_decl 0x7fffe9ef9c00 foo>) [0 foo S8 A8]) (const_int 0 [0])) (unspec:DI [ (const_int 0 [0]) ] UNSPEC_CALLEE_ABI) (clobber (reg:DI 30 x30)) ]) "pr108681.C":11:9 58 {*call_insn} (expr_list:REG_DEAD (reg:V4x1DI 32 v0) (expr_list:REG_CALL_DECL (symbol_ref:DI ("_Z3foo11int64x1x4_t") [flags 0x41] <function_decl 0x7fffe9ef9c00 foo>) (nil))) (expr_list (clobber (reg:DI 17 x17)) (expr_list (clobber (reg:DI 16 x16)) (expr_list:V4x1DI (use (reg:V4x1DI 32 v0)) (nil))))) (insn 12 11 30 3 (clobber (reg/v:V4x1DI 34 v2 [orig:92 b ] [92])) -1 (expr_list:REG_UNUSED (reg:TI 36 v4) (nil))) (jump_insn:TI 30 12 31 3 (set (pc) (label_ref 13)) 2 {jump} (nil) -> 13) and bb 2 (note 5 1 37 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn/f:TI 37 5 38 2 (parallel [ (set (reg/f:DI 31 sp) (plus:DI (reg/f:DI 31 sp) (const_int -32 [0xffffffffffffffe0]))) (set/f (mem:DI (plus:DI (reg/f:DI 31 sp) (const_int -32 [0xffffffffffffffe0])) [0 S8 A8]) (reg:DI 29 x29)) (set/f (mem:DI (plus:DI (reg/f:DI 31 sp) (const_int -24 [0xffffffffffffffe8])) [0 S8 A8]) (reg:DI 30 x30)) ]) "pr108681.C":7:1 127 {storewb_pairdi_di} (expr_list:REG_DEAD (reg:DI 30 x30) (expr_list:REG_DEAD (reg:DI 29 x29) (nil)))) (insn 38 37 39 2 (set (reg/f:DI 29 x29) (reg/f:DI 31 sp)) "pr108681.C":7:1 65 {*movdi_aarch64} (nil)) (insn:TI 39 38 40 2 (set (mem:BLK (scratch) [0 A8]) (unspec:BLK [ (reg/f:DI 31 sp) (reg/f:DI 29 x29) ] UNSPEC_PRLG_STK)) "pr108681.C":7:1 1140 {stack_tie} (expr_list:REG_DEAD (reg/f:DI 29 x29) (nil))) (insn/f 40 39 41 2 (set (mem/c:DF (plus:DI (reg/f:DI 31 sp) (const_int 16 [0x10])) [3 S8 A8]) (reg:DF 40 v8)) "pr108681.C":7:1 76 {*movdf_aarch64} (expr_list:REG_DEAD (reg:DF 40 v8) (nil))) (note 41 40 2 2 NOTE_INSN_PROLOGUE_END) (insn:TI 2 41 4 2 (set (reg:DI 40 v8 [orig:93 a ] [93]) (reg:DI 32 v0 [ a ])) "pr108681.C":7:1 65 {*movdi_aarch64} (expr_list:REG_DEAD (reg:DI 32 v0 [ a ]) (nil))) (note 4 2 13 2 NOTE_INSN_FUNCTION_BEG)
Seems the constant local/global changes are that register 37 (v5) is being added to the local_live bitmaps all the time, copied to DF_LR_IN, but then something changes it back.
(insn 9 7 8 3 (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24)) "...":10:14 -1 (nil)) looks suspicious. I would have expected that to be simplified or removed by LRA.
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b commit r13-5972-g3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b Author: Richard Sandiford <richard.sandiford@arm.com> Date: Mon Feb 13 21:13:59 2023 +0000 lra: Replace subregs in bare uses & clobbers [PR108681] In this PR we had a write to one vector of a 4-vector tuple. The vector had mode V1DI, and the target doesn't provide V1DI moves, so this was converted into: (clobber (subreg:V1DI (reg/v:V4x1DI 92 [ b ]) 24)) followed by a DImode move. (The clobber isn't really necessary or helpful for a single word, but would be for wider moves.) The subreg in the clobber survived until after RA: (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24)) IMO this isn't well-formed. If a subreg of a hard register simplifies to a hard register, it should be replaced by the hard register. If the subreg doesn't simplify, then target-independent code can't be sure which parts of the register are affected and which aren't. A clobber of such a subreg isn't useful and (again IMO) should just be removed. Conversely, a use of such a subreg is effectively a use of the whole inner register. LRA has code to simplify subregs of hard registers, but it didn't handle bare uses and clobbers. The patch extends it to do that. One question was whether the final_p argument to alter_subregs should be true or false. True is IMO dangerous, since it forces replacements that might not be valid from a dataflow perspective, and uses and clobbers only exist for dataflow. As said above, I think the correct way of handling a failed simplification would be to delete clobbers and replace uses of subregs with uses of the inner register. But I didn't want to write untested code to do that. In the PR, the clobber caused an infinite loop in DCE, because of a disagreement about what effect the clobber had. But for the reasons above, I think that was GIGO rather than a bug in DF or DCE. gcc/ PR rtl-optimization/108681 * lra-spills.cc (lra_final_code_change): Extend subreg replacement code to handle bare uses and clobbers. gcc/testsuite/ PR rtl-optimization/108681 * gcc.target/aarch64/pr108681.c: New test.
Fixed.
(In reply to rsandifo@gcc.gnu.org from comment #9) > Fixed. Thanks, I can confirm that it fixes the hang in the original case (building carotene in opencv).
*** Bug 106041 has been marked as a duplicate of this bug. ***
As the duplicate shows this also affects the GCC 12 branch (at least).
Hey everyone, has this been fixed in gcc-12 branch as well? Summary states it is a gcc-12 regression, fix went into gcc-13 and 12.2.1 is known to work. Is this summary correct, or does the fix still need to be backported to the gcc-12 branch?
No, it's not in GCC 12 branch yet. I'm leaving it for a few weeks to see if there's any fallout on trunk before backporting. I've removed the misleading "Known to work", thanks for the heads up.
The releases/gcc-12 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:61bdd3c38039e1e309d5cf78c16c4052f6e09bea commit r12-9382-g61bdd3c38039e1e309d5cf78c16c4052f6e09bea Author: Richard Sandiford <richard.sandiford@arm.com> Date: Mon Apr 3 09:57:08 2023 +0100 lra: Replace subregs in bare uses & clobbers [PR108681] In this PR we had a write to one vector of a 4-vector tuple. The vector had mode V1DI, and the target doesn't provide V1DI moves, so this was converted into: (clobber (subreg:V1DI (reg/v:V4x1DI 92 [ b ]) 24)) followed by a DImode move. (The clobber isn't really necessary or helpful for a single word, but would be for wider moves.) The subreg in the clobber survived until after RA: (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24)) IMO this isn't well-formed. If a subreg of a hard register simplifies to a hard register, it should be replaced by the hard register. If the subreg doesn't simplify, then target-independent code can't be sure which parts of the register are affected and which aren't. A clobber of such a subreg isn't useful and (again IMO) should just be removed. Conversely, a use of such a subreg is effectively a use of the whole inner register. LRA has code to simplify subregs of hard registers, but it didn't handle bare uses and clobbers. The patch extends it to do that. One question was whether the final_p argument to alter_subregs should be true or false. True is IMO dangerous, since it forces replacements that might not be valid from a dataflow perspective, and uses and clobbers only exist for dataflow. As said above, I think the correct way of handling a failed simplification would be to delete clobbers and replace uses of subregs with uses of the inner register. But I didn't want to write untested code to do that. In the PR, the clobber caused an infinite loop in DCE, because of a disagreement about what effect the clobber had. But for the reasons above, I think that was GIGO rather than a bug in DF or DCE. gcc/ PR rtl-optimization/108681 * lra-spills.cc (lra_final_code_change): Extend subreg replacement code to handle bare uses and clobbers. gcc/testsuite/ PR rtl-optimization/108681 * gcc.target/aarch64/pr108681.c: New test. (cherry picked from commit 3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b)
Fixed for GCC 12 too.