108681 – [12 Regression] gcc hangs compiling opencv/channels_combine.cpp for aarch64

Bug 108681 - [12 Regression] gcc hangs compiling opencv/channels_combine.cpp for aarch64

Summary: [12 Regression] gcc hangs compiling opencv/channels_combine.cpp for aarch64

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	13.0

Importance:	P3 normal
Target Milestone:	12.3
Assignee:	Richard Sandiford

URL:
Keywords:	compile-time-hog, needs-bisection

Duplicates (1):	106041 (view as bug list)
Depends on:
Blocks:

Reported:	2023-02-06 05:19 UTC by Khem Raj
Modified:	2024-09-02 21:45 UTC (History)
CC List:	6 users (show)

See Also:	106041 109794 116564
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:	13.0
Last reconfirmed:	2023-02-06 00:00:00

Attachments
unreduced Testcase (110.33 KB, application/x-bzip) 2023-02-06 05:41 UTC, Andrew Pinski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Khem Raj 2023-02-06 05:19:31 UTC

GCC-trunk as of 31924665c86d47af6b1f22a74f594f2e1dc0ed2d is taking a long long time, probably a hang since I cancelled it after 12 mins ) compiling this file

https://uclibc.org/~kraj/channels_combine.i

aarch64-yoe-linux/aarch64-yoe-linux-g++ channels_combine.i -O2 ( hangs )

It compiled ok with -O0,-Os,-Og,-Oz but not with -O1,-O2,-O3

Comment 1 Andrew Pinski 2023-02-06 05:41:19 UTC

Created attachment 54411 [details]
unreduced Testcase

Comment 2 Andrew Pinski 2023-02-06 06:08:32 UTC

rtl_dce seems stuck and keeps on adding to the worklist for:

;; Function carotene_o4t::combine2 (_ZN12carotene_o4t8combine2ERKNS_6Size2DEPKllS4_lPll, funcdef_no=5226, decl_uid=40999, cgraph_uid=4575, symbol_order=4584)

Comment 3 Richard Biener 2023-02-06 07:56:43 UTC

there's another endless DCE bug somewhere.

Comment 4 Jakub Jelinek 2023-02-06 10:04:35 UTC

Reduced testcase (-O2):
#pragma GCC aarch64 "arm_neon.h"
typedef __Int64x1_t int64x1_t;
void foo (int64x1x4_t);

void
bar (int64x1_t a)
{
  for (;;) {
    int64x1x4_t b;
    b.val[3] = a;
    foo (b);
  }
}

Comment 5 Jakub Jelinek 2023-02-06 10:18:08 UTC

The peephole2 dump keeps repeating
Finished finding needed instructions:
processing block 3 lr out =  31 [sp] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 40 [v8]
  Adding insn 12 to worklist
  Adding insn 36 to worklist
  Adding insn 35 to worklist
  Adding insn 34 to worklist
  Adding insn 33 to worklist
  Adding insn 8 to worklist
processing block 2 lr out =  31 [sp] 34 [v2] 35 [v3] 36 [v4] 40 [v8]
  Adding insn 2 to worklist
  Adding insn 38 to worklist
df_worklist_dataflow_doublequeue: n_basic_blocks 4 n_edges 3 count 4 (    1)
forever.
bb 3 is:
(code_label 13 4 7 3 2 (nil) [1 uses])
(note 7 13 9 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(insn 9 7 8 3 (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24)) "pr108681.C":10:14 -1
     (expr_list:REG_UNUSED (reg/v:V4x1DI 34 v2 [orig:92 b ] [92])
        (nil)))
(insn:TI 8 9 33 3 (set (reg:DI 37 v5 [ b+24 ])
        (reg:DI 40 v8 [orig:93 a ] [93])) "pr108681.C":10:14 65 {*movdi_aarch64}
     (nil))
(insn 33 8 34 3 (set (reg:DI 32 v0)
        (reg:DI 34 v2)) "pr108681.C":11:9 65 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 34 v2)
        (nil)))
(insn:TI 34 33 35 3 (set (reg:DI 33 v1)
        (reg:DI 35 v3)) "pr108681.C":11:9 65 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 35 v3)
        (nil)))
(insn 35 34 36 3 (set (reg:DI 34 v2)
        (reg:DI 36 v4)) "pr108681.C":11:9 65 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 36 v4)
        (nil)))
(insn:TI 36 35 11 3 (set (reg:DI 35 v3)
        (reg:DI 37 v5)) "pr108681.C":11:9 65 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 37 v5)
        (nil)))
(call_insn 11 36 12 3 (parallel [
            (call (mem:DI (symbol_ref:DI ("_Z3foo11int64x1x4_t") [flags 0x41] <function_decl 0x7fffe9ef9c00 foo>) [0 foo S8 A8])
                (const_int 0 [0]))
            (unspec:DI [
                    (const_int 0 [0])
                ] UNSPEC_CALLEE_ABI)
            (clobber (reg:DI 30 x30))
        ]) "pr108681.C":11:9 58 {*call_insn}
     (expr_list:REG_DEAD (reg:V4x1DI 32 v0)
        (expr_list:REG_CALL_DECL (symbol_ref:DI ("_Z3foo11int64x1x4_t") [flags 0x41] <function_decl 0x7fffe9ef9c00 foo>)
            (nil)))
    (expr_list (clobber (reg:DI 17 x17))
        (expr_list (clobber (reg:DI 16 x16))
            (expr_list:V4x1DI (use (reg:V4x1DI 32 v0))
                (nil)))))
(insn 12 11 30 3 (clobber (reg/v:V4x1DI 34 v2 [orig:92 b ] [92])) -1
     (expr_list:REG_UNUSED (reg:TI 36 v4)
        (nil)))
(jump_insn:TI 30 12 31 3 (set (pc)
        (label_ref 13)) 2 {jump}
     (nil)
 -> 13)
and bb 2
(note 5 1 37 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn/f:TI 37 5 38 2 (parallel [
            (set (reg/f:DI 31 sp)
                (plus:DI (reg/f:DI 31 sp)
                    (const_int -32 [0xffffffffffffffe0])))
            (set/f (mem:DI (plus:DI (reg/f:DI 31 sp)
                        (const_int -32 [0xffffffffffffffe0])) [0  S8 A8])
                (reg:DI 29 x29))
            (set/f (mem:DI (plus:DI (reg/f:DI 31 sp)
                        (const_int -24 [0xffffffffffffffe8])) [0  S8 A8])
                (reg:DI 30 x30))
        ]) "pr108681.C":7:1 127 {storewb_pairdi_di}
     (expr_list:REG_DEAD (reg:DI 30 x30)
        (expr_list:REG_DEAD (reg:DI 29 x29)
            (nil))))
(insn 38 37 39 2 (set (reg/f:DI 29 x29)
        (reg/f:DI 31 sp)) "pr108681.C":7:1 65 {*movdi_aarch64}
     (nil))
(insn:TI 39 38 40 2 (set (mem:BLK (scratch) [0  A8])
        (unspec:BLK [
                (reg/f:DI 31 sp)
                (reg/f:DI 29 x29)
            ] UNSPEC_PRLG_STK)) "pr108681.C":7:1 1140 {stack_tie}
     (expr_list:REG_DEAD (reg/f:DI 29 x29)
        (nil)))
(insn/f 40 39 41 2 (set (mem/c:DF (plus:DI (reg/f:DI 31 sp)
                (const_int 16 [0x10])) [3  S8 A8])
        (reg:DF 40 v8)) "pr108681.C":7:1 76 {*movdf_aarch64}
     (expr_list:REG_DEAD (reg:DF 40 v8)
        (nil)))
(note 41 40 2 2 NOTE_INSN_PROLOGUE_END)
(insn:TI 2 41 4 2 (set (reg:DI 40 v8 [orig:93 a ] [93])
        (reg:DI 32 v0 [ a ])) "pr108681.C":7:1 65 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 32 v0 [ a ])
        (nil)))
(note 4 2 13 2 NOTE_INSN_FUNCTION_BEG)

Comment 6 Jakub Jelinek 2023-02-06 10:49:16 UTC

Seems the constant local/global changes are that register 37 (v5) is being added to the local_live bitmaps all the time, copied to DF_LR_IN, but then something changes it back.

Comment 7 Richard Sandiford 2023-02-06 14:18:32 UTC

(insn 9 7 8 3 (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24)) "...":10:14 -1
     (nil))

looks suspicious.  I would have expected that to be simplified or
removed by LRA.

Comment 8 GCC Commits 2023-02-13 21:14:13 UTC

The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b

commit r13-5972-g3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Feb 13 21:13:59 2023 +0000

    lra: Replace subregs in bare uses & clobbers [PR108681]
    
    In this PR we had a write to one vector of a 4-vector tuple.
    The vector had mode V1DI, and the target doesn't provide V1DI
    moves, so this was converted into:
    
        (clobber (subreg:V1DI (reg/v:V4x1DI 92 [ b ]) 24))
    
    followed by a DImode move.  (The clobber isn't really necessary
    or helpful for a single word, but would be for wider moves.)
    
    The subreg in the clobber survived until after RA:
    
        (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24))
    
    IMO this isn't well-formed.  If a subreg of a hard register simplifies
    to a hard register, it should be replaced by the hard register.  If the
    subreg doesn't simplify, then target-independent code can't be sure
    which parts of the register are affected and which aren't.  A clobber
    of such a subreg isn't useful and (again IMO) should just be removed.
    Conversely, a use of such a subreg is effectively a use of the whole
    inner register.
    
    LRA has code to simplify subregs of hard registers, but it didn't
    handle bare uses and clobbers.  The patch extends it to do that.
    
    One question was whether the final_p argument to alter_subregs
    should be true or false.  True is IMO dangerous, since it forces
    replacements that might not be valid from a dataflow perspective,
    and uses and clobbers only exist for dataflow.  As said above,
    I think the correct way of handling a failed simplification would
    be to delete clobbers and replace uses of subregs with uses of
    the inner register.  But I didn't want to write untested code
    to do that.
    
    In the PR, the clobber caused an infinite loop in DCE, because
    of a disagreement about what effect the clobber had.  But for
    the reasons above, I think that was GIGO rather than a bug in
    DF or DCE.
    
    gcc/
            PR rtl-optimization/108681
            * lra-spills.cc (lra_final_code_change): Extend subreg replacement
            code to handle bare uses and clobbers.
    
    gcc/testsuite/
            PR rtl-optimization/108681
            * gcc.target/aarch64/pr108681.c: New test.

Comment 9 Richard Sandiford 2023-02-13 21:18:01 UTC

Fixed.

Comment 10 Martin Jansa 2023-02-14 08:01:34 UTC

(In reply to rsandifo@gcc.gnu.org from comment #9)
> Fixed.

Thanks, I can confirm that it fixes the hang in the original case (building carotene in opencv).

Comment 11 Richard Biener 2023-02-22 08:53:36 UTC

*** Bug 106041 has been marked as a duplicate of this bug. ***

Comment 12 Richard Biener 2023-02-22 08:54:31 UTC

As the duplicate shows this also affects the GCC 12 branch (at least).

Comment 13 tt_1 2023-03-02 13:17:54 UTC

Hey everyone, has this been fixed in gcc-12 branch as well? Summary states it is a gcc-12 regression, fix went into gcc-13 and 12.2.1 is known to work. 

Is this summary correct, or does the fix still need to be backported to the gcc-12 branch?

Comment 14 Richard Sandiford 2023-03-02 13:20:55 UTC

No, it's not in GCC 12 branch yet.  I'm leaving it for a few weeks
to see if there's any fallout on trunk before backporting.

I've removed the misleading "Known to work", thanks for the heads up.

Comment 15 GCC Commits 2023-04-03 08:58:05 UTC

The releases/gcc-12 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:61bdd3c38039e1e309d5cf78c16c4052f6e09bea

commit r12-9382-g61bdd3c38039e1e309d5cf78c16c4052f6e09bea
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Apr 3 09:57:08 2023 +0100

    lra: Replace subregs in bare uses & clobbers [PR108681]
    
    In this PR we had a write to one vector of a 4-vector tuple.
    The vector had mode V1DI, and the target doesn't provide V1DI
    moves, so this was converted into:
    
        (clobber (subreg:V1DI (reg/v:V4x1DI 92 [ b ]) 24))
    
    followed by a DImode move.  (The clobber isn't really necessary
    or helpful for a single word, but would be for wider moves.)
    
    The subreg in the clobber survived until after RA:
    
        (clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24))
    
    IMO this isn't well-formed.  If a subreg of a hard register simplifies
    to a hard register, it should be replaced by the hard register.  If the
    subreg doesn't simplify, then target-independent code can't be sure
    which parts of the register are affected and which aren't.  A clobber
    of such a subreg isn't useful and (again IMO) should just be removed.
    Conversely, a use of such a subreg is effectively a use of the whole
    inner register.
    
    LRA has code to simplify subregs of hard registers, but it didn't
    handle bare uses and clobbers.  The patch extends it to do that.
    
    One question was whether the final_p argument to alter_subregs
    should be true or false.  True is IMO dangerous, since it forces
    replacements that might not be valid from a dataflow perspective,
    and uses and clobbers only exist for dataflow.  As said above,
    I think the correct way of handling a failed simplification would
    be to delete clobbers and replace uses of subregs with uses of
    the inner register.  But I didn't want to write untested code
    to do that.
    
    In the PR, the clobber caused an infinite loop in DCE, because
    of a disagreement about what effect the clobber had.  But for
    the reasons above, I think that was GIGO rather than a bug in
    DF or DCE.
    
    gcc/
            PR rtl-optimization/108681
            * lra-spills.cc (lra_final_code_change): Extend subreg replacement
            code to handle bare uses and clobbers.
    
    gcc/testsuite/
            PR rtl-optimization/108681
            * gcc.target/aarch64/pr108681.c: New test.
    
    (cherry picked from commit 3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b)

Comment 16 Richard Sandiford 2023-04-03 09:04:28 UTC

Fixed for GCC 12 too.