106617 – [13 Regression] gcc is very slow at ternary expressions since r13-322-g7f04b0d786e13ff5

Bug 106617 - [13 Regression] gcc is very slow at ternary expressions since r13-322-g7f04b0d786e13ff5

Summary: [13 Regression] gcc is very slow at ternary expressions since r13-322-g7f04b0...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	13.0

Importance:	P3 normal
Target Milestone:	13.0
Assignee:	Richard Biener

URL:
Keywords:	compile-time-hog

Duplicates (1):	106642 (view as bug list)
Depends on:
Blocks:	106642
	Show dependency tree / graph

Reported:	2022-08-14 21:52 UTC by Sergei Trofimovich
Modified:	2022-08-18 14:04 UTC (History)
CC List:	5 users (show)

See Also:	106642
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-08-15 00:00:00

Attachments
fc_exch.c.c.orig.gz (295.39 KB, application/gzip) 2022-08-14 21:52 UTC, Sergei Trofimovich	Details
ring_buffer.c.gz (258.88 KB, application/gzip) 2022-08-15 07:22 UTC, Sergei Trofimovich	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sergei Trofimovich 2022-08-14 21:52:08 UTC

Created attachment 53454 [details]
fc_exch.c.c.orig.gz

Initially I observed slowness on ia64 trying to compile linux-4.19 using this week's gcc snapshot. x86_64 also seems to be affected.

Original (ia64-specific) example is attached as fc_exch.c.c.orig.gz. It shows the origin of this construct.

I reduced the example with 4s timeout and got the following reproducer:

// $ cat fc_exch.c.c
int nr_cpu_ids;
void fc_setup_exch_mgr() {
  (((((((1UL << (((0, 0)
                            ? ((1)
                                   ? (((nr_cpu_ids)) ? 0
                                      : ((nr_cpu_ids)) & (21) ? 21
                                      : ((nr_cpu_ids)) ? 20
                                      : ((nr_cpu_ids)) & (19) ? 19
                                      : ((nr_cpu_ids)) ? 18
                                      : ((nr_cpu_ids)) & (17) ? 17
                                      : ((nr_cpu_ids)) ? 16
                                      : ((nr_cpu_ids)) & (15) ? 15
                                      : ((nr_cpu_ids)) ? 14
                                      : ((nr_cpu_ids)) & (13) ? 13
                                      : ((nr_cpu_ids)) ? 12
                                      : ((nr_cpu_ids)) & (11) ? 11
                                      : ((nr_cpu_ids)) ? 10
                                      : ((nr_cpu_ids)) & (9)  ? 9
                                      : ((nr_cpu_ids))  ? 8
                                      : ((nr_cpu_ids)) & (7)  ? 7
                                      : ((nr_cpu_ids))  ? 6
                                      : ((nr_cpu_ids)) & (5)  ? 5
                                      : ((nr_cpu_ids))  ? 4
                                      : ((nr_cpu_ids)) & (3)
                                          ? 3
                                          : ((nr_cpu_ids)-1) & 1)
                                   : 1)
                            : 0) +
                       1))))) &
                 (1ULL << 2)
             ? 2
             : 1))
           );
}


$ time gcc-13.0.0 -Wno-address-of-packed-member -c fc_exch.c.c -o bug.o -O1

real    0m4,821s
user    0m4,726s
sys     0m0,077s

Almost 5s!

$ time gcc-12.1.0 -Wno-address-of-packed-member -c fc_exch.c.c -o bug.o -O1

real    0m0,019s
user    0m0,013s
sys     0m0,006s

$ gcc -v
Using built-in specs.
COLLECT_GCC=/<<NIX>>/gcc-13.0.0/bin/gcc
COLLECT_LTO_WRAPPER=/<<NIX>>/gcc-13.0.0/libexec/gcc/x86_64-unknown-linux-gnu/13.0.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with:
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.0.0 20220807 (experimental) (GCC)

On this example perf shows the following output:

     8.63%  cc1      cc1                   [.] operand_compare::operand_equal_p
     4.39%  cc1      cc1                   [.] operand_compare::verify_hash_value
     4.29%  cc1      cc1                   [.] wide_int_to_tree_1
     3.67%  cc1      cc1                   [.] fold_binary_loc
     3.17%  cc1      cc1                   [.] generic_simplify_NE_EXPR
     2.67%  cc1      cc1                   [.] tree_strip_nop_conversions
     2.64%  cc1      cc1                   [.] generic_simplify_EQ_EXPR
     2.47%  cc1      cc1                   [.] tree_operand_check
     2.39%  cc1      cc1                   [.] get_inner_reference
     2.37%  cc1      cc1                   [.] integer_zerop
     2.25%  cc1      cc1                   [.] wide_int_binop
     2.11%  cc1      cc1                   [.] tree_nop_convert
     1.84%  cc1      cc1                   [.] int_const_binop
     1.81%  cc1      cc1                   [.] int_cst_hasher::hash
     1.78%  cc1      cc1                   [.] ggc_internal_alloc
     1.70%  cc1      cc1                   [.] element_precision
     1.44%  cc1      cc1                   [.] wi::fits_to_tree_p<poly_int<1u, generic_wide_int<wide_int_ref_storage<false, true> > > >
     1.36%  cc1      cc1                   [.] contains_struct_check
     1.08%  cc1      cc1                   [.] build2
     1.05%  cc1      cc1                   [.] hash_table<int_cst_hasher, false, xcallocator>::find_slot_with_hash
     1.04%  cc1      cc1                   [.] generic_simplify
     1.03%  cc1      cc1                   [.] cache_wide_int_in_type_cache
     0.97%  cc1      cc1                   [.] get_int_cst_ext_nunits
     0.91%  cc1      cc1                   [.] bitmask_inv_cst_vector_p
     0.87%  cc1      cc1                   [.] tree_strip_sign_nop_conversions
     0.86%  cc1      libc.so.6             [.] __memset_avx2_unaligned_erms

Comment 1 Andrew Pinski 2022-08-14 21:55:23 UTC

I doubt this is a regression. I suspect if you compile the trunk with --enable-checking=release you would see the same compile time as 12.

Comment 2 Sergei Trofimovich 2022-08-14 22:02:59 UTC

(In reply to Andrew Pinski from comment #1)
> I doubt this is a regression. I suspect if you compile the trunk with
> --enable-checking=release you would see the same compile time as 12.

I think both my gcc-12.1.0 and gcc-13.0.0 are compiled with the same options (both are without --enable-checking= option). My unsubstantiated guess would be on recently added rules to fold & (or similar) which does not scale too well on long strings of ternary operators.

Comment 3 Sergei Trofimovich 2022-08-14 22:08:18 UTC

(In reply to Sergei Trofimovich from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > I doubt this is a regression. I suspect if you compile the trunk with
> > --enable-checking=release you would see the same compile time as 12.
> 
> I think both my gcc-12.1.0 and gcc-13.0.0 are compiled with the same options
> (both are without --enable-checking= option). My unsubstantiated guess would
> be on recently added rules to fold & (or similar) which does not scale too
> well on long strings of ternary operators.

Oh, only now I realized snapshots have a different default compared to releases proper. I'll add --enable-checking=release and re-test.

Comment 4 Andrew Pinski 2022-08-14 22:09:12 UTC

Gcc releases are always defaults to --enable-checking=release while the git trunk is =yes.

Comment 5 Sergei Trofimovich 2022-08-15 07:01:05 UTC

--enable-checking=release did not help much. Attached example still compiles 3.5 s. verify parts disappeared from profile, but generic_simplify are still there:

    11.20%  cc1      cc1                   [.] operand_compare::operand_equal_p
     4.32%  cc1      cc1                   [.] wide_int_to_tree_1
     4.23%  cc1      cc1                   [.] fold_binary_loc
     4.02%  cc1      cc1                   [.] generic_simplify_NE_EXPR
     3.64%  cc1      cc1                   [.] generic_simplify_EQ_EXPR
     3.62%  cc1      cc1                   [.] tree_strip_nop_conversions
     2.78%  cc1      cc1                   [.] get_inner_reference
     2.22%  cc1      cc1                   [.] wide_int_binop
     2.17%  cc1      cc1                   [.] int_const_binop
     2.00%  cc1      cc1                   [.] tree_nop_convert
     1.90%  cc1      cc1                   [.] operand_equal_p
     1.77%  cc1      cc1                   [.] wi::fits_to_tree_p<poly_int<1u, generic_wide_int<wide_int_ref_storage<false, true> > > >
     1.75%  cc1      cc1                   [.] integer_zerop
     1.72%  cc1      cc1                   [.] ggc_internal_alloc
     1.66%  cc1      cc1                   [.] element_precision
     1.38%  cc1      cc1                   [.] hash_table<int_cst_hasher, false, xcallocator>::find_slot_with_hash
     1.33%  cc1      cc1                   [.] build2
     1.30%  cc1      cc1                   [.] tree_with_possible_nonzero_bits2
     1.29%  cc1      cc1                   [.] fold_ternary_loc
     1.24%  cc1      cc1                   [.] wi::popcount
     1.19%  cc1      cc1                   [.] tree_truth_valued_p
     1.14%  cc1      cc1                   [.] tree_strip_sign_nop_conversions
     1.11%  cc1      cc1                   [.] generic_simplify
     1.11%  cc1      cc1                   [.] fold_truth_andor
     1.10%  cc1      cc1                   [.] bitmask_inv_cst_vector_p
     1.08%  cc1      cc1                   [.] __popcountdi2
     1.02%  cc1      cc1                   [.] get_nonzero_bits

And I also see hangups when building linux-4.19, this time on x86_64 as well. Will try to extract non-minimized example.

Comment 6 Sergei Trofimovich 2022-08-15 07:22:26 UTC

Created attachment 53455 [details]
ring_buffer.c.gz

ring_buffer.c.gz is a preprocessed file from linux-4.19.

$ time gcc-12.1.0 -O2 -c ring_buffer.c -o bug.o

real    0m0,308s
user    0m0,285s
sys     0m0,017s

$ gcc -O2 -c ring_buffer.c -o bug.o

# Runs for minutes, increases RAM usage, never finishes.

perf stats after 3 minutes' run:

  10,16%  cc1               [.] operand_compare::operand_equal_p
   7,38%  cc1               [.] fold_binary_loc
   7,01%  cc1               [.] generic_simplify_453
   5,63%  cc1               [.] generic_simplify_GT_EXPR
   3,72%  cc1               [.] wide_int_binop
   3,42%  cc1               [.] generic_simplify_COND_EXPR
   3,38%  cc1               [.] hash_table<int_cst_hasher, false, xcallocator>::find_slot_with_hash
   3,04%  cc1               [.] wide_int_to_tree_1
   3,00%  cc1               [.] tree_strip_nop_conversions
   2,82%  cc1               [.] int_const_binop
   2,47%  cc1               [.] get_inner_reference
   2,45%  cc1               [.] fold_build2_loc
   2,23%  cc1               [.] fold_ternary_loc
   2,07%  cc1               [.] tree_strip_sign_nop_conversions
   1,96%  cc1               [.] int_cst_hasher::hash
   1,83%  cc1               [.] ggc_internal_alloc
   1,81%  cc1               [.] fold_relational_const
   1,61%  cc1               [.] wi::fits_to_tree_p<poly_int<1u, generic_wide_int<wide_int_ref_storage<false, true> > > >
   1,57%  cc1               [.] generic_simplify_155
   1,51%  cc1               [.] operand_equal_p
   1,22%  cc1               [.] dbg_cnt
   1,20%  cc1               [.] element_precision
   1,10%  cc1               [.] ggc_free

Comment 7 Richard Biener 2022-08-15 08:55:05 UTC

Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
Applying pattern match.pd:5769, generic-match.cc:14472
...

that's the following pattern which GCC 12 doesn't have

/* From fold_binary_op_with_conditional_arg handle the case of
   rewriting (a ? b : c) > d to a ? (b > d) : (c > d) when the
   compares simplify.  */
(for cmp (simple_comparison)
 (simplify
  (cmp:c (cond @0 @1 @2) @3)
  /* Do not move possibly trapping operations into the conditional as this
     pessimizes code and causes gimplification issues when applied late.  */
  (if (!FLOAT_TYPE_P (TREE_TYPE (@3))
       || operation_could_trap_p (cmp, true, false, @3))
   (cond @0 (cmp! @1 @3) (cmp! @2 @3)))))

the testcase is essentially a degenerate case of this, applying the
pattern recursively.

The testcase is a bit too large to easily follow what happens, cutting
some lines in the middle shows we are eventually producing

  0, 0 ? nr_cpu_ids == 0 && ((nr_cpu_ids & 7) == 0 && (nr_cpu_ids == 0 && ((nr_cpu_ids & 5) == 0 && (nr_cpu_ids == 0 && ((nr_cpu_ids & 3) == 0 && (nr_cpu_ids + -1 & 1) != 0))))) : 0 ? 2 : 1;

not sure why we don't fold the (0, 0) ? ..., that seems to be a "bug" of
fully folding in the C frontend?   We do simplify 0 ? ... just fine.
Supposedly COMPOUND_EXPRs are never constant folded?

Comment 8 Richard Biener 2022-08-15 09:06:51 UTC

We don't fold because

    case COMPOUND_EXPR:
      /* When pedantic, a compound expression can be neither an lvalue
         nor an integer constant expression.  */
      if (TREE_SIDE_EFFECTS (arg0) || TREE_CONSTANT (arg1))
        return NULL_TREE;
      /* Don't let (0, 0) be null pointer constant.  */
      tem = integer_zerop (arg1) ? build1_loc (loc, NOP_EXPR, type, arg1)
                                 : fold_convert_loc (loc, type, arg1);
      return tem;

where we chicken out because of the fear of early folding.  At least the
C frontend should get this correct now.  Ideally the above would
just return arg1 if !TREE_SIDE_EFFECTS (arg0).  OTOH it's really sth for
language specific folding.

Of course the bug then would be still present when replacing (0, 0) with
something non-constant.

Comment 9 Martin Liška 2022-08-16 06:47:50 UTC

Started with r13-322-g7f04b0d786e13ff5.

Comment 10 Andrew Pinski 2022-08-16 14:48:11 UTC

*** Bug 106642 has been marked as a duplicate of this bug. ***

Comment 11 Peter Ivanov 2022-08-16 15:10:00 UTC

(In reply to Martin Liška from comment #9)
> Started with r13-322-g7f04b0d786e13ff5.

I do confirm that after reverting this commit there is no more hanging when compiling (bug 106642)

Comment 12 Richard Biener 2022-08-18 09:06:05 UTC

Btw, just for reference the original looks like

 fc_cpu_order = ( __builtin_constant_p(( __builtin_constant_p(nr_cpu_ids) ? ( ((nr_cpu_ids) == 1) ? 1 : (1UL << (( __builtin_constant_p((nr_cpu_ids) - 1) ? ( __builtin_constant_p((nr_cpu_ids) - 1) ? ( ((nr_cpu_ids) - 1) < 2 ? 0 : ((nr_cpu_ids) - 1) & (1ULL << 63) ? 63 : ((nr_cpu_ids) - 1) & (1ULL << 62) ? 62 : ((nr_cpu_ids) - 1) & (1ULL << 61) ? 61 : ((nr_cpu_ids) - 1) & (1ULL << 60) ? 60 : ((nr_cpu_ids) - 1) & (1ULL << 59) ? 59 : ((nr_cpu_ids) - 1) & (1ULL << 58) ? 58 : ((nr_cpu_ids) - 1) & (1ULL << 57) ? ...

so some fancy ceil_log2.

If you take a simplified

unsigned long fc_setup_exch_mgr(int nr_cpu_ids_m1)
{
  return ((((nr_cpu_ids_m1)) & (1<<21) ? 21
                  : ((nr_cpu_ids_m1)) & (1<<20) ? 20
                  : ((nr_cpu_ids_m1)) & (1<<19) ? 19
                  : ((nr_cpu_ids_m1)) & (1<<18) ? 18
                  : ((nr_cpu_ids_m1)) & (1<<17) ? 17
                  : ((nr_cpu_ids_m1)) & (1<<16) ? 16
                  : ((nr_cpu_ids_m1)) & (1<<15) ? 15
                  : ((nr_cpu_ids_m1)) & (1<<14) ? 14
                  : ((nr_cpu_ids_m1)) & (1<<13) ? 13
                  : ((nr_cpu_ids_m1)) & (1<<12) ? 12
                  : ((nr_cpu_ids_m1)) & (1<<11) ? 11  /* THIS */
                  : ((nr_cpu_ids_m1)) & (1<<9)  ? 9
                  : ((nr_cpu_ids_m1)) & (1<<8)  ? 8
                  : ((nr_cpu_ids_m1)) & (1<<7)  ? 7
                  : ((nr_cpu_ids_m1)) & (1<<6)  ? 6
                  : ((nr_cpu_ids_m1)) & (1<<5)  ? 5
                  : ((nr_cpu_ids_m1)) & (1<<4)  ? 4
                  : ((nr_cpu_ids_m1)) & (1<<3) ? 3
                  : 0)) != 11;
}

I see

./cc1 -quiet t2.c -fdump-tree-original-folding; grep 'Applying' t2.c.005t.original  | sort -u ; grep 'Applying' t2.c.005t.original  | wc -l
Applying pattern match.pd:4778, generic-match.cc:95676
Applying pattern match.pd:5809, generic-match.cc:24025
16383

but when I just remove the marked line it becomes

./cc1 -quiet t2.c -fdump-tree-original-folding; grep 'Applying' t2.c.005t.original  | sort -u ; grep 'Applying' t2.c.005t.original  | wc -l
Applying pattern match.pd:4778, generic-match.cc:95676
Applying pattern match.pd:5809, generic-match.cc:24025
34

hmm, and the issue is probably that we have this pattern both in
fold-const.cc and match.pd and thus when the pattern applies recursively,
like for

 (a ? (b ? d : e) : f) > g

and the original fold implementation would simplify because one of the
braches simplifes:

  /* Check that we have simplified at least one of the branches.  */
  if (!TREE_CONSTANT (arg) && !TREE_CONSTANT (lhs) && !TREE_CONSTANT (rhs))
    return NULL_TREE;

so it goes all the way, recursively to "simple".  But then the match.pd
variant rejects the result because it is EXPR_P for the lhs (but not
for the rhs).  We always first try the match.pd path but then will
try fold_binary_op_with_conditional_arg anyway which makes this effectively
at least quadratic.

This was all done to avoid regressing the COND_EXPR gimple IL refactoring.
One possibility to avoid the recursion with fold-const is to disable the
match.pd pattern for GENERIC.

Comment 13 GCC Commits 2022-08-18 14:04:10 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:ac68f904fe31baf80fa53218f1d8ee033bd8c79b

commit r13-2113-gac68f904fe31baf80fa53218f1d8ee033bd8c79b
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Aug 18 11:10:30 2022 +0200

    middle-end/106617 - fix fold_binary_op_with_conditional_arg pattern issue
    
    Now that we have parts of fold_binary_op_with_conditional_arg duplicated
    in match.pd and are using ! to take or throw away the result we have to
    be careful to not have both implementations play games which each other,
    causing quadratic behavior.  In particular the match.pd implementation
    requires both arms to simplify while the fold-const.cc is happy with
    just one arm simplifying (something we cannot express in match.pd).
    
    The fix is to simply not enable the match.pd pattern for GENERIC.
    
            PR middle-end/106617
            * match.pd ((a ? b : c) > d -> a ? (b > d) : (c > d)): Fix
            guard, disable on GENERIC to not cause quadratic behavior
            with the fold-const.cc implementation and the use of !
    
            * gcc.dg/pr106617.c: New testcase.

Comment 14 Richard Biener 2022-08-18 14:04:20 UTC

Should be fixed now.