102253 – scalability issues with large loop depth

Bug 102253 - scalability issues with large loop depth

Summary: scalability issues with large loop depth

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	12.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	compile-time-hog

Duplicates (1):	110396 (view as bug list)
Depends on:
Blocks:	110396
	Show dependency tree / graph

Reported:	2021-09-09 09:45 UTC by Richard Biener
Modified:	2023-06-24 20:13 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Biener 2021-09-09 09:45:57 UTC

When investigating an improvement to LIMs fill_always_executed_in I created
the following testcase which creates a loop nest of depth N with conditionally
executed subloops.

extern void foobar (int);
template <int a>
struct bar
{
  static void baz(int b, int)
  {
    if (b & (1 << (a % 32)))
      for (int i = 0; i < 1024; ++i)
        bar<a-1>::baz (b, i);
  }
};
template <>
struct bar<0>
{
  static void baz (int, int i) { foobar (i); }
};
void __attribute__((flatten)) foo(int b)
{
#ifndef N
#define N 10
#endif
  bar<N>::baz (b, 0);
}

For N == 900 (the maximum unless you also specify -ftemplate-depth) and -O1 we see

 tree canonical iv                  :   1.42 ( 13%)   0.00 (  0%)   1.42 ( 13%)    28M ( 13%)
 complete unrolling                 :   2.80 ( 27%)   0.00 (  0%)   2.81 ( 26%)    42M ( 19%)
 integrated RA                      :   3.41 ( 32%)   0.32 ( 80%)   3.72 ( 34%)   640k (  0%)
 TOTAL                              :  10.54          0.40         10.96          224M

For N == 1800 and -O1 it is already

 tree canonical iv                  :  30.43 ( 28%)   0.05 ( 14%)  30.50 ( 28%)   116M ( 15%)
 complete unrolling                 :  63.96 ( 59%)   0.06 ( 17%)  64.04 ( 59%)   175M ( 22%)
 tree iv optimization               :   5.75 (  5%)   0.00 (  0%)   5.77 (  5%)   126M ( 16%)
 integrated RA                      :   1.40 (  1%)   0.12 ( 34%)   1.53 (  1%)  1754k (  0%)
 TOTAL                              : 108.35          0.35        108.75          796M

For reference compile-time with N == 450 is 2.5s with

 tree canonical iv                  :   0.18 (  7%)   0.00 (  0%)   0.19 (  7%)  6904k ( 10%)
 complete unrolling                 :   0.34 ( 14%)   0.00 (  0%)   0.34 ( 13%)  8412k ( 13%)

Comment 1 Richard Biener 2021-09-09 10:04:18 UTC

replacing the loop bound 1024 with 'b' moves the two offenders off the radar
leaving, with N == 10000

 ipa function summary               :  12.91 ( 11%)   0.00 (  0%)  12.92 ( 11%)  9800k (  0%)
 loop init                          :  11.08 (  9%)   0.12 ( 15%)  11.19 (  9%)  1677M ( 65%)
 branch prediction                  :  81.17 ( 68%)   0.05 (  6%)  81.12 ( 67%)    51M (  2%)
 TOTAL                              : 119.93          0.78        120.77         2567M

where IPA function summary might be because of the artificial testcase construction relying on inlining.

Comment 2 Richard Biener 2021-09-09 10:53:10 UTC

The unroll and ivcanon issues are all from invoking estimate_numbers_of_iterations which for each loop walks the whole loop body
(including conditionally executed parts) and on the signed IV increment
invokes infer_loop_bounds_from_signedness which performs SCEV analysis.

It's highly unlikely that there's useful info from stmts in inner loop
given we're looking for an evolution in the outer loop.

The code is also calling analyze_scalar_evolution with a loop that's not the
loop the stmt is in which might be a correctness issue.  idx_infer_loop_bounds
seems to do the correct thing here.

Possibly all SCEV analysis could be performed once and instantiated in the
outermost loop we're interested in.  That is, we'd walk the outermost loop
and fill in bounds for the whole subnest with a single IL walk.

While scalar evolutions are cached, the instantiations are not.

Comment 3 Andrew Pinski 2023-06-24 19:40:20 UTC

*** Bug 110396 has been marked as a duplicate of this bug. ***

Comment 4 Andrew Pinski 2023-06-24 19:41:49 UTC

VRP/ranger uses SCEV now so it might even be worse, the testcase from PR 110396 has that behavior too.

Comment 5 Andrew Pinski 2023-06-24 20:13:27 UTC

On the trunk with the original testcase here we get:
 tree copy headers                  :  12.16 ( 19%)   0.01 (  2%)  21.57 ( 28%)   771k (  0%)


(I suspect the rest is due to not setting release checking ...)