When investigating an improvement to LIMs fill_always_executed_in I created the following testcase which creates a loop nest of depth N with conditionally executed subloops. extern void foobar (int); template <int a> struct bar { static void baz(int b, int) { if (b & (1 << (a % 32))) for (int i = 0; i < 1024; ++i) bar<a-1>::baz (b, i); } }; template <> struct bar<0> { static void baz (int, int i) { foobar (i); } }; void __attribute__((flatten)) foo(int b) { #ifndef N #define N 10 #endif bar<N>::baz (b, 0); } For N == 900 (the maximum unless you also specify -ftemplate-depth) and -O1 we see tree canonical iv : 1.42 ( 13%) 0.00 ( 0%) 1.42 ( 13%) 28M ( 13%) complete unrolling : 2.80 ( 27%) 0.00 ( 0%) 2.81 ( 26%) 42M ( 19%) integrated RA : 3.41 ( 32%) 0.32 ( 80%) 3.72 ( 34%) 640k ( 0%) TOTAL : 10.54 0.40 10.96 224M For N == 1800 and -O1 it is already tree canonical iv : 30.43 ( 28%) 0.05 ( 14%) 30.50 ( 28%) 116M ( 15%) complete unrolling : 63.96 ( 59%) 0.06 ( 17%) 64.04 ( 59%) 175M ( 22%) tree iv optimization : 5.75 ( 5%) 0.00 ( 0%) 5.77 ( 5%) 126M ( 16%) integrated RA : 1.40 ( 1%) 0.12 ( 34%) 1.53 ( 1%) 1754k ( 0%) TOTAL : 108.35 0.35 108.75 796M For reference compile-time with N == 450 is 2.5s with tree canonical iv : 0.18 ( 7%) 0.00 ( 0%) 0.19 ( 7%) 6904k ( 10%) complete unrolling : 0.34 ( 14%) 0.00 ( 0%) 0.34 ( 13%) 8412k ( 13%)
replacing the loop bound 1024 with 'b' moves the two offenders off the radar leaving, with N == 10000 ipa function summary : 12.91 ( 11%) 0.00 ( 0%) 12.92 ( 11%) 9800k ( 0%) loop init : 11.08 ( 9%) 0.12 ( 15%) 11.19 ( 9%) 1677M ( 65%) branch prediction : 81.17 ( 68%) 0.05 ( 6%) 81.12 ( 67%) 51M ( 2%) TOTAL : 119.93 0.78 120.77 2567M where IPA function summary might be because of the artificial testcase construction relying on inlining.
The unroll and ivcanon issues are all from invoking estimate_numbers_of_iterations which for each loop walks the whole loop body (including conditionally executed parts) and on the signed IV increment invokes infer_loop_bounds_from_signedness which performs SCEV analysis. It's highly unlikely that there's useful info from stmts in inner loop given we're looking for an evolution in the outer loop. The code is also calling analyze_scalar_evolution with a loop that's not the loop the stmt is in which might be a correctness issue. idx_infer_loop_bounds seems to do the correct thing here. Possibly all SCEV analysis could be performed once and instantiated in the outermost loop we're interested in. That is, we'd walk the outermost loop and fill in bounds for the whole subnest with a single IL walk. While scalar evolutions are cached, the instantiations are not.
*** Bug 110396 has been marked as a duplicate of this bug. ***
VRP/ranger uses SCEV now so it might even be worse, the testcase from PR 110396 has that behavior too.
On the trunk with the original testcase here we get: tree copy headers : 12.16 ( 19%) 0.01 ( 2%) 21.57 ( 28%) 771k ( 0%) (I suspect the rest is due to not setting release checking ...)