I just tried to compile the package fceu-0.98.25-1.7 with the GNU C++ compiler version 4.4 snapshot 20081212. The compiler appears to get into a long or possibly infinite loop. I tried compiling with flag -O2 and that took 40 seconds or so. With flag -O3, compilation did not finish in over seven minutes on an otherwise idle machine. Preprocessed source code attached. Flag -O3 required.
Created attachment 16904 [details] C source code
On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave: lake:pts/1[9]> time ./xgcc -B./ -O3 /tmp/bug48.c -S ./xgcc -B./ -O3 /tmp/bug48.c -S 94.39s user 0.98s system 99% cpu 1:35.51 total lake:pts/1[10]> time ./xgcc -B./ -O2 /tmp/bug48.c -S ./xgcc -B./ -O2 /tmp/bug48.c -S 11.66s user 0.22s system 99% cpu 11.877 total lake:pts/1[11]> time ./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize ./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize 61.16s user 0.73s system 99% cpu 1:01.92 total
(In reply to comment #2) > On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave: Thanks for the extra information. I am concerned that the -O3 takes nine times longer than the -O2. I'd be interested in finding out where that extra time is being spent. I've had a look at "man gcc" but nothing obvious is found.
-ftime-report will report where the extra time is happening. Anyways confirmed. loop unswitching : 308.19 (70%) usr 0.05 ( 4%) sys 308.17 (70%) wall 105 kB ( 0%) ggc
With GCC 4.8 we see loop invariant motion : 16.79 (32%) usr 0.02 ( 3%) sys 16.65 (31%) wall 148 kB ( 0%) ggc loop unswitching : 10.43 (20%) usr 0.02 ( 3%) sys 10.42 (20%) wall 0 kB ( 0%) ggc TOTAL : 52.07 0.67 53.00 363360 kB which is RTL loop optimizers.
The issue here is the DF machinery, or rather the fact that RTL invariant motion calls df_analyze () for each loop, thus void move_loop_invariants (void) { struct loop *loop; if (flag_ira_loop_pressure) { df_analyze (); regstat_init_n_sets_and_refs (); ira_set_pseudo_classes (true, dump_file); calculate_loop_reg_pressure (); regstat_free_n_sets_and_refs (); } df_set_flags (DF_EQ_NOTES + DF_DEFER_INSN_RESCAN); /* Process the loops, innermost first. */ FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { curr_loop = loop; /* move_single_loop_invariants for very large loops is time consuming and might need a lot of memory. */ if (loop->num_nodes <= (unsigned) LOOP_INVARIANT_MAX_BBS_IN_LOOP) move_single_loop_invariants (loop); } here move_single_loop_invariants -> find_invariants -> find_defs which does static void find_defs (struct loop *loop, basic_block *body) { unsigned i; bitmap blocks = BITMAP_ALLOC (NULL); for (i = 0; i < loop->num_nodes; i++) bitmap_set_bit (blocks, body[i]->index); if (dump_file) { fprintf (dump_file, "*****starting processing of loop %d ******\n", loop->num); } df_remove_problem (df_chain); df_process_deferred_rescans (); df_chain_add_problem (DF_UD_CHAIN); df_set_blocks (blocks); df_set_flags (DF_RD_PRUNE_DEAD_DEFS); df_analyze (); ... which is excessive. It looks like the above DF stuff does not affects the whole function but only the BBs in the loop. Still the fact that we iterate from inner to outer loops still makes the DF analysis time quadratic in the loop depth. The testcase has only loops of depth 1 (but 2164 ones), so it seems that the DF restriction to a set of BBs does not work as expected? From the profile I see that the most time spent in df_analyze is inverted_post_order_compute and post_order_compute (obviosly not region aware!) and then bitmap_set_bit and df_prune_to_subcfg. I wonder if it isn't possible to either update the DF during the transform, deal with partly out-of-date DF info (like process loop siblings without re-calculating DF, and re-compute whole FN DF after such iteration).
Fixing this DF issue get's us down to loop invariant motion : 0.14 ( 0%) usr 0.01 ( 1%) sys 0.14 ( 0%) wall 148 kB ( 0%) ggc loop unswitching : 15.44 (30%) usr 0.02 ( 3%) sys 15.48 (29%) wall 0 kB ( 0%) ggc TOTAL : 52.26 0.73 52.99 456534 kB so finally a RTL unswitching slowness ;)
(In reply to Richard Biener from comment #7) > Fixing this DF issue get's us down to > > loop invariant motion : 0.14 ( 0%) usr 0.01 ( 1%) sys 0.14 ( 0%) > wall 148 kB ( 0%) ggc > loop unswitching : 15.44 (30%) usr 0.02 ( 3%) sys 15.48 (29%) > wall 0 kB ( 0%) ggc > TOTAL : 52.26 0.73 52.99 > 456534 kB > > so finally a RTL unswitching slowness ;) Fixing the same issue in loop-iv.c fixes that, too. loop invariant motion : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.16 ( 0%) wall 148 kB ( 0%) ggc loop unswitching : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 37.14 0.67 37.80 456534 kB
Mine.
Author: rguenth Date: Fri Jan 17 10:47:59 2014 New Revision: 206702 URL: http://gcc.gnu.org/viewcvs?rev=206702&root=gcc&view=rev Log: 2014-01-17 Richard Biener <rguenther@suse.de> PR rtl-optimization/38518 * df.h (df_analyze_loop): Declare. * df-core.c: Include cfgloop.h. (df_analyze_1): Split out main part of df_analyze. (df_analyze): Adjust. (loop_inverted_post_order_compute): New function. (loop_post_order_compute): Likewise. (df_analyze_loop): New function avoiding whole-function postorder computes. * loop-invariant.c (find_defs): Use df_analyze_loop. (find_invariants): Adjust. * loop-iv.c (iv_analysis_loop_init): Use df_analyze_loop. Modified: trunk/gcc/ChangeLog trunk/gcc/df-core.c trunk/gcc/df.h trunk/gcc/loop-invariant.c trunk/gcc/loop-iv.c
With 4.9.0 -O1: 5s -O2: 7s -O3: 41s the profile at -O3 is mostly flat, worst offender is LRA hard-reg assignment which may also contribute to the high DF time. Execution times (seconds) phase opt and generate : 41.00 (100%) usr 0.68 (94%) sys 41.75 (100%) wall 451991 kB (99%) ggc df reaching defs : 5.87 (14%) usr 0.03 ( 4%) sys 5.82 (14%) wall 0 kB ( 0%) ggc dominance frontiers : 2.52 ( 6%) usr 0.00 ( 0%) sys 2.51 ( 6%) wall 0 kB ( 0%) ggc LRA hard reg assignment : 4.94 (12%) usr 0.04 ( 6%) sys 4.96 (12%) wall 0 kB ( 0%) ggc TOTAL : 41.05 0.72 41.84 456534 kB It's now a RA issue, I'm out of here (though I'd even consider closing this as fixed - I've added the testcase to daily monitoring).
With recent gcc: no -O flag: 2 seconds -O2: 24 seconds -O3: 247 seconds I'll have a look at -ftime-report.
Also look at gcc.opensuse.org/ in the c++bench "random" part. At -O2 it shows df live&initialized regs: 12.12 (31%) usr backwards jump threading: 6.44 (17%) usr the backwards jump threading thing is new. Jeff? At -O3 it runs OOM on vangelis, czerny reports 3.9GB memory use and besides the above: load CSE after reload : 120.58 (70%) usr I try to add most "interesting" compile-time/memory-use testcases to the C++ tester "random" part so we can track time-report / memory usage issues there. I think I've seen the load CSE after reload issue elsewhere, too.
Actually CCing Jeff now -- see last comment, backward jump-threading compile-time hog at -O2.
(In reply to David Binderman from comment #12) > With recent gcc: > > no -O flag: 2 seconds > > -O2: 24 seconds > > -O3: 247 seconds Those numbers for a version of trunk gcc with lots of checking. Without checking, I get gcc trunk with -O3 in 4 mins 56 seconds and release gcc 7.1.1 in 3 minutes 38 seconds, so it still looks like it's going backwards. From -ftime-report on trunk gcc with -O3: $ grep "([0-9][0-9]\%) usr" report.out df live&initialized regs: 45.04 (16%) usr 0.06 ( 2%) sys 45.74 (15%) wall 0 kB ( 0%) ggc load CSE after reload : 174.76 (60%) usr 0.07 ( 2%) sys 176.23 (60%) wall 10 kB ( 0%) ggc $ So I think that confirms what Richard is seeing.
> callgraph functions expansion : 38.69 ( 98%) 0.88 ( 83%) 39.60 ( 97%) 431M ( 93%) I know the above is a catch all though. At least one issue is this for the trunk: machine dep reorg : 7.23 ( 18%) 0.00 ( 0%) 7.23 ( 18%) 0 ( 0%) This is with checking enabled but with -fno-checking. So the x86 maintainer should look into why for that. Note the trunk with -O2 -fno-checking (compiled with --enable-checking=yes) is faster than 7.3 compiled with --enable-checking=release :). Nothing else stands out for the trunk even though -O3 is 39s while -O2 is 8s.