Created attachment 37545 [details] gzipped C source code The attached code, when compiled by trunk gcc 6.0.0, with flag -O2, seems to take about 17 minutes. gcc 5.3.1 seems to take about 16 minutes for the same code. In a private email, Jeffery Law says: >I do note that test is spending most of its time (90%) in block >reordering, so you might want to go ahead and file it
Also takes a lot of memory with GCC 4.9 at least (killed at 2Gb). GCC 5 seems to peak at ~1.6GB. Note that with this kind of generated code I _always_ recommend -O1. -O2 simply has too many quadraticnesses. This case is a load of indirect jumps and loops and thus a very twisted CFG. I'd say the number of BBs hits the BB reordering slowness. First memory peak is for RTL pre (400MB), then rest_of_handle_ud_dce (800MB), then REE (on trunk 2.1GB!). All basically DF issues. It looks like we leak somewhere as well. Thus first tracking the memory-use regression towards GCC 5 at -O2. -O1 behaves reasonably (300MB memory use, 30sec compile-time for GCC 6), only out-lier is df live&initialized regs: 11.51 (39%) usr 0.08 ( 9%) sys 11.47 (37%) wall 0 kB ( 0%) ggc well, we know DF is slow and memory hungry.
-freorder-blocks-algorithm=simple makes the bb-reorder time go down to an insignificant fraction. Probably worth trying to see if we can determine from the CFG whether the STC algorithm is likely to be too expensive.
Something like this maybe? Tries to determine if too large a fraction of blocks are only reachable by computed jumps in a large function (which I'm guessing is what triggers the compile time issues). diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c index 8cbde89..10d106a 100644 --- a/gcc/bb-reorder.c +++ b/gcc/bb-reorder.c @@ -2435,13 +2435,39 @@ reorder_basic_blocks (void) { gcc_assert (current_ir_type () == IR_RTL_CFGLAYOUT); - if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1) + int n_bbs = n_basic_blocks_for_fn (cfun); + if (n_bbs <= NUM_FIXED_BLOCKS + 1) return; set_edge_can_fallthru_flag (); mark_dfs_back_edges (); - switch (flag_reorder_blocks_algorithm) + enum reorder_blocks_algorithm alg = flag_reorder_blocks_algorithm; + /* Try to detect cases where the STC algorithm is too expensive + and likely to be ineffective. */ + if (alg == REORDER_BLOCKS_ALGORITHM_STC && n_bbs > 2000) + { + basic_block bb; + int abnormals = 0; + FOR_EACH_BB_FN (bb, cfun) + { + edge e; + edge_iterator ei; + bool all_abnormal = true; + FOR_EACH_EDGE (e, ei, bb->preds) + if (!(e->flags & EDGE_ABNORMAL)) + { + all_abnormal = false; + break; + } + if (all_abnormal) + abnormals++; + } + if ((double)abnormals * 16 / n_bbs > 3) + alg = REORDER_BLOCKS_ALGORITHM_SIMPLE; + } + + switch (alg) { case REORDER_BLOCKS_ALGORITHM_SIMPLE: reorder_basic_blocks_simple ();
Note that while slow and memory hungry, this, AFAICT isn't a gcc-6 regression. gcc-4.9, gcc-5 and gcc-6 all have comparable behaviour. gcc-6 does try to PRE this code per a conscious decision to allow PRE to allocate more memory in gcc-6.
So checking enabled trunk with -O2 -fno-checking results in PRE : 25.58 ( 7%) usr 0.53 (33%) sys 26.14 ( 7%) wall 793 kB ( 0%) ggc reorder blocks : 286.65 (80%) usr 0.08 ( 5%) sys 287.01 (79%) wall 432597 kB (58%) ggc TOTAL : 359.83 1.60 361.82 745954 kB callgrind points at bb_to_key (called 4.7 million times), accounting for 55% of all samples. Ah, bb_to_key does FOR_EACH_EDGE on PREDs which might explain this ... I suppose either caching the result of this loop or limiting it would fix the slowness. The first one looks very desirable. In fact pre-computing bb_to_key once looks very desirable, it looks the result won't change over the pass execution? Well, maybe .end_of_trace will. But then adjusting the pre-computed value of all succs when adjusting .end_of_trace might be possible.
Maybe sth as simple as Index: gcc/bb-reorder.c =================================================================== --- gcc/bb-reorder.c (revision 233262) +++ gcc/bb-reorder.c (working copy) @@ -889,6 +889,7 @@ bb_to_key (basic_block bb) /* Prefer blocks whose predecessor is an end of some trace or whose predecessor edge is EDGE_DFS_BACK. */ + if (EDGE_COUNT (bb->preds) <= 8) FOR_EACH_EDGE (e, ei, bb->preds) { if ((e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun) or precomputing whether a BB has abnormal preds and disabling the trace in that case (or somehow sorting the edge vectors so that abnormal edges come last so we can stop at the first abnormal edge).
(In reply to Richard Biener from comment #6) > Maybe sth as simple as > > Index: gcc/bb-reorder.c > =================================================================== > --- gcc/bb-reorder.c (revision 233262) > +++ gcc/bb-reorder.c (working copy) > @@ -889,6 +889,7 @@ bb_to_key (basic_block bb) > > /* Prefer blocks whose predecessor is an end of some trace > or whose predecessor edge is EDGE_DFS_BACK. */ > + if (EDGE_COUNT (bb->preds) <= 8) > FOR_EACH_EDGE (e, ei, bb->preds) > { > if ((e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun) > > or precomputing whether a BB has abnormal preds and disabling the trace in > that case (or somehow sorting the edge vectors so that abnormal edges > come last so we can stop at the first abnormal edge). I tried this patch and it changes compile time from something more than ten minutes to 1m57s. More than five times faster looks good to go to me.
Created attachment 37718 [details] patch The attached patch implements a simple caching mechanism which reduces time (checking enabled but with -fno-checking) from reorder blocks : 260.53 (80%) usr 0.37 (27%) sys 261.80 (80%) wall 432597 kB (58%) ggc TOTAL : 323.98 1.38 328.14 745950 kB to reorder blocks : 135.46 (71%) usr 0.12 ( 9%) sys 135.70 (71%) wall 432597 kB (58%) ggc TOTAL : 190.60 1.29 192.10 745950 kB I'm going to see whether bb_to_key is still the issue (the resetting of cached key could be made smarter, updating the value from the edge that changed). Still the above is a reasonable improvement given we have FOR_EACH_EDGE (e, ei, bb->succs) { if (e == best_edge || e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun) || bb_visited_trace (e->dest)) continue; key = bb_to_key (e->dest); where bb_to_key does FOR_EACH_EDGE (e, ei, bb->preds). Yup. Even with the patch bb_to_key is still taking 67% of the compile time (according to perf).
Created attachment 37719 [details] updated patch Ok, this implements max frequency caching. reorder blocks : 1.03 ( 2%) usr 0.08 ( 7%) sys 1.08 ( 2%) wall 432597 kB (58%) ggc TOTAL : 55.67 1.19 56.99 745934 kB hah.
Top offender is now df live&initialized regs: 12.06 (21%) usr 0.00 ( 0%) sys 12.10 (21%) wall 0 kB ( 0%) ggc PRE : 15.53 (27%) usr 0.29 (26%) sys 15.83 (27%) wall 793 kB ( 0%) ggc with perf reporting 11.95% cc1 cc1 [.] df_live_bb_local_compute(unsigned int) 6.69% cc1 cc1 [.] compute_transp(rtx_def const*, int, simple_ 4.44% cc1 cc1 [.] df_worklist_dataflow(dataflow*, bitmap_head 3.10% cc1 cc1 [.] df_simulate_one_insn_forwards(basic_block_d 2.99% cc1 cc1 [.] bitmap_set_bit(bitmap_head*, int) I suspect the DF part is PRE as well. PRE already has some limiting on CFG complexity that appearantly doesn't trigger here. See gcse_or_cprop_is_too_expensive. OTOH we still call df_analyze before checking that (oops?!).
The PRE/GCSE limits were changed between gcc-5 and gcc-6 to allow it to run on larger CFGs such as this, but without totally blowing up. So I don't consider the PRE/GCSE bump a regression, a conscious decision to try harder to optimize functions with these painful CFGs.
Author: rguenth Date: Wed Feb 17 14:57:58 2016 New Revision: 233498 URL: https://gcc.gnu.org/viewcvs?rev=233498&root=gcc&view=rev Log: 2016-02-17 Richard Biener <rguenther@suse.de> PR rtl-optimization/69609 * bb-reorder.c (struct bbro_basic_block_data): Add priority member. (find_traces_1_round): When ending a trace update cached priority of successors. (bb_to_key): Use cached priority when available. (copy_bb): Initialize cached priority. (reorder_basic_blocks_software_trace_cache): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/bb-reorder.c
Fixed on trunk. Keeping open for a backport to GCC 5.
Author: rguenth Date: Wed Feb 24 12:55:31 2016 New Revision: 233661 URL: https://gcc.gnu.org/viewcvs?rev=233661&root=gcc&view=rev Log: 2016-02-24 Richard Biener <rguenther@suse.de> Backport from mainline 2016-02-17 Richard Biener <rguenther@suse.de> PR rtl-optimization/69609 * bb-reorder.c (struct bbro_basic_block_data): Add priority member. (find_traces_1_round): When ending a trace update cached priority of successors. (bb_to_key): Use cached priority when available. (copy_bb): Initialize cached priority. (reorder_basic_blocks_software_trace_cache): Likewise. 2016-02-08 Richard Biener <rguenther@suse.de> PR tree-optimization/69719 * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Properly use absolute of the difference of the two offsets to compare or adjust the segment length. * gcc.dg/torture/pr69719.c: New testcase. 2016-02-10 Richard Biener <rguenther@suse.de> PR tree-optimization/69719 * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Adjust previous fix by ensuring that dr_a1 is left of dr_a2. 2016-02-15 Richard Biener <rguenther@suse.de> PR tree-optimization/69783 * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Add trivially correct cases. * gcc.dg/vect/pr69783.c: New testcase. Added: branches/gcc-5-branch/gcc/testsuite/gcc.dg/torture/pr69719.c branches/gcc-5-branch/gcc/testsuite/gcc.dg/vect/pr69783.c Modified: branches/gcc-5-branch/gcc/ChangeLog branches/gcc-5-branch/gcc/bb-reorder.c branches/gcc-5-branch/gcc/tree-vect-data-refs.c
Fixed for 5.4.
current HEAD with -O2 -fno-checking (a checking enabled build) shows callgraph ipa passes : 6.80 ( 14%) 0.19 ( 12%) 7.00 ( 14%) 33M ( 5%) tree FRE : 6.55 ( 13%) 0.05 ( 3%) 6.58 ( 13%) 4146k ( 1%) PRE : 13.08 ( 26%) 0.04 ( 3%) 13.12 ( 26%) 13k ( 0%) machine dep reorg : 4.63 ( 9%) 0.00 ( 0%) 4.62 ( 9%) 0 ( 0%) TOTAL : 49.61 1.55 51.18 706M I looked at the tree FRE time and improved PHI hashing a bit but we still spend a lot of time in processing PHIs.