Bug 69609 - block reordering consumes an inordinate amount of time, REE consumes much memory
Summary: block reordering consumes an inordinate amount of time, REE consumes much memory
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 6.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: compile-time-hog, memory-hog
Depends on:
Blocks:
 
Reported: 2016-02-01 21:51 UTC by David Binderman
Modified: 2020-11-13 14:40 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work: 5.4.0, 6.0
Known to fail:
Last reconfirmed: 2016-02-02 00:00:00


Attachments
gzipped C source code (48.93 KB, application/x-gzip)
2016-02-01 21:51 UTC, David Binderman
Details
patch (1.07 KB, patch)
2016-02-17 09:49 UTC, Richard Biener
Details | Diff
updated patch (1.34 KB, patch)
2016-02-17 10:05 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Binderman 2016-02-01 21:51:03 UTC
Created attachment 37545 [details]
gzipped C source code

The attached code, when compiled by trunk gcc 6.0.0,
with flag -O2, seems to take about 17 minutes.

gcc 5.3.1 seems to take about 16 minutes for the same code.

In a private email, Jeffery Law says: 
>I do note that test is spending most of its time (90%) in block
>reordering, so you might want to go ahead and file it
Comment 1 Richard Biener 2016-02-02 09:23:32 UTC
Also takes a lot of memory with GCC 4.9 at least (killed at 2Gb).  GCC 5 seems to peak at ~1.6GB.

Note that with this kind of generated code I _always_ recommend -O1.  -O2 simply
has too many quadraticnesses.

This case is a load of indirect jumps and loops and thus a very twisted CFG.
I'd say the number of BBs hits the BB reordering slowness.

First memory peak is for RTL pre (400MB), then rest_of_handle_ud_dce (800MB),
then REE (on trunk 2.1GB!).  All basically DF issues.

It looks like we leak somewhere as well.

Thus first tracking the memory-use regression towards GCC 5 at -O2.

-O1 behaves reasonably (300MB memory use, 30sec compile-time for GCC 6), only
out-lier is

 df live&initialized regs:  11.51 (39%) usr   0.08 ( 9%) sys  11.47 (37%) wall       0 kB ( 0%) ggc

well, we know DF is slow and memory hungry.
Comment 2 Bernd Schmidt 2016-02-05 01:45:51 UTC
-freorder-blocks-algorithm=simple makes the bb-reorder time go down to an insignificant fraction. Probably worth trying to see if we can determine from the CFG whether the STC algorithm is likely to be too expensive.
Comment 3 Bernd Schmidt 2016-02-05 18:18:52 UTC
Something like this maybe? Tries to determine if too large a fraction of blocks are only reachable by computed jumps in a large function (which I'm guessing is what triggers the compile time issues).

diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
index 8cbde89..10d106a 100644
--- a/gcc/bb-reorder.c
+++ b/gcc/bb-reorder.c
@@ -2435,13 +2435,39 @@ reorder_basic_blocks (void)
 {
   gcc_assert (current_ir_type () == IR_RTL_CFGLAYOUT);
 
-  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1)
+  int n_bbs = n_basic_blocks_for_fn (cfun);
+  if (n_bbs <= NUM_FIXED_BLOCKS + 1)
     return;
 
   set_edge_can_fallthru_flag ();
   mark_dfs_back_edges ();
 
-  switch (flag_reorder_blocks_algorithm)
+  enum reorder_blocks_algorithm alg = flag_reorder_blocks_algorithm;
+  /* Try to detect cases where the STC algorithm is too expensive
+     and likely to be ineffective.  */
+  if (alg == REORDER_BLOCKS_ALGORITHM_STC && n_bbs > 2000)
+    {
+      basic_block bb;
+      int abnormals = 0;
+      FOR_EACH_BB_FN (bb, cfun)
+       {
+         edge e;
+         edge_iterator ei;
+         bool all_abnormal = true;
+         FOR_EACH_EDGE (e, ei, bb->preds)
+           if (!(e->flags & EDGE_ABNORMAL))
+             {
+               all_abnormal = false;
+               break;
+             }
+         if (all_abnormal)
+           abnormals++;
+       }
+      if ((double)abnormals * 16 / n_bbs > 3)
+       alg = REORDER_BLOCKS_ALGORITHM_SIMPLE;
+    }
+
+  switch (alg)
     {
     case REORDER_BLOCKS_ALGORITHM_SIMPLE:
       reorder_basic_blocks_simple ();
Comment 4 Jeffrey A. Law 2016-02-08 18:56:54 UTC
Note that while slow and memory hungry, this, AFAICT isn't a gcc-6 regression.  gcc-4.9, gcc-5 and gcc-6 all have comparable behaviour.  gcc-6 does try to PRE this code per a conscious decision to allow PRE to allocate more memory in gcc-6.
Comment 5 Richard Biener 2016-02-10 12:53:44 UTC
So checking enabled trunk with -O2 -fno-checking results in

 PRE                     :  25.58 ( 7%) usr   0.53 (33%) sys  26.14 ( 7%) wall     793 kB ( 0%) ggc
 reorder blocks          : 286.65 (80%) usr   0.08 ( 5%) sys 287.01 (79%) wall  432597 kB (58%) ggc
 TOTAL                 : 359.83             1.60           361.82             745954 kB

callgrind points at bb_to_key (called 4.7 million times), accounting for 55%
of all samples.

Ah, bb_to_key does FOR_EACH_EDGE on PREDs which might explain this ...

I suppose either caching the result of this loop or limiting it would fix
the slowness.  The first one looks very desirable. In fact pre-computing
bb_to_key once looks very desirable, it looks the result won't change over
the pass execution?  Well, maybe .end_of_trace will.  But then adjusting
the pre-computed value of all succs when adjusting .end_of_trace might be
possible.
Comment 6 Richard Biener 2016-02-10 12:57:24 UTC
Maybe sth as simple as

Index: gcc/bb-reorder.c
===================================================================
--- gcc/bb-reorder.c    (revision 233262)
+++ gcc/bb-reorder.c    (working copy)
@@ -889,6 +889,7 @@ bb_to_key (basic_block bb)
 
   /* Prefer blocks whose predecessor is an end of some trace
      or whose predecessor edge is EDGE_DFS_BACK.  */
+  if (EDGE_COUNT (bb->preds) <= 8)
   FOR_EACH_EDGE (e, ei, bb->preds)
     {
       if ((e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)

or precomputing whether a BB has abnormal preds and disabling the trace in
that case (or somehow sorting the edge vectors so that abnormal edges
come last so we can stop at the first abnormal edge).
Comment 7 David Binderman 2016-02-17 08:52:30 UTC
(In reply to Richard Biener from comment #6)
> Maybe sth as simple as
> 
> Index: gcc/bb-reorder.c
> ===================================================================
> --- gcc/bb-reorder.c    (revision 233262)
> +++ gcc/bb-reorder.c    (working copy)
> @@ -889,6 +889,7 @@ bb_to_key (basic_block bb)
>  
>    /* Prefer blocks whose predecessor is an end of some trace
>       or whose predecessor edge is EDGE_DFS_BACK.  */
> +  if (EDGE_COUNT (bb->preds) <= 8)
>    FOR_EACH_EDGE (e, ei, bb->preds)
>      {
>        if ((e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)
> 
> or precomputing whether a BB has abnormal preds and disabling the trace in
> that case (or somehow sorting the edge vectors so that abnormal edges
> come last so we can stop at the first abnormal edge).

I tried this patch and it changes compile time from something
more than ten minutes to 1m57s.

More than five times faster looks good to go to me.
Comment 8 Richard Biener 2016-02-17 09:49:45 UTC
Created attachment 37718 [details]
patch

The attached patch implements a simple caching mechanism which reduces time
(checking enabled but with -fno-checking) from

 reorder blocks          : 260.53 (80%) usr   0.37 (27%) sys 261.80 (80%) wall  432597 kB (58%) ggc
 TOTAL                 : 323.98             1.38           328.14             745950 kB

to

 reorder blocks          : 135.46 (71%) usr   0.12 ( 9%) sys 135.70 (71%) wall  432597 kB (58%) ggc
 TOTAL                 : 190.60             1.29           192.10             745950 kB

I'm going to see whether bb_to_key is still the issue (the resetting of cached
key could be made smarter, updating the value from the edge that changed).

Still the above is a reasonable improvement given we have

          FOR_EACH_EDGE (e, ei, bb->succs)
            {
              if (e == best_edge
                  || e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)
                  || bb_visited_trace (e->dest))
                continue;

              key = bb_to_key (e->dest);

where bb_to_key does FOR_EACH_EDGE (e, ei, bb->preds).

Yup.  Even with the patch bb_to_key is still taking 67% of the compile time
(according to perf).
Comment 9 Richard Biener 2016-02-17 10:05:42 UTC
Created attachment 37719 [details]
updated patch

Ok, this implements max frequency caching.

 reorder blocks          :   1.03 ( 2%) usr   0.08 ( 7%) sys   1.08 ( 2%) wall  432597 kB (58%) ggc
 TOTAL                 :  55.67             1.19            56.99             745934 kB

hah.
Comment 10 Richard Biener 2016-02-17 10:13:50 UTC
Top offender is now

 df live&initialized regs:  12.06 (21%) usr   0.00 ( 0%) sys  12.10 (21%) wall       0 kB ( 0%) ggc
 PRE                     :  15.53 (27%) usr   0.29 (26%) sys  15.83 (27%) wall     793 kB ( 0%) ggc

with perf reporting

 11.95%  cc1  cc1                [.] df_live_bb_local_compute(unsigned int)
  6.69%  cc1  cc1                [.] compute_transp(rtx_def const*, int, simple_
  4.44%  cc1  cc1                [.] df_worklist_dataflow(dataflow*, bitmap_head
  3.10%  cc1  cc1                [.] df_simulate_one_insn_forwards(basic_block_d
  2.99%  cc1  cc1                [.] bitmap_set_bit(bitmap_head*, int)

I suspect the DF part is PRE as well.  PRE already has some limiting on
CFG complexity that appearantly doesn't trigger here.  See
gcse_or_cprop_is_too_expensive.  OTOH we still call df_analyze before
checking that (oops?!).
Comment 11 Jeffrey A. Law 2016-02-17 14:11:21 UTC
The PRE/GCSE limits were changed between gcc-5 and gcc-6 to allow it to run on larger CFGs such as this, but without totally blowing up.  So I don't consider the PRE/GCSE bump a regression, a conscious decision to try harder to optimize functions with these painful CFGs.
Comment 12 Richard Biener 2016-02-17 14:58:30 UTC
Author: rguenth
Date: Wed Feb 17 14:57:58 2016
New Revision: 233498

URL: https://gcc.gnu.org/viewcvs?rev=233498&root=gcc&view=rev
Log:
2016-02-17  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/69609
	* bb-reorder.c (struct bbro_basic_block_data): Add priority member.
	(find_traces_1_round): When ending a trace update cached priority
	of successors.
	(bb_to_key): Use cached priority when available.
	(copy_bb): Initialize cached priority.
	(reorder_basic_blocks_software_trace_cache): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/bb-reorder.c
Comment 13 Richard Biener 2016-02-17 14:58:43 UTC
Fixed on trunk.  Keeping open for a backport to GCC 5.
Comment 14 Richard Biener 2016-02-24 12:56:03 UTC
Author: rguenth
Date: Wed Feb 24 12:55:31 2016
New Revision: 233661

URL: https://gcc.gnu.org/viewcvs?rev=233661&root=gcc&view=rev
Log:
2016-02-24  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2016-02-17  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/69609
	* bb-reorder.c (struct bbro_basic_block_data): Add priority member.
	(find_traces_1_round): When ending a trace update cached priority
	of successors.
	(bb_to_key): Use cached priority when available.
	(copy_bb): Initialize cached priority.
	(reorder_basic_blocks_software_trace_cache): Likewise.

	2016-02-08  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/69719
	* tree-vect-data-refs.c (vect_prune_runtime_alias_test_list):
	Properly use absolute of the difference of the two offsets to
	compare or adjust the segment length.

	* gcc.dg/torture/pr69719.c: New testcase.

	2016-02-10  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/69719
	* tree-vect-data-refs.c (vect_prune_runtime_alias_test_list):
	Adjust previous fix by ensuring that dr_a1 is left of dr_a2.

	2016-02-15  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/69783
	* tree-vect-data-refs.c (vect_prune_runtime_alias_test_list):
	Add trivially correct cases.

	* gcc.dg/vect/pr69783.c: New testcase.

Added:
    branches/gcc-5-branch/gcc/testsuite/gcc.dg/torture/pr69719.c
    branches/gcc-5-branch/gcc/testsuite/gcc.dg/vect/pr69783.c
Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/bb-reorder.c
    branches/gcc-5-branch/gcc/tree-vect-data-refs.c
Comment 15 Richard Biener 2016-02-24 12:56:23 UTC
Fixed for 5.4.
Comment 16 Richard Biener 2020-11-13 14:40:24 UTC
current HEAD with -O2 -fno-checking (a checking enabled build) shows

 callgraph ipa passes               :   6.80 ( 14%)   0.19 ( 12%)   7.00 ( 14%)    33M (  5%)
 tree FRE                           :   6.55 ( 13%)   0.05 (  3%)   6.58 ( 13%)  4146k (  1%)
 PRE                                :  13.08 ( 26%)   0.04 (  3%)  13.12 ( 26%)    13k (  0%)
 machine dep reorg                  :   4.63 (  9%)   0.00 (  0%)   4.62 (  9%)     0  (  0%)
 TOTAL                              :  49.61          1.55         51.18          706M

I looked at the tree FRE time and improved PHI hashing a bit but we still
spend a lot of time in processing PHIs.