Bug 56113

Summary: out of memory when compiling a function with many goto labels (50k > )
Product: gcc Reporter: Kangkook <aixer77>
Component: middle-endAssignee: Richard Biener <rguenth>
Status: ASSIGNED ---    
Severity: normal CC: jsm28, rguenth, steven
Priority: P3 Keywords: alias, compile-time-hog, memory-hog
Version: 4.6.3   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2013-01-27 00:00:00
Bug Depends on: 54627    
Bug Blocks: 47344    
Attachments: a script to generates the code that reproduces the bug
patch
kill dominator queries from domwalk

Description Kangkook 2013-01-25 20:36:39 UTC
Out of memory when compiling a function with many goto labels (50k > )

gcc 4.6.3 compiler(running on top of ubuntu 12.04 32-bit) failed to output an
object for a function when it comes with extremely large number of goto
labels.

Attached is a simple script that generates the test source file. Here's how to
reproduce a failure with the script.

~$./goto_gen.py 60000 t.c

~$ time cc -O1 -o t t.c

cc1: out of memory allocating 4064 bytes after a total of 2309808128 bytes

real	11m44.371s
user	11m42.592s
sys	0m1.876s

~$ time cc -O0 -o t t.c

real	22m5.106s
user	22m3.539s
sys	0m1.640s


As you can see from the above, -O1 option trigger the problem while -O0
doesn't.

Could you anyone tell me how to workaround this situation? Or which specific
optimization option causes the issue?

Thanks a lot for your help in advance.

Regards, Kangkook
Comment 1 Kangkook 2013-01-25 20:39:06 UTC
Created attachment 29276 [details]
a script to generates the code that reproduces the bug
Comment 2 Richard Biener 2013-01-26 15:29:30 UTC
Please try GCC 4.8 which has _lots_ of scalability patches to address this
kind of issues.

Note that a 32bit host is not guaranteed to be able to compile arbitrary
large functions.
Comment 3 Kangkook 2013-01-26 15:40:43 UTC
Hi, Richard

Thanks a lot for your advice. I will definitely try gcc 4.8 and let you know about the result.
Btw, I also tested it from the 64-bit env. but it also crashed the compiler after using up 8GB memories.

Thanks!
Comment 4 Steven Bosscher 2013-01-27 00:16:35 UTC
I do not believe this kind of test case can ever be compiled by any
reasonable compiler. The problem is not the labels, but the computed
gotos. This causes an exceptionally dense CFG with N_LABELS**2 edges.
GCC tries to factor these computed gotos early on in the passes
pipeline but expands them near the end of the passes pipeline. The
function is still simply too large to handle. But anyway, let's all
pretend for a moment that this test case is sane...


On powerpc64-unknown-linux-gnu, with GCC 4.8 trunk r195103, compile
time grows nicely linear with the number of computed gotos in the
function, but memory usage appears to grow quadratically:

$ for n in 1000 2000 3000 4000 5000 10000 20000 40000 ; do
>     ./goto_gen.py ${n} t.c
>    ~/bin/compacttime ./cc1 -quiet -O1 t.c
> done
DBG: 1000
29.48user 0.39system 0:29.88elapsed 99%CPU
(0avgtext+0avgdata 172352maxresident)k
DBG: 2000
70.07user 0.56system 1:10.64elapsed 99%CPU
(0avgtext+0avgdata 386496maxresident)k
DBG: 3000
137.31user 1.45system 2:18.80elapsed 99%CPU
(0avgtext+0avgdata 664960maxresident)k
DBG: 4000
190.45user 2.43system 3:12.92elapsed 99%CPU
(0avgtext+0avgdata 1050560maxresident)k
DBG: 5000
318.18user 2.90system 5:21.16elapsed 99%CPU
(0avgtext+0avgdata 629760maxresident)k
DBG: 10000
1374.55user 13.03system 23:07.66elapsed 99%CPU
(0avgtext+0avgdata 1701120maxresident)k
DBG: 20000
3587.06user 39.71system 1:00:27.19elapsed 99%CPU
(0avgtext+0avgdata 6343296maxresident)k
DBG: 40000
(...still running... top says:
 4567 stevenb   20   0 23.4g  23g  17m R 100.1 37.6  26:20.65 cc1
so it's good this machine has 64GB plus some swap -- it might just
make it all the way through to the end, some day... :-)

(Above 5000 computed gotos in the test case, there appears to be a
tipping point where some transformation decides the function is
too big to handle, causing memory usage to actually drop at that
point. I'm guessing that's IRA's conflict table toggle, but I'm
not sure.)


From the point of view of compile time, this is not great but not
unreasonable either. According to -ftime-report, there are no passes
that stand out as not scaling well for compile time of this test
case. Everything looks reasonable and as expected: parsing, gimplify,
expand, combine, and IRA top the GC memory lists (being IR rewriting
passes), and top time consumers for n=20000 are:

 tree PTA                :1093.69 (30%) usr  6250 kB ( 0%) ggc
 tree SSA incremental    : 858.29 (24%) usr     0 kB ( 0%) ggc
 forward prop            : 316.48 ( 9%) usr 64614 kB ( 4%) ggc


The quadratic increase of the memory footprint is a problem, though...

The test case with 20000 computed gotos has 360025 lines of code, which
expand to ~10 instructions per line of code. Memory peaks at ~6.3GB.
That doesn't look reasonable if you do a bit of math:

* There are ~3 gimple_statement_with_memory_ops per input line, with
typically 3 operands. These are 128 bytes, consuming an estimated
memory of 3*360000*128=131MB. Other than that, there are a many small
PHIs (and a few large PHIs for the factored computed goto and for the
return statement) that consume a significant amount of memory.
Let's assume we need about 6 times the 131MB to to account for all stmt
operands and for PHI nodes. That's ~1GB total for the function body.

* Everything else is relatively small. Per computed goto there are
initially 7 basic blocks and 8 edges, and 8 label_decls.  On a 64bit
host, sizeof(struct basic_block_def)=108, sizeof(struct edge_def)=56,
and sizeof(struct tree_label_decl)=128.
With alignment that's 128, 64, and 128.
That's 20000*(7*128+8*64+8*128)=46MB for the CFG and labels.
Nothing else of any significance lives in GC memory early in the
compilation pipeline.

* Function bodies as RTL tend to be much smaller than GIMPLE, accounting
for maybe 200MB or so in this case (1/5th of the size of the GIMPLE body
is typical for pre-processed GCC itself).

* The above estimates are supported by -ftime-report's total allocated
GC memory counter: 1476435kB. That means the other 4.8GB is allocated
in non-GC space, and that non-GC memory dominates the memory footprint.


For n=10000, max. GC=741582kB, and max. resident is 1.7GB.  So here,
too, ~1GB of on-GC memory dominates the memory footprint.  For n=40000
peak memory is ~23.4GB so far. So tabulated:

n      max. mem    max. GC mem    max. non-GC mem
10000  1.7GB        741582kB       ~1GB
20000  6.3GB       1476435kB       ~4.8GB
40000  23.4GB     ~2900000kB       ~20.5GB

A compiler built with --enable-gather-detailed-mem-stats should be able
to tell where and for what that memory is allocated. The peak memory
usage happens pretty early in the compilation process, so I'm guessing
the memory is allocated for PTA.
Comment 5 Steven Bosscher 2013-01-27 11:23:51 UTC
For completeness, n=40000:
compile time 17531.37s
max. GC memory: 3001361
peak memory: 23.4GB

Big spenders:
 forward prop            :1757.74 (10%) usr  135302 kB ( 5%) ggc
 tree SSA incremental    :3972.18 (23%) usr       0 kB ( 0%) ggc
 tree PTA                :6274.46 (36%) usr   12500 kB ( 0%) ggc
Comment 6 Steven Bosscher 2013-01-27 13:27:04 UTC
I've collected the ratios of the --enable-gather-detailed-memory-stats
output for bitmaps, for test cases with n=1000 and n=8000.

Worst memory non-linear behavior is in the following functions:

Bitmap                                          Alloc   Peak    Leak
------------------------------------------------------------------------
tree-ssa-loop-manip.c:248 (compute_live_l       75.49   7.87    0.00
cfganal.c:1167 (compute_idf)                    72.62   7.69    0.00
tree-ssa-structalias.c:2104 (label_visit)       60.73   60.73   60.73
df-problems.c:554 (df_rd_transfer_functio       46.35   6.90    0.00
df-problems.c:233 (df_rd_alloc)                 18.57   18.57   18.57
df-problems.c:525 (df_rd_transfer_functio       12.97   9.55    9.60
df-problems.c:2113 (df_chain_create_bb)         11.62   8.43    0.00
df-problems.c:413 (df_rd_local_compute)         9.91    1.00    0.00
df-problems.c:230 (df_rd_alloc)                 9.61    9.61    9.61
df-problems.c:232 (df_rd_alloc)                 9.61    9.61    9.61


The only one of the above there the amount of memory allocated is really
significant, is tree-ssa-structalias.c:2104 (label_visit):

n= 1000 ->   27055768 bytes allocated (of total  111411120, 24%)
n= 2000 ->  104669648 bytes allocated (of total  293048432. 36%)
n= 4000 ->  415898848 bytes allocated (of total  768746792, 54%)
n= 8000 -> 1642995248 bytes allocated (of total 2556626064, 64%)
n=16000 -> 6549989248 bytes allocated (of total 8780662936, 75%)
Comment 7 Richard Biener 2013-01-28 09:45:06 UTC
label_visit () seems to collect recursively points_to bits over the predecessor
graph, thus using a quadratic amount of memory.  It does so to optimize
variables that point to the same stuff by assigning them the same
pointer_label.

"indirect" nodes seem to be special as far as I can see as they will never
share the same label with a previously visited node.  We should be able
to represent their points_to set by themselves and not miss any equivalences
nor create false ones by that.  Thus:

Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c  (revision 195502)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -2103,6 +2103,17 @@ label_visit (constraint_graph_t graph, s
   if (!graph->points_to[n])
     graph->points_to[n] = BITMAP_ALLOC (&predbitmap_obstack);
 
+  /* Indirect nodes get fresh variables.  Represent those by that
+     single fresh variable.  */
+  if (!bitmap_bit_p (graph->direct_nodes, n))
+    {
+      bitmap_set_bit (graph->points_to[n], FIRST_REF_NODE + n);
+      graph->pointer_label[n] = pointer_equiv_class++;
+      equiv_class_add (pointer_equiv_class_table,
+                      graph->pointer_label[n], graph->points_to[n]);
+      return;
+    }
+
   /* Label and union our incoming edges's points to sets.  */
   EXECUTE_IF_IN_NONNULL_BITMAP (graph->preds[n], 0, i, bi)
     {
@@ -2117,9 +2128,6 @@ label_visit (constraint_graph_t graph, s
       if (graph->points_to[w])
        bitmap_ior_into(graph->points_to[n], graph->points_to[w]);
     }
-  /* Indirect nodes get fresh variables.  */
-  if (!bitmap_bit_p (graph->direct_nodes, n))
-    bitmap_set_bit (graph->points_to[n], FIRST_REF_NODE + n);
 
   if (!bitmap_empty_p (graph->points_to[n]))
     {

not sure if it helps in this case though.  Assigning pointer_labels is
an optimization, and could be completely short-cut if necessary (not
sure what the result would be for this testcase, or how we could up-front
detect if performing this equivalencing is worthwhile or not).
One may also think of using the pointer_labels of the incoming edges
to perform the unioning instead, thus cache by { set of pointer labels },
 { points-to set } instead of by expanded points-to set only.

The algorithm is definitely poor in its memory usage ...
Comment 8 Richard Biener 2013-01-28 10:04:46 UTC
Moving ->points_to to a separate obstack might also help (performing label_visit
in topological order we could then free ->points_to once we have visited
all successors of a node).
Comment 9 Steven Bosscher 2013-01-28 23:34:36 UTC
(In reply to comment #7)

With the patch from comment #7:

n=1000 6.18user 254976k maxresident
n=2000 16.76user 509184k maxresident
n=4000 54.23user 1432576l maxresident
n=8000 201.31user 1343296k maxresident

Without:
n=1000 6.45user 255616k maxresident
n=2000 17.65user 572096k maxresident
n=4000 55.10user 1485184k maxresident
n=8000 199.49user 1456000k maxresident

So there appears to be some small effect on memory footprint but
nothing to get excited about, and no effect on compile time.
Comment 10 rguenther@suse.de 2013-01-29 09:52:12 UTC
On Mon, 28 Jan 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56113
> 
> --- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2013-01-28 23:34:36 UTC ---
> (In reply to comment #7)
> 
> With the patch from comment #7:
> 
> n=1000 6.18user 254976k maxresident
> n=2000 16.76user 509184k maxresident
> n=4000 54.23user 1432576l maxresident
> n=8000 201.31user 1343296k maxresident
> 
> Without:
> n=1000 6.45user 255616k maxresident
> n=2000 17.65user 572096k maxresident
> n=4000 55.10user 1485184k maxresident
> n=8000 199.49user 1456000k maxresident
> 
> So there appears to be some small effect on memory footprint but
> nothing to get excited about, and no effect on compile time.

Yeah, I sort-of expected that ... the following should help more
(apart from the fact that we do not optimize the visiting order
to minimize the number of life bitmaps).

I was thinking of sth like the following - but this of course
leaves the equiv hashtable pointing to freed bitmaps ...
As finding all equivalences of  U recursive-preds (n)  for all
nodes n is the goal there doesn't seem to be a good way of
avoiding the excessive memory use (we can free those that get
their pointer_label shared - but that should be the minority).

I'm out of ideas - apart from killing this whole unification
on the base that it cannot be very effective.

Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c  (revision 195530)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -507,6 +507,10 @@ struct constraint_graph
   /* Explicit predecessors of each node (Used for variable substitution).  
*/
   bitmap *preds;
 
+  /* Number of nodes this is in their preds (used to track lifetime of
+     points_to).  */
+  unsigned *n_pred_of;
+
   /* Indirect cycle representatives, or -1 if the node has no indirect
      cycles.  */
   int *indirect_cycles;
@@ -2112,10 +2116,14 @@ label_visit (constraint_graph_t graph, s
 
       /* Skip unused edges  */
       if (w == n || graph->pointer_label[w] == 0)
-       continue;
-
-      if (graph->points_to[w])
+       ;
+      else
        bitmap_ior_into(graph->points_to[n], graph->points_to[w]);
+
+      /* If we were the last unvisited successor of w free
+        its points-to set.  */
+      if (--graph->n_pred_of[w] == 0 && w != n)
+       BITMAP_FREE (graph->points_to[w]);
     }
   /* Indirect nodes get fresh variables.  */
   if (!bitmap_bit_p (graph->direct_nodes, n))
@@ -2133,6 +2141,10 @@ label_visit (constraint_graph_t graph, s
        }
       graph->pointer_label[n] = label;
     }
+
+  /* If n has itself as predecessor we delayed freeing points_to.  */
+  if (graph->n_pred_of[n] == 0)
+    BITMAP_FREE (graph->points_to[n]);
 }
 
 /* Perform offline variable substitution, discovering equivalence
@@ -2159,12 +2171,26 @@ perform_var_substitution (constraint_gra
     if (!bitmap_bit_p (si->visited, si->node_mapping[i]))
       condense_visit (graph, si, si->node_mapping[i]);
 
+  /* Count the number of nodes a node is predecessor of.  */
+  graph->n_pred_of = XCNEWVEC (unsigned, graph->size);
+  for (i = 0; i < graph->size; i++)
+    {
+      bitmap_iterator bi;
+      unsigned int j;
+      if (si->node_mapping[i] != i)
+       continue;
+      EXECUTE_IF_IN_NONNULL_BITMAP (graph->preds[i], 0, j, bi)
+       graph->n_pred_of[si->node_mapping[j]]++;
+    }
+
   bitmap_clear (si->visited);
   /* Actually the label the nodes for pointer equivalences  */
   for (i = 0; i < FIRST_REF_NODE; i++)
     if (!bitmap_bit_p (si->visited, si->node_mapping[i]))
       label_visit (graph, si, si->node_mapping[i]);
 
+  free (graph->n_pred_of);
+
   /* Calculate location equivalence labels.  */
   for (i = 0; i < FIRST_REF_NODE; i++)
     {
Comment 11 Richard Biener 2013-01-29 13:06:48 UTC
Created attachment 29300 [details]
patch

n=100    0.97user   83232maxresident
n=1000   9.76user  260800maxresident
n=2000  23.14user  468096maxresident
n=4000  62.08user  890768maxresident
n=8000 189.13user 1722624maxresident

looks reasonably linear (double-checking now with mem-stats, not sure if
that will show reduced memory usage - maybe for the obstack peak use).

release checking with mem-stats numbers, and not -O0 compiler:

unpatched: n=10000 127.82user 7755408maxresident (top shows 1.8GB memory usage very quickly)
patched:   n=10000 126.42user 3400032maxresident (top shows max. 340MB memory usage for a very long time, later peaks at ~800MB)

unpatched:
tree-ssa-structalias.c:2105 (label_visit)   150024      2554542888      2554542888      2554542888
patched:
tree-ssa-structalias.c:2114 (label_visit)   150024      2554542888         1475192         1440608          0          0

The patch is of course lame - it only reduces the amount of memory used
if there are a lot of pointer equivalences.  But at least it's very cheap
to do so.
Comment 12 Richard Biener 2013-01-29 14:22:56 UTC
Author: rguenth
Date: Tue Jan 29 14:22:47 2013
New Revision: 195541

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195541
Log:
2013-01-29  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/56113
	* tree-ssa-structalias.c (equiv_class_lookup): Also return
	the bitmap leader.
	(label_visit): Free duplicate bitmaps and record the leader instead.
	(perform_var_substitution): Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-structalias.c
Comment 13 Richard Biener 2013-01-29 14:24:01 UTC
Author: rguenth
Date: Tue Jan 29 14:23:48 2013
New Revision: 195542

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195542
Log:
2013-01-29  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/56113
	* tree-ssa-structalias.c (equiv_class_lookup): Also return
	the bitmap leader.
	(label_visit): Free duplicate bitmaps and record the leader instead.
	(perform_var_substitution): Adjust.

Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/tree-ssa-structalias.c
Comment 14 Richard Biener 2013-01-29 15:38:24 UTC
Run till exit from #0  update_label_decls (scope=0x7f32004f6c60)
    at /space/rguenther/src/svn/trunk/gcc/c/c-decl.c:1023

takes a long time for n = 50000.  Seems to be quadratic (# of scopes
times # of gotos).

Re-writing the function into SSA takes a surprisingly large amount of
memory and compile-time as well.  I suppose using a special memory variable
to factor the goto (one without virtual operands) would make this cheaper.
Basically avoid going into SSA for it (avoid the gigantic PHI).
Similar rewrite_into_loop_closed_ssa is slow.

Next is PTA, but we know that already, the amount of extra memory used
is reasonable now.  Simple compile-time optimizations seem to be possible
(we do a lot of redundant work in find_what_var_points_to by not caching
the result for the representative).

I have a patch to reduce (n = 10000)

 tree PTA                :  25.37 (24%) usr   0.03 ( 3%) sys  25.44 (24%) wall    3125 kB ( 1%) ggc

to

 tree PTA                :   5.06 ( 6%) usr   0.02 ( 2%) sys   5.09 ( 6%) wall     937 kB ( 0%) ggc

leaves

 tree SSA incremental    :  34.14 (39%) usr   0.00 ( 0%) sys  34.22 (39%) wall       0 kB ( 0%) ggc

as the biggest offender.
Comment 15 Richard Biener 2013-01-30 13:50:04 UTC
All of the tree SSA incremental time is spent in computing the IDFs.  With
a patch to cache IDF on def-blocks nothing is gained.

Unpatched, n = 10000:
 tree SSA incremental    :  48.86 (51%) usr
 TOTAL                 :  94.88

I tried replacing the work-stack vector with a sparseset to avoid visiting
a block twice (and thus avoid doing the EXECUTE_IF_AND_COMPL_IN_BITMAP
twice which the 2nd time has no bits).  To no avail - sparseset seems to
have a higher overhead.
 tree SSA incremental    :  52.25 (53%) usr
 TOTAL                 :  98.60

Also the phi-insertion-point update can be done with bitmap_ior_into
(slows things down tremendously):
 tree SSA incremental    :  75.56 (62%) usr
 TOTAL                 : 121.73

combined with using a sparse-set for iteration:
 tree SSA incremental    :  78.28 (63%) usr
 TOTAL                 : 124.72

arranging for the work_stack to be large enough to hold 2*n_basic_blocks
we can use quick_push in the inner loop.
 tree SSA incremental    :  47.05 (51%) usr
 TOTAL                 :  91.60
Comment 16 Richard Biener 2013-01-30 14:38:29 UTC
The following (old!?) idea helps though:

Index: gcc/tree-ssa-loop-manip.c
===================================================================
--- gcc/tree-ssa-loop-manip.c   (revision 195574)
+++ gcc/tree-ssa-loop-manip.c   (working copy)
@@ -536,7 +536,7 @@ rewrite_into_loop_closed_ssa (bitmap cha
 
   /* Fix up all the names found to be used outside their original
      loops.  */
-  update_ssa (TODO_update_ssa);
+  update_ssa (TODO_update_ssa_no_phi);
 }
 
 /* Check invariants of the loop closed ssa form for the USE in BB.  */

why would adding a copy on an edge (thus a new def) require us to insert
new PHI nodes?  With that patch:

 tree SSA incremental    :   0.06 ( 0%) usr
 TOTAL                 :  46.36

Of course I must not remember sth here ...
Comment 17 Richard Biener 2013-01-30 15:40:57 UTC
(In reply to comment #16)
> The following (old!?) idea helps though:
> 
> Index: gcc/tree-ssa-loop-manip.c
> ===================================================================
> --- gcc/tree-ssa-loop-manip.c   (revision 195574)
> +++ gcc/tree-ssa-loop-manip.c   (working copy)
> @@ -536,7 +536,7 @@ rewrite_into_loop_closed_ssa (bitmap cha
> 
>    /* Fix up all the names found to be used outside their original
>       loops.  */
> -  update_ssa (TODO_update_ssa);
> +  update_ssa (TODO_update_ssa_no_phi);
>  }
> 
>  /* Check invariants of the loop closed ssa form for the USE in BB.  */
> 
> why would adding a copy on an edge (thus a new def) require us to insert
> new PHI nodes?  With that patch:
> 
>  tree SSA incremental    :   0.06 ( 0%) usr
>  TOTAL                 :  46.36
> 
> Of course I must not remember sth here ...

Multiple exits.

Btw, it's all virtual loop-closed PHI nodes we insert.  Thus, reverting

2012-08-23  Richard Guenther  <rguenther@suse.de>

        * tree-ssa-loop-manip.c (add_exit_phis_var): Allow virtual operands.
        (find_uses_to_rename_use): Likewise.
        (find_uses_to_rename_bb): Likewise.
        (find_uses_to_rename_stmt): Walk over all operands.

improves compile-time here (until somebody fixes the testcase so there is
a real use on each exit).
Comment 18 Richard Biener 2013-01-31 10:13:27 UTC
Ok, reverted.  With n = 50000 rest_of_handle_split_after_reload blows up
memory usage to > 3.5GB for me ...  With n = 40000 I managed to complete
with about 2GB memory usage (--enable-gather-detailed-mem-stats enabled compiler, release checking):

 parser function body    : 124.43 (14%) usr
 tree PTA                : 118.84 (13%) usr
 tree SSA rewrite        :  43.03 ( 5%) usr
 tree SSA other          :  47.07 ( 5%) usr
 dominator optimization  :  88.98 (10%) usr
 tree linearize phis     :  42.14 ( 5%) usr
 tree loop invariant motion: 169.80 (19%) usr
 tree SSA uncprop        :  41.77 ( 5%) usr
 forward prop            :  59.97 ( 7%) usr
 straight-line strength reduction:  42.30 ( 5%) usr
 TOTAL                 : 892.08
Comment 19 Richard Biener 2013-01-31 16:36:04 UTC
Created attachment 29317 [details]
kill dominator queries from domwalk

This patch kills dominator queries from domwalk, removing a quadratic bottleneck
I introduced there.  Do so by sorting DOM children after DFS completion numbers.

Which hopefully results in equivalent operation ;)

[TOOD: do something similar for CDI_POST_DOMINATORS direction]

Changes

 TOTAL                 : 575.28

to

 TOTAL                 : 353.61

With changes like

 dominator optimization  :  36.46 ( 6%) usr
to
 dominator optimization  :   2.03 ( 1%) usr

and

 tree loop invariant motion:  69.73 (12%) usr
to
 tree loop invariant motion:   1.32 ( 0%) usr

...
Comment 20 Richard Biener 2013-01-31 16:52:28 UTC
Remains:

 parser function body    : 122.90 (35%) usr
 tree PTA                : 120.27 (34%) usr
 TOTAL                 : 353.61

the rest is all <= 2%.
Comment 21 Steven Bosscher 2013-01-31 19:56:33 UTC
(In reply to comment #19)
> This patch kills dominator queries from domwalk, removing a quadratic
> bottleneck
> I introduced there.  Do so by sorting DOM children after DFS completion
> numbers.

You can't do this, several domwalk users modify the CFG during the walk.
True, they should not -- but they do. I tested a patch like yours, and it
broke several ASAN tests.
Comment 22 Steven Bosscher 2013-01-31 20:16:25 UTC
Author: steven
Date: Thu Jan 31 20:16:07 2013
New Revision: 195632

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195632
Log:
	PR middle-end/56113
	* fwprop.c (fwprop_init): Set up loops without CFG modifications.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fwprop.c
Comment 23 Steven Bosscher 2013-01-31 22:51:36 UTC
(In reply to comment #19)
> Created attachment 29317 [details]
> kill dominator queries from domwalk
> 
> This patch kills dominator queries from domwalk, removing a quadratic
> bottleneck
> I introduced there.  Do so by sorting DOM children after DFS completion
> numbers.
> 
> Which hopefully results in equivalent operation ;)

Part of the "equivalent operation" can be achieved by walking the maximum
extended basic block in DFS order first, and the remaining dominator sons
second. It's too late now to try and draft a proof, but I think that with
such a visiting order, you always visit all predecessors of the dominator
sons that are not successor blocks in the CFG. This is trivially true for
diamond regions and loops, and my intuition says it is true in general to
be proven by induction...
Comment 24 Steven Bosscher 2013-01-31 23:22:43 UTC
(In reply to comment #23)
> (In reply to comment #19)
> > Created attachment 29317 [details]
> > kill dominator queries from domwalk
> > 
> > This patch kills dominator queries from domwalk, removing a quadratic
> > bottleneck
> > I introduced there.  Do so by sorting DOM children after DFS completion
> > numbers.
> > 
> > Which hopefully results in equivalent operation ;)

Another alternative would be to set up a vector with the edge counts
at the start of the dominator walk.

When visiting a basic block bb, do

  FOR_EACH_EDGE (e, ei, bb->succs)
    if (e->dest->index < unvisited_preds_count.length () // *
        && (single_pred_p (e->dest) // common case, cheap test
            || dominated_by_p (CDI_DOMINATORS, e->dest, e->src))
      --unvisited_preds_count[e->dest]

and replace the expensive loop:

                FOR_EACH_EDGE (e, ei, bb->preds)
                  { 
                    if (!dominated_by_p (CDI_DOMINATORS, e->src, e->dest)
                        && !bitmap_bit_p (visited, e->src->index))
                      { 
                        found = false;
                        break;
                      }
                  }

with:

                if (e->dest->index < unvisited_preds_count.length () // *
                    && unvisited_preds_count[e->dest->index] > 0)
                  { 
                    found = false;
                    break;
                  }

(*) can go away if CFG modifications are forbidden during a domwalk,
but that's for GCC 4.9 earliest.
Comment 25 rguenther@suse.de 2013-02-01 08:48:32 UTC
On Thu, 31 Jan 2013, steven at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56113
> 
> --- Comment #24 from Steven Bosscher <steven at gcc dot gnu.org> 2013-01-31 23:22:43 UTC ---
> (In reply to comment #23)
> > (In reply to comment #19)
> > > Created attachment 29317 [details]
> > > kill dominator queries from domwalk
> > > 
> > > This patch kills dominator queries from domwalk, removing a quadratic
> > > bottleneck
> > > I introduced there.  Do so by sorting DOM children after DFS completion
> > > numbers.
> > > 
> > > Which hopefully results in equivalent operation ;)
> 
> Another alternative would be to set up a vector with the edge counts
> at the start of the dominator walk.
> 
> When visiting a basic block bb, do
> 
>   FOR_EACH_EDGE (e, ei, bb->succs)
>     if (e->dest->index < unvisited_preds_count.length () // *
>         && (single_pred_p (e->dest) // common case, cheap test
>             || dominated_by_p (CDI_DOMINATORS, e->dest, e->src))
>       --unvisited_preds_count[e->dest]
> 
> and replace the expensive loop:
> 
>                 FOR_EACH_EDGE (e, ei, bb->preds)
>                   { 
>                     if (!dominated_by_p (CDI_DOMINATORS, e->src, e->dest)
>                         && !bitmap_bit_p (visited, e->src->index))
>                       { 
>                         found = false;
>                         break;
>                       }
>                   }
> 
> with:
> 
>                 if (e->dest->index < unvisited_preds_count.length () // *
>                     && unvisited_preds_count[e->dest->index] > 0)
>                   { 
>                     found = false;
>                     break;
>                   }

Yeah, I thought about such scheme but found the proposed patch
much better ;)

> (*) can go away if CFG modifications are forbidden during a domwalk,
> but that's for GCC 4.9 earliest.

But it seems the patch passed bootstrap & regtest ok ... I wonder
how CFG modifications would survive the current scheme - after all
we're using a visited sbitmap, too.  So it at least can't be allowed
to add basic-blocks during the domwalk.  Changing dominator
relationship with the proposed patch would only make the sorting
bogus (thus, back to "random", as it were before the issue
was introduced to domwalk which was rev. 159100).

So, I'm going to propose the patch nevertheless - did you spot
code that does CFG manipulations?  asan/tsan do not use domwalk
as far as I can see.

Richard.
Comment 26 Richard Biener 2013-02-01 10:10:17 UTC
With another patch to PTA we now are bottle-necked by the C fronted
in update_label_decls ;)

 parser function body    : 125.32 (43%) usr
 alias stmt walking      :   7.49 ( 3%) usr
 tree PTA                :  51.62 (18%) usr
 tree DSE                :   8.53 ( 3%) usr
 TOTAL                 : 289.57

that was everything > 2% at -O1 with n = 40000.
Comment 27 Richard Biener 2013-02-01 12:38:51 UTC
Author: rguenth
Date: Fri Feb  1 12:38:45 2013
New Revision: 195646

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195646
Log:
2013-02-01  Richard Biener  <rguenther@suse.de>

        PR tree-optimization/56113
	* tree-ssa-structalias.c (label_visit): Reduce work for
	single-predecessor nodes.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-structalias.c
Comment 28 Richard Biener 2013-02-04 09:30:19 UTC
Author: rguenth
Date: Mon Feb  4 09:30:12 2013
New Revision: 195707

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195707
Log:
2013-02-04  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/56113
	* tree-ssa-structalias.c (equiv_class_lookup, equiv_class_add):
	Merge into ...
	(equiv_class_lookup_or_add): ... this.
	(label_visit): Adjust and fix error in previous patch.
	(perform_var_substitution): Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-structalias.c
Comment 29 Steven Bosscher 2013-03-06 10:53:47 UTC
Now a C front end issue.
Comment 30 Richard Biener 2013-03-18 08:47:02 UTC
Author: rguenth
Date: Mon Mar 18 08:46:44 2013
New Revision: 196769

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=196769
Log:
2013-03-18  Richard Biener  <rguenther@suse.de>

	PR middle-end/56113
	* domwalk.c (bb_postorder): New global static.
	(cmp_bb_postorder): New function.
	(walk_dominator_tree): Replace scheme imposing an order for
	visiting dominator sons by one sorting them at the time they
	are pushed on the stack.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/domwalk.c
Comment 31 Richard Biener 2013-03-18 10:02:08 UTC
At -O2 this shows VRP is slow and uses much memory.

  n        VRP time
10000    5.76 (14%) usr
20000    18.90 (14%) usr
Comment 32 Kangkook 2013-04-04 18:31:19 UTC
Hi, guys

I have a couple of questions regarding the case. 

i) What is the current status of the fix? is this fixed or still open? 

ii) If it is fixed, 
  - how many nodes now it can scale?
  - What version of gcc comes with the patch/fix applied?

Thanks a lot for your help!

/Kangkook
Comment 33 Richard Biener 2013-04-08 08:37:22 UTC
(In reply to comment #32)
> Hi, guys
> 
> I have a couple of questions regarding the case. 
> 
> i) What is the current status of the fix? is this fixed or still open? 

It's still open, there are bits that can be still improved.

> ii) If it is fixed, 
>   - how many nodes now it can scale?

No idea, you have to try.  It depends on the amount of compile-time/memory
you have available.

>   - What version of gcc comes with the patch/fix applied?

GCC 4.8.0 contains some fixes that should help -O1 requirements.  GCC 4.9
will contain further fixes.

> Thanks a lot for your help!
> 
> /Kangkook