Out of memory when compiling a function with many goto labels (50k > ) gcc 4.6.3 compiler(running on top of ubuntu 12.04 32-bit) failed to output an object for a function when it comes with extremely large number of goto labels. Attached is a simple script that generates the test source file. Here's how to reproduce a failure with the script. ~$./goto_gen.py 60000 t.c ~$ time cc -O1 -o t t.c cc1: out of memory allocating 4064 bytes after a total of 2309808128 bytes real 11m44.371s user 11m42.592s sys 0m1.876s ~$ time cc -O0 -o t t.c real 22m5.106s user 22m3.539s sys 0m1.640s As you can see from the above, -O1 option trigger the problem while -O0 doesn't. Could you anyone tell me how to workaround this situation? Or which specific optimization option causes the issue? Thanks a lot for your help in advance. Regards, Kangkook
Created attachment 29276 [details] a script to generates the code that reproduces the bug
Please try GCC 4.8 which has _lots_ of scalability patches to address this kind of issues. Note that a 32bit host is not guaranteed to be able to compile arbitrary large functions.
Hi, Richard Thanks a lot for your advice. I will definitely try gcc 4.8 and let you know about the result. Btw, I also tested it from the 64-bit env. but it also crashed the compiler after using up 8GB memories. Thanks!
I do not believe this kind of test case can ever be compiled by any reasonable compiler. The problem is not the labels, but the computed gotos. This causes an exceptionally dense CFG with N_LABELS**2 edges. GCC tries to factor these computed gotos early on in the passes pipeline but expands them near the end of the passes pipeline. The function is still simply too large to handle. But anyway, let's all pretend for a moment that this test case is sane... On powerpc64-unknown-linux-gnu, with GCC 4.8 trunk r195103, compile time grows nicely linear with the number of computed gotos in the function, but memory usage appears to grow quadratically: $ for n in 1000 2000 3000 4000 5000 10000 20000 40000 ; do > ./goto_gen.py ${n} t.c > ~/bin/compacttime ./cc1 -quiet -O1 t.c > done DBG: 1000 29.48user 0.39system 0:29.88elapsed 99%CPU (0avgtext+0avgdata 172352maxresident)k DBG: 2000 70.07user 0.56system 1:10.64elapsed 99%CPU (0avgtext+0avgdata 386496maxresident)k DBG: 3000 137.31user 1.45system 2:18.80elapsed 99%CPU (0avgtext+0avgdata 664960maxresident)k DBG: 4000 190.45user 2.43system 3:12.92elapsed 99%CPU (0avgtext+0avgdata 1050560maxresident)k DBG: 5000 318.18user 2.90system 5:21.16elapsed 99%CPU (0avgtext+0avgdata 629760maxresident)k DBG: 10000 1374.55user 13.03system 23:07.66elapsed 99%CPU (0avgtext+0avgdata 1701120maxresident)k DBG: 20000 3587.06user 39.71system 1:00:27.19elapsed 99%CPU (0avgtext+0avgdata 6343296maxresident)k DBG: 40000 (...still running... top says: 4567 stevenb 20 0 23.4g 23g 17m R 100.1 37.6 26:20.65 cc1 so it's good this machine has 64GB plus some swap -- it might just make it all the way through to the end, some day... :-) (Above 5000 computed gotos in the test case, there appears to be a tipping point where some transformation decides the function is too big to handle, causing memory usage to actually drop at that point. I'm guessing that's IRA's conflict table toggle, but I'm not sure.) From the point of view of compile time, this is not great but not unreasonable either. According to -ftime-report, there are no passes that stand out as not scaling well for compile time of this test case. Everything looks reasonable and as expected: parsing, gimplify, expand, combine, and IRA top the GC memory lists (being IR rewriting passes), and top time consumers for n=20000 are: tree PTA :1093.69 (30%) usr 6250 kB ( 0%) ggc tree SSA incremental : 858.29 (24%) usr 0 kB ( 0%) ggc forward prop : 316.48 ( 9%) usr 64614 kB ( 4%) ggc The quadratic increase of the memory footprint is a problem, though... The test case with 20000 computed gotos has 360025 lines of code, which expand to ~10 instructions per line of code. Memory peaks at ~6.3GB. That doesn't look reasonable if you do a bit of math: * There are ~3 gimple_statement_with_memory_ops per input line, with typically 3 operands. These are 128 bytes, consuming an estimated memory of 3*360000*128=131MB. Other than that, there are a many small PHIs (and a few large PHIs for the factored computed goto and for the return statement) that consume a significant amount of memory. Let's assume we need about 6 times the 131MB to to account for all stmt operands and for PHI nodes. That's ~1GB total for the function body. * Everything else is relatively small. Per computed goto there are initially 7 basic blocks and 8 edges, and 8 label_decls. On a 64bit host, sizeof(struct basic_block_def)=108, sizeof(struct edge_def)=56, and sizeof(struct tree_label_decl)=128. With alignment that's 128, 64, and 128. That's 20000*(7*128+8*64+8*128)=46MB for the CFG and labels. Nothing else of any significance lives in GC memory early in the compilation pipeline. * Function bodies as RTL tend to be much smaller than GIMPLE, accounting for maybe 200MB or so in this case (1/5th of the size of the GIMPLE body is typical for pre-processed GCC itself). * The above estimates are supported by -ftime-report's total allocated GC memory counter: 1476435kB. That means the other 4.8GB is allocated in non-GC space, and that non-GC memory dominates the memory footprint. For n=10000, max. GC=741582kB, and max. resident is 1.7GB. So here, too, ~1GB of on-GC memory dominates the memory footprint. For n=40000 peak memory is ~23.4GB so far. So tabulated: n max. mem max. GC mem max. non-GC mem 10000 1.7GB 741582kB ~1GB 20000 6.3GB 1476435kB ~4.8GB 40000 23.4GB ~2900000kB ~20.5GB A compiler built with --enable-gather-detailed-mem-stats should be able to tell where and for what that memory is allocated. The peak memory usage happens pretty early in the compilation process, so I'm guessing the memory is allocated for PTA.
For completeness, n=40000: compile time 17531.37s max. GC memory: 3001361 peak memory: 23.4GB Big spenders: forward prop :1757.74 (10%) usr 135302 kB ( 5%) ggc tree SSA incremental :3972.18 (23%) usr 0 kB ( 0%) ggc tree PTA :6274.46 (36%) usr 12500 kB ( 0%) ggc
I've collected the ratios of the --enable-gather-detailed-memory-stats output for bitmaps, for test cases with n=1000 and n=8000. Worst memory non-linear behavior is in the following functions: Bitmap Alloc Peak Leak ------------------------------------------------------------------------ tree-ssa-loop-manip.c:248 (compute_live_l 75.49 7.87 0.00 cfganal.c:1167 (compute_idf) 72.62 7.69 0.00 tree-ssa-structalias.c:2104 (label_visit) 60.73 60.73 60.73 df-problems.c:554 (df_rd_transfer_functio 46.35 6.90 0.00 df-problems.c:233 (df_rd_alloc) 18.57 18.57 18.57 df-problems.c:525 (df_rd_transfer_functio 12.97 9.55 9.60 df-problems.c:2113 (df_chain_create_bb) 11.62 8.43 0.00 df-problems.c:413 (df_rd_local_compute) 9.91 1.00 0.00 df-problems.c:230 (df_rd_alloc) 9.61 9.61 9.61 df-problems.c:232 (df_rd_alloc) 9.61 9.61 9.61 The only one of the above there the amount of memory allocated is really significant, is tree-ssa-structalias.c:2104 (label_visit): n= 1000 -> 27055768 bytes allocated (of total 111411120, 24%) n= 2000 -> 104669648 bytes allocated (of total 293048432. 36%) n= 4000 -> 415898848 bytes allocated (of total 768746792, 54%) n= 8000 -> 1642995248 bytes allocated (of total 2556626064, 64%) n=16000 -> 6549989248 bytes allocated (of total 8780662936, 75%)
label_visit () seems to collect recursively points_to bits over the predecessor graph, thus using a quadratic amount of memory. It does so to optimize variables that point to the same stuff by assigning them the same pointer_label. "indirect" nodes seem to be special as far as I can see as they will never share the same label with a previously visited node. We should be able to represent their points_to set by themselves and not miss any equivalences nor create false ones by that. Thus: Index: gcc/tree-ssa-structalias.c =================================================================== --- gcc/tree-ssa-structalias.c (revision 195502) +++ gcc/tree-ssa-structalias.c (working copy) @@ -2103,6 +2103,17 @@ label_visit (constraint_graph_t graph, s if (!graph->points_to[n]) graph->points_to[n] = BITMAP_ALLOC (&predbitmap_obstack); + /* Indirect nodes get fresh variables. Represent those by that + single fresh variable. */ + if (!bitmap_bit_p (graph->direct_nodes, n)) + { + bitmap_set_bit (graph->points_to[n], FIRST_REF_NODE + n); + graph->pointer_label[n] = pointer_equiv_class++; + equiv_class_add (pointer_equiv_class_table, + graph->pointer_label[n], graph->points_to[n]); + return; + } + /* Label and union our incoming edges's points to sets. */ EXECUTE_IF_IN_NONNULL_BITMAP (graph->preds[n], 0, i, bi) { @@ -2117,9 +2128,6 @@ label_visit (constraint_graph_t graph, s if (graph->points_to[w]) bitmap_ior_into(graph->points_to[n], graph->points_to[w]); } - /* Indirect nodes get fresh variables. */ - if (!bitmap_bit_p (graph->direct_nodes, n)) - bitmap_set_bit (graph->points_to[n], FIRST_REF_NODE + n); if (!bitmap_empty_p (graph->points_to[n])) { not sure if it helps in this case though. Assigning pointer_labels is an optimization, and could be completely short-cut if necessary (not sure what the result would be for this testcase, or how we could up-front detect if performing this equivalencing is worthwhile or not). One may also think of using the pointer_labels of the incoming edges to perform the unioning instead, thus cache by { set of pointer labels }, { points-to set } instead of by expanded points-to set only. The algorithm is definitely poor in its memory usage ...
Moving ->points_to to a separate obstack might also help (performing label_visit in topological order we could then free ->points_to once we have visited all successors of a node).
(In reply to comment #7) With the patch from comment #7: n=1000 6.18user 254976k maxresident n=2000 16.76user 509184k maxresident n=4000 54.23user 1432576l maxresident n=8000 201.31user 1343296k maxresident Without: n=1000 6.45user 255616k maxresident n=2000 17.65user 572096k maxresident n=4000 55.10user 1485184k maxresident n=8000 199.49user 1456000k maxresident So there appears to be some small effect on memory footprint but nothing to get excited about, and no effect on compile time.
On Mon, 28 Jan 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56113 > > --- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2013-01-28 23:34:36 UTC --- > (In reply to comment #7) > > With the patch from comment #7: > > n=1000 6.18user 254976k maxresident > n=2000 16.76user 509184k maxresident > n=4000 54.23user 1432576l maxresident > n=8000 201.31user 1343296k maxresident > > Without: > n=1000 6.45user 255616k maxresident > n=2000 17.65user 572096k maxresident > n=4000 55.10user 1485184k maxresident > n=8000 199.49user 1456000k maxresident > > So there appears to be some small effect on memory footprint but > nothing to get excited about, and no effect on compile time. Yeah, I sort-of expected that ... the following should help more (apart from the fact that we do not optimize the visiting order to minimize the number of life bitmaps). I was thinking of sth like the following - but this of course leaves the equiv hashtable pointing to freed bitmaps ... As finding all equivalences of U recursive-preds (n) for all nodes n is the goal there doesn't seem to be a good way of avoiding the excessive memory use (we can free those that get their pointer_label shared - but that should be the minority). I'm out of ideas - apart from killing this whole unification on the base that it cannot be very effective. Index: gcc/tree-ssa-structalias.c =================================================================== --- gcc/tree-ssa-structalias.c (revision 195530) +++ gcc/tree-ssa-structalias.c (working copy) @@ -507,6 +507,10 @@ struct constraint_graph /* Explicit predecessors of each node (Used for variable substitution). */ bitmap *preds; + /* Number of nodes this is in their preds (used to track lifetime of + points_to). */ + unsigned *n_pred_of; + /* Indirect cycle representatives, or -1 if the node has no indirect cycles. */ int *indirect_cycles; @@ -2112,10 +2116,14 @@ label_visit (constraint_graph_t graph, s /* Skip unused edges */ if (w == n || graph->pointer_label[w] == 0) - continue; - - if (graph->points_to[w]) + ; + else bitmap_ior_into(graph->points_to[n], graph->points_to[w]); + + /* If we were the last unvisited successor of w free + its points-to set. */ + if (--graph->n_pred_of[w] == 0 && w != n) + BITMAP_FREE (graph->points_to[w]); } /* Indirect nodes get fresh variables. */ if (!bitmap_bit_p (graph->direct_nodes, n)) @@ -2133,6 +2141,10 @@ label_visit (constraint_graph_t graph, s } graph->pointer_label[n] = label; } + + /* If n has itself as predecessor we delayed freeing points_to. */ + if (graph->n_pred_of[n] == 0) + BITMAP_FREE (graph->points_to[n]); } /* Perform offline variable substitution, discovering equivalence @@ -2159,12 +2171,26 @@ perform_var_substitution (constraint_gra if (!bitmap_bit_p (si->visited, si->node_mapping[i])) condense_visit (graph, si, si->node_mapping[i]); + /* Count the number of nodes a node is predecessor of. */ + graph->n_pred_of = XCNEWVEC (unsigned, graph->size); + for (i = 0; i < graph->size; i++) + { + bitmap_iterator bi; + unsigned int j; + if (si->node_mapping[i] != i) + continue; + EXECUTE_IF_IN_NONNULL_BITMAP (graph->preds[i], 0, j, bi) + graph->n_pred_of[si->node_mapping[j]]++; + } + bitmap_clear (si->visited); /* Actually the label the nodes for pointer equivalences */ for (i = 0; i < FIRST_REF_NODE; i++) if (!bitmap_bit_p (si->visited, si->node_mapping[i])) label_visit (graph, si, si->node_mapping[i]); + free (graph->n_pred_of); + /* Calculate location equivalence labels. */ for (i = 0; i < FIRST_REF_NODE; i++) {
Created attachment 29300 [details] patch n=100 0.97user 83232maxresident n=1000 9.76user 260800maxresident n=2000 23.14user 468096maxresident n=4000 62.08user 890768maxresident n=8000 189.13user 1722624maxresident looks reasonably linear (double-checking now with mem-stats, not sure if that will show reduced memory usage - maybe for the obstack peak use). release checking with mem-stats numbers, and not -O0 compiler: unpatched: n=10000 127.82user 7755408maxresident (top shows 1.8GB memory usage very quickly) patched: n=10000 126.42user 3400032maxresident (top shows max. 340MB memory usage for a very long time, later peaks at ~800MB) unpatched: tree-ssa-structalias.c:2105 (label_visit) 150024 2554542888 2554542888 2554542888 patched: tree-ssa-structalias.c:2114 (label_visit) 150024 2554542888 1475192 1440608 0 0 The patch is of course lame - it only reduces the amount of memory used if there are a lot of pointer equivalences. But at least it's very cheap to do so.
Author: rguenth Date: Tue Jan 29 14:22:47 2013 New Revision: 195541 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195541 Log: 2013-01-29 Richard Biener <rguenther@suse.de> PR tree-optimization/56113 * tree-ssa-structalias.c (equiv_class_lookup): Also return the bitmap leader. (label_visit): Free duplicate bitmaps and record the leader instead. (perform_var_substitution): Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-structalias.c
Author: rguenth Date: Tue Jan 29 14:23:48 2013 New Revision: 195542 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195542 Log: 2013-01-29 Richard Biener <rguenther@suse.de> PR tree-optimization/56113 * tree-ssa-structalias.c (equiv_class_lookup): Also return the bitmap leader. (label_visit): Free duplicate bitmaps and record the leader instead. (perform_var_substitution): Adjust. Modified: branches/gcc-4_7-branch/gcc/ChangeLog branches/gcc-4_7-branch/gcc/tree-ssa-structalias.c
Run till exit from #0 update_label_decls (scope=0x7f32004f6c60) at /space/rguenther/src/svn/trunk/gcc/c/c-decl.c:1023 takes a long time for n = 50000. Seems to be quadratic (# of scopes times # of gotos). Re-writing the function into SSA takes a surprisingly large amount of memory and compile-time as well. I suppose using a special memory variable to factor the goto (one without virtual operands) would make this cheaper. Basically avoid going into SSA for it (avoid the gigantic PHI). Similar rewrite_into_loop_closed_ssa is slow. Next is PTA, but we know that already, the amount of extra memory used is reasonable now. Simple compile-time optimizations seem to be possible (we do a lot of redundant work in find_what_var_points_to by not caching the result for the representative). I have a patch to reduce (n = 10000) tree PTA : 25.37 (24%) usr 0.03 ( 3%) sys 25.44 (24%) wall 3125 kB ( 1%) ggc to tree PTA : 5.06 ( 6%) usr 0.02 ( 2%) sys 5.09 ( 6%) wall 937 kB ( 0%) ggc leaves tree SSA incremental : 34.14 (39%) usr 0.00 ( 0%) sys 34.22 (39%) wall 0 kB ( 0%) ggc as the biggest offender.
All of the tree SSA incremental time is spent in computing the IDFs. With a patch to cache IDF on def-blocks nothing is gained. Unpatched, n = 10000: tree SSA incremental : 48.86 (51%) usr TOTAL : 94.88 I tried replacing the work-stack vector with a sparseset to avoid visiting a block twice (and thus avoid doing the EXECUTE_IF_AND_COMPL_IN_BITMAP twice which the 2nd time has no bits). To no avail - sparseset seems to have a higher overhead. tree SSA incremental : 52.25 (53%) usr TOTAL : 98.60 Also the phi-insertion-point update can be done with bitmap_ior_into (slows things down tremendously): tree SSA incremental : 75.56 (62%) usr TOTAL : 121.73 combined with using a sparse-set for iteration: tree SSA incremental : 78.28 (63%) usr TOTAL : 124.72 arranging for the work_stack to be large enough to hold 2*n_basic_blocks we can use quick_push in the inner loop. tree SSA incremental : 47.05 (51%) usr TOTAL : 91.60
The following (old!?) idea helps though: Index: gcc/tree-ssa-loop-manip.c =================================================================== --- gcc/tree-ssa-loop-manip.c (revision 195574) +++ gcc/tree-ssa-loop-manip.c (working copy) @@ -536,7 +536,7 @@ rewrite_into_loop_closed_ssa (bitmap cha /* Fix up all the names found to be used outside their original loops. */ - update_ssa (TODO_update_ssa); + update_ssa (TODO_update_ssa_no_phi); } /* Check invariants of the loop closed ssa form for the USE in BB. */ why would adding a copy on an edge (thus a new def) require us to insert new PHI nodes? With that patch: tree SSA incremental : 0.06 ( 0%) usr TOTAL : 46.36 Of course I must not remember sth here ...
(In reply to comment #16) > The following (old!?) idea helps though: > > Index: gcc/tree-ssa-loop-manip.c > =================================================================== > --- gcc/tree-ssa-loop-manip.c (revision 195574) > +++ gcc/tree-ssa-loop-manip.c (working copy) > @@ -536,7 +536,7 @@ rewrite_into_loop_closed_ssa (bitmap cha > > /* Fix up all the names found to be used outside their original > loops. */ > - update_ssa (TODO_update_ssa); > + update_ssa (TODO_update_ssa_no_phi); > } > > /* Check invariants of the loop closed ssa form for the USE in BB. */ > > why would adding a copy on an edge (thus a new def) require us to insert > new PHI nodes? With that patch: > > tree SSA incremental : 0.06 ( 0%) usr > TOTAL : 46.36 > > Of course I must not remember sth here ... Multiple exits. Btw, it's all virtual loop-closed PHI nodes we insert. Thus, reverting 2012-08-23 Richard Guenther <rguenther@suse.de> * tree-ssa-loop-manip.c (add_exit_phis_var): Allow virtual operands. (find_uses_to_rename_use): Likewise. (find_uses_to_rename_bb): Likewise. (find_uses_to_rename_stmt): Walk over all operands. improves compile-time here (until somebody fixes the testcase so there is a real use on each exit).
Ok, reverted. With n = 50000 rest_of_handle_split_after_reload blows up memory usage to > 3.5GB for me ... With n = 40000 I managed to complete with about 2GB memory usage (--enable-gather-detailed-mem-stats enabled compiler, release checking): parser function body : 124.43 (14%) usr tree PTA : 118.84 (13%) usr tree SSA rewrite : 43.03 ( 5%) usr tree SSA other : 47.07 ( 5%) usr dominator optimization : 88.98 (10%) usr tree linearize phis : 42.14 ( 5%) usr tree loop invariant motion: 169.80 (19%) usr tree SSA uncprop : 41.77 ( 5%) usr forward prop : 59.97 ( 7%) usr straight-line strength reduction: 42.30 ( 5%) usr TOTAL : 892.08
Created attachment 29317 [details] kill dominator queries from domwalk This patch kills dominator queries from domwalk, removing a quadratic bottleneck I introduced there. Do so by sorting DOM children after DFS completion numbers. Which hopefully results in equivalent operation ;) [TOOD: do something similar for CDI_POST_DOMINATORS direction] Changes TOTAL : 575.28 to TOTAL : 353.61 With changes like dominator optimization : 36.46 ( 6%) usr to dominator optimization : 2.03 ( 1%) usr and tree loop invariant motion: 69.73 (12%) usr to tree loop invariant motion: 1.32 ( 0%) usr ...
Remains: parser function body : 122.90 (35%) usr tree PTA : 120.27 (34%) usr TOTAL : 353.61 the rest is all <= 2%.
(In reply to comment #19) > This patch kills dominator queries from domwalk, removing a quadratic > bottleneck > I introduced there. Do so by sorting DOM children after DFS completion > numbers. You can't do this, several domwalk users modify the CFG during the walk. True, they should not -- but they do. I tested a patch like yours, and it broke several ASAN tests.
Author: steven Date: Thu Jan 31 20:16:07 2013 New Revision: 195632 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195632 Log: PR middle-end/56113 * fwprop.c (fwprop_init): Set up loops without CFG modifications. Modified: trunk/gcc/ChangeLog trunk/gcc/fwprop.c
(In reply to comment #19) > Created attachment 29317 [details] > kill dominator queries from domwalk > > This patch kills dominator queries from domwalk, removing a quadratic > bottleneck > I introduced there. Do so by sorting DOM children after DFS completion > numbers. > > Which hopefully results in equivalent operation ;) Part of the "equivalent operation" can be achieved by walking the maximum extended basic block in DFS order first, and the remaining dominator sons second. It's too late now to try and draft a proof, but I think that with such a visiting order, you always visit all predecessors of the dominator sons that are not successor blocks in the CFG. This is trivially true for diamond regions and loops, and my intuition says it is true in general to be proven by induction...
(In reply to comment #23) > (In reply to comment #19) > > Created attachment 29317 [details] > > kill dominator queries from domwalk > > > > This patch kills dominator queries from domwalk, removing a quadratic > > bottleneck > > I introduced there. Do so by sorting DOM children after DFS completion > > numbers. > > > > Which hopefully results in equivalent operation ;) Another alternative would be to set up a vector with the edge counts at the start of the dominator walk. When visiting a basic block bb, do FOR_EACH_EDGE (e, ei, bb->succs) if (e->dest->index < unvisited_preds_count.length () // * && (single_pred_p (e->dest) // common case, cheap test || dominated_by_p (CDI_DOMINATORS, e->dest, e->src)) --unvisited_preds_count[e->dest] and replace the expensive loop: FOR_EACH_EDGE (e, ei, bb->preds) { if (!dominated_by_p (CDI_DOMINATORS, e->src, e->dest) && !bitmap_bit_p (visited, e->src->index)) { found = false; break; } } with: if (e->dest->index < unvisited_preds_count.length () // * && unvisited_preds_count[e->dest->index] > 0) { found = false; break; } (*) can go away if CFG modifications are forbidden during a domwalk, but that's for GCC 4.9 earliest.
On Thu, 31 Jan 2013, steven at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56113 > > --- Comment #24 from Steven Bosscher <steven at gcc dot gnu.org> 2013-01-31 23:22:43 UTC --- > (In reply to comment #23) > > (In reply to comment #19) > > > Created attachment 29317 [details] > > > kill dominator queries from domwalk > > > > > > This patch kills dominator queries from domwalk, removing a quadratic > > > bottleneck > > > I introduced there. Do so by sorting DOM children after DFS completion > > > numbers. > > > > > > Which hopefully results in equivalent operation ;) > > Another alternative would be to set up a vector with the edge counts > at the start of the dominator walk. > > When visiting a basic block bb, do > > FOR_EACH_EDGE (e, ei, bb->succs) > if (e->dest->index < unvisited_preds_count.length () // * > && (single_pred_p (e->dest) // common case, cheap test > || dominated_by_p (CDI_DOMINATORS, e->dest, e->src)) > --unvisited_preds_count[e->dest] > > and replace the expensive loop: > > FOR_EACH_EDGE (e, ei, bb->preds) > { > if (!dominated_by_p (CDI_DOMINATORS, e->src, e->dest) > && !bitmap_bit_p (visited, e->src->index)) > { > found = false; > break; > } > } > > with: > > if (e->dest->index < unvisited_preds_count.length () // * > && unvisited_preds_count[e->dest->index] > 0) > { > found = false; > break; > } Yeah, I thought about such scheme but found the proposed patch much better ;) > (*) can go away if CFG modifications are forbidden during a domwalk, > but that's for GCC 4.9 earliest. But it seems the patch passed bootstrap & regtest ok ... I wonder how CFG modifications would survive the current scheme - after all we're using a visited sbitmap, too. So it at least can't be allowed to add basic-blocks during the domwalk. Changing dominator relationship with the proposed patch would only make the sorting bogus (thus, back to "random", as it were before the issue was introduced to domwalk which was rev. 159100). So, I'm going to propose the patch nevertheless - did you spot code that does CFG manipulations? asan/tsan do not use domwalk as far as I can see. Richard.
With another patch to PTA we now are bottle-necked by the C fronted in update_label_decls ;) parser function body : 125.32 (43%) usr alias stmt walking : 7.49 ( 3%) usr tree PTA : 51.62 (18%) usr tree DSE : 8.53 ( 3%) usr TOTAL : 289.57 that was everything > 2% at -O1 with n = 40000.
Author: rguenth Date: Fri Feb 1 12:38:45 2013 New Revision: 195646 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195646 Log: 2013-02-01 Richard Biener <rguenther@suse.de> PR tree-optimization/56113 * tree-ssa-structalias.c (label_visit): Reduce work for single-predecessor nodes. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-structalias.c
Author: rguenth Date: Mon Feb 4 09:30:12 2013 New Revision: 195707 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195707 Log: 2013-02-04 Richard Biener <rguenther@suse.de> PR tree-optimization/56113 * tree-ssa-structalias.c (equiv_class_lookup, equiv_class_add): Merge into ... (equiv_class_lookup_or_add): ... this. (label_visit): Adjust and fix error in previous patch. (perform_var_substitution): Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-structalias.c
Now a C front end issue.
Author: rguenth Date: Mon Mar 18 08:46:44 2013 New Revision: 196769 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=196769 Log: 2013-03-18 Richard Biener <rguenther@suse.de> PR middle-end/56113 * domwalk.c (bb_postorder): New global static. (cmp_bb_postorder): New function. (walk_dominator_tree): Replace scheme imposing an order for visiting dominator sons by one sorting them at the time they are pushed on the stack. Modified: trunk/gcc/ChangeLog trunk/gcc/domwalk.c
At -O2 this shows VRP is slow and uses much memory. n VRP time 10000 5.76 (14%) usr 20000 18.90 (14%) usr
Hi, guys I have a couple of questions regarding the case. i) What is the current status of the fix? is this fixed or still open? ii) If it is fixed, - how many nodes now it can scale? - What version of gcc comes with the patch/fix applied? Thanks a lot for your help! /Kangkook
(In reply to comment #32) > Hi, guys > > I have a couple of questions regarding the case. > > i) What is the current status of the fix? is this fixed or still open? It's still open, there are bits that can be still improved. > ii) If it is fixed, > - how many nodes now it can scale? No idea, you have to try. It depends on the amount of compile-time/memory you have available. > - What version of gcc comes with the patch/fix applied? GCC 4.8.0 contains some fixes that should help -O1 requirements. GCC 4.9 will contain further fixes. > Thanks a lot for your help! > > /Kangkook
GCC 8 takes 400MB and 11s at -O1 for 6000 and 3.5GB and 899ss for 60000 where it first runs into the mentioned C parser issue (update_label_decls). Time variable usr sys wall GGC phase parsing : 733.99 ( 82%) 5.93 ( 60%) 739.94 ( 81%) 273812 kB ( 10%) phase opt and generate : 165.08 ( 18%) 3.98 ( 40%) 169.07 ( 19%) 2576743 kB ( 90%) parser function body : 726.43 ( 81%) 5.44 ( 55%) 732.53 ( 81%) 191868 kB ( 7%) alias stmt walking : 17.95 ( 2%) 1.12 ( 11%) 19.28 ( 2%) 1875 kB ( 0%) tree PTA : 14.61 ( 2%) 0.10 ( 1%) 14.73 ( 2%) 4219 kB ( 0%) TOTAL : 899.07 9.94 909.03 2851811 kB 899.07user 10.04system 15:09.13elapsed 99%CPU (0avgtext+0avgdata 3579512maxresident)k 0inputs+39984outputs (0major+1276027minor)pagefaults 0swaps Re-categorizing.
(In reply to Richard Biener from comment #34) > GCC 8 takes 400MB and 11s at -O1 for 6000 and 3.5GB and 899ss for 60000 > where it first runs into the mentioned C parser issue (update_label_decls). > > Time variable usr sys > wall GGC > phase parsing : 733.99 ( 82%) 5.93 ( 60%) 739.94 ( > 81%) 273812 kB ( 10%) > phase opt and generate : 165.08 ( 18%) 3.98 ( 40%) 169.07 ( > 19%) 2576743 kB ( 90%) > parser function body : 726.43 ( 81%) 5.44 ( 55%) 732.53 ( > 81%) 191868 kB ( 7%) > alias stmt walking : 17.95 ( 2%) 1.12 ( 11%) 19.28 ( > 2%) 1875 kB ( 0%) > tree PTA : 14.61 ( 2%) 0.10 ( 1%) 14.73 ( > 2%) 4219 kB ( 0%) > TOTAL : 899.07 9.94 909.03 > 2851811 kB > 899.07user 10.04system 15:09.13elapsed 99%CPU (0avgtext+0avgdata > 3579512maxresident)k > 0inputs+39984outputs (0major+1276027minor)pagefaults 0swaps > > Re-categorizing. Ian and Joseph are the ones doing the implementation (which is just for a diagnostic btw).
Different Ian, but I'm not sure which one -- ILT
(In reply to Ian Lance Taylor from comment #36) > Different Ian, but I'm not sure which one -- ILT I'm sure it was you - r148512. Of course the intent of that rev. was to speed things up. I didn't check whether this change regressed the testcase in this PR though.
Ah, sorry, misunderstood. Yes, that work was all for the goal of implementing -Wjump-misses-init, which is a small aspect of -Wc++-compat. That was part of the work of getting GCC to use the common subset of C and C++ before we switched to C++. (It's also used for the error of jumping into the scope of a variably modified type.)
On Fri, 10 May 2019, ian at airs dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56113 > > --- Comment #38 from Ian Lance Taylor <ian at airs dot com> --- > Ah, sorry, misunderstood. > > Yes, that work was all for the goal of implementing -Wjump-misses-init, which > is a small aspect of -Wc++-compat. That was part of the work of getting GCC to > use the common subset of C and C++ before we switched to C++. > > (It's also used for the error of jumping into the scope of a variably modified > type.) Yes, and it looks like this code is somehow quadratic (or worse) in the number of jumps/locals.
GCC 4.8.5: parser function body : 225.66 (15%) usr 2.48 (54%) sys 228.54 (16%) wall 106264 kB ( 4%) ggc dominator optimization : 162.90 (11%) usr 0.20 ( 4%) sys 163.21 (11%) wall 84375 kB ( 4%) ggc tree loop invariant motion: 338.05 (23%) usr 0.01 ( 0%) sys 338.28 (23%) wall 0 kB ( 0%) ggc 1463.11user 4.62system 24:28.53elapsed 99%CPU (0avgtext+0avgdata 3079920maxresident)k 26096inputs+0outputs (12major+631840minor)pagefaults 0swaps GCC 13.1: parser function body : 721.25 ( 74%) 3.36 ( 54%) 725.07 ( 74%) 203M ( 8%) 977.96user 6.37system 16:24.91elapsed 99%CPU (0avgtext+0avgdata 3468252maxresident)k 69024inputs+0outputs (104major+1015878minor)pagefaults 0swaps so overall things improved but the issue in the parser is still present. Peak memory use also regressed somewhat (but the GCC binary itself is a lot larger). On trunk it's really only the frontend. I believe we might have a duplicate for this particular issue.
Samples: 3M of event 'cycles', Event count (approx.): 4336899960635 Overhead Samples Command Shared Object Symbol 73.66% 2793479 cc1 cc1 [.] pop_scope 5.99% 226673 cc1 cc1 [.] bitmap_ior_into 4.72% 178543 cc1 cc1 [.] bitmap_set_bit 1.45% 55762 cc1 cc1 [.] dse_classify_store 1.33% 50428 cc1 cc1 [.] bitmap_bit_p 0.36% 13619 cc1 cc1 [.] rev_post_order_and_mark