> /usr/local/gcc49/bin/gcc -v Using built-in specs. COLLECT_GCC=/usr/local/gcc49/bin/gcc COLLECT_LTO_WRAPPER=/usr/local/gcc49/libexec/gcc/x86_64-apple-darwin13.2.0/4.10.0/lto-wrapper Target: x86_64-apple-darwin13.2.0 Configured with: ../../cc/gcc/configure --prefix=/usr/local/gcc49 --with-arch=native --with-tune=native --disable-nls --with-gmp=/sw --disable-bootstrap --with-isl=/sw --enable-languages=c,c++,lto,objc,obj-c++ --no-create --no-recursion Thread model: posix gcc version 4.10.0 20140615 (experimental) (GCC) For the attached source (C translation from a large BF program): - gcc -O0 takes 9 minutes - gcc -Os does not finish after 40 minutes
Created attachment 32944 [details] Preprocessed source
Since you are testing the trunk, checking is turned on by default. Can you try with --enable-checking=release instead? This should reduce the compile time.
Without checking, -O0 went from 8 -> 5 minutes. I stopped the -Os compile at 29 minutes.
Also seems to use a lot of memory, even at -O0 with 4.9. Let's see if I can find a machine with enough memory to do some comparisons.
4.8 and 4.9 at -O0 take ~32s and ~2.4GB of ram (and IRA taking 28% of the time) 4.8 at -O1 takes ~15min and ~5.5GB of ram (RTL loop invariant motion taking 80% of the time - ugh) 4.9 at -O1 takes ~5min and ~2GB of ram (30% compile time in SSA incremental, 27% in DF) Now -O[s2] will enable passes that are not really "tamed" and may show quadratic behavior (PRE and VRP for example, off the top of my head). You're supposed to drop to -O1 whenever you hit this kind of bug ... Building release checking trunk now.
4.9 at -Os takes 5min and ~2.2GB of ram (points-to takes 20%, DF 33%)
(In reply to Richard Biener from comment #6) > 4.9 at -Os takes 5min and ~2.2GB of ram (points-to takes 20%, DF 33%) trunk at -Os takes 15min and ~2.1GB of ram (dominator optimization takes 67%) trunk at -O1 takes 14min and ~2GB of ram (still DOM at 62%) So it seems that on trunk DOM regressed a lot. Confirmed as 4.10 regression.
(In reply to Richard Biener from comment #7) > (In reply to Richard Biener from comment #6) > > 4.9 at -Os takes 5min and ~2.2GB of ram (points-to takes 20%, DF 33%) > > trunk at -Os takes 15min and ~2.1GB of ram (dominator optimization takes 67%) > > trunk at -O1 takes 14min and ~2GB of ram (still DOM at 62%) > > So it seems that on trunk DOM regressed a lot. Confirmed as 4.10 regression. With -O1 -fno-tree-dominator-opts behavior is sane: df reaching defs : 86.99 (28%) usr 36.53 (66%) sys 124.04 (34%) wall 0 kB ( 0%) ggc tree PTA : 10.31 ( 3%) usr 0.15 ( 0%) sys 10.47 ( 3%) wall 1948 kB ( 0%) ggc tree SSA incremental : 98.24 (31%) usr 8.39 (15%) sys 106.63 (29%) wall 13570 kB ( 1%) ggc tree loop invariant motion: 14.46 ( 5%) usr 0.10 ( 0%) sys 14.54 ( 4%) wall 14421 kB ( 1%) ggc scev constant prop : 11.14 ( 4%) usr 0.02 ( 0%) sys 11.18 ( 3%) wall 28795 kB ( 2%) ggc TOTAL : 312.10 55.72 368.92 1523092 kB 312.10user 55.84system 6:09.34elapsed 99%CPU (0avgtext+0avgdata 2073876maxresident)k 19664inputs+13880outputs (130major+38202143minor)pagefaults 0swaps (well, sane apart from DF and SSA incremental), but with DOM not disabled we get df reaching defs : 108.63 (11%) usr 8.83 (38%) sys 117.08 (12%) wall 0 kB ( 0%) ggc tree PTA : 10.51 ( 1%) usr 0.09 ( 0%) sys 10.60 ( 1%) wall 1948 kB ( 0%) ggc tree SSA incremental : 99.41 (10%) usr 9.17 (39%) sys 108.93 (11%) wall 13474 kB ( 1%) ggc dominator optimization : 617.08 (63%) usr 0.32 ( 1%) sys 618.97 (61%) wall 56878 kB ( 3%) ggc tree loop invariant motion: 2.15 ( 0%) usr 0.11 ( 0%) sys 2.25 ( 0%) wall 12012 kB ( 1%) ggc scev constant prop : 13.31 ( 1%) usr 0.02 ( 0%) sys 13.36 ( 1%) wall 28348 kB ( 2%) ggc TOTAL : 981.98 23.25 1007.67 1625119 kB 981.98user 23.34system 16:47.79elapsed 99%CPU (0avgtext+0avgdata 2071112maxresident)k 184inputs+66416outputs (5major+18571554minor)pagefaults 0swaps (yes, this is with release checking) The testcase has _lots_ of loops (~11000), inside a big outer one, the maximum nesting isn't too big (<10 from what I can see). SSA incremental is likely loop-closed SSA rewrite - didn't check. It should be possible to reduce the testcase somewhat if needed. Eventually the soulution for DOM is to disable the new path-based threading (if that turns out to be the issue) for -fno-expensive-optimizations (that is, optimize < 2).
Created attachment 32952 [details] stripped down testcase Stripped down, now it starts to fall off a bit: tree SSA incremental : 1.48 ( 7%) usr 0.42 (33%) sys 1.96 ( 8%) wall 1882 kB ( 1%) ggc dominator optimization : 10.57 (50%) usr 0.03 ( 2%) sys 12.44 (48%) wall 6918 kB ( 4%) ggc TOTAL : 21.12 1.27 25.67 191269 kB (just cutting in half three times from the bottom and appending } until it compiles again)
All time is spent in invalidate_equivalencies. /* A new value has been assigned to LHS. If necessary, invalidate any equivalences that are no longer valid. */ static void invalidate_equivalences (tree lhs, vec<tree> *stack) { for (unsigned int i = 1; i < num_ssa_names; i++) if (ssa_name (i) && SSA_NAME_VALUE (ssa_name (i)) == lhs) record_temporary_equivalence (ssa_name (i), NULL_TREE, stack); if (SSA_NAME_VALUE (lhs)) record_temporary_equivalence (lhs, NULL_TREE, stack); } this is obviously quadratic ... nearly all calls are from static gimple record_temporary_equivalences_from_stmts_at_dest (edge e, vec<tree> *stack, tree (*simplify) (gimple, gimple), bool backedge_seen) { ... else if (backedge_seen) invalidate_equivalences (gimple_get_lhs (stmt), stack); } return stmt; } I think you should record a bitmap of SSA names you need to invalidate equivalences to and only at the end of this do a single scan over all SSA names.
Same code on the 4.9 branch now.
Fundamentally we don't have a way to know what equivalences we need to invalidate. Invalidation is, umm, painful. The question in my mind is why are we getting so many invalidations to start with. That's the first thing to look at. Honestly though, I really wonder if handling backedges is worth the effort, even though it's important for one benchmark.
On Tue, 17 Jun 2014, law at redhat dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61515 > > Jeffrey A. Law <law at redhat dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |law at redhat dot com > Assignee|unassigned at gcc dot gnu.org |law at redhat dot com > > --- Comment #12 from Jeffrey A. Law <law at redhat dot com> --- > Fundamentally we don't have a way to know what equivalences we need to > invalidate. > Invalidation is, umm, painful. The question in my mind is why are we getting > so many invalidations to start with. That's the first thing to look at. Well, it's easy to avoid the quadraticness - you can always create testcases that need a lot of invalidates. But the current algorithm really does not scale. > Honestly though, I really wonder if handling backedges is worth the effort, > even though it's important for one benchmark. Not sure about that, but trivial improvements to the scalability are possible here. Walking all SSA names is O(number of stmts) - if you do that O(number of stmts) time (as you do) that's clearly the bug.
GCC 4.9.1 has been released.
(In reply to rguenther@suse.de from comment #13) > On Tue, 17 Jun 2014, law at redhat dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61515 > > > > Jeffrey A. Law <law at redhat dot com> changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------------- > > CC| |law at redhat dot com > > Assignee|unassigned at gcc dot gnu.org |law at redhat dot com > > > > --- Comment #12 from Jeffrey A. Law <law at redhat dot com> --- > > Fundamentally we don't have a way to know what equivalences we need to > > invalidate. > > Invalidation is, umm, painful. The question in my mind is why are we getting > > so many invalidations to start with. That's the first thing to look at. > > Well, it's easy to avoid the quadraticness - you can always create > testcases that need a lot of invalidates. But the current algorithm > really does not scale. > > > Honestly though, I really wonder if handling backedges is worth the effort, > > even though it's important for one benchmark. > > Not sure about that, but trivial improvements to the scalability are > possible here. Walking all SSA names is O(number of stmts) - if you > do that O(number of stmts) time (as you do) that's clearly the bug. Sth like (untested) Index: tree-ssa-threadedge.c =================================================================== --- tree-ssa-threadedge.c (revision 216464) +++ tree-ssa-threadedge.c (working copy) @@ -287,20 +287,6 @@ fold_assignment_stmt (gimple stmt) } } -/* A new value has been assigned to LHS. If necessary, invalidate any - equivalences that are no longer valid. */ -static void -invalidate_equivalences (tree lhs, vec<tree> *stack) -{ - - for (unsigned int i = 1; i < num_ssa_names; i++) - if (ssa_name (i) && SSA_NAME_VALUE (ssa_name (i)) == lhs) - record_temporary_equivalence (ssa_name (i), NULL_TREE, stack); - - if (SSA_NAME_VALUE (lhs)) - record_temporary_equivalence (lhs, NULL_TREE, stack); -} - /* Try to simplify each statement in E->dest, ultimately leading to a simplification of the COND_EXPR at the end of E->dest. @@ -335,6 +321,8 @@ record_temporary_equivalences_from_stmts we discover. Note any equivalences we discover are context sensitive (ie, are dependent on traversing E) and must be unwound when we're finished processing E. */ + sbitmap to_invalidate = sbitmap_alloc (num_ssa_names); + bitmap_clear (to_invalidate); for (gsi = gsi_start_bb (e->dest); !gsi_end_p (gsi); gsi_next (&gsi)) { tree cached_lhs = NULL; @@ -379,7 +367,7 @@ record_temporary_equivalences_from_stmts if (backedge_seen) FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_DEF) - invalidate_equivalences (op, stack); + bitmap_set_bit (to_invalidate, SSA_NAME_VERSION (op)); continue; } @@ -419,7 +407,7 @@ record_temporary_equivalences_from_stmts if (backedge_seen) { tree lhs = gimple_get_lhs (stmt); - invalidate_equivalences (lhs, stack); + bitmap_set_bit (to_invalidate, SSA_NAME_VERSION (lhs)); } continue; } @@ -497,8 +485,24 @@ record_temporary_equivalences_from_stmts || is_gimple_min_invariant (cached_lhs))) record_temporary_equivalence (gimple_get_lhs (stmt), cached_lhs, stack); else if (backedge_seen) - invalidate_equivalences (gimple_get_lhs (stmt), stack); + bitmap_set_bit (to_invalidate, + SSA_NAME_VERSION (gimple_get_lhs (stmt))); } + + /* Now invalidate all equivalencies we have to invalidate. */ + unsigned i; + sbitmap_iterator bi; + EXECUTE_IF_SET_IN_BITMAP (to_invalidate, 0, i, bi) + { + tree name = ssa_name (i); + record_temporary_equivalence (name, NULL_TREE, stack); + + if (name) + record_temporary_equivalence (name, NULL_TREE, stack); + } + + sbitmap_free (to_invalidate); + return stmt; }
Btw, why is it recording twice a temporary equivalence? Might be simpler even with /* Now invalidate all equivalencies we have to invalidate. */ unsigned i; sbitmap_iterator bi; EXECUTE_IF_SET_IN_BITMAP (to_invalidate, 0, i, bi) record_temporary_equivalence (ssa_name (i), NULL_TREE, stack);
Oops, that's not 100% the same. But /* Now invalidate all equivalencies we have to invalidate. */ for (unsigned int i = 1; i < num_ssa_names; ++i) { tree name = ssa_name (i); if (!name) continue; tree val = SSA_NAME_VALUE (name); if (TREE_CODE (val) == SSA_NAME && bitmap_bit_p (to_invalidate, SSA_NAME_VERSION (val))) record_temporary_equivalence (name, NULL_TREE, stack); } unsigned i; bitmap_iterator bi; EXECUTE_IF_SET_IN_BITMAP (to_invalidate, 0, i, bi) { tree name = ssa_name (i); if (SSA_NAME_VALUE (ssa_name (i))) record_temporary_equivalence (name, NULL_TREE, stack); } would be.
Testing that.
Btw, are you sure all temporary equivalences are to SSA names only? ISTR we have memory equivalencies as well.
Created attachment 33766 [details] non-working patch Of course this still walks all SSA names (but only once per BB), so it isn't really removing the quadraticness. Moreover the patch (see attachment) I was testing fails to bootstrap on the 4.9 branch, seemingly miscompiling stage2 gcc (and crashing in building stage2 libgcc in swap_tree_comparison called from IVOPTs). Eventually record_temporary_equivalences_from_stmts_at_dest depends on the exact ordering of record_temporary_equivalence calls (I do the invaldiate ones late). Jeff? As this is a regression from 4.9.0 appearing in 4.9.1 it's quite bad that this is now still unfixed given we are about to release 4.9.2. I'm out of here again ;)
So to recap - apart from really fixing the quadraticness - it would be nice if we can just disable threading over backedges at -O1, thus for !flag_expensive_optimizations. Especially on the 4.9 branch.
GCC 4.9.2 has been released.
The piece you're missing in this is that when we seen an assignment to some SSA_NAME, call it LHS and we've traversed a backedge, then we have to walk through all the equivalences and see if there's anything that's been marked as equivalent to LHS and invalidate those. Then we also ahve to invalidate LHS. for (unsigned int i = 1; i < num_ssa_names; i++) if (ssa_name (i) && SSA_NAME_VALUE (ssa_name (i)) == lhs) record_temporary_equivalence (ssa_name (i), NULL_TREE, stack); if (SSA_NAME_VALUE (lhs)) record_temporary_equivalence (lhs, NULL_TREE, stack); The loop finds handles things equivalent to LHS, the second handles LHS itself. Both are required for correctness. In the past there was a map similar to your implementation, but it's not sufficient. See pr61289.C and pr61289-2.C. The problem is you may need to invalidate equivalences that are created *outside* tree-ssa-threadedge.c. So any attempts to track inside tree-ssa-threadedge are not sufficient for correctness. What's still not clear to me here is *why* we're calling the invalidation code in the first place. It's only supposed to be called when we encounter a statement which produces an output that we can't handle *and* we've traversed a backedge. The combination of those two events is supposed to be very rare. Otherwise the invalidation is obviously too expensive. That's why I suggested in c#12 that we need to look at *why* we're calling the invalidation code at all.
(In reply to Jeffrey A. Law from comment #23) > The piece you're missing in this is that when we seen an assignment to some > SSA_NAME, call it LHS and we've traversed a backedge, then we have to walk > through all the equivalences and see if there's anything that's been marked > as equivalent to LHS and invalidate those. Then we also ahve to invalidate > LHS. > > for (unsigned int i = 1; i < num_ssa_names; i++) > if (ssa_name (i) && SSA_NAME_VALUE (ssa_name (i)) == lhs) > record_temporary_equivalence (ssa_name (i), NULL_TREE, stack); > > if (SSA_NAME_VALUE (lhs)) > record_temporary_equivalence (lhs, NULL_TREE, stack); > > > The loop finds handles things equivalent to LHS, the second handles LHS > itself. Both are required for correctness. > > In the past there was a map similar to your implementation, but it's not > sufficient. See pr61289.C and pr61289-2.C. > > The problem is you may need to invalidate equivalences that are created > *outside* tree-ssa-threadedge.c. So any attempts to track inside > tree-ssa-threadedge are not sufficient for correctness. > > What's still not clear to me here is *why* we're calling the invalidation > code in the first place. It's only supposed to be called when we encounter > a statement which produces an output that we can't handle *and* we've > traversed a backedge. The combination of those two events is supposed to be > very rare. Otherwise the invalidation is obviously too expensive. That's > why I suggested in c#12 that we need to look at *why* we're calling the > invalidation code at all. Well, I don't know the code at all so I can just try to fix the complexity in the present algorithm. Please have a look here during stage3.
So I think there's another approach. invalidate_equivalences is passed in the stack of temporary equivalences, which include those created by jump threading as well as those created by DOM. This means that in theory, we could walk that stack (in reverse order) to find all the equivalences we need to invalidate. If the stack has fewer entries than we have SSA_NAMES, then we win. For the stripped down testcase in this PR we go from ~40 seconds to ~1 second in the relevant part of the compiler with a hack that does exactly what's described above. I need to do a bit more review/research, but it looks like a promising alternative.
(In reply to Jeffrey A. Law from comment #25) > So I think there's another approach. invalidate_equivalences is passed in > the stack of temporary equivalences, which include those created by jump > threading as well as those created by DOM. > > This means that in theory, we could walk that stack (in reverse order) to > find all the equivalences we need to invalidate. If the stack has fewer > entries than we have SSA_NAMES, then we win. > > For the stripped down testcase in this PR we go from ~40 seconds to ~1 > second in the relevant part of the compiler with a hack that does exactly > what's described above. > > I need to do a bit more review/research, but it looks like a promising > alternative. Huh, I thought I suggested that somewhere but you said the information is not available here. But yes, this sounds exactly what we want to do complexity-wise. I suppose we would need to amend the stack of tempoary equivalences to record the basic-block we are creating the actual set of equivalences for (to be able to terminate the walk?)
Richi, I thought you had mentioned something along those lines as well and I scanned for it yesterday but didn't see it. Maybe it's in a different BZ or something. I'll probably come across it as a I work through the various jump threading BZs ;-)
Author: law Date: Fri Nov 7 22:55:00 2014 New Revision: 217239 URL: https://gcc.gnu.org/viewcvs?rev=217239&root=gcc&view=rev Log: PR tree-optimization/61515 * tree-ssa-threadedge.c (invalidate_equivalences): Walk the unwinding stack rather than looking at ever SSA_NAME's value. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-threadedge.c
GCC 4.9.3 has been released.
Can this be closed? The reduced testcase currently compiles in slightly over a minute with -O2 and -Os. Not great, but good enough maybe?
(In reply to Bernd Schmidt from comment #30) > Can this be closed? The reduced testcase currently compiles in slightly over > a minute with -O2 and -Os. Not great, but good enough maybe? On the 4.9 branch? I don't think the patch has been backported.
That patch ought to be trivial to backport to 4.9. I just never considered it that much of a priority. I'll do a bootstrap/regression test on 4.9 and commit it.
Author: law Date: Thu Dec 10 19:40:23 2015 New Revision: 231542 URL: https://gcc.gnu.org/viewcvs?rev=231542&root=gcc&view=rev Log: PR tree-optimization/61515 PR tree-optimization/46590 Backport from mainline. * tree-ssa-threadedge.c (invalidate_equivalences): Walk the unwinding stack rather than looking at every SSA_NAME's value. Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/tree-ssa-threadedge.c
Fixed on branch as well.