In the following example, virtual calls via ap should be speciallized -- there is one dominatating call target, but compiling the program at -O3 with -fprofile-use, it does not happen. g++ -O1 -fprofile-generate devirt.cc ./a.out g++ -fprofile-use -O3 -fdump-tree-optimized devirt.cc // devirt.cc class A { public: virtual int foo() { return 1; } int i; }; class B : public A { public: virtual int foo() { return 2; } int b; } ; int main() { int i; A* ap = 0; for (i = 0; i < 10000; i++) { if (i%7==0) { ap = new A(); } else ap = new B(); ap->foo(); delete ap; } return 0; }
Jump threading happened by we did not prop the call correctly: ap = (struct A *) D.2075; ap->_vptr$A = &_ZTV1A[2]; OBJ_TYPE_REF(*ap->_vptr$A;ap->0) (ap); ... this = (struct B *) D.2076; this->D.2018._vptr$A = &_ZTV1B[2]; ap.61 = &this->D.2018; OBJ_TYPE_REF(*ap.61->_vptr$A;ap.61->0) (ap.61);
Couple years later we finally devirtualize here: <bb 5>: ap_8 = operator new (16); ap_8->i = 0; ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B]; _19 = foo; PROF_26 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_8->0); if (PROF_26 == foo) goto <bb 8>; else goto <bb 7>; <bb 6>: ap_13 = operator new (16); MEM[(struct B *)ap_13].D.2237.i = 0; MEM[(struct B *)ap_13].b = 0; MEM[(struct B *)ap_13].D.2237._vptr.A = &MEM[(void *)&_ZTV1B + 16B]; _1 = foo; PROF_30 = [obj_type_ref] OBJ_TYPE_REF(_1;(struct A)ap_13->0); if (PROF_30 == foo) goto <bb 8>; else goto <bb 7>; however the code ends up super sily after tracer. for some reason we do not manage to fold away the virtual table lookup. Why?
Following patch gets rid of OBJ_TYPE_REF Index: value-prof.c =================================================================== --- value-prof.c (revision 206040) +++ value-prof.c (working copy) @@ -1333,9 +1333,15 @@ gimple_ic (gimple icall_stmt, struct cgr cond_bb = gimple_bb (icall_stmt); gsi = gsi_for_stmt (icall_stmt); - tmp0 = make_temp_ssa_name (optype, NULL, "PROF"); - tmp1 = make_temp_ssa_name (optype, NULL, "PROF"); - tmp = unshare_expr (gimple_call_fn (icall_stmt)); + tmp0 = make_temp_ssa_name (optype, NULL, "SPEC"); + tmp1 = make_temp_ssa_name (optype, NULL, "SPEC"); + tmp = gimple_call_fn (icall_stmt); + + /* Drop OBJ_TYPE_REF expression, it is ignored by rest of + optimization queue anyway. */ + if (TREE_CODE (tmp) == OBJ_TYPE_REF) + tmp = OBJ_TYPE_REF_EXPR (tmp); + tmp = unshare_expr (tmp); load_stmt = gimple_build_assign (tmp0, tmp); gsi_insert_before (&gsi, load_stmt, GSI_SAME_STMT); but it does not help. We end up: _1 = foo; if (_1 == foo) goto <bb 8>; not much better. This nonsense appears only in optimized dump, before we have the following nonsense: # ap_9 = PHI <ap_13(6)> # prephitmp_18 = PHI <&MEM[(void *)&_ZTV1B + 16B](6)> _1 = *prephitmp_18; that is introduced by duplicating <bb 7>: # ap_2 = PHI <ap_8(5), ap_13(6)> # prephitmp_14 = PHI <&MEM[(void *)&_ZTV1A + 16B](5), &MEM[(void *)&_ZTV1B + 16B](6)> _19 = *prephitmp_14; if (_19 == foo) goto <bb 9>; else goto <bb 8>; The following patch: Index: passes.def =================================================================== --- passes.def (revision 206040) +++ passes.def (working copy) @@ -242,9 +242,9 @@ along with GCC; see the file COPYING3. only examines PHIs to discover const/copy propagation opportunities. */ NEXT_PASS (pass_phi_only_cprop); + NEXT_PASS (pass_tracer); NEXT_PASS (pass_vrp); NEXT_PASS (pass_cd_dce); - NEXT_PASS (pass_tracer); NEXT_PASS (pass_dse); NEXT_PASS (pass_forwprop); NEXT_PASS (pass_phiopt); makes us to VRP value through but only when OBJ_TYPE_REF is nout around. I think pushing tracer up is good idea - we should have at least one vrp/ccp pass after tracer. Its main purpose is to provide enough context so those forward propagating passes can do better job. it seems stupid to do it only after everything was finished. I also see why tracer would preffer DCE to be done first, but we have only so many passes to do.
"hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote: >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35545 > >Jan Hubicka <hubicka at gcc dot gnu.org> changed: > > What |Removed |Added >---------------------------------------------------------------------------- > Status|UNCONFIRMED |NEW > Last reconfirmed| |2013-12-17 > CC| |hubicka at gcc dot gnu.org, > | |mjambor at suse dot cz, > | |rguenther at suse dot de > Ever confirmed|0 |1 > >--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> --- >Couple years later we finally devirtualize here: > <bb 5>: > ap_8 = operator new (16); > ap_8->i = 0; > ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B]; > _19 = foo; > PROF_26 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_8->0); > if (PROF_26 == foo) > goto <bb 8>; > else > goto <bb 7>; > > <bb 6>: > ap_13 = operator new (16); > MEM[(struct B *)ap_13].D.2237.i = 0; > MEM[(struct B *)ap_13].b = 0; > MEM[(struct B *)ap_13].D.2237._vptr.A = &MEM[(void *)&_ZTV1B + 16B]; > _1 = foo; > PROF_30 = [obj_type_ref] OBJ_TYPE_REF(_1;(struct A)ap_13->0); > if (PROF_30 == foo) > goto <bb 8>; > else > goto <bb 7>; > >however the code ends up super sily after tracer. >for some reason we do not manage to fold away the virtual table lookup. >Why? This is probably doms fault. Does it even handle function pointers? Tracer runs way too late and with no cleanups behind it.
Main issue seems to be that VRP messes up on: # ap_2 = PHI <ap_8(5)> # prephitmp_14 = PHI <&MEM[(void *)&_ZTV1A + 16B](5)> _19 = *prephitmp_14; here it somehow won't constant propagate the load. Index: passes.def =================================================================== --- passes.def (revision 206040) +++ passes.def (working copy) @@ -236,6 +236,7 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_reassoc); NEXT_PASS (pass_strength_reduction); NEXT_PASS (pass_dominator); + NEXT_PASS (pass_tracer); /* The only const/copy propagation opportunities left after DOM should be due to degenerate PHI nodes. So rather than run the full propagators, run a specialized pass which @@ -244,7 +245,6 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_phi_only_cprop); NEXT_PASS (pass_vrp); NEXT_PASS (pass_cd_dce); - NEXT_PASS (pass_tracer); NEXT_PASS (pass_dse); NEXT_PASS (pass_forwprop); NEXT_PASS (pass_phiopt); actually helps since phi_only_cprop is good on this transform. I do not quite gather why VRP can't do it itself. I sent first patch to http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01517.html
You certainly don't want to put something between DOM and phi-only-cprop. Jump threading will tend to expose lots of degenerate PHIs. phi-only-cprop eliminates those degenerates. We could have used the normal cprop code, but it seemed too heavy-weight for the cleanups we wanted to do. One could argue we want a phi-only-cprop cleanup after VRP since it threads jumps too.
> You certainly don't want to put something between DOM and phi-only-cprop. Jump > threading will tend to expose lots of degenerate PHIs. phi-only-cprop > eliminates those degenerates. We could have used the normal cprop code, but it > seemed too heavy-weight for the cleanups we wanted to do. Well, consttants, copies and PHIs are accounted as 0 size and thus not part of tracer's cost model, so perhaps we do not care about presence of degnerated PHIs here. Moreover it is the degnerate PHI produced by tracer that causes are problems. I assume DOM does degnerate PHIs by code duplication for jump threading and tracer is exactly the same type of transformation and for same reasons we may want phi-only-cprop after tracer as we do it after DOM. It however seems a more like VRP's missed optimization to not be able to paper over that degenerate PHI produced by tracer at first place, so I will try to poke about this tomorrow a bit more.
It's not a matter of cost model, but if propagating the values to their uses. I haven't looked closely at the tracer, but wouldn't it benefit by having constants in particular propagated to their uses? Yes, DOM's duplication/isolation of paths exposes the degenerate PHIs. So we might have had: One of the nice side effects of jump threading is that it isolates paths. So we might have had BBn x = phi (a, b, c, constant, d, e, f) duplication for threading might turn that into BBn: x = phi (a, b, c, d, e, f) // The original elsewhere BBm: x' = phi (constant) // the duplicate Propagating the constant for x' in BBm and eliminating the degenerate is what the phi-only cprop pass does. If the tracer generates similar things, then running phi-only cprop after it might be useful as well. It *should* be very fast.
> It's not a matter of cost model, but if propagating the values to their uses. > I haven't looked closely at the tracer, but wouldn't it benefit by having > constants in particular propagated to their uses? Tracer depends on the usual estimate_num_insns limits (it is 12 years since I wrote it, so what I recall) It collects BBs that are interesting starts of traces, takes them in priority order and duplicates from the seeds until code growth is met or it runs out of interesting candidates by other criteria. I think it generally tends to starve on candidates as the definition of trace is relatively strong, but I am not 100% sure on it. So it is not that much dependent on bounds given by code size metric. If we had unlimited time, it would be better to propagate constants and cleanup both before and after tracer. If we can chose whether we want to do tracer before last pass that is able to propagate and fold constants or after, I would chose before for the reason I mentioned on begginig; the whole point of the tail duplication is to simplify CFG and allow better propagation. I think missed tracing here and there is less painful than missed optimizatoin in duplicated code. We may even consider pushing tracer before DOM, since tail duplication may enable DOM to produce more useful threading/propagation and code after tracer is not too painfuly obstructated. Sure you can end up with PHI that has only one constant argument. I can see that DOM may miss optimization here. > Propagating the constant for x' in BBm and eliminating the degenerate is what > the phi-only cprop pass does. If the tracer generates similar things, then > running phi-only cprop after it might be useful as well. It *should* be very > fast. Yes, tracer does similar things. You can think about it as about speculative jump threading - if one path through meet points seems more likely than the other based on profile, tracer will duplicate it in a hope that later optimization pass will prove some of conditionals constant over the duplicated path. For that it needs subsequent propagation pass (CCP or better VRP) to match. That is why its current place in pass queue is unlucky. Possible benefits of tail duplications are of course not limited to threading. We can do one extra cleanup pass, too. Tracer is on by default only with -fprofile-use so extra phi-only cprop with -ftracer probably is not dangerous to overall compile time experience. Honza
> Tracer depends on the usual estimate_num_insns limits > (it is 12 years since I wrote it, so what I recall) note that one impotant thing that changed in those 12 years is that I originally carefuly tuned tracer in combination with crossjumping. Tracer produced duplicates and if no pass managed to take use of them, crossjumping cleaned them up pre-reload. Trace formation in bb-reorder re-instantiated duplicated when it seemed sensible for code layout. This broke, since SSA makes RTL cross jumping quite useless and it is now done after reg-alloc only. We never really got working code unification pass on gimple. http://www.ucw.cz/~hubicka/papers/amd64/node4.html Claims that at that time I got 1.6% speedup on SPECint with profile feedback 1.43% code growth. That is not bad, but wonder to what it translates today. Honza
Author: hubicka Date: Tue Dec 17 23:41:41 2013 New Revision: 206073 URL: http://gcc.gnu.org/viewcvs?rev=206073&root=gcc&view=rev Log: PR middle-end/35545 * tree-vrp.c (extract_range_from_unary_expr_1): Handle OBJ_TYPE_REF. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vrp.c Author: hubicka Date: Tue Dec 17 23:43:22 2013 New Revision: 206074 URL: http://gcc.gnu.org/viewcvs?rev=206074&root=gcc&view=rev Log: PR middle-end/35545 * gimple-fold.c (fold_gimple_assign): Attempt to devirtualize OBJ_TYPE_REF. (gimple_fold_stmt_to_constant_1): Bypass OBJ_TYPE_REF wrappers. Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-fold.c
We still fail to fold here. After tracer we get: # ap_2 = PHI <ap_8(4)> # prephitmp_14 = PHI <&MEM[(void *)&_ZTV1A + 16B](4)> _19 = *prephitmp_14; PROF_24 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_2->0); if (PROF_24 == foo) goto <bb 9>; else goto <bb 8>; and the necessary forward propagation & folding to realize that: _19 = foo; PROF_24 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_8->0); if (PROF_24 == foo) happens only in .optimized dump. I would go with patch from comment 5 moving tracer earlier - the tail duplication intends to make code more optimizable and it is stupid to run it only after all optimizations are done.
Well - ideally you'd want to do some CSE, not only VRP (constant propagation in this case). So I'd move it even earlier, after pass_lower_vector_ssa? I can't see what tracing would break (some odd cases can destroy loop info, but I think I have disabled all those).
This feels like the kind of situation where I've always wanted a pass to be able to say something like "I've done some set of transformations", schedule the appropriate cleanup passes to run". It's heavyweight, but making the tracer imply a later DOM or VRP pass might make sense. Jeff
Author: hubicka Date: Sat Sep 27 00:03:23 2014 New Revision: 215651 URL: https://gcc.gnu.org/viewcvs?rev=215651&root=gcc&view=rev Log: PR middle-end/35545 * passes.def (pass_tracer): Move before last dominator pass. * g++.dg/tree-prof/pr35545.C: New testcase. Added: trunk/gcc/testsuite/g++.dg/tree-prof/pr35545.C Modified: trunk/gcc/ChangeLog trunk/gcc/passes.def trunk/gcc/testsuite/ChangeLog
I have moved tracer before the late cleanups that seems to be rather obbious thing to do. This lets us to optimize the testcase (with -O2): int main() () { struct A * ap; int i; int _6; <bb 2>: <bb 3>: # i_29 = PHI <i_22(6), 0(2)> _6 = i_29 % 7; if (_6 == 0) goto <bb 4>; else goto <bb 5>; <bb 4>: ap_8 = operator new (16); ap_8->i = 0; ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B]; goto <bb 6>; <bb 5>: ap_13 = operator new (16); MEM[(struct B *)ap_13].D.2244.i = 0; MEM[(struct B *)ap_13].b = 0; MEM[(struct B *)ap_13].D.2244._vptr.A = &MEM[(void *)&_ZTV1B + 16B]; <bb 6>: # ap_4 = PHI <ap_13(5), ap_8(4)> operator delete (ap_4); i_22 = i_29 + 1; if (i_22 != 10000) goto <bb 3>; else goto <bb 7>; <bb 7>: return 0; } Martin, I do not have SPEC setup, do you think you can benchmark the attached patch with SPEC and profile feedback and also non-FDO -O3 -ftracer compared to -O3, please? It would be nice to know code size impact, too. Index: passes.def =================================================================== --- passes.def (revision 215651) +++ passes.def (working copy) @@ -155,6 +155,7 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_dce); NEXT_PASS (pass_call_cdce); NEXT_PASS (pass_cselim); + NEXT_PASS (pass_tracer); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_tree_ifcombine); NEXT_PASS (pass_phiopt); @@ -252,7 +253,6 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc); NEXT_PASS (pass_strength_reduction); - NEXT_PASS (pass_tracer); NEXT_PASS (pass_dominator); NEXT_PASS (pass_strlen); NEXT_PASS (pass_vrp); Doing it at same approximately the same place as loop header copying seems to make most sense to me. It benefits from early cleanups and DCE definitly and it should enable more fun with the later scalar passes that are almost all rerun then.
(In reply to Jan Hubicka from comment #16) > I have moved tracer before the late cleanups that seems to be rather obbious > thing to do. This lets us to optimize the testcase (with -O2): > int main() () > { > struct A * ap; > int i; > int _6; > > <bb 2>: > > <bb 3>: > # i_29 = PHI <i_22(6), 0(2)> > _6 = i_29 % 7; > if (_6 == 0) > goto <bb 4>; > else > goto <bb 5>; > > <bb 4>: > ap_8 = operator new (16); > ap_8->i = 0; > ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B]; > goto <bb 6>; > > <bb 5>: > ap_13 = operator new (16); > MEM[(struct B *)ap_13].D.2244.i = 0; > MEM[(struct B *)ap_13].b = 0; > MEM[(struct B *)ap_13].D.2244._vptr.A = &MEM[(void *)&_ZTV1B + 16B]; > > <bb 6>: > # ap_4 = PHI <ap_13(5), ap_8(4)> > operator delete (ap_4); > i_22 = i_29 + 1; > if (i_22 != 10000) > goto <bb 3>; > else > goto <bb 7>; > > <bb 7>: > return 0; > > } > > Martin, I do not have SPEC setup, do you think you can benchmark the > attached patch with SPEC and profile feedback and also non-FDO -O3 -ftracer > compared to -O3, please? > It would be nice to know code size impact, too. > Index: passes.def > =================================================================== > --- passes.def (revision 215651) > +++ passes.def (working copy) > @@ -155,6 +155,7 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_dce); > NEXT_PASS (pass_call_cdce); > NEXT_PASS (pass_cselim); > + NEXT_PASS (pass_tracer); > NEXT_PASS (pass_copy_prop); > NEXT_PASS (pass_tree_ifcombine); > NEXT_PASS (pass_phiopt); > @@ -252,7 +253,6 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_cse_reciprocals); > NEXT_PASS (pass_reassoc); > NEXT_PASS (pass_strength_reduction); > - NEXT_PASS (pass_tracer); > NEXT_PASS (pass_dominator); > NEXT_PASS (pass_strlen); > NEXT_PASS (pass_vrp); > > Doing it at same approximately the same place as loop header copying seems > to make most sense to me. It benefits from early cleanups and DCE definitly > and it should enable more fun with the later scalar passes that are almost > all rerun then. WE can try some internal benchmarks with this change too. David
> WE can try some internal benchmarks with this change too. That would be very welcome. Tracer used to be quite useful pass in old days, doing 1.6% on -O3+FDO SPECint (for 1.4% code size cost) that put it very close to the inliner (1.8%) that however needed 12.9% of code size. http://www.ucw.cz/~hubicka/papers/amd64/node4.html (see aggressive optimization table) I do not think it was ever resonably returned for GIMPLE. Martin, do you have any current scores? Part of benefits came from lack of global optimizers, but I think tail duplication has a potential in modern compiler, too. Perhaps we should try to do that and given that we now have tail merging pass, consider getting it useful without FDO, too. Honza
(In reply to Jan Hubicka from comment #18) > > WE can try some internal benchmarks with this change too. > > That would be very welcome. Tracer used to be quite useful pass in old days, > doing 1.6% on -O3+FDO SPECint (for 1.4% code size cost) that put it very > close > to the inliner (1.8%) that however needed 12.9% of code size. > > http://www.ucw.cz/~hubicka/papers/amd64/node4.html (see aggressive > optimization table) > > I do not think it was ever resonably returned for GIMPLE. Martin, do you have > any current scores? Part of benefits came from lack of global optimizers, but > I think tail duplication has a potential in modern compiler, too. > > Perhaps we should try to do that and given that we now have tail merging > pass, > consider getting it useful without FDO, too. > > Honza I saw small improvement across most of the benchmarks with this change on sandybridge machines with FDO. The geomean improvement is about 0.3%. The baseline compiler is Google gcc-49 compiler, and the test compiler is Google gcc-49 with the patch.
On Fri, 26 Sep 2014, law at redhat dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35545 > > --- Comment #14 from Jeffrey A. Law <law at redhat dot com> --- > This feels like the kind of situation where I've always wanted a pass to be > able to say something like "I've done some set of transformations", schedule > the appropriate cleanup passes to run". > > It's heavyweight, but making the tracer imply a later DOM or VRP pass might > make sense. Uh that idea will lead to very big compile-time increases. One thing I'd liked to do at some point is make the SSA propagators and/or DOM work on MEME regions so we can schedule them on a sub-CFG. Richard.
nOn Sat, 27 Sep 2014, hubicka at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35545 > > Jan Hubicka <hubicka at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |ASSIGNED > CC| |mliska at suse dot cz > Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org > > --- Comment #16 from Jan Hubicka <hubicka at gcc dot gnu.org> --- > I have moved tracer before the late cleanups that seems to be rather obbious > thing to do. This lets us to optimize the testcase (with -O2): > int main() () > { > struct A * ap; > int i; > int _6; > > <bb 2>: > > <bb 3>: > # i_29 = PHI <i_22(6), 0(2)> > _6 = i_29 % 7; > if (_6 == 0) > goto <bb 4>; > else > goto <bb 5>; > > <bb 4>: > ap_8 = operator new (16); > ap_8->i = 0; > ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B]; > goto <bb 6>; > > <bb 5>: > ap_13 = operator new (16); > MEM[(struct B *)ap_13].D.2244.i = 0; > MEM[(struct B *)ap_13].b = 0; > MEM[(struct B *)ap_13].D.2244._vptr.A = &MEM[(void *)&_ZTV1B + 16B]; > > <bb 6>: > # ap_4 = PHI <ap_13(5), ap_8(4)> > operator delete (ap_4); > i_22 = i_29 + 1; > if (i_22 != 10000) > goto <bb 3>; > else > goto <bb 7>; > > <bb 7>: > return 0; > > } > > Martin, I do not have SPEC setup, do you think you can benchmark the attached > patch with SPEC and profile feedback and also non-FDO -O3 -ftracer compared to > -O3, please? > It would be nice to know code size impact, too. > Index: passes.def > =================================================================== > --- passes.def (revision 215651) > +++ passes.def (working copy) > @@ -155,6 +155,7 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_dce); > NEXT_PASS (pass_call_cdce); > NEXT_PASS (pass_cselim); > + NEXT_PASS (pass_tracer); > NEXT_PASS (pass_copy_prop); > NEXT_PASS (pass_tree_ifcombine); > NEXT_PASS (pass_phiopt); > @@ -252,7 +253,6 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_cse_reciprocals); > NEXT_PASS (pass_reassoc); > NEXT_PASS (pass_strength_reduction); > - NEXT_PASS (pass_tracer); > NEXT_PASS (pass_dominator); > NEXT_PASS (pass_strlen); > NEXT_PASS (pass_vrp); > > Doing it at same approximately the same place as loop header copying seems to > make most sense to me. It benefits from early cleanups and DCE definitly and > it should enable more fun with the later scalar passes that are almost all > rerun then. We need to make sure tracer doesn't mess too much with loops then. Btw, "useless" tracing may be undone again by tail-merging. Tracer seems to consume only profile information and thus doesn't rely on any other transforms (well, apart from cleanups which could affect its cost function). Why not schedule it even earlier? Like to before pass_build_alias? (the pipeline up to loop transforms is quite a mess...)
> > Doing it at same approximately the same place as loop header copying seems to > > make most sense to me. It benefits from early cleanups and DCE definitly and > > it should enable more fun with the later scalar passes that are almost all > > rerun then. > > We need to make sure tracer doesn't mess too much with loops then. > Btw, "useless" tracing may be undone again by tail-merging. Tracer already does: /* We have the tendency to duplicate the loop header of all do { } while loops. Do not do that - it is not profitable and it might create a loop with multiple entries or at least rotate the loop. */ && bb2->loop_father->header != bb2) so it won't kill natural loops and peel (I should find time to update the peeling pass). It also has: if (bb_seen_p (bb2) || (e->flags & (EDGE_DFS_BACK | EDGE_COMPLEX)) || find_best_successor (bb2) != e) break; to not unroll. So i think it is safe for loop optimizer - all it can do is expanding a loop that has control flow in it reducing its unrollability later. > > Tracer seems to consume only profile information and thus doesn't > rely on any other transforms (well, apart from cleanups which could > affect its cost function). Why not schedule it even earlier? > Like to before pass_build_alias? (the pipeline up to loop transforms > is quite a mess...) It uses profile information and code size estiamtes. I expect that (especially for C++ stuff) the early post inline passes will remove a lot of code and thus improve traceability. This is why I looked for spot after first DCE/DSE. David, can you please be more specific about how you tested? Was it with profile feedback? What about code size metrics? Honza
(In reply to Jan Hubicka from comment #22) > > > Doing it at same approximately the same place as loop header copying seems to > > > make most sense to me. It benefits from early cleanups and DCE definitly and > > > it should enable more fun with the later scalar passes that are almost all > > > rerun then. > > > > We need to make sure tracer doesn't mess too much with loops then. > > Btw, "useless" tracing may be undone again by tail-merging. > > Tracer already does: > > /* We have the tendency to duplicate the loop header > of all do { } while loops. Do not do that - it is > not profitable and it might create a loop with multiple > entries or at least rotate the loop. */ > && bb2->loop_father->header != bb2) > > so it won't kill natural loops and peel (I should find time to update the > peeling pass). It also has: > > if (bb_seen_p (bb2) || (e->flags & (EDGE_DFS_BACK | EDGE_COMPLEX)) > || find_best_successor (bb2) != e) > break; > > to not unroll. So i think it is safe for loop optimizer - all it can do > is expanding a loop that has control flow in it reducing its unrollability > later. > > > > Tracer seems to consume only profile information and thus doesn't > > rely on any other transforms (well, apart from cleanups which could > > affect its cost function). Why not schedule it even earlier? > > Like to before pass_build_alias? (the pipeline up to loop transforms > > is quite a mess...) > > It uses profile information and code size estiamtes. I expect that > (especially > for C++ stuff) the early post inline passes will remove a lot of code and > thus > improve traceability. This is why I looked for spot after first DCE/DSE. > > David, can you please be more specific about how you tested? Was it with > profile feedback? What about code size metrics? It is with profile feedback. There is also reduction in code size, though very tiny change. David > > Honza
Hello Honza. I've been working on SPEC numbers, I will send it this evening.
SPEC CPU numbers (--size=train, --iterations=3): First 3 columns present time (where smaller means faster) and for binary size the same. +------------+----------+------------+---------------+-------------+-----------------+--------------------+ | | gcc-O3 | gcc-O3-fdo | gcc-O3-tracer | gcc-O3_SIZE | gcc-O3-fdo_SIZE | gcc-O3-tracer_SIZE | +------------+----------+------------+---------------+-------------+-----------------+--------------------+ | GemsFDTD | 100.00 % | 96.00 % | 95.56 % | 100.00 % | 101.14 % | 109.58 % | | hmmer | 100.00 % | 97.44 % | 97.14 % | 100.00 % | 73.94 % | 105.71 % | | sphinx3 | 100.00 % | 95.42 % | 97.76 % | 100.00 % | 88.44 % | 104.29 % | | milc | 100.00 % | 99.35 % | 101.13 % | 100.00 % | 101.48 % | 103.98 % | | cactusADM | 100.00 % | 99.97 % | 99.46 % | 100.00 % | 76.84 % | 106.86 % | | tonto | 100.00 % | 94.72 % | 98.04 % | 100.00 % | 80.27 % | 107.15 % | | bwaves | 100.00 % | 104.47 % | 101.77 % | 100.00 % | 100.33 % | 103.21 % | | lbm | 100.00 % | 96.65 % | 98.49 % | 100.00 % | 106.25 % | 102.04 % | | wrf | 100.00 % | 61.39 % | 65.49 % | 100.00 % | 64.51 % | 112.40 % | | bzip2 | 100.00 % | 93.79 % | 100.10 % | 100.00 % | 115.36 % | 108.78 % | | leslie3d | 100.00 % | 96.52 % | 98.73 % | 100.00 % | 100.24 % | 107.03 % | | gromacs | 100.00 % | 97.12 % | 100.75 % | 100.00 % | 77.08 % | 104.82 % | | sjeng | 100.00 % | 98.77 % | 99.80 % | 100.00 % | 84.01 % | 101.87 % | | calculix | 100.00 % | 101.55 % | 100.28 % | 100.00 % | 79.58 % | 104.36 % | | astar | 100.00 % | 101.70 % | 97.19 % | 100.00 % | 102.73 % | 101.95 % | | zeusmp | 100.00 % | 98.79 % | 101.00 % | 100.00 % | 113.59 % | 102.36 % | | povray | 100.00 % | 89.17 % | 100.35 % | 100.00 % | 83.61 % | 105.07 % | | omnetpp | 100.00 % | 88.23 % | 102.54 % | 100.00 % | 90.70 % | 102.47 % | | mcf | 100.00 % | 95.92 % | 100.03 % | 100.00 % | 124.69 % | 102.62 % | | gcc | 100.00 % | 97.04 % | 98.88 % | 100.00 % | 108.29 % | 101.24 % | | h264ref | 100.00 % | 94.61 % | 100.58 % | 100.00 % | 88.25 % | 105.93 % | | perlbench | 100.00 % | 98.90 % | 100.46 % | 100.00 % | 106.21 % | 106.23 % | | xalancbmk | 100.00 % | 75.17 % | 82.56 % | 100.00 % | 87.10 % | 103.41 % | | namd | 100.00 % | 99.48 % | 100.59 % | 100.00 % | 88.25 % | 100.74 % | | gamess | 100.00 % | 130.31 % | 131.62 % | 100.00 % | 96.00 % | 101.59 % | | libquantum | 100.00 % | 97.04 % | 98.98 % | 100.00 % | 68.74 % | 104.55 % | | soplex | 100.00 % | 96.15 % | 100.92 % | 100.00 % | 87.04 % | 100.96 % | | gobmk | 100.00 % | 91.22 % | 104.56 % | 100.00 % | 84.16 % | 106.75 % | +------------+----------+------------+---------------+-------------+-----------------+--------------------+ | AVG | 100.00 % | 95.96 % | 99.10 % | 100.00 % | 92.10 % | 104.57 % | +------------+----------+------------+---------------+-------------+-----------------+--------------------+ Martin