Bug 35545 - tracer pass is run too late
tracer pass is run too late
Status: ASSIGNED
Product: gcc
Classification: Unclassified
Component: middle-end
4.4.0
: P3 enhancement
: ---
Assigned To: Jan Hubicka
: missed-optimization
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-12 05:55 UTC by davidxl
Modified: 2014-10-01 11:29 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-12-17 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description davidxl 2008-03-12 05:55:26 UTC
In the following example, virtual calls via ap should be speciallized -- there is one dominatating call target, but compiling the program at -O3 with -fprofile-use, it does not happen.

g++ -O1 -fprofile-generate devirt.cc
./a.out
g++ -fprofile-use -O3 -fdump-tree-optimized devirt.cc


// devirt.cc

class A {
public:
  virtual int foo() {
     return 1;
  }

int i;
};

class B : public A
{
public:
  virtual int foo() {
     return 2;
  }

 int b;
} ;


int main()
{
 int i;

  A* ap = 0;

  for (i = 0; i < 10000; i++)
  {

     if (i%7==0)
     {
        ap = new A();
     }
     else
        ap = new B();

    ap->foo();

    delete ap;

  }

  return 0;

}
Comment 1 Andrew Pinski 2008-03-12 06:07:35 UTC
Jump threading happened by we did not prop the call correctly:
  ap = (struct A *) D.2075;
  ap->_vptr$A = &_ZTV1A[2];
  OBJ_TYPE_REF(*ap->_vptr$A;ap->0) (ap);

...
  this = (struct B *) D.2076;
  this->D.2018._vptr$A = &_ZTV1B[2];
  ap.61 = &this->D.2018;
  OBJ_TYPE_REF(*ap.61->_vptr$A;ap.61->0) (ap.61);

Comment 2 Jan Hubicka 2013-12-17 16:34:45 UTC
Couple years later we finally devirtualize here:
  <bb 5>:
  ap_8 = operator new (16);
  ap_8->i = 0;
  ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B];
  _19 = foo;
  PROF_26 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_8->0);
  if (PROF_26 == foo)
    goto <bb 8>;
  else
    goto <bb 7>;

  <bb 6>:
  ap_13 = operator new (16);
  MEM[(struct B *)ap_13].D.2237.i = 0;
  MEM[(struct B *)ap_13].b = 0;
  MEM[(struct B *)ap_13].D.2237._vptr.A = &MEM[(void *)&_ZTV1B + 16B];
  _1 = foo;
  PROF_30 = [obj_type_ref] OBJ_TYPE_REF(_1;(struct A)ap_13->0);
  if (PROF_30 == foo)
    goto <bb 8>;
  else
    goto <bb 7>;

however the code ends up super sily after tracer.
for some reason we do not manage to fold away the virtual table lookup.
Why?
Comment 3 Jan Hubicka 2013-12-17 17:29:46 UTC
Following patch gets rid of OBJ_TYPE_REF
Index: value-prof.c
===================================================================
--- value-prof.c        (revision 206040)
+++ value-prof.c        (working copy)
@@ -1333,9 +1333,15 @@ gimple_ic (gimple icall_stmt, struct cgr
   cond_bb = gimple_bb (icall_stmt);
   gsi = gsi_for_stmt (icall_stmt);
 
-  tmp0 = make_temp_ssa_name (optype, NULL, "PROF");
-  tmp1 = make_temp_ssa_name (optype, NULL, "PROF");
-  tmp = unshare_expr (gimple_call_fn (icall_stmt));
+  tmp0 = make_temp_ssa_name (optype, NULL, "SPEC");
+  tmp1 = make_temp_ssa_name (optype, NULL, "SPEC");
+  tmp = gimple_call_fn (icall_stmt);
+
+  /* Drop OBJ_TYPE_REF expression, it is ignored by rest of
+     optimization queue anyway.  */
+  if (TREE_CODE (tmp) == OBJ_TYPE_REF)
+    tmp = OBJ_TYPE_REF_EXPR (tmp);
+  tmp = unshare_expr (tmp);
   load_stmt = gimple_build_assign (tmp0, tmp);
   gsi_insert_before (&gsi, load_stmt, GSI_SAME_STMT);
 

but it does not help.  We end up:
  _1 = foo;
  if (_1 == foo)
    goto <bb 8>;

not much better.  This nonsense appears only in optimized dump, before we have
the following nonsense:
  # ap_9 = PHI <ap_13(6)>
  # prephitmp_18 = PHI <&MEM[(void *)&_ZTV1B + 16B](6)>
  _1 = *prephitmp_18;

that is introduced by duplicating
  <bb 7>:
  # ap_2 = PHI <ap_8(5), ap_13(6)>
  # prephitmp_14 = PHI <&MEM[(void *)&_ZTV1A + 16B](5), &MEM[(void *)&_ZTV1B + 16B](6)>
  _19 = *prephitmp_14;
  if (_19 == foo)
    goto <bb 9>;
  else
    goto <bb 8>;

The following patch:
Index: passes.def
===================================================================
--- passes.def  (revision 206040)
+++ passes.def  (working copy)
@@ -242,9 +242,9 @@ along with GCC; see the file COPYING3.
         only examines PHIs to discover const/copy propagation
         opportunities.  */
       NEXT_PASS (pass_phi_only_cprop);
+      NEXT_PASS (pass_tracer);
       NEXT_PASS (pass_vrp);
       NEXT_PASS (pass_cd_dce);
-      NEXT_PASS (pass_tracer);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_forwprop);
       NEXT_PASS (pass_phiopt);

makes us to VRP value through but only when OBJ_TYPE_REF is nout around.
I think pushing tracer up is good idea - we should have at least one vrp/ccp pass after tracer. Its main purpose is to provide enough context so those forward propagating passes can do better job. it seems stupid to do it only after everything was finished.  I also see why tracer would preffer DCE to be done first, but we have only so many passes to do.
Comment 4 rguenther@suse.de 2013-12-17 18:05:53 UTC
"hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35545
>
>Jan Hubicka <hubicka at gcc dot gnu.org> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>             Status|UNCONFIRMED                 |NEW
>   Last reconfirmed|                            |2013-12-17
>            CC|                            |hubicka at gcc dot gnu.org,
>                  |                            |mjambor at suse dot cz,
>                 |                            |rguenther at suse dot de
>     Ever confirmed|0                           |1
>
>--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
>Couple years later we finally devirtualize here:
>  <bb 5>:
>  ap_8 = operator new (16);
>  ap_8->i = 0;
>  ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B];
>  _19 = foo;
>  PROF_26 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_8->0);
>  if (PROF_26 == foo)
>    goto <bb 8>;
>  else
>    goto <bb 7>;
>
>  <bb 6>:
>  ap_13 = operator new (16);
>  MEM[(struct B *)ap_13].D.2237.i = 0;
>  MEM[(struct B *)ap_13].b = 0;
>  MEM[(struct B *)ap_13].D.2237._vptr.A = &MEM[(void *)&_ZTV1B + 16B];
>  _1 = foo;
>  PROF_30 = [obj_type_ref] OBJ_TYPE_REF(_1;(struct A)ap_13->0);
>  if (PROF_30 == foo)
>    goto <bb 8>;
>  else
>    goto <bb 7>;
>
>however the code ends up super sily after tracer.
>for some reason we do not manage to fold away the virtual table lookup.
>Why?

This is probably doms fault. Does it even handle function pointers?

Tracer runs way too late and with no cleanups behind it.
Comment 5 Jan Hubicka 2013-12-17 18:09:19 UTC
Main issue seems to be that VRP messes up on:
  # ap_2 = PHI <ap_8(5)>
  # prephitmp_14 = PHI <&MEM[(void *)&_ZTV1A + 16B](5)>
  _19 = *prephitmp_14;

here it somehow won't constant propagate the load. 
Index: passes.def
===================================================================
--- passes.def  (revision 206040)
+++ passes.def  (working copy)
@@ -236,6 +236,7 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_reassoc);
       NEXT_PASS (pass_strength_reduction);
       NEXT_PASS (pass_dominator);
+      NEXT_PASS (pass_tracer);
       /* The only const/copy propagation opportunities left after
         DOM should be due to degenerate PHI nodes.  So rather than
         run the full propagators, run a specialized pass which
@@ -244,7 +245,6 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_phi_only_cprop);
       NEXT_PASS (pass_vrp);
       NEXT_PASS (pass_cd_dce);
-      NEXT_PASS (pass_tracer);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_forwprop);
       NEXT_PASS (pass_phiopt);

actually helps since phi_only_cprop is good on this transform. I do not quite gather why VRP can't do it itself.

I sent first patch to http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01517.html
Comment 6 Jeffrey A. Law 2013-12-17 20:33:36 UTC
You certainly don't want to put something between DOM and phi-only-cprop.  Jump threading will tend to expose lots of degenerate PHIs.  phi-only-cprop eliminates those degenerates.  We could have used the normal cprop code, but it seemed too heavy-weight for the cleanups we wanted to do.

One could argue we want a phi-only-cprop cleanup after VRP since it threads jumps too.
Comment 7 Jan Hubicka 2013-12-17 20:39:53 UTC
> You certainly don't want to put something between DOM and phi-only-cprop.  Jump
> threading will tend to expose lots of degenerate PHIs.  phi-only-cprop
> eliminates those degenerates.  We could have used the normal cprop code, but it
> seemed too heavy-weight for the cleanups we wanted to do.

Well, consttants, copies and PHIs are accounted as 0 size and thus not part of tracer's
cost model, so perhaps we do not care about presence of degnerated PHIs here.
Moreover it is the degnerate PHI produced by tracer that causes are problems. I assume
DOM does degnerate PHIs by code duplication for jump threading and tracer is exactly the
same type of transformation and for same reasons we may want phi-only-cprop after tracer
as we do it after DOM.

It however seems a more like VRP's missed optimization to not be able to paper over that
degenerate PHI produced by tracer at first place, so I will try to poke about this tomorrow
a bit more.
Comment 8 Jeffrey A. Law 2013-12-17 20:53:53 UTC
It's not a matter of cost model, but if propagating the values to their uses.  I haven't looked closely at the tracer, but wouldn't it benefit by having constants in particular propagated to their uses?

Yes, DOM's duplication/isolation of paths exposes the degenerate PHIs.  So we might have had:

One of the nice side effects of jump threading is that it isolates paths.  So we might have had 

BBn
x = phi (a, b, c, constant, d, e, f)

duplication for threading might turn that into

BBn:
x = phi (a, b, c, d, e, f)  // The original


elsewhere

BBm:
x' = phi (constant)  // the duplicate


Propagating the constant for x' in BBm and eliminating the degenerate is what the phi-only cprop pass does.  If the tracer generates similar things, then running phi-only cprop after it might be useful as well.  It *should* be very fast.
Comment 9 Jan Hubicka 2013-12-17 21:15:26 UTC
> It's not a matter of cost model, but if propagating the values to their uses. 
> I haven't looked closely at the tracer, but wouldn't it benefit by having
> constants in particular propagated to their uses?

Tracer depends on the usual estimate_num_insns limits
 (it is 12 years since I wrote it, so what I recall)
It collects BBs that are interesting starts of traces, takes them in priority
order and duplicates from the seeds until code growth is met or it runs out of
interesting candidates by other criteria.

I think it generally tends to starve on candidates as the definition of trace is
relatively strong, but I am not 100% sure on it. So it is not that much dependent
on bounds given by code size metric.

If we had unlimited time, it would  be better to propagate constants and cleanup
both before and after tracer.

If we can chose whether we want to do tracer before last pass that is able
to propagate and fold constants or after, I would chose before for the reason
I mentioned on begginig; the whole point of the tail duplication is to simplify
CFG and allow better propagation.

I think missed tracing here and there is less painful than missed optimizatoin
in duplicated code.

We may even consider pushing tracer before DOM, since tail duplication may enable
DOM to produce more useful threading/propagation and code after tracer is not
too painfuly obstructated. Sure you can end up with PHI that has only one constant
argument.  I can see that DOM may miss optimization here.
> Propagating the constant for x' in BBm and eliminating the degenerate is what
> the phi-only cprop pass does.  If the tracer generates similar things, then
> running phi-only cprop after it might be useful as well.  It *should* be very
> fast.

Yes, tracer does similar things.  You can think about it as about speculative
jump threading - if one path through meet points seems more likely than the
other based on profile, tracer will duplicate it in a hope that later
optimization pass will prove some of conditionals constant over the duplicated
path.  For that it needs subsequent propagation pass (CCP or better VRP) to
match.  That is why its current place in pass queue is unlucky.  Possible
benefits of tail duplications are of course not limited to threading.

We can do one extra cleanup pass, too.  Tracer is on by default only with
-fprofile-use so extra phi-only cprop with -ftracer probably is not dangerous
to overall compile time experience.

Honza
Comment 10 Jan Hubicka 2013-12-17 21:28:05 UTC
> Tracer depends on the usual estimate_num_insns limits
>  (it is 12 years since I wrote it, so what I recall)

note that one impotant thing that changed in those 12 years is that I
originally carefuly tuned tracer in combination with crossjumping. Tracer
produced duplicates and if no pass managed to take use of them, crossjumping
cleaned them up pre-reload. Trace formation in bb-reorder re-instantiated
duplicated when it seemed sensible for code layout.

This broke, since SSA makes RTL cross jumping quite useless and it is now done
after reg-alloc only.  We never really got working code unification pass on
gimple.
http://www.ucw.cz/~hubicka/papers/amd64/node4.html
Claims that at that time I got 1.6% speedup on SPECint with profile feedback
1.43% code growth.  That is not bad, but wonder to what it translates today.

Honza
Comment 11 Tobias Burnus 2013-12-18 10:31:35 UTC
Author: hubicka
Date: Tue Dec 17 23:41:41 2013
New Revision: 206073

URL: http://gcc.gnu.org/viewcvs?rev=206073&root=gcc&view=rev
Log:
	PR middle-end/35545
	* tree-vrp.c (extract_range_from_unary_expr_1): Handle OBJ_TYPE_REF.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-vrp.c



Author: hubicka
Date: Tue Dec 17 23:43:22 2013
New Revision: 206074

URL: http://gcc.gnu.org/viewcvs?rev=206074&root=gcc&view=rev
Log:
	PR middle-end/35545
	* gimple-fold.c (fold_gimple_assign): Attempt to devirtualize
	OBJ_TYPE_REF.
	(gimple_fold_stmt_to_constant_1): Bypass OBJ_TYPE_REF wrappers.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-fold.c
Comment 12 Jan Hubicka 2014-09-25 21:29:37 UTC
We still fail to fold here.  After tracer we get:
  # ap_2 = PHI <ap_8(4)>
  # prephitmp_14 = PHI <&MEM[(void *)&_ZTV1A + 16B](4)>
  _19 = *prephitmp_14;
  PROF_24 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_2->0);
  if (PROF_24 == foo)
    goto <bb 9>;
  else
    goto <bb 8>;

and the necessary forward propagation & folding to realize that:
  _19 = foo;
  PROF_24 = [obj_type_ref] OBJ_TYPE_REF(_19;(struct A)ap_8->0);
  if (PROF_24 == foo)

happens only in .optimized dump.

I would go with patch from comment 5 moving tracer earlier - the tail duplication intends to make code more optimizable and it is stupid to run it only after all optimizations are done.
Comment 13 Richard Biener 2014-09-26 07:44:06 UTC
Well - ideally you'd want to do some CSE, not only VRP (constant propagation in this case).  So I'd move it even earlier, after pass_lower_vector_ssa?  I
can't see what tracing would break (some odd cases can destroy loop info, but
I think I have disabled all those).
Comment 14 Jeffrey A. Law 2014-09-26 15:52:00 UTC
This feels like the kind of situation where I've always wanted a pass to be able to say something like "I've done some set of transformations", schedule the appropriate cleanup passes to run".

It's heavyweight, but making  the tracer imply a later DOM or VRP pass might make sense.

Jeff
Comment 15 Jan Hubicka 2014-09-27 00:03:55 UTC
Author: hubicka
Date: Sat Sep 27 00:03:23 2014
New Revision: 215651

URL: https://gcc.gnu.org/viewcvs?rev=215651&root=gcc&view=rev
Log:

	PR middle-end/35545
	* passes.def (pass_tracer): Move before last dominator pass.
	* g++.dg/tree-prof/pr35545.C: New testcase.

Added:
    trunk/gcc/testsuite/g++.dg/tree-prof/pr35545.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/passes.def
    trunk/gcc/testsuite/ChangeLog
Comment 16 Jan Hubicka 2014-09-27 00:13:52 UTC
I have moved tracer before the late cleanups that seems to be rather obbious thing to do. This lets us to optimize the testcase (with -O2):
int main() ()
{
  struct A * ap;
  int i;
  int _6;

  <bb 2>:

  <bb 3>:
  # i_29 = PHI <i_22(6), 0(2)>
  _6 = i_29 % 7;
  if (_6 == 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  ap_8 = operator new (16);
  ap_8->i = 0;
  ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B];
  goto <bb 6>;

  <bb 5>:
  ap_13 = operator new (16);
  MEM[(struct B *)ap_13].D.2244.i = 0;
  MEM[(struct B *)ap_13].b = 0;
  MEM[(struct B *)ap_13].D.2244._vptr.A = &MEM[(void *)&_ZTV1B + 16B];

  <bb 6>:
  # ap_4 = PHI <ap_13(5), ap_8(4)>
  operator delete (ap_4);
  i_22 = i_29 + 1;
  if (i_22 != 10000)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 7>:
  return 0;

}

Martin, I do not have SPEC setup, do you think you can benchmark the attached patch with SPEC and profile feedback and also non-FDO -O3 -ftracer compared to -O3, please?
It would be nice to know code size impact, too.
Index: passes.def
===================================================================
--- passes.def  (revision 215651)
+++ passes.def  (working copy)
@@ -155,6 +155,7 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_dce);
       NEXT_PASS (pass_call_cdce);
       NEXT_PASS (pass_cselim);
+      NEXT_PASS (pass_tracer);
       NEXT_PASS (pass_copy_prop);
       NEXT_PASS (pass_tree_ifcombine);
       NEXT_PASS (pass_phiopt);
@@ -252,7 +253,6 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_cse_reciprocals);
       NEXT_PASS (pass_reassoc);
       NEXT_PASS (pass_strength_reduction);
-      NEXT_PASS (pass_tracer);
       NEXT_PASS (pass_dominator);
       NEXT_PASS (pass_strlen);
       NEXT_PASS (pass_vrp);

Doing it at same approximately the same place as loop header copying seems to make most sense to me.  It benefits from early cleanups and DCE definitly and it should enable more fun with the later scalar passes that are almost all rerun then.
Comment 17 davidxl 2014-09-27 00:23:12 UTC
(In reply to Jan Hubicka from comment #16)
> I have moved tracer before the late cleanups that seems to be rather obbious
> thing to do. This lets us to optimize the testcase (with -O2):
> int main() ()
> {
>   struct A * ap;
>   int i;
>   int _6;
> 
>   <bb 2>:
> 
>   <bb 3>:
>   # i_29 = PHI <i_22(6), 0(2)>
>   _6 = i_29 % 7;
>   if (_6 == 0)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
> 
>   <bb 4>:
>   ap_8 = operator new (16);
>   ap_8->i = 0;
>   ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B];
>   goto <bb 6>;
> 
>   <bb 5>:
>   ap_13 = operator new (16);
>   MEM[(struct B *)ap_13].D.2244.i = 0;
>   MEM[(struct B *)ap_13].b = 0;
>   MEM[(struct B *)ap_13].D.2244._vptr.A = &MEM[(void *)&_ZTV1B + 16B];
> 
>   <bb 6>:
>   # ap_4 = PHI <ap_13(5), ap_8(4)>
>   operator delete (ap_4);
>   i_22 = i_29 + 1;
>   if (i_22 != 10000)
>     goto <bb 3>;
>   else
>     goto <bb 7>;
> 
>   <bb 7>:
>   return 0;
> 
> }
> 
> Martin, I do not have SPEC setup, do you think you can benchmark the
> attached patch with SPEC and profile feedback and also non-FDO -O3 -ftracer
> compared to -O3, please?
> It would be nice to know code size impact, too.
> Index: passes.def
> ===================================================================
> --- passes.def  (revision 215651)
> +++ passes.def  (working copy)
> @@ -155,6 +155,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_dce);
>        NEXT_PASS (pass_call_cdce);
>        NEXT_PASS (pass_cselim);
> +      NEXT_PASS (pass_tracer);
>        NEXT_PASS (pass_copy_prop);
>        NEXT_PASS (pass_tree_ifcombine);
>        NEXT_PASS (pass_phiopt);
> @@ -252,7 +253,6 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_cse_reciprocals);
>        NEXT_PASS (pass_reassoc);
>        NEXT_PASS (pass_strength_reduction);
> -      NEXT_PASS (pass_tracer);
>        NEXT_PASS (pass_dominator);
>        NEXT_PASS (pass_strlen);
>        NEXT_PASS (pass_vrp);
> 
> Doing it at same approximately the same place as loop header copying seems
> to make most sense to me.  It benefits from early cleanups and DCE definitly
> and it should enable more fun with the later scalar passes that are almost
> all rerun then.


WE can try some internal benchmarks with this change too.

David
Comment 18 Jan Hubicka 2014-09-27 01:03:55 UTC
> WE can try some internal benchmarks with this change too.

That would be very welcome.  Tracer used to be quite useful pass in old days,
doing 1.6% on -O3+FDO SPECint (for 1.4% code size cost) that put it very close
to the inliner (1.8%) that however needed 12.9% of code size.

http://www.ucw.cz/~hubicka/papers/amd64/node4.html (see aggressive optimization table)

I do not think it was ever resonably returned for GIMPLE. Martin, do you have
any current scores? Part of benefits came from lack of global optimizers, but
I think tail duplication has a potential in modern compiler, too.

Perhaps we should try to do that and given that we now have tail merging pass,
consider getting it useful without FDO, too.

Honza
Comment 19 davidxl 2014-09-28 18:21:57 UTC
(In reply to Jan Hubicka from comment #18)
> > WE can try some internal benchmarks with this change too.
> 
> That would be very welcome.  Tracer used to be quite useful pass in old days,
> doing 1.6% on -O3+FDO SPECint (for 1.4% code size cost) that put it very
> close
> to the inliner (1.8%) that however needed 12.9% of code size.
> 
> http://www.ucw.cz/~hubicka/papers/amd64/node4.html (see aggressive
> optimization table)
> 
> I do not think it was ever resonably returned for GIMPLE. Martin, do you have
> any current scores? Part of benefits came from lack of global optimizers, but
> I think tail duplication has a potential in modern compiler, too.
> 
> Perhaps we should try to do that and given that we now have tail merging
> pass,
> consider getting it useful without FDO, too.
> 
> Honza

I saw small improvement across most of the benchmarks with this change on sandybridge machines with FDO. The geomean improvement is about 0.3%. The baseline compiler is Google gcc-49 compiler, and the test compiler is Google gcc-49 with the patch.
Comment 20 rguenther@suse.de 2014-09-29 08:17:04 UTC
On Fri, 26 Sep 2014, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35545
> 
> --- Comment #14 from Jeffrey A. Law <law at redhat dot com> ---
> This feels like the kind of situation where I've always wanted a pass to be
> able to say something like "I've done some set of transformations", schedule
> the appropriate cleanup passes to run".
> 
> It's heavyweight, but making  the tracer imply a later DOM or VRP pass might
> make sense.

Uh that idea will lead to very big compile-time increases.  One
thing I'd liked to do at some point is make the SSA propagators
and/or DOM work on MEME regions so we can schedule them on
a sub-CFG.

Richard.
Comment 21 rguenther@suse.de 2014-09-29 10:22:28 UTC
nOn Sat, 27 Sep 2014, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35545
> 
> Jan Hubicka <hubicka at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |ASSIGNED
>                  CC|                            |mliska at suse dot cz
>            Assignee|unassigned at gcc dot gnu.org      |hubicka at gcc dot gnu.org
> 
> --- Comment #16 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> I have moved tracer before the late cleanups that seems to be rather obbious
> thing to do. This lets us to optimize the testcase (with -O2):
> int main() ()
> {
>   struct A * ap;
>   int i;
>   int _6;
> 
>   <bb 2>:
> 
>   <bb 3>:
>   # i_29 = PHI <i_22(6), 0(2)>
>   _6 = i_29 % 7;
>   if (_6 == 0)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
> 
>   <bb 4>:
>   ap_8 = operator new (16);
>   ap_8->i = 0;
>   ap_8->_vptr.A = &MEM[(void *)&_ZTV1A + 16B];
>   goto <bb 6>;
> 
>   <bb 5>:
>   ap_13 = operator new (16);
>   MEM[(struct B *)ap_13].D.2244.i = 0;
>   MEM[(struct B *)ap_13].b = 0;
>   MEM[(struct B *)ap_13].D.2244._vptr.A = &MEM[(void *)&_ZTV1B + 16B];
> 
>   <bb 6>:
>   # ap_4 = PHI <ap_13(5), ap_8(4)>
>   operator delete (ap_4);
>   i_22 = i_29 + 1;
>   if (i_22 != 10000)
>     goto <bb 3>;
>   else
>     goto <bb 7>;
> 
>   <bb 7>:
>   return 0;
> 
> }
> 
> Martin, I do not have SPEC setup, do you think you can benchmark the attached
> patch with SPEC and profile feedback and also non-FDO -O3 -ftracer compared to
> -O3, please?
> It would be nice to know code size impact, too.
> Index: passes.def
> ===================================================================
> --- passes.def  (revision 215651)
> +++ passes.def  (working copy)
> @@ -155,6 +155,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_dce);
>        NEXT_PASS (pass_call_cdce);
>        NEXT_PASS (pass_cselim);
> +      NEXT_PASS (pass_tracer);
>        NEXT_PASS (pass_copy_prop);
>        NEXT_PASS (pass_tree_ifcombine);
>        NEXT_PASS (pass_phiopt);
> @@ -252,7 +253,6 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_cse_reciprocals);
>        NEXT_PASS (pass_reassoc);
>        NEXT_PASS (pass_strength_reduction);
> -      NEXT_PASS (pass_tracer);
>        NEXT_PASS (pass_dominator);
>        NEXT_PASS (pass_strlen);
>        NEXT_PASS (pass_vrp);
> 
> Doing it at same approximately the same place as loop header copying seems to
> make most sense to me.  It benefits from early cleanups and DCE definitly and
> it should enable more fun with the later scalar passes that are almost all
> rerun then.

We need to make sure tracer doesn't mess too much with loops then.
Btw, "useless" tracing may be undone again by tail-merging.

Tracer seems to consume only profile information and thus doesn't
rely on any other transforms (well, apart from cleanups which could
affect its cost function).  Why not schedule it even earlier?
Like to before pass_build_alias?  (the pipeline up to loop transforms
is quite a mess...)
Comment 22 Jan Hubicka 2014-09-29 16:48:44 UTC
> > Doing it at same approximately the same place as loop header copying seems to
> > make most sense to me.  It benefits from early cleanups and DCE definitly and
> > it should enable more fun with the later scalar passes that are almost all
> > rerun then.
> 
> We need to make sure tracer doesn't mess too much with loops then.
> Btw, "useless" tracing may be undone again by tail-merging.

Tracer already does:

              /* We have the tendency to duplicate the loop header
                 of all do { } while loops.  Do not do that - it is
                 not profitable and it might create a loop with multiple
                 entries or at least rotate the loop.  */
              && bb2->loop_father->header != bb2)

so it won't kill natural loops and peel (I should find time to update the
peeling pass). It also has:

      if (bb_seen_p (bb2) || (e->flags & (EDGE_DFS_BACK | EDGE_COMPLEX))
          || find_best_successor (bb2) != e)
        break;

to not unroll.  So i think it is safe for loop optimizer - all it can do
is expanding a loop that has control flow in it reducing its unrollability
later.  
> 
> Tracer seems to consume only profile information and thus doesn't
> rely on any other transforms (well, apart from cleanups which could
> affect its cost function).  Why not schedule it even earlier?
> Like to before pass_build_alias?  (the pipeline up to loop transforms
> is quite a mess...)

It uses profile information and code size estiamtes.  I expect that (especially
for C++ stuff) the early post inline passes will remove a lot of code and thus
improve traceability.  This is why I looked for spot after first DCE/DSE.

David, can you please be more specific about how you tested? Was it with
profile feedback?  What about code size metrics?

Honza
Comment 23 davidxl 2014-09-29 19:30:24 UTC
(In reply to Jan Hubicka from comment #22)
> > > Doing it at same approximately the same place as loop header copying seems to
> > > make most sense to me.  It benefits from early cleanups and DCE definitly and
> > > it should enable more fun with the later scalar passes that are almost all
> > > rerun then.
> > 
> > We need to make sure tracer doesn't mess too much with loops then.
> > Btw, "useless" tracing may be undone again by tail-merging.
> 
> Tracer already does:
> 
>               /* We have the tendency to duplicate the loop header
>                  of all do { } while loops.  Do not do that - it is
>                  not profitable and it might create a loop with multiple
>                  entries or at least rotate the loop.  */
>               && bb2->loop_father->header != bb2)
> 
> so it won't kill natural loops and peel (I should find time to update the
> peeling pass). It also has:
> 
>       if (bb_seen_p (bb2) || (e->flags & (EDGE_DFS_BACK | EDGE_COMPLEX))
>           || find_best_successor (bb2) != e)
>         break;
> 
> to not unroll.  So i think it is safe for loop optimizer - all it can do
> is expanding a loop that has control flow in it reducing its unrollability
> later.  
> > 
> > Tracer seems to consume only profile information and thus doesn't
> > rely on any other transforms (well, apart from cleanups which could
> > affect its cost function).  Why not schedule it even earlier?
> > Like to before pass_build_alias?  (the pipeline up to loop transforms
> > is quite a mess...)
> 
> It uses profile information and code size estiamtes.  I expect that
> (especially
> for C++ stuff) the early post inline passes will remove a lot of code and
> thus
> improve traceability.  This is why I looked for spot after first DCE/DSE.
> 
> David, can you please be more specific about how you tested? Was it with
> profile feedback?  What about code size metrics?

It is with profile feedback. There is also reduction in code size, though very tiny change.

David

> 
> Honza
Comment 24 Martin Liška 2014-09-30 11:11:24 UTC
Hello Honza. I've been working on SPEC numbers, I will send it this evening.
Comment 25 Martin Liška 2014-10-01 11:29:45 UTC
SPEC CPU numbers (--size=train, --iterations=3):

First 3 columns present time (where smaller means faster) and for binary size the same.

+------------+----------+------------+---------------+-------------+-----------------+--------------------+
|            | gcc-O3   | gcc-O3-fdo | gcc-O3-tracer | gcc-O3_SIZE | gcc-O3-fdo_SIZE | gcc-O3-tracer_SIZE |
+------------+----------+------------+---------------+-------------+-----------------+--------------------+
| GemsFDTD   | 100.00 % |    96.00 % |       95.56 % |    100.00 % |        101.14 % |           109.58 % |
| hmmer      | 100.00 % |    97.44 % |       97.14 % |    100.00 % |         73.94 % |           105.71 % |
| sphinx3    | 100.00 % |    95.42 % |       97.76 % |    100.00 % |         88.44 % |           104.29 % |
| milc       | 100.00 % |    99.35 % |      101.13 % |    100.00 % |        101.48 % |           103.98 % |
| cactusADM  | 100.00 % |    99.97 % |       99.46 % |    100.00 % |         76.84 % |           106.86 % |
| tonto      | 100.00 % |    94.72 % |       98.04 % |    100.00 % |         80.27 % |           107.15 % |
| bwaves     | 100.00 % |   104.47 % |      101.77 % |    100.00 % |        100.33 % |           103.21 % |
| lbm        | 100.00 % |    96.65 % |       98.49 % |    100.00 % |        106.25 % |           102.04 % |
| wrf        | 100.00 % |    61.39 % |       65.49 % |    100.00 % |         64.51 % |           112.40 % |
| bzip2      | 100.00 % |    93.79 % |      100.10 % |    100.00 % |        115.36 % |           108.78 % |
| leslie3d   | 100.00 % |    96.52 % |       98.73 % |    100.00 % |        100.24 % |           107.03 % |
| gromacs    | 100.00 % |    97.12 % |      100.75 % |    100.00 % |         77.08 % |           104.82 % |
| sjeng      | 100.00 % |    98.77 % |       99.80 % |    100.00 % |         84.01 % |           101.87 % |
| calculix   | 100.00 % |   101.55 % |      100.28 % |    100.00 % |         79.58 % |           104.36 % |
| astar      | 100.00 % |   101.70 % |       97.19 % |    100.00 % |        102.73 % |           101.95 % |
| zeusmp     | 100.00 % |    98.79 % |      101.00 % |    100.00 % |        113.59 % |           102.36 % |
| povray     | 100.00 % |    89.17 % |      100.35 % |    100.00 % |         83.61 % |           105.07 % |
| omnetpp    | 100.00 % |    88.23 % |      102.54 % |    100.00 % |         90.70 % |           102.47 % |
| mcf        | 100.00 % |    95.92 % |      100.03 % |    100.00 % |        124.69 % |           102.62 % |
| gcc        | 100.00 % |    97.04 % |       98.88 % |    100.00 % |        108.29 % |           101.24 % |
| h264ref    | 100.00 % |    94.61 % |      100.58 % |    100.00 % |         88.25 % |           105.93 % |
| perlbench  | 100.00 % |    98.90 % |      100.46 % |    100.00 % |        106.21 % |           106.23 % |
| xalancbmk  | 100.00 % |    75.17 % |       82.56 % |    100.00 % |         87.10 % |           103.41 % |
| namd       | 100.00 % |    99.48 % |      100.59 % |    100.00 % |         88.25 % |           100.74 % |
| gamess     | 100.00 % |   130.31 % |      131.62 % |    100.00 % |         96.00 % |           101.59 % |
| libquantum | 100.00 % |    97.04 % |       98.98 % |    100.00 % |         68.74 % |           104.55 % |
| soplex     | 100.00 % |    96.15 % |      100.92 % |    100.00 % |         87.04 % |           100.96 % |
| gobmk      | 100.00 % |    91.22 % |      104.56 % |    100.00 % |         84.16 % |           106.75 % |
+------------+----------+------------+---------------+-------------+-----------------+--------------------+
| AVG        | 100.00 % |    95.96 % |       99.10 % |    100.00 % |         92.10 % |           104.57 % |
+------------+----------+------------+---------------+-------------+-----------------+--------------------+

Martin