Bug 15524

Summary:	[4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
Product:	gcc	Reporter:	Andrew Pinski <pinskia>
Component:	tree-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	normal	CC:	berndtrog, bje, dnovillo, gcc-bugs, giovannibajo, jeffreyalaw, kazu
Priority:	P2	Keywords:	compile-time-hog
Version:	4.0.0
Target Milestone:	4.0.0
Host:		Target:
Build:		Known to work:
Known to fail:		Last reconfirmed:	2004-12-23 13:31:01
Bug Depends on:	10588, 17895
Bug Blocks:
Attachments:	An experimental patch for this bug ZZZ ZZZ patch ZZZ ZZZ PPP PPP PPP PPP PPP PPP

Description Andrew Pinski 2004-05-18 19:43:00 UTC

From PR 7344 (which was about the same compile time regression but was for rtl):
#define CL0(a) case a: f(); goto c;
 #define CL1(a) CL0(a##0) CL0(a##1) CL0(a##2) CL0(a##3) CL0(a##4) CL0(a##5) \
 CL0(a##6) CL0(a##7) CL0(a##8) CL0(a##9)
 #define CL2(a) CL1(a##0) CL1(a##1) CL1(a##2) CL1(a##3) CL1(a##4) CL1(a##5) \
 CL1(a##6) CL1(a##7) CL1(a##8) CL1(a##9)
 #define CL3(a) CL2(a##0) CL2(a##1) CL2(a##2) CL2(a##3) CL2(a##4) CL2(a##5) \
 CL2(a##6) CL2(a##7) CL2(a##8) CL2(a##9)
 #define CL4(a) CL3(a##0) CL3(a##1) CL3(a##2) CL3(a##3) CL3(a##4) CL3(a##5) \
 CL3(a##6) CL3(a##7) CL3(a##8) CL3(a##9)

 void f();

 void a() {
     int b;
  c: switch (b) {
         CL4(1)
     }
 }

Comment 1 Andrew Pinski 2004-05-18 19:46:33 UTC

The top 5 functions in a profile:
36.5% label_to_block  <-- this function is small and could be inlined.
24.9% simple_cst_equal
12.1% find_taken_edge
5.5% cached_make_edge
3.4% dom_opt_initialize_block

11% remaining.

Comment 2 Andrew Pinski 2004-05-18 23:32:40 UTC

The patch here should help: <http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01150.html>.

Comment 3 Andrew Pinski 2004-05-24 20:28:20 UTC

Confirmed.

Comment 4 Andrew Pinski 2004-05-24 20:31:54 UTC

I had forgot to say all of time spent in simple_cst_equal comes from when find_taken_edge calls it.

Comment 5 Steven Bosscher 2004-05-26 10:43:38 UTC

Got a patch.

Comment 6 Steven Bosscher 2004-05-26 11:26:19 UTC

With my case grouping patch, the top offender according to oprofile
is dom_opt_initialize_block:
samples  %        app name    symbol name
21264210 36.4196  cc1         dom_opt_initialize_block

which is consistent with:
 dominator optimization:  79.10 (64%) usr   0.00 ( 0%) sys  79.11 (63%) wall

The other ones Andrew mentioned are blown away from the top of the profile:
309485    0.5301  cc1         label_to_block
2361      0.0040  cc1         find_taken_edge
...and the tree_int_cst_{lt,compare} functions way below these.

Comment 7 Andrew Pinski 2004-05-26 15:56:47 UTC

I should note that at -O0, 3.5.0 is already faster than 3.4.0, about a 4x improvement (even with 
checking on 3.5.0 and not on 3.4.0).

Comment 8 Andrew Pinski 2004-05-31 21:01:49 UTC

Now only label_to_block is really a problem on powerpc-apple-darwin.  I will note that on powerpc-
apple-darwin for some reason same_case_target_p is not being inlined into 
expand_end_case_case_type.  Also label_rtx could be inline into expand_end_case_case_type which 
should help the -O0 case too.

Comment 9 Steven Bosscher 2004-05-31 21:21:48 UTC

Subject: Re:  [3.5 Regression] jump threading on trees is slow with switch statements with large # of cases

On Monday 31 May 2004 23:01, pinskia at gcc dot gnu dot org wrote:
> Now only label_to_block is really a problem on powerpc-apple-darwin.

No.  label_to_block is fine as it is.  The real problem is something
of quadratic-like behavior:

1) walk every dom successor
2) walk a big part of the case label vector

which is (N_dom_children * TREE_VEC_LENGTH (label_vector)), and
since both typically are of the same order of magnitude, this is
quadratic behavior.

Walking the case label vector once is possible if the information
extracted in the dominated block is instead precomputed in the 
dominating block (and probably attach the unique case label to the
edge's or the block's aux field, or something like that).

Comment 10 Steven Bosscher 2004-07-13 17:03:54 UTC

This not forgotten, but DOM is about to go away entirely thanks to Diego.  
When this bug still exists after DOM is gone, Ben Elliston's work to make 
edges a vector instead of a list will make it much much easier to fix this 
bug. 
 
This bug is the reason for a lot of slowness in various machine generated 
files, including insn-attrtab for example.

Comment 11 Steven Bosscher 2004-07-27 18:04:39 UTC

Jeff is rewriting the jump threader.  I'm adding him to 
the CC: list so he's aware of this problem.  Hopefully 
the new threader will fix this problem.

Comment 12 Steven Bosscher 2004-09-17 23:50:10 UTC

Created attachment 7163 [details]
An experimental patch for this bug

I'm pondering on something like the attached patch.
The patch makes us walk the case label vector only
once, but we do walk all successor edges of blocks
ending in a switch statement twice now.  This could
probably be reduced to once - I'm not sure about it
but it's worth trying.

In any case, this part of the algorithm should not be
quadratic anymore with the patch.

This patch was bootstrapped on amd64.  I haven't for
any numbers for compile time improvement, but the gut
feeling is that, unfortunately, even with this patch
DOM is still unbelievably slow.

Comment 13 Steven Bosscher 2004-09-18 00:23:50 UTC

For the test case in this PR, this patch definitely helps a lot. 
 
For more realistic code, another bottleneck in DOM means there 
really isn't a significant difference (though there is of course 
still a good speedup, but it's insignificant compared to the 
total compile time).

Comment 14 Jeffrey A. Law 2004-09-24 06:12:49 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Fri, 2004-09-17 at 17:50, steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2004-09-17 23:50 -------
> Created an attachment (id=7163)
>  --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=7163&action=view)
> An experimental patch for this bug
> 
> I'm pondering on something like the attached patch.
> The patch makes us walk the case label vector only
> once, but we do walk all successor edges of blocks
> ending in a switch statement twice now.  This could
> probably be reduced to once - I'm not sure about it
> but it's worth trying.
> 
> In any case, this part of the algorithm should not be
> quadratic anymore with the patch.
> 
> This patch was bootstrapped on amd64.  I haven't for
> any numbers for compile time improvement, but the gut
> feeling is that, unfortunately, even with this patch
> DOM is still unbelievably slow.
There may be a better approach for attacking this problem.

After we build the CFG, but before we go into SSA form we can insert
assignments on outgoing edges from the switch.    ie, if we had

  switch (x)
    {
    case 42:
      whatever
    }

[ Assuming there's no other path to the code for "whatever" ]

We can turn that into

  switch (x)
    {
      case 42:
        x = 42;
        whatever
    }

Once that's done standard copy propagation will take care of
everything for us and we no longer have to try and associate
a value with an outgoing edge from the switch in the optimizers.

That method dovetails nicely with some suggestions Kenny made
at the GCC summit.  In fact, you can do that with some conditionals
as well which would obliterate the need for parts of 
get_eq_expr_value.

The open questions in my mind for these kinds of things are:

  1. Is it really necessary to mark these assignments as undeletable
     until late in the SSA path (that's something Kenny suggested
     as well).

     Not deleting them means they're around through most/all the
     optimization passes, which is good as we may not see the 
     opportunity to use those equivalences until later in the
     optimizers.

     But not deleting them can also mean we generate and keep a
     bunch of "useless" PHI nodes hanging around, which uses time
     and space and might even interfere with some optimizations.

  2. Is it worth the effort to be able to add these assignments in
     later?  Say for example that instead we originally had

     switch (x)
        {
          case 42:
          case -1:
        }

And we prove that case 0xabcdef can never occur (say because we
know that X has the range 0..maxint via VRP).

In that case does it make sense to be able to insert a copy
x = 42 into the appropriate location once we make the "-1"
case label go away.  [ And if it makes sense, can we do so
efficiently?. ]

Anyway, that might ultimately be a better way to handle these
kinds of equivalences created by traversing edges out of
COND_EXPRs and SWITCH_EXPRs.

A similar scheme may be usable to record equivalences created
by stuff like 

if (x < y)
  goto BB2
else
  goto BB4

Into:

t0 = (x < y)
t1 = (x > y)
t2 = (x >= y)
t3 = (x == y)
if (t0)
  t0 = 1;
  t1 = 0;
  t2 = 0;
  t3 = 0;
  goto BB2
else
  t0 = 0;
  t2 = 1;
  goto BB4

We'd use formal temporaries so that all instances of x < y would
get mapped to the same formal temporary.

I believe that would zap all of get_eq_expr_value, record_cond,
record_dominating_conditions and the like.    It might have some
other unpleasant interactions.  But it might be worth investigating.

Thoughts?

Jeff

Comment 15 Diego Novillo 2004-09-24 10:29:04 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Fri, 2004-09-24 at 02:13, law at redhat dot com wrote:

> We can turn that into
> 
>   switch (x)
>     {
>       case 42:
>         x = 42;
>         whatever
>     }
> 
This is what I'm working with as part of VRP.  The idea is to insert
ASSERT_EXPR that act as special assignments.  I have a WIP patch if you
are interested, but I at the moment I'm paying more attention to fixes
for 4.0.

>   1. Is it really necessary to mark these assignments as undeletable
>      until late in the SSA path (that's something Kenny suggested
>      as well).
> 
The ASSERT_EXPRs are naturally undeletable if they are used.  When we
find:

if (x_3 == y_9)
  {
    x_4 = ASSERT_EXPR (x_3 == y_9)
    if (x_4 > 4)
       ...
  }

If you end up knowing something about y_9, you may be able to fold the
inner conditional into a constant.  If you never use x_4, then the
ASSERT_EXPR is naturally dead and goes away.

ASSERT_EXPRs are inserted before we go into SSA and are treated as
regular IL elements.  VRP would propagate and fold them using the SSA
propagator.  This part is still to be done.  Initially I had fused this
with CCP, and that didn't work too well.

Diego.

Comment 16 Jeffrey A. Law 2004-09-24 16:04:27 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Fri, 2004-09-24 at 04:29, dnovillo at redhat dot com wrote:
> ------- Additional Comments From dnovillo at redhat dot com  2004-09-24 10:29 -------
> Subject: Re:  [4.0 Regression] jump threading
> 	on trees is slow with switch statements with large # of cases
> 
> On Fri, 2004-09-24 at 02:13, law at redhat dot com wrote:
> 
> > We can turn that into
> > 
> >   switch (x)
> >     {
> >       case 42:
> >         x = 42;
> >         whatever
> >     }
> > 
> This is what I'm working with as part of VRP.  The idea is to insert
> ASSERT_EXPR that act as special assignments.  I have a WIP patch if you
> are interested, but I at the moment I'm paying more attention to fixes
> for 4.0.
But for the case of a simple x == 42 or a case statement such as
I've shown above, an ASSERT_EXPR is, well, silly.  Just issue
x = 42 in the appropriate place and let's get on with real work :-)

Of course given a conditional such as

if (x == 42)
  goto BB2
else
  goto BB4

Using ASSERT_EXPR I would think you'd really want

if (x == 42)
  x = 42;
  goto BB2
else
  x = ASSERT_EXPR (x != 42)

ie, you use the ASSERT_EXPR to assert something about the value of
an expression, not the value of objects within the expression.  And
if that is the case, then the LHS of these assert expressions ought
to be a new formal temporary.  And if I've understood all this
correctly, you can get precisely the same behavior without using
ASSERT_EXPR.

t = (x == 42)
if (t)
  t = 1;
  goto BB2;
else
  t = 0;
  goto BB4;

You set up an equivalency between t and (x == 42) when you hit the
assignment, then you add the constant true to the equivalency list
in the true arm (removing it of course when you're done with the
true arm).  Then add the constant false to the equivalancy list
in the false arm.

This is a perfect example of why I want to have an equivalency list,
much like the tree-vn code provides.

So before you go too much further with ASSERT_EXPR I think we need
to reach an agreement that ASSERT_EXPR is actually useful.  I believe
you can do the same thing without ASSERT_EXPR using facilities that
already exist today.

Jeff

Comment 17 James A. Morrison 2004-10-03 21:01:03 UTC

 PR14859 is not directly related, but whatever fixes this bug probably shouldn't
stop case labels being combined.

Comment 18 Jeffrey A. Law 2004-10-06 19:58:35 UTC

So after playing with the idea of inserting copies/constant initializations
on outgoing edges from COND_EXPRs and SWITCH_EXPRs and killing all the
special code in DOM to propagate which handles edge equivalences, I'm pretty
sure that't he wrong approach.

The fundamental problem is we would need to insert new copies anytime we
change the condition in a COND_EXPR or SWITCH_EXPR.  Consider:

  D.15513 = insn->code;
  D.15514 = (unsigned int) D.15513;
  switch (D.15514)
    {
      case 5: goto <L17>;
      case 6: goto <L2>;
      case 7: goto <L3>;
      case 8: goto <L22>;
      case 9 ... 10: goto <L0>;
      default : goto <L23>;
    }
  # SUCC: 22 1 21 3 2 16
[ ... ]
 # BLOCK 3
  # PRED: 0
<L3>:;
  _rtx = insn;
  D.15521 = _rtx->code;
  if (D.15521 != 7) goto <L4>; else goto <L5>;
  # SUCC: 5 (false) 4 (true)

Now we all can see that the conditional at the end of block #7 is always
false.   Let's see what happens with the copy insertion approach.

After copy insertion and rewriting into SSA we would have:
  D.15513_11 = insn_10->code;
  D.15514_12 = (unsigned int) D.15513_11;
  switch (D.15514_12)
    {
      case 5: goto <L17>;
      case 6: goto <L2>;
      case 7: goto <L3>;
      case 8: goto <L22>;
      case 9 ... 10: goto <L0>;
      default : goto <L23>;
    }
  # SUCC: 22 1 21 3 2 16
[ ... ]

  # BLOCK 3
  # PRED: 0
<L3>:;
  D.15514_28 = 7;
  _rtx_29 = insn_10;
  D.15521_30 = _rtx_29->code;
  if (D.15521_30 != 7) goto <L4>; else goto <L5>;
  # SUCC: 5 (false) 4 (true)


Which is exactly what I would expect with this copying method.  We know that
D.15514 has the value 7 at the start of block 3, so we put in a suitable
assignment/copy.  Note carefully that the condition tests a different
variable underlying variable.

After DOM we have:
  D.15513_11 = insn_10->code;
  D.15514_12 = (unsigned int) D.15513_11;
  switch (D.15513_11)
    {
      case 5: goto <L17>;
      case 6: goto <L2>;
      case 7: goto <L3>;
      case 8: goto <L22>;
      case 9 ... 10: goto <L0>;
      default : goto <L23>;
    }
  # SUCC: 22 1 21 3 2 16
[ ... ]

  # BLOCK 3
  # PRED: 0
<L3>:;
  _rtx_29 = insn_10;
  D.15521_30 = D.15513_11;
  if (D.15513_11 != 7) goto <L4>; else goto <L5>;
  # SUCC: 5 (false) 4 (true)


Ugh.  The copy we inserted turned out to be totally useless for determining
that the condition at the end of BB5 is always false.  Argggh.

To make this scheme work we'd have to do copy insertions anytime the
COND_EXPR_COND or SWITCH_EXPR changed.  Worse yet, we'd need to call
rewrite_ssa_into_ssa to keep things in SSA form.  Double Ugh.

The more I think about this, the more I'm convinced that copy insertions
aren't the solution.  Too bad since this scheme took DOM totally off the
radar for this testcase.  I'm pondering other approaches based on some of
the ideas in the patch.

Note, we may see similar problems with the ASSERT_EXPR scheme that has been
discussed as well.

Comment 19 Andrew Pinski 2004-10-08 21:36:24 UTC

Note the CFG cleanup problem is filed as PR 17895.

Comment 20 Jeffrey A. Law 2004-10-30 00:06:14 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

This patch drastically reduces the worst cast cost for finding 
equivalences associated with outgoing edges from switch statements.

Previously upon entry into a block we would search for any equivalences
associated with the edge we took into the block.  If that edge was
an outgoing edge from a switch statement we had to walk the entire
switch vector to determine if traversing the edge created an
equivalence.  Thus searching for these equivalences was effectively
O(N^2) where N is the number of outgoing edges from a switch statement.

With this patch we reduce that to O(N) by scanning the switch vector
once and recording any/all equivalences during that single pass over
the switch vector.  We can then query the recorded information as
we enter blocks during the dominator walk.

To give you some idea of how this affects worst-case behavior, the
dominator optimizer takes 41.76 seconds (19% of total compilation time)
for the testcase in PR 15524.  After this patch the dominator optimizer
takes 3.86 seconds (2% of total compilation time).

I've seen no measurable impact on more normal codes from this version
of the patch.

It turns out that it's fairly simple to record equivalences from
COND_EXPRs in the same data structure.  This has the nice side
effect of reducing the number of calls to invert_truthvalue and
thus reducing the amount of GC garbage we create.  I don't have
data for that handy though.

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 21 Jeffrey A. Law 2004-10-30 00:06:21 UTC

Created attachment 7434 [details]
ZZZ

Comment 22 Jeffrey A. Law 2004-11-01 03:24:37 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases


More work to speed up 15524.  I was rather puzzled that "expand" was
being charged with 30% of the compilation time.  I originally thought
there was some lameness in the switch statement expanders.  However,
it turned out I was totally wrong.

The culprit is actually the code to record the number of iterations
of each loop.  We never pushed a timevar for that pass and thus the
time was charged to whatever was on the top of the timevar stack
(which happens to be TV_EXPAND).

The code to record the number of loop iterations was being rather
dumb.  It basically did something like:

foreach loop
  foreach block
     if block is in loop
        code


Clearly the problem is second loop -- it's walking *all* the blocks
and testing for loop membership.  Ouch.

It's far more efficient to use get_loop_body (which does a DFS walk 
starting at the beginning of the loop to the end of the loop).  To
give you some idea here's some data:

The original time without my patch is:

 expand                :  54.55 (30%) usr   0.03 ( 4%) sys  55.21 (30%)
 TOTAL                 : 181.76             0.85           185.17

With my patch:

 expand                :   6.78 ( 5%) usr   0.05 ( 6%) sys   7.20 ( 5%)
 TOTAL                 : 134.20             0.83           137.19

Not bad for 10 minutes of work.

I took the liberty to add a new timevar for the loop bounds recording
pass and to kill an unused field within the loop structure.

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 23 Jeffrey A. Law 2004-11-01 03:24:39 UTC

Created attachment 7450 [details]
ZZZ

Comment 24 Daniel Berlin 2004-11-01 03:51:48 UTC

Subject: Re:  [4.0 Regression] jump threading
 on trees is slow with switch statements with large # of cases



On Sun, 31 Oct 2004, Jeffrey A Law wrote:

>
> More work to speed up 15524.  I was rather puzzled that "expand" was
> being charged with 30% of the compilation time.  I originally thought
> there was some lameness in the switch statement expanders.  However,
> it turned out I was totally wrong.
>
> The culprit is actually the code to record the number of iterations
> of each loop.  We never pushed a timevar for that pass and thus the
> time was charged to whatever was on the top of the timevar stack
> (which happens to be TV_EXPAND).
>
> The code to record the number of loop iterations was being rather
> dumb.  It basically did something like:
>
> foreach loop
>  foreach block
>     if block is in loop
>        code
>
>
> Clearly the problem is second loop -- it's walking *all* the blocks
> and testing for loop membership.  Ouch.

I actually had fixed this as part of my last patch, but was waiting till 
monday to commit it.

Comment 25 Giovanni Bajo 2004-11-01 03:52:49 UTC

Jeff, what is the compile-time situation in mainline now, compared to 3.4 or 
even older versions (2.95)?

Comment 26 Andrew Pinski 2004-11-01 05:16:07 UTC

We are still way behind 3.3, it takes 15 seconds on my 1.5GHz PPC 7400 with 3.3 but with 4.0, well for 
4.0 time just look at Jeff's data and see that we are way behind still.

Comment 27 Jeffrey A. Law 2004-11-01 14:09:15 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Mon, 2004-11-01 at 05:16 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-11-01 05:16 -------
> We are still way behind 3.3, it takes 15 seconds on my 1.5GHz PPC 7400 with 3.3 but with 4.0, well for 
> 4.0 time just look at Jeff's data and see that we are way behind still.
Well, I haven't looked at 3.3, but I can make a reasonable guess that
we're still way way way behind due to the way we update case labels
when forwarding edges and splitting critical edges.  Some early 
experiments I've done with that indicate there's another 30-35%
improvement that can be made by fixing that problem.  Then there's
*another* 30% or so we're burning in the RTL branch prediction code
(I haven't looked to see if there's anything we can do with that code
yet).

I doubt it makes much sense to look closely at 3.3 vs 4.0 for this
testcase and similar code until we fix those glaring problems.

It's also not clear to me how much of an improvement those changes
will make in real-world code.  ie, we could easily run into a case
where we drastically improve that testcase without improving any
real code, much like what happened recently with my changes to 
improve how we find/record equivalences for edges.

Jeff

Comment 28 s.bosscher 2004-11-01 15:03:08 UTC

Subject: RE:  [4.0 Regression] jump threading
	 on trees is slow with switch statements with large # of cases

I created a special function for the split_critical_edges pass
to split edges out of SWITCH_EXPR blocks in linear time wrt. the
number of outgoing edges E and the number of case labels C.  The
function makes the edge splitting take O(E+2C) time, instead of
E*C which is effectively quadratic.

The effect for the test case of this PR is huge.

The effect on everything else (for example insn-attrtab) is not
measurable.

Dunno if this is worth the trouble.  The patch I have is a hack
and the real solution is using a smarter representation for the
case label targets.  I'll try some other code and perhaps post
the patch.  We might want to just do it as a hack on the gcc 4.0
release branch...

Comment 29 Giovanni Bajo 2004-11-01 16:55:55 UTC

> The effect for the test case of this PR is huge.
> The effect on everything else (for example insn-attrtab) is not
> measurable.

We have had many bug reports from users trying to compile interpret-like code 
which is machine generated and is effectively made by giant switch statements. 
Improving this PR will definitely make their code go faster.

In fact, if we are going to have a performance testsuite for GCC (as I asked to 
the SC lately), I would surely propose to put such a testcase in there.

Comment 30 s.bosscher 2004-11-01 19:06:17 UTC

Subject: RE:  [4.0 Regression] jump threading
	 on trees is slow with switch statements with large # of cases


OK, then can you see if this hack helps...?

I'll be the first to admit that it's gross, but it
bootstraps, and if it helps (which I haven't tested)
then, well, maybe, you know, eh, ...

Comment 31 s.bosscher 2004-11-01 19:06:23 UTC

Created attachment 7453 [details]
patch

Comment 32 Kazu Hirata 2004-11-01 19:34:53 UTC

Subject: Re:  [4.0 Regression] jump threading
 on trees is slow with switch statements with large # of cases

Hi Steven,

> OK, then can you see if this hack helps...?

+   /* Step 4: Update the case label vector.  */
+   TREE_VEC_LENGTH (label_vec) = n;
+   for (i = 0; i < n; ++i)
+     {
+       tree elt = TREE_VEC_ELT (label_vec, i);
+       e = case_to_edge[i];
+       CASE_LABEL (elt) = tree_block_label (e->dest);
+     }

Wow, this is barely safe.  You are assuming that the edge redirection
is done by changing e->dest and that cfg.c:redirect_edge_succ_nodup
does not remove e, which is actually the case because there is no
existing edge from the block containing SWITCH_EXPR to the new basic
block created by split_edge.

In any case, I agree this should work.  I'll be the second one to
admit this is gross. :-)

Kazu Hirata

Comment 33 Jeffrey A. Law 2004-11-01 20:04:19 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Mon, 2004-11-01 at 19:35 +0000, kazu at cs dot umass dot edu wrote:
> ------- Additional Comments From kazu at cs dot umass dot edu  2004-11-01 19:34 -------
> Subject: Re:  [4.0 Regression] jump threading
>  on trees is slow with switch statements with large # of cases
> 
> Hi Steven,
> 
> > OK, then can you see if this hack helps...?
> 
> +   /* Step 4: Update the case label vector.  */
> +   TREE_VEC_LENGTH (label_vec) = n;
> +   for (i = 0; i < n; ++i)
> +     {
> +       tree elt = TREE_VEC_ELT (label_vec, i);
> +       e = case_to_edge[i];
> +       CASE_LABEL (elt) = tree_block_label (e->dest);
> +     }
> 
> Wow, this is barely safe.  You are assuming that the edge redirection
> is done by changing e->dest and that cfg.c:redirect_edge_succ_nodup
> does not remove e, which is actually the case because there is no
> existing edge from the block containing SWITCH_EXPR to the new basic
> block created by split_edge.
> 
> In any case, I agree this should work.  I'll be the second one to
> admit this is gross. :-)
I think it'll ultimately be cleaner to simply drop the labels after
we've built the CFG and associate an edge with each of the cases.

We'd still have the assumption that edge redirection happens by
changing e->dest, but I think that approach is otherwise reasonably
clean.

Probably the ugliest part of that approach is the need to have edges
in the switch statement.

The nice thing is this approach matches pretty well with what we've
done with simple gotos.

Jeff

Comment 34 Steven Bosscher 2004-11-01 20:17:22 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases

> I think it'll ultimately be cleaner to simply drop the labels after
> we've built the CFG and associate an edge with each of the cases.

Yes, this is what I would really want to do, but I don't know if such
a patch would be safe for GCC 4.0.  I suppose we could introduce a fake
'x' class tree with is really just a VEC(edge) plus a tree_common, and
stuff that into SWITCH_LABELS (which would at that point not longer be
an appropriate name of course ;-).  The trouble is that we still need
to recreate the labels for expand, that's the ugly part.

Comment 35 Jeffrey A. Law 2004-11-01 21:18:33 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Mon, 2004-11-01 at 16:56 +0000, giovannibajo at libero dot it wrote:
> ------- Additional Comments From giovannibajo at libero dot it  2004-11-01 16:55 -------
> > The effect for the test case of this PR is huge.
> > The effect on everything else (for example insn-attrtab) is not
> > measurable.
> 
> We have had many bug reports from users trying to compile interpret-like code 
> which is machine generated and is effectively made by giant switch statements. 
> Improving this PR will definitely make their code go faster.
> 
> In fact, if we are going to have a performance testsuite for GCC (as I asked to 
> the SC lately), I would surely propose to put such a testcase in there.
But how often are those codes going to trigger jump threading and
edge splitting.  Just having large switches isn't enough to cause us
to trip over the representational issues that are giving us so many
compile-time problems.

jeff

Comment 36 Jeffrey A. Law 2004-11-01 21:18:33 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Mon, 2004-11-01 at 20:17 +0000, stevenb at suse dot de wrote:
> ------- Additional Comments From stevenb at suse dot de  2004-11-01 20:17 -------
> Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> 
> > I think it'll ultimately be cleaner to simply drop the labels after
> > we've built the CFG and associate an edge with each of the cases.
> 
> Yes, this is what I would really want to do, but I don't know if such
> a patch would be safe for GCC 4.0.
I think this can be done in a reasonably localized way.  We convert
to using edges just after building the CFG and we return to using
labels after we're done with the SSA path.  I don't expect we'll have
much (if any) code that has to handle both representations.


>   I suppose we could introduce a fake
> 'x' class tree with is really just a VEC(edge) plus a tree_common, and
> stuff that into SWITCH_LABELS (which would at that point not longer be
> an appropriate name of course ;-). 
Or break out case labels in a manner similar to what we do with other
nodes that have "interesting" fields.



>  The trouble is that we still need
> to recreate the labels for expand, that's the ugly part.
Na.  That's trivial.

jeff

Comment 37 Steven Bosscher 2004-11-01 21:22:30 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases


I don't know about jump threading, but edge splitting in such code
is very likely because we pre-split all critical edges for GVN-PRE
and I can imagine that in such interpreter-like code you'd have lots
of critical edges.  But maybe not...

Comment 38 Jeffrey A. Law 2004-11-03 15:03:34 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases


With loop bounds recording no longer charged to the expander it's time
to deal with the inefficiencies which are in the switch expander.

Basically we have code which does

for each case
  for each case
    see if the label in the outer loop matches any of the cases in the
    inner loop (up to the current case in the outer loop).

Needless to say this can get rather expensive.

This patch avoids the inner loop completely by keeping track of the
target labels we've seen in a bitmap and only incrementing the unique
target count when we encounter a target which hasn't been seen.

Before this patch we got the following times for expand in pr15524.


expand          :   6.53 ( 5%) usr   0.03 ( 4%) sys   6.56 ( 5%)

After this patch we see:

expand          :   0.71 ( 1%) usr   0.02 ( 3%) sys   0.75 ( 1%)


Not bad.  Unfortunately, no measurable difference on insn-attrtab.i
and friends.  Oh well.

Bootstrapped and regression tested on i686-pc-linux-gnu

Comment 39 Jeffrey A. Law 2004-11-03 15:03:44 UTC

Created attachment 7466 [details]
ZZZ

Comment 40 Andrew Pinski 2004-11-03 15:08:46 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases


On Nov 3, 2004, at 10:03 AM, Jeffrey A Law wrote:

>
> With loop bounds recording no longer charged to the expander it's time
> to deal with the inefficiencies which are in the switch expander.
>
> Basically we have code which does
>
> for each case
>   for each case
>     see if the label in the outer loop matches any of the cases in the
>     inner loop (up to the current case in the outer loop).

We don't really need some of this code anyways because the gimplifier
now reorders the case statements so that you don't have to do the loop
at all.

Thanks,
Andrew Pinski

Comment 41 Jeffrey A. Law 2004-11-03 15:43:17 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Wed, 2004-11-03 at 15:08 +0000, pinskia at physics dot uc dot edu
wrote:
> ------- Additional Comments From pinskia at physics dot uc dot edu  2004-11-03 15:08 -------
> Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> 
> 
> On Nov 3, 2004, at 10:03 AM, Jeffrey A Law wrote:
> 
> >
> > With loop bounds recording no longer charged to the expander it's time
> > to deal with the inefficiencies which are in the switch expander.
> >
> > Basically we have code which does
> >
> > for each case
> >   for each case
> >     see if the label in the outer loop matches any of the cases in the
> >     inner loop (up to the current case in the outer loop).
> 
> We don't really need some of this code anyways because the gimplifier
> now reorders the case statements so that you don't have to do the loop
> at all.
Note carefully we are looking at the destination labels for each
case within the switch.  Thus to be able to avoid this loop we would
have to know that even after the optimizers run that all cases which
transfer control to the same destination are grouped together.
jef

Comment 42 richard.guenther@gmail.com 2004-11-03 15:44:17 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases

On Wed, 3 Nov 2004 10:07:58 -0500, Andrew Pinski <pinskia@physics.uc.edu> wrote:
> 
> On Nov 3, 2004, at 10:03 AM, Jeffrey A Law wrote:
> 
> >
> > With loop bounds recording no longer charged to the expander it's time
> > to deal with the inefficiencies which are in the switch expander.
> >
> > Basically we have code which does
> >
> > for each case
> >   for each case
> >     see if the label in the outer loop matches any of the cases in the
> >     inner loop (up to the current case in the outer loop).
> 
> We don't really need some of this code anyways because the gimplifier
> now reorders the case statements so that you don't have to do the loop
> at all.

I believe I have seen patches from Kazu that address this.

Richard.

Comment 43 Kazu Hirata 2004-11-03 16:33:01 UTC

Richard,

My patches to expand_case does not change its asymptotic behavior.

Jeff,

You might want to use SBITMAP instead of BITMAP in your patch to
expand_case because the bitmap you construct will be dense, and you know
its size in advance.

Comment 44 Jeffrey A. Law 2004-11-03 16:46:40 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Wed, 2004-11-03 at 16:33 +0000, kazu at cs dot umass dot edu wrote:
> ------- Additional Comments From kazu at cs dot umass dot edu  2004-11-03 16:33 -------
> Richard,
> 
> My patches to expand_case does not change its asymptotic behavior.
> 
> Jeff,
> 
> You might want to use SBITMAP instead of BITMAP in your patch to
> expand_case because the bitmap you construct will be dense, and you know
> its size in advance.
No you don't know it's size or density in advance.  We're not looking
at case values, but at the label where each case statement will jump to.

Jeff

Comment 45 Kazu Hirata 2004-11-03 16:51:39 UTC

Jeff,

Oops, you're right!

Kazu Hirata

Comment 46 s.bosscher 2004-11-03 16:52:33 UTC

Subject: RE:  [4.0 Regression] jump threading
	 on trees is slow with switch statements with large # of cases

> No you don't know it's size or density in advance.  We're not looking
> at case values, but at the label where each case statement will jump to.

You could move the expand code for SWITCH_EXPRs out of
expand_expr, and call it from cfgexpand.c, passing the
number of outgoing edges.

Comment 47 Jeffrey A. Law 2004-11-04 00:22:43 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases


We've known for a little while that the prediction code compile time
behavior for code like pr15524 is, err, poor at best.


The first problem is the prediction code, much like the code
to record loop bounds, was walking all the blocks in the function and
performing a member test on them.

To give you an idea of how expensive this style of coding is, here's
the timevar info for pr15524 before my patch:

branch prediction     :  44.17 (35%) usr   0.01 ( 1%) sys  44.28 (32%)

OUCH.  Over 1/3 of our total compilation time is spent in the branch
prediction code.

I changed the code to record the blocks to visit in a bitmap and
we use EXECUTE_IF_SET_IN_BITMAP.  After that change we get:


branch prediction     :  11.23 (12%) usr   0.00 ( 0%) sys  11.25 (12%)

Which is a nice improvement.  However, there's clearly an algorithmic
problem in this code.  If I'm reading the code correctly it has a
fundamental problem in that the nodes in the inner loops are processed
multiple times (once per loop nest).  It would seem to me we could
get a nice speedup by aggregating and recording the information from
inner loops, then not walking them again and instead using the
cached information.

Anyway, it's a pretty straightforward change and gives us a nice
speedup on the 15524 code.

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 48 Jeffrey A. Law 2004-11-04 00:22:54 UTC

Created attachment 7469 [details]
ZZZ

Comment 49 Steven Bosscher 2004-11-04 00:28:56 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases

> However, there's clearly an algorithmic problem in this code.

There is.  The loop predictors are quadratic in the loop nest depth.
Honza already suggested predicting the loops from the innermost one to
outer, and just not predict the outer loops with such an expensive
algorithm.

Comment 50 Steven Bosscher 2004-11-04 00:29:52 UTC

Comment 51 Steven Bosscher 2004-11-07 22:13:35 UTC

It seems that connect_infinite_loops_to_exit is the next 
bottleneck for this test case.  Probably not important 
enough to care about for GCC 4.0 (would *any* real-world 
code ever consist of 30000-odd infinite loops in a single 
switch statement? ;-)

Comment 52 Jeffrey A. Law 2004-11-09 04:28:34 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

Ho hum.  Another 12% improvement. 

There's two closely related changes in this patch.

First, we can drastically speed up find_edge for code with large
switch statements.  Right now we call find_edge with a src/dst
edge pair.  find_edge walks src->succs to find a successor
with a matching destination block.

That is usually a pretty efficient scheme as most blocks have one
or two successors and thus the search terminates quickly.

However, when handling large switch statements that scheme can 
perform poorly.  Instead it is sometimes better to walk dst->preds
searching for the right src/dst pair.

Previously we had no good way to determine which list to search
through.  However, with the recent work on our pred/succ list
representations we can trivially check which of the two
important lists is shorter.  We (of course) search the shorter list :-)

redirect_edge_succ_nodup actually implements find_edge inline and thus
it does not perform well on blocks with lots of successors.  Clearly
this code can (and does) benefit from calling the significantly 
improved find_edge.

Key timevars before the change:




 cfg cleanup           :   2.80 ( 3%) usr   0.00 ( 0%) sys   2.78 ( 3%)
 tree CFG cleanup      :  34.23 (36%) usr   0.00 ( 0%) sys  34.26 (35%)
 dominator optimization:   5.11 ( 5%) usr   0.02 ( 2%) sys   5.13 ( 5%)
 tree split crit edges :  17.36 (18%) usr   0.00 ( 0%) sys  17.36 (18%)
 TOTAL                 :  96.21             0.82            97.53


Key timevars after the change:

 cfg cleanup           :   0.62 ( 1%) usr   0.00 ( 0%) sys   0.62 ( 1%)
 tree CFG cleanup      :  28.27 (33%) usr   0.00 ( 0%) sys  28.33 (32%)
 dominator optimization:   3.77 ( 4%) usr   0.01 ( 1%) sys   3.80 ( 4%)
 tree split crit edges :  15.58 (18%) usr   0.00 ( 0%) sys  15.57 (18%)
 TOTAL                 :  84.81             0.81            87.92

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 53 Jeffrey A. Law 2004-11-09 04:28:39 UTC

Created attachment 7495 [details]
PPP

Comment 54 Jeffrey A. Law 2004-11-10 05:04:34 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases


Whee, almost another 30% improvement...

In this installment we're attacking the insane amount of time we're
taking to insert fake edges in the CFG to connect infinite loops
to the exit block (which we do 3 times during the course of compiling
this testcase).

The testcase in pr15524 is somewhat unique in that it has a huge 
number of infinite loops.  In fact, that's about all the testcase
is once jumps are threaded.

I agree with Steven that it is very unlikely that any normal looking
code will have similar numbers of unreachable loops and such a large
enough CFG to trigger the insane compile-time behavior like we see
with pr15524.

However, I do think it is worth fixing this problem.  Particularly 
when we realize that we're doing unnecessary work anytime we encounter
a function with an infinite loop and that it's pretty trivial to
avoid the unnecessary work.

The fundamental problem is after DFS walking the reverse CFG we start
searching blocks from EXIT to ENTRY looking for the first one which
has not been visited.

We perform that search each time the reverse CFG DFS walk stops
(ie, each time we find an infinite loop), each time we start the search
at the EXIT block.

For any function with an infinite loop we're wasting compilation
time searching blocks multiple times.  In 15524 we happen to have
enough blocks and enough infinite loops to make the lameness of
our implementation obvious.

Luckily fixing this is quite trivial.  All we have to do is pass
in the unvisited block from the previous search and use that
as the starting point for the loop instead of EXIT_BLOCK.

This basically takes PRE and the branch predictors off the radar for
this testcase.

Before the fix:
 tree PRE              :  12.69 (15%) usr   0.01 ( 1%) sys  12.73 (15%) 
 branch prediction     :  11.36 (14%) usr   0.00 ( 0%) sys  11.37 (13%)
 TOTAL                 :  83.16             0.79            84.45


After the fix:

 tree PRE              :   0.12 ( 0%) usr   0.02 ( 3%) sys   0.14 ( 0%)
 branch prediction     :   0.41 ( 1%) usr   0.01 ( 1%) sys   0.41 ( 1%)
 TOTAL                 :  59.54             0.80            61.29

[ At this point our SWITCH_EXPR issues account for roughly 70% of
  the remaining compilation time.  Yup, that's right, I'd expect
  a reduction from ~60 seconds to around 15-25 seconds depending
  on how well we solve our SWITCH_EXPR issues. ]


Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 55 Jeffrey A. Law 2004-11-10 05:04:43 UTC

Created attachment 7514 [details]
PPP

Comment 56 Jeffrey A. Law 2004-11-11 21:42:31 UTC

GRRR.  I thought I had settled on a scheme that would address the SWITCH_EXPR
representation issues with 15524.   It was a combination of a hash table to
get us from an edge to a CASE_LABEL_EXPR and the "CASE_LEADER" concept from
Kazu.

It works amazingly well on 15524 -- cuts compile-time by just over 70% and
should have been suitable for more general code.

The problem is we don't have a way to get notification of edge deletions and
there are too many places where we can get into remove_edge without first
going through ssa_remove_edge, which is were I had put my code to handle
edge removals from the hash table.

It looks like we might be stuck building the hash table lazily in the CFG
cleanup code and when splitting critical edges.  Ugh.

I may go forward with the CASE_LEADER changes separately -- they should give
a measurable benefit, and can work with or without the hash table to map
from edge to CASE_LABEL_EXPRs.

Comment 57 Jeffrey A. Law 2004-11-13 04:18:35 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases


And the saga continues....

This patch introduces the concept of a case leader to CASE_LABEL_EXPR.

As Kazu outlined, the case leader is the "representative value" for
the partition of cases which reach the same destination block.   The
case leader is the only CASE_LABEL_EXPR which contains a LABEL_DECL;
other members of the partition point to the case leader.

We convert to the case leader representation as we create edges for
each SWITCH_EXPR.

In this iteration of the change we still have to do a search of the
case vector to find the case leader when we redirect an edge out of
the SWITCH_EXPR.  A future version is expected to use a hash table,
but some kinks in that code still need to be worked out.

Even though we're still doing a search of the case vector, this
patch represents a big improvement for 15524.  Instead of having
to search the entire case vector when we redirect a jump, we 
(on average) half the case vector to find the leader.


To give you an idea of how this affects compile time, we have the
critical timevars before this patch:
 tree CFG cleanup      :  28.20 (49%) usr   0.06 ( 9%) sys  29.81 (48%) 
 tree split crit edges :  14.55 (25%) usr   0.02 ( 3%) sys  14.83 (24%)
 TOTAL                 :  57.76             0.70            62.18


After this patch:
 tree CFG cleanup      :  12.29 (36%) usr   0.01 ( 2%) sys  12.35 (34%)
 tree split crit edges :   5.91 (17%) usr   0.00 ( 0%) sys   5.92 (16%)
 TOTAL                 :  33.89             0.57            36.65

That's a pretty good improvement.

FWIW, a version which uses a persistent hash table for lookups of
edge to case leaders brings the CFG cleanup and critical edge
splitting phases down into the noise with a total time of < 18
seconds.

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 58 Jeffrey A. Law 2004-11-13 04:18:46 UTC

Created attachment 7538 [details]
PPP

Comment 59 Jeffrey A. Law 2004-11-17 21:06:37 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

OK.  This ought to fix 18478, 18521 and any other breakages caused
by the introduction of CASE_LEADER _and_ speed up 15524 considerably.

As Kazu determined, sharing nodes within the case vector has some
rather unpleasant effects and IMHO fixing them was going to create
an unmaintainable mess in the tree copying code.

This patch kills the concept of a CASE_LEADER; however, it introduces
an edge to cases hash table which allows us to map from an edge to
the CASE_LABEL_EXPRs which use that edge.

Because we do not have all the hooks in place to be notified of
CFG changes, the hash table is not persistent.  Instead we create
on-demand in response to requests to redirect jumps out of SWITCH_EXPRs.

The code is designed so that it will work with or without the hash
table.  That way we don't have to worry about correctness issues
with passes that may redirect edges without creating the hash table.
[ Such passes will simply run slower if they redirect a lot of
  edges out of SWITCH_EXPRs.  ]

The effect on 15524 is dramatic.  Before this change:

tree CFG cleanup      :  15.14 (38%) usr   0.00 ( 0%) sys  15.16 (37%)
tree split crit edges :   7.50 (19%) usr   0.00 ( 0%) sys   7.50 (18%)
TOTAL                 :  40.27             0.80            41.09


After this change:


tree CFG cleanup      :   0.46 ( 3%) usr   0.00 ( 0%) sys   0.47 ( 2%)
tree split crit edges :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%)
TOTAL                 :  18.35             0.81            19.18


FWIW, if we could have a persistent hash table which was built during
CFG construction, the time for CFG cleanup would be effectively zero.

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 60 Jeffrey A. Law 2004-11-17 21:06:52 UTC

Created attachment 7563 [details]
PPP

Comment 61 Andrew Pinski 2004-11-18 12:59:50 UTC

*** Bug 17895 has been marked as a duplicate of this bug. ***

Comment 62 Andrew Pinski 2004-11-18 13:00:23 UTC

I will do some timings today but I think this is fixed now.

Comment 63 Andrew Pinski 2004-11-18 14:31:38 UTC

I think this fixed except I have not checked with checking turned off.
With checking still on I get the following passes as hot:
 dominator optimization:  31.03 (42%) usr   0.02 ( 1%) sys  31.26 (40%) wall
 tree SSA verifier     :   5.64 ( 8%) usr   0.00 ( 0%) sys   5.68 ( 7%) wall
 rename registers      :   7.37 (10%) usr   0.02 ( 1%) sys   7.46 (10%) wall

But cfg cleanup went down a huge amount:
 tree CFG cleanup      :   1.23 ( 2%) usr   0.00 ( 0%) sys   1.25 ( 2%) wall

Before (20041113) on a slightly faster machine and different OS:
 tree CFG cleanup      :  33.36 (43%) usr   0.01 ( 1%) sys  33.41 (42%) wall
 tree split crit edges :  16.83 (22%) usr   0.01 ( 1%) sys  16.87 (21%) wall
 dominator optimization:   7.94 (10%) usr   0.03 ( 2%) sys   7.97 (10%) wall
 tree SSA verifier     :   2.57 ( 3%) usr   0.01 ( 1%) sys   2.53 ( 3%) wall
 rename registers      :   5.57 ( 7%) usr   0.03 ( 2%) sys   5.61 ( 7%) wall

Hmm, either DOM was just slowed down (because of checking) or something is really different between 
these two OS's.

But as you can see that tree CFG cleanup and tree split crit edges have sped up a lot.
I will try to get a new compiler built without checking turned on later today to get final numbers.

Comment 64 Jeffrey A. Law 2004-11-18 15:56:58 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Thu, 2004-11-18 at 13:00 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-11-18 13:00 -------
> I will do some timings today but I think this is fixed now.
Please don't mark it as fixed.  THere's still some significant
lameness.  I suspect there's another 10-20% improvement waiting
to be realized without doing anything terribly difficult.

jeff

Comment 65 Andrew Pinski 2004-11-18 18:23:34 UTC

Here are the current timings for the mainline and 3.3:
mainline:
[zhivago:~/src/localgccPRs] pinskia% time ~/local3/bin/gcc -S pr15524.c -O1
27.680u 1.240s 0:32.98 87.6%    0+0k 0+6io 0pf+0w

3.3:
[zhivago:~/src/localgccPRs] pinskia% time gcc -S pr15524.c -O1
14.260u 0.560s 0:16.30 90.9%    0+0k 0+4io 0pf+0w

At -O0:
mainline:
[zhivago:~/src/localgccPRs] pinskia% time ~/local3/bin/gcc -S pr15524.c
2.790u 0.350s 0:03.52 89.2%     0+0k 0+3io 0pf+0w

3.3:
[zhivago:~/src/localgccPRs] pinskia% time gcc -S pr15524.c
1.930u 0.280s 0:02.28 96.9%     0+0k 1+3io 0pf+0w

Note that is this on powerpc-darwin and with Apple's 3.3 (build 1650 or so).


So we are still twice as slow as 3.3 but faster than before.
Using -ftime-report I get:
 dominator optimization:  11.69 (42%) usr   0.13 ( 4%) sys  12.28 (38%) wall
 rename registers      :   3.32 (12%) usr   0.12 ( 4%) sys   3.62 (11%) wall

So only DOM needs to be sped up.

For DOM the problem comes from thread_across_edge when we go and update the profile of the BB.

For renane registers, most of a profile of compiling this source comes from regrename.c the loop which 
is around line 1761.

Comment 66 Jeffrey A. Law 2004-11-19 03:14:32 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Thu, 2004-11-18 at 18:23 +0000, pinskia at gcc dot gnu dot org wrote:

> So we are still twice as slow as 3.3 but faster than before.
> Using -ftime-report I get:
>  dominator optimization:  11.69 (42%) usr   0.13 ( 4%) sys  12.28 (38%) wall
>  rename registers      :   3.32 (12%) usr   0.12 ( 4%) sys   3.62 (11%) wall
> 
> So only DOM needs to be sped up.
> 
> For DOM the problem comes from thread_across_edge when we go and update the profile of the BB.
That's part of the problem, but there's definitely more stuff going
on than just that annoying useless divide operation in the code to
update the profile.  There's some issues in tree-ssa-threadupdate as
well.


> For renane registers, most of a profile of compiling this source comes from regrename.c the loop which 
> is around line 1761. 
Yes.  I know.

Jeff

Comment 67 Jeffrey A. Law 2004-11-20 20:17:44 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases



This gets regrename off the radar.

Look at this code from regrename.c and tell me if you can find the part
that isn't particularly intelligent:



  FOR_EACH_BB (bb)
    {
      /* If a block has a single predecessor, that we've already
         processed, begin with the value data that was live at
         the end of the predecessor block.  */
      /* ??? Ought to use more intelligent queuing of blocks.  */
      if (EDGE_COUNT (bb->preds) > 0)
        for (bbp = bb; bbp && bbp != EDGE_PRED (bb, 0)->src; bbp = bbp-
prev_bb);
      if (EDGE_COUNT (bb->preds) == 1
          && ! (EDGE_PRED (bb, 0)->flags & (EDGE_ABNORMAL_CALL |
EDGE_EH))
          && EDGE_PRED (bb, 0)->src != ENTRY_BLOCK_PTR
          && bbp)
        all_vd[bb->index] = all_vd[EDGE_PRED (bb, 0)->src->index];
      else
        init_value_data (all_vd + bb->index);

      if (copyprop_hardreg_forward_1 (bb, all_vd + bb->index))
        need_refresh = true;
    }


OK.  A hint.  bbp isn't used/set anywhere outside this loop...

OK.  Another hint.  If BB has more than one predecessor is the loop
walking through the prev_bb blocks to set bbp even necessary?

That's right.  We do that silly loop setting bbp even in cases where
it's totally unnecessary.  

Before:

rename registers      :  3.27 (18%) 
TOTAL                 : 17.77


After:
rename registers      :  0.19 ( 1%) 
TOTAL                 : 14.53


Yup, that's nearly a net 20% improvement for pr15524.  I would expect
a small improvement for more normal codes.

[ Yes, I got the message regarding the bootstrap comparison failure
  on darwin.  I'm going to try and reproduce it on a relatively old
  PPC box here.  It'll take quite some time though.  I do plan on
  flushing out one more patch while my PPC bootstrap is running. ]

Comment 68 Jeffrey A. Law 2004-11-20 20:17:48 UTC

Created attachment 7572 [details]
PPP

Comment 69 Eric Botcazou 2004-11-20 20:32:38 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases

> [ Yes, I got the message regarding the bootstrap comparison failure
>   on darwin.  I'm going to try and reproduce it on a relatively old
>   PPC box here.  It'll take quite some time though.  I do plan on
>   flushing out one more patch while my PPC bootstrap is running. ]

There are problems on x86_64, SPARC, PowerPC and i686, see the PR.

Comment 70 Jeffrey A. Law 2004-11-20 20:39:25 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Sat, 2004-11-20 at 21:33 +0100, Eric Botcazou wrote:
> > [ Yes, I got the message regarding the bootstrap comparison failure
> >   on darwin.  I'm going to try and reproduce it on a relatively old
> >   PPC box here.  It'll take quite some time though.  I do plan on
> >   flushing out one more patch while my PPC bootstrap is running. ]
> 
> There are problems on x86_64, SPARC, PowerPC and i686, see the PR.
I haven't seen the comparison failure on i686 -- if I had I never
would have checked in the patch.  PPC is the only other kind of box
I have here locally.

There's some other approaches I can (and will) try while the PPC
box is doing its thing.  
jeff

Comment 71 Eric Botcazou 2004-11-21 07:12:58 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases

> I haven't seen the comparison failure on i686 -- if I had I never
> would have checked in the patch.  PPC is the only other kind of box
> I have here locally.

Right, the problem seems to be very sensitive to the environment.  H.J. saw it 
on AMD64 while I didn't.

Comment 72 Jeffrey A. Law 2004-11-21 20:42:57 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

And nearly another 20% improvement.

As mentioned in a previous message, the most expensive code right now
for pr15524 is the code to rescale edge probabilities.  This patch has
two changes to improve this situation.

First, if the scale factor is 1, then the probabilities will not change
and thus there is no work to do.  This is the common case for pr15524.

In the case where the scale factor is != 1, we compute the scale factor
outside the loop.  This saves us a divide every iteration of the loop.

Before:

dominator optimization:   3.50 (24%)
TOTAL                 :  14.53

After:

dominator optimization:   0.61 ( 5%)
TOTAL                 :  11.84

Bootstrapped and regression tested on i686-pc-linux-gnu.

Comment 73 Jeffrey A. Law 2004-11-21 20:42:59 UTC

Created attachment 7578 [details]
PPP

Comment 74 Andrew Pinski 2004-11-21 21:55:37 UTC

This is now back to 3.3 timings so closing, I filed PR 18601 for another case where tree cfg cleanup is 
slow, it is a variant of this testcase but I have seen the same problem in another testcase but I cannot 
remember which one.

Comment 75 Jeffrey A. Law 2004-11-21 22:18:31 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Sun, 2004-11-21 at 21:55 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-11-21 21:55 -------
> This is now back to 3.3 timings so closing, I filed PR 18601 for another case where tree cfg cleanup is 
> slow, it is a variant of this testcase but I have seen the same problem in another testcase but I cannot 
> remember which one.
Don't close yet.  There's still some fat to be trimmed.
jeff
>

Comment 76 Giovanni Bajo 2004-11-21 22:54:27 UTC

I am sure that Jeff can use Bugzilla himself, when he wants this bug closed.

Comment 77 Jeffrey A. Law 2004-11-22 17:15:12 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases


It's clearly getting more difficult to find compile-time improvements
for 15524.  But that doesn't mean we're done :-)

As it turns out we've got more code which is just inlined versions
of find_edge, but without the ability to select which list (src->succ
or dest->pred) to walk depending on their lengths.

This patch replaces those inlined instances of find_edge with calls
to find_edge.  The big winner is tree CFG construction which goes
from .40 seconds to around .04 seconds.  There are other wins
scattered throughout the various passes.  In all we go from 11.63
seconds of total time to 10.97 seconds of total time -- a net
improvement of around 5%.

[ Please don't close this PR.  I still think there's another 4-5%
  that we'll be able to get without major surgery. ]



Bootstrapped and regression tested on i686-pc-linux-gnu.


	* cfg.c (cached_make_edge): Use find_edge rather than an inlined
	variant.
	* cfgbuild.c (make_edges): Likewise.
	* cfghooks.c (can_duplicate_block_p): Likewise.
	* cfgloop.c (loop_latch_edge): Likewise.
	* cfgloopmanip.c (force_single_succ_latches): Likewise.
	* cfgrtl.c (rtl_flow_call_edges_add): Likewise.
	* predict.c (predict_loops, propagate_freq): Likewise. 
	* tracer.c (tail_duplicate): Likewise.
	* tree-cfg.c (disband_implicit_edges): Likewise.
	(tree_forwarder_block_p, tree_flow_call_edges_add): Likewise.

Index: cfg.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cfg.c,v
retrieving revision 1.76
diff -c -p -r1.76 cfg.c
*** cfg.c	22 Nov 2004 12:23:43 -0000	1.76
--- cfg.c	22 Nov 2004 16:57:55 -0000
*************** cached_make_edge (sbitmap *edge_cache, b
*** 288,294 ****
  {
    int use_edge_cache;
    edge e;
-   edge_iterator ei;
  
    /* Don't bother with edge cache for ENTRY or EXIT, if there aren't that
       many edges to them, or we didn't allocate memory for it.  */
--- 288,293 ----
*************** cached_make_edge (sbitmap *edge_cache, b
*** 309,320 ****
  
        /* Fall through.  */
      case 0:
!       FOR_EACH_EDGE (e, ei, src->succs)
! 	if (e->dest == dst)
! 	  {
! 	    e->flags |= flags;
! 	    return NULL;
! 	  }
        break;
      }
  
--- 308,319 ----
  
        /* Fall through.  */
      case 0:
!       e = find_edge (src, dst);
!       if (e)
! 	{
! 	  e->flags |= flags;
! 	  return NULL;
! 	}
        break;
      }
  
Index: cfgbuild.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cfgbuild.c,v
retrieving revision 1.55
diff -c -p -r1.55 cfgbuild.c
*** cfgbuild.c	28 Sep 2004 07:59:42 -0000	1.55
--- cfgbuild.c	22 Nov 2004 16:57:55 -0000
*************** make_edges (basic_block min, basic_block
*** 271,277 ****
        enum rtx_code code;
        int force_fallthru = 0;
        edge e;
-       edge_iterator ei;
  
        if (LABEL_P (BB_HEAD (bb))
  	  && LABEL_ALT_ENTRY_P (BB_HEAD (bb)))
--- 271,276 ----
*************** make_edges (basic_block min, basic_block
*** 390,401 ****
  
        /* Find out if we can drop through to the next block.  */
        insn = NEXT_INSN (insn);
!       FOR_EACH_EDGE (e, ei, bb->succs)
! 	if (e->dest == EXIT_BLOCK_PTR && e->flags & EDGE_FALLTHRU)
! 	  {
! 	    insn = 0;
! 	    break;
! 	  }
        while (insn
  	     && NOTE_P (insn)
  	     && NOTE_LINE_NUMBER (insn) != NOTE_INSN_BASIC_BLOCK)
--- 389,398 ----
  
        /* Find out if we can drop through to the next block.  */
        insn = NEXT_INSN (insn);
!       e = find_edge (bb, EXIT_BLOCK_PTR);
!       if (e && e->flags & EDGE_FALLTHRU)
! 	insn = NULL;
! 
        while (insn
  	     && NOTE_P (insn)
  	     && NOTE_LINE_NUMBER (insn) != NOTE_INSN_BASIC_BLOCK)
Index: cfghooks.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cfghooks.c,v
retrieving revision 1.19
diff -c -p -r1.19 cfghooks.c
*** cfghooks.c	20 Nov 2004 05:02:28 -0000	1.19
--- cfghooks.c	22 Nov 2004 16:57:55 -0000
*************** bool
*** 675,681 ****
  can_duplicate_block_p (basic_block bb)
  {
    edge e;
-   edge_iterator ei;
  
    if (!cfg_hooks->can_duplicate_block_p)
      internal_error ("%s does not support can_duplicate_block_p.",
--- 675,680 ----
*************** can_duplicate_block_p (basic_block bb)
*** 686,694 ****
  
    /* Duplicating fallthru block to exit would require adding a jump
       and splitting the real last BB.  */
!   FOR_EACH_EDGE (e, ei, bb->succs)
!     if (e->dest == EXIT_BLOCK_PTR && e->flags & EDGE_FALLTHRU)
!        return false;
  
    return cfg_hooks->can_duplicate_block_p (bb);
  }
--- 685,693 ----
  
    /* Duplicating fallthru block to exit would require adding a jump
       and splitting the real last BB.  */
!   e = find_edge (bb, EXIT_BLOCK_PTR);
!   if (e && (e->flags & EDGE_FALLTHRU))
!     return false;
  
    return cfg_hooks->can_duplicate_block_p (bb);
  }
Index: cfgloop.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cfgloop.c,v
retrieving revision 1.43
diff -c -p -r1.43 cfgloop.c
*** cfgloop.c	22 Nov 2004 12:23:44 -0000	1.43
--- cfgloop.c	22 Nov 2004 16:57:56 -0000
*************** verify_loop_structure (struct loops *loo
*** 1499,1512 ****
  edge
  loop_latch_edge (const struct loop *loop)
  {
!   edge e;
!   edge_iterator ei;
! 
!   FOR_EACH_EDGE (e, ei, loop->header->preds)
!     if (e->src == loop->latch)
!       break;
! 
!   return e;
  }
  
  /* Returns preheader edge of LOOP.  */
--- 1499,1505 ----
  edge
  loop_latch_edge (const struct loop *loop)
  {
!   return find_edge (loop->latch, loop->header);
  }
  
  /* Returns preheader edge of LOOP.  */
Index: cfgloopmanip.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cfgloopmanip.c,v
retrieving revision 1.37
diff -c -p -r1.37 cfgloopmanip.c
*** cfgloopmanip.c	22 Nov 2004 12:23:45 -0000	1.37
--- cfgloopmanip.c	22 Nov 2004 16:57:56 -0000
*************** force_single_succ_latches (struct loops 
*** 1221,1234 ****
  
    for (i = 1; i < loops->num; i++)
      {
-       edge_iterator ei;
        loop = loops->parray[i];
        if (loop->latch != loop->header && EDGE_COUNT (loop->latch->succs) == 1)
  	continue;
  
!       FOR_EACH_EDGE (e, ei, loop->header->preds)
! 	if (e->src == loop->latch)
! 	  break;
  
        loop_split_edge_with (e, NULL_RTX);
      }
--- 1221,1231 ----
  
    for (i = 1; i < loops->num; i++)
      {
        loop = loops->parray[i];
        if (loop->latch != loop->header && EDGE_COUNT (loop->latch->succs) == 1)
  	continue;
  
!       e = find_edge (loop->latch, loop->header);
  
        loop_split_edge_with (e, NULL_RTX);
      }
Index: cfgrtl.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/cfgrtl.c,v
retrieving revision 1.147
diff -c -p -r1.147 cfgrtl.c
*** cfgrtl.c	22 Nov 2004 12:23:45 -0000	1.147
--- cfgrtl.c	22 Nov 2004 16:57:57 -0000
*************** rtl_flow_call_edges_add (sbitmap blocks)
*** 2972,2986 ****
        if (need_fake_edge_p (insn))
  	{
  	  edge e;
- 	  edge_iterator ei;
  
! 	  FOR_EACH_EDGE (e, ei, bb->succs)
! 	    if (e->dest == EXIT_BLOCK_PTR)
! 	      {
! 		insert_insn_on_edge (gen_rtx_USE (VOIDmode, const0_rtx), e);
! 		commit_edge_insertions ();
! 		break;
! 	      }
  	}
      }
  
--- 2972,2984 ----
        if (need_fake_edge_p (insn))
  	{
  	  edge e;
  
! 	  e = find_edge (bb, EXIT_BLOCK_PTR);
! 	  if (e)
! 	    {
! 	      insert_insn_on_edge (gen_rtx_USE (VOIDmode, const0_rtx), e);
! 	      commit_edge_insertions ();
! 	    }
  	}
      }
  
*************** rtl_flow_call_edges_add (sbitmap blocks)
*** 3023,3031 ****
  #ifdef ENABLE_CHECKING
  	      if (split_at_insn == BB_END (bb))
  		{
! 		  edge_iterator ei;
! 		  FOR_EACH_EDGE (e, ei, bb->succs)
! 		    gcc_assert (e->dest != EXIT_BLOCK_PTR);
  		}
  #endif
  
--- 3021,3028 ----
  #ifdef ENABLE_CHECKING
  	      if (split_at_insn == BB_END (bb))
  		{
! 		  e = find_edge (bb, EXIT_BLOCK_PTR);
! 		  gcc_assert (e == NULL);
  		}
  #endif
  
Index: predict.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/predict.c,v
retrieving revision 1.134
diff -c -p -r1.134 predict.c
*** predict.c	19 Nov 2004 00:03:14 -0000	1.134
--- predict.c	22 Nov 2004 16:57:58 -0000
*************** predict_loops (struct loops *loops_info,
*** 669,681 ****
  
  	  /* Loop branch heuristics - predict an edge back to a
  	     loop's head as taken.  */
! 	  FOR_EACH_EDGE (e, ei, bb->succs)
! 	    if (e->dest == loop->header
! 		&& e->src == loop->latch)
! 	      {
! 		header_found = 1;
! 		predict_edge_def (e, PRED_LOOP_BRANCH, TAKEN);
! 	      }
  
  	  /* Loop exit heuristics - predict an edge exiting the loop if the
  	     conditional has no loop header successors as not taken.  */
--- 669,683 ----
  
  	  /* Loop branch heuristics - predict an edge back to a
  	     loop's head as taken.  */
! 	  if (bb == loop->latch)
! 	    {
! 	      e = find_edge (loop->latch, loop->header);
! 	      if (e)
! 		{
! 		  header_found = 1;
! 		  predict_edge_def (e, PRED_LOOP_BRANCH, TAKEN);
! 		}
! 	    }
  
  	  /* Loop exit heuristics - predict an edge exiting the loop if the
  	     conditional has no loop header successors as not taken.  */
*************** propagate_freq (struct loop *loop, bitma
*** 1660,1680 ****
  
        bitmap_clear_bit (tovisit, bb->index);
  
!       /* Compute back edge frequencies.  */
!       FOR_EACH_EDGE (e, ei, bb->succs)
! 	if (e->dest == head)
! 	  {
! 	    sreal tmp;
  	    
! 	    /* EDGE_INFO (e)->back_edge_prob
! 	       = ((e->probability * BLOCK_INFO (bb)->frequency)
! 	       / REG_BR_PROB_BASE); */
  	    
! 	    sreal_init (&tmp, e->probability, 0);
! 	    sreal_mul (&tmp, &tmp, &BLOCK_INFO (bb)->frequency);
! 	    sreal_mul (&EDGE_INFO (e)->back_edge_prob,
! 		       &tmp, &real_inv_br_prob_base);
! 	  }
  
        /* Propagate to successor blocks.  */
        FOR_EACH_EDGE (e, ei, bb->succs)
--- 1662,1681 ----
  
        bitmap_clear_bit (tovisit, bb->index);
  
!       e = find_edge (bb, head);
!       if (e)
! 	{
! 	  sreal tmp;
  	    
! 	  /* EDGE_INFO (e)->back_edge_prob
! 	     = ((e->probability * BLOCK_INFO (bb)->frequency)
! 	     / REG_BR_PROB_BASE); */
  	    
! 	  sreal_init (&tmp, e->probability, 0);
! 	  sreal_mul (&tmp, &tmp, &BLOCK_INFO (bb)->frequency);
! 	  sreal_mul (&EDGE_INFO (e)->back_edge_prob,
! 		     &tmp, &real_inv_br_prob_base);
! 	}
  
        /* Propagate to successor blocks.  */
        FOR_EACH_EDGE (e, ei, bb->succs)
Index: tracer.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tracer.c,v
retrieving revision 1.22
diff -c -p -r1.22 tracer.c
*** tracer.c	28 Sep 2004 07:59:50 -0000	1.22
--- tracer.c	22 Nov 2004 16:57:58 -0000
*************** tail_duplicate (void)
*** 275,286 ****
  	      && can_duplicate_block_p (bb2))
  	    {
  	      edge e;
- 	      edge_iterator ei;
  	      basic_block old = bb2;
  
! 	      FOR_EACH_EDGE (e, ei, bb2->preds)
! 		if (e->src == bb)
! 		  break;
  
  	      nduplicated += counts [bb2->index];
  	      bb2 = duplicate_block (bb2, e);
--- 275,283 ----
  	      && can_duplicate_block_p (bb2))
  	    {
  	      edge e;
  	      basic_block old = bb2;
  
! 	      e = find_edge (bb, bb2);
  
  	      nduplicated += counts [bb2->index];
  	      bb2 = duplicate_block (bb2, e);
Index: tree-cfg.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-cfg.c,v
retrieving revision 2.112
diff -c -p -r2.112 tree-cfg.c
*** tree-cfg.c	19 Nov 2004 22:14:35 -0000	2.112
--- tree-cfg.c	22 Nov 2004 16:57:59 -0000
*************** disband_implicit_edges (void)
*** 2649,2659 ****
  	     from cfg_remove_useless_stmts here since it violates the
  	     invariants for tree--cfg correspondence and thus fits better
  	     here where we do it anyway.  */
! 	  FOR_EACH_EDGE (e, ei, bb->succs)
  	    {
- 	      if (e->dest != bb->next_bb)
- 		continue;
- 
  	      if (e->flags & EDGE_TRUE_VALUE)
  		COND_EXPR_THEN (stmt) = build_empty_stmt ();
  	      else if (e->flags & EDGE_FALSE_VALUE)
--- 2649,2657 ----
  	     from cfg_remove_useless_stmts here since it violates the
  	     invariants for tree--cfg correspondence and thus fits better
  	     here where we do it anyway.  */
! 	  e = find_edge (bb, bb->next_bb);
! 	  if (e)
  	    {
  	      if (e->flags & EDGE_TRUE_VALUE)
  		COND_EXPR_THEN (stmt) = build_empty_stmt ();
  	      else if (e->flags & EDGE_FALSE_VALUE)
*************** static bool
*** 3892,3899 ****
  tree_forwarder_block_p (basic_block bb)
  {
    block_stmt_iterator bsi;
-   edge e;
-   edge_iterator ei;
  
    /* BB must have a single outgoing edge.  */
    if (EDGE_COUNT (bb->succs) != 1
--- 3890,3895 ----
*************** tree_forwarder_block_p (basic_block bb)
*** 3911,3920 ****
    gcc_assert (bb != ENTRY_BLOCK_PTR);
  #endif
  
!   /* Successors of the entry block are not forwarders.  */
!   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
!     if (e->dest == bb)
!       return false;
  
    /* Now walk through the statements.  We can ignore labels, anything else
       means this is not a forwarder block.  */
--- 3907,3914 ----
    gcc_assert (bb != ENTRY_BLOCK_PTR);
  #endif
  
!   if (find_edge (ENTRY_BLOCK_PTR, bb))
!     return false;
  
    /* Now walk through the statements.  We can ignore labels, anything else
       means this is not a forwarder block.  */
*************** tree_flow_call_edges_add (sbitmap blocks
*** 5206,5212 ****
       Handle this by adding a dummy instruction in a new last basic block.  */
    if (check_last_block)
      {
-       edge_iterator ei;
        basic_block bb = EXIT_BLOCK_PTR->prev_bb;
        block_stmt_iterator bsi = bsi_last (bb);
        tree t = NULL_TREE;
--- 5200,5205 ----
*************** tree_flow_call_edges_add (sbitmap blocks
*** 5217,5229 ****
  	{
  	  edge e;
  
! 	  FOR_EACH_EDGE (e, ei, bb->succs)
! 	    if (e->dest == EXIT_BLOCK_PTR)
! 	      {
! 		bsi_insert_on_edge (e, build_empty_stmt ());
! 		bsi_commit_edge_inserts ();
! 		break;
! 	      }
  	}
      }
  
--- 5210,5221 ----
  	{
  	  edge e;
  
! 	  e = find_edge (bb, EXIT_BLOCK_PTR);
! 	  if (e)
! 	    {
! 	      bsi_insert_on_edge (e, build_empty_stmt ());
! 	      bsi_commit_edge_inserts ();
! 	    }
  	}
      }
  
*************** tree_flow_call_edges_add (sbitmap blocks
*** 5260,5268 ****
  #ifdef ENABLE_CHECKING
  		  if (stmt == last_stmt)
  		    {
! 		      edge_iterator ei;
! 		      FOR_EACH_EDGE (e, ei, bb->succs)
! 			gcc_assert (e->dest != EXIT_BLOCK_PTR);
  		    }
  #endif
  
--- 5252,5259 ----
  #ifdef ENABLE_CHECKING
  		  if (stmt == last_stmt)
  		    {
! 		      e = find_edge (bb, EXIT_BLOCK_PTR);
! 		      gcc_assert (e == NULL);
  		    }
  #endif

Comment 78 Andrew Pinski 2004-12-04 15:35:24 UTC

Here are the new numbers.
option     mainline    3.3.2
-O0         2.790      2.680
-O1         12.51      11.44
-O2         18.26      19.84
-O3         18.64      19.53

So we are faster or just as fast except for at -O2 where we are 9% slower.  We are no where near the 
numbers we were at before.

Comment 79 Steven Bosscher 2004-12-20 00:51:22 UTC

I suggest we close this one, 9% is just comparable to the overall 
slowdown we see in pretty much all code.  This is just noise in 
the list of regressions.  If nobody objects, I'll close this in a 
couple of days from now...

Comment 80 Jeffrey A. Law 2004-12-21 06:23:28 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Mon, 2004-12-20 at 00:51 +0000, steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2004-12-20 00:51 -------
> I suggest we close this one, 9% is just comparable to the overall 
> slowdown we see in pretty much all code.  This is just noise in 
> the list of regressions.  If nobody objects, I'll close this in a 
> couple of days from now... 
I wouldn't close it.  There's still some significant representational 
issues that need addressing (jump vectors in RTL).  It exposes some
major lamness in invariant code motion (trying to hoist a few thousand
instances of GLOBAL_VAR), and some lameness in our CFG code as well
(the edge cache).

The fact that we're spending so much time redirecting edges at the RTL
level is also an indication that we may not be passing off optimal
trees to the expanders.

In all, there may be another 10% gain for pr15524 waiting for us to
nail down.

These are all issues I want to see examined, but they're lower priority
than fixing the extreme lameness in our aliasing code which is causing
even bigger compile-time regressions in other testcodes.

Jeff

Comment 81 Steven Bosscher 2004-12-22 19:18:03 UTC

That may all be true, but the fact is that you're now talking about  
missed-optimizations, and not a real compile time hog.  Yes we can 
win.  No, this is not a hog anymore, it's just not as good as it 
could be.  You're also talking about RTL, not tree jump threading. 
 
In fact I think it's plain wrong to keep this bug open.  It's just 
keeping the number of real regressions artificially high.  We have 
compile time slowdowns, and that's a regression.  But we don't need 
to keep bugs open when the issue of that bug was been fixed. 
 
I suggest someone updates the summary and keywords to reflect the 
actual status of the bug, or close this one and open a new one for 
the basically unrelated issues Jeff mentioned

Comment 82 Jeffrey A. Law 2004-12-22 19:28:15 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Wed, 2004-12-22 at 19:18 +0000, steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2004-12-22 19:18 -------
> That may all be true, but the fact is that you're now talking about  
> missed-optimizations, and not a real compile time hog.  Yes we can 
> win.  No, this is not a hog anymore, it's just not as good as it 
> could be.  You're also talking about RTL, not tree jump threading. 
I'm talking strictly about compile-time hogs.  It's insanely dumb that
we're burning 5% of compilation time just trying to hoist the
fake GLOBAL_VAR variable out of empty loops.  It's insanely dumb that
we're burning 5% of compilation time threading jumps at the RTL
level when they all should have been threaded at the tree level.

>  
> In fact I think it's plain wrong to keep this bug open.  It's just 
> keeping the number of real regressions artificially high.  We have 
> compile time slowdowns, and that's a regression.  But we don't need 
> to keep bugs open when the issue of that bug was been fixed. 
I stongly disagree.  Please do not close this bug report.

jeff

Comment 83 Steven Bosscher 2004-12-23 13:31:00 UTC

Jeff, as I said, you're attacking problems not related to the original
problem reported in this PR.  Maybe you don't use bugzilla often enough
to see how annoying it is when the PR summary and the problem described
in the bug report don't match. But oh well, whatever, if you can win 
10% here, why am I complaining ;-)

I'm very curious about your tree alias rewrite.  Got a plan/overview of
it somewhere?  Is it GCC 4.0 material?

Relation to this PR: Last time I looked, ~40% of the RTL jump threads
were related to more accurate alias info (components!) on the RTL level.

Comment 84 Jeffrey A. Law 2004-12-23 17:22:15 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Thu, 2004-12-23 at 13:31 +0000, steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2004-12-23 13:31 -------
> Jeff, as I said, you're attacking problems not related to the original
> problem reported in this PR.  Maybe you don't use bugzilla often enough
> to see how annoying it is when the PR summary and the problem described
> in the bug report don't match. But oh well, whatever, if you can win 
> 10% here, why am I complaining ;-)
The bug report really should have been called  "compilation speed 
regression" rather than calling out "jump threading" -- a goodly number
of the changes necessary to get 15524 to a point of almost acceptable
were not jump threading related.

How about this, open a new report with the same testcase and a generic
description about compile-time regression, then close 15524.  Seems like
a make-work project, but hey, if it makes you happy....

> I'm very curious about your tree alias rewrite.  Got a plan/overview of
> it somewhere? 
In simplest terms, we lose the braindamaged type memory tag nonsense and
only do type based alias analysis on objects that have not been
disambiguated by other means.  This as the effect of greatly reducing
number of type based aliasing checks -- which accounts for the vast 
majority of the compilation time for 15855.


>  Is it GCC 4.0 material?
Unsure, mostly because I will not have much time to work on it for the
next 3 weeks.

Jeff

Comment 85 Andrew Pinski 2005-01-28 14:27:55 UTC

The timings for 3.3.2 and 4.0.0 are the same are faster for 4.0.0 so closing as fixed.

Comment 86 Diego Novillo 2005-04-08 21:49:12 UTC

(In reply to comment #18)

> Ugh.  The copy we inserted turned out to be totally useless for determining
> that the condition at the end of BB5 is always false.  Argggh.
> 
> To make this scheme work we'd have to do copy insertions anytime the
> COND_EXPR_COND or SWITCH_EXPR changed.  Worse yet, we'd need to call
> rewrite_ssa_into_ssa to keep things in SSA form.  Double Ugh.
> 
It's a matter of scheduling the passes mostly.  The reason you're having a hard
time here is mostly because some of the scalar cleanups haven't run.  Suppose
that we do this after FRE and copy propagation.  We end up with:

  D.15513_11 = insn_10->code;
  switch (D.15514_11)
    {
      case 5: goto <L17>;
      case 6: goto <L2>;
      case 7: goto <L3>;
      case 8: goto <L22>;
      case 9 ... 10: goto <L0>;
      default : goto <L23>;
    }
  # SUCC: 22 1 21 3 2 16
[ ... ]

  # BLOCK 3
  # PRED: 0
<L3>:;
  if (D.15521_11 != 7) goto <L4>; else goto <L5>;
  # SUCC: 5 (false) 4 (true)

Now, if you insert an assertion of any kind at the start of block #3, you'll get
the elimination you're looking for.

Currently, VRP will not handle this case because of the switch.  That is easily
fixable.  In TCB, the code I get after VRP is this:

  # BLOCK 0
  # PRED: ENTRY (fallthru,exec)
  #   VUSE <TMT.0_7>;
  D.1144_2 = insn_1->code;
  insn_3 = insn_1;
  switch (D.1144_2)
    {
      case 5: goto L17;
      case 6: goto L17;
      case 7: goto L3;
      default : goto L17;
    }
  # SUCC: 1 (exec) 3 (exec)

  # BLOCK 1
  # PRED: 0 (exec)
L3:;
  D.1145_6 = D.1144_2;
  if (D.1145_6 != 7) goto <L5>; else goto L17;
  # SUCC: 2 (true,exec) 3 (false,exec)

[ ... ]

So, it would be a matter of choosing the ordering and arranging for VRP to
handle SWITCH_EXPRs.

Now, if what you want is a design where all these actions are done in one
super-pass, well, that's a different can of worms.


Diego.

Comment 87 Jeffrey A. Law 2005-04-12 16:55:16 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Fri, 2005-04-08 at 21:49 +0000, dnovillo at gcc dot gnu dot org
wrote:
> ------- Additional Comments From dnovillo at gcc dot gnu dot org  2005-04-08 21:49 -------
> (In reply to comment #18)
> 
> > Ugh.  The copy we inserted turned out to be totally useless for determining
> > that the condition at the end of BB5 is always false.  Argggh.
> > 
> > To make this scheme work we'd have to do copy insertions anytime the
> > COND_EXPR_COND or SWITCH_EXPR changed.  Worse yet, we'd need to call
> > rewrite_ssa_into_ssa to keep things in SSA form.  Double Ugh.
> > 
> It's a matter of scheduling the passes mostly.  The reason you're having a hard
> time here is mostly because some of the scalar cleanups haven't run.  Suppose
> that we do this after FRE and copy propagation.  We end up with:
As we discussed on the phone, it can be seen in that way.  For VRP you
have the nice property that it's self-contained.  ie, the other
optimizers have already run, you insert your ASSERT_EXPRs, then run the
VPR algorithm, then destroy the ASSERT_EXPRs.

That mental model doesn't work right now with the way DOM & the jump
threader because they are too tightly intertwined.

As I mentioned on the phone, what I want to do long term is have the
jump threader which maintains its temporary equivalences and queries
the value handles created by PRE/FRE or whatever.

The iteration step we see today would turn into a worklist.  ie, after
we thread jumps, we walk through the IL for PHIs which represent a copy
that can be propagated.  The uses of the PHI result are put onto a
worklist of statements which need their value handles updated and
which may now be redundant.  If the state is proven redundant, then
we have a new copy to propagate and the uses of the LHS of the
statement are put on the worklist.

What's nice about this is the bulk of DOM as we know it today disappears
along with the expensive iteration in the case when jumps are threaded.
We just need the right APIs for copy propagation and value handles.

We could still potentially use ASSERT_EXPRs to encode information about
edge equivalences, though we probably would still want the ASSERT_EXPR
to appear as statement rather than the RHS of a MODIFY_EXPR.

Jeff

Comment 88 Diego Novillo 2005-04-13 13:03:55 UTC

Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases

On Tue, Apr 12, 2005 at 04:55:20PM -0000, law at redhat dot com wrote:

> That mental model doesn't work right now with the way DOM & the
> jump threader because they are too tightly intertwined.
> 
The link that you have still not shown me is why doesn't this
mental model work for the jump threader.  That's why I said that
I need to run the threader myself and see why it needs all these
additional crutches.

If the code has been cleaned up of redundancies, copies
propagated, names have known values and ranges are set, then I
don't see why we would need all the extra baggage.

> The iteration step we see today would turn into a worklist.
> ie, after we thread jumps, we walk through the IL for PHIs
> which represent a copy that can be propagated.
>
Ah, here's something specific I don't follow.  Why would you have
these PHIs anymore?  If all the arguments were ultimately
equivalent, then the various propagators are bound to have
removed them.  If not, that sounds like a bug in them.

> What's nice about this is the bulk of DOM as we know it today
> disappears along with the expensive iteration in the case when
> jumps are threaded. We just need the right APIs for copy
> propagation and value handles.
> 
Again, why?  The propagators are supposed to have done this
already.  They are all supposed to work in conditional regions.

> We could still potentially use ASSERT_EXPRs to encode
> information about edge equivalences, though we probably would
> still want the ASSERT_EXPR to appear as statement rather than
> the RHS of a MODIFY_EXPR.
> 
Odd, the nice thing about assertions being on the RHS is that you
pin their result to a specific SSA name that you then get to use
at every place naturally dominated by the assertion.

If you could give me a concrete test case that I could sink my
teeth in, that'd be great.


Thanks.  Diego.

Comment 89 Jeffrey A. Law 2005-04-13 17:11:30 UTC

Subject: Re:  [4.0 Regression] jump threading
	on trees is slow with switch statements with large # of cases

On Wed, 2005-04-13 at 13:04 +0000, dnovillo at redhat dot com wrote:
> ------- Additional Comments From dnovillo at redhat dot com  2005-04-13 13:03 -------
> Subject: Re:  [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> 
> On Tue, Apr 12, 2005 at 04:55:20PM -0000, law at redhat dot com wrote:
> 
> > That mental model doesn't work right now with the way DOM & the
> > jump threader because they are too tightly intertwined.
> > 
> The link that you have still not shown me is why doesn't this
> mental model work for the jump threader.  That's why I said that
> I need to run the threader myself and see why it needs all these
> additional crutches.
That the model we wont to go to -- but it's certainly not where we
are today.  Why?  Because the jump threader exposes more optimization
opportunities for DOM (including constant and copy propagations) and
DOM exposes more opportunities for the threader.  They are inherently
tied together with their current structure.

> 
> If the code has been cleaned up of redundancies, copies
> propagated, names have known values and ranges are set, then I
> don't see why we would need all the extra baggage.
Because threading exposes new opportunities to perform constant
and copy propagation and redundancy elimination.

> 
> > The iteration step we see today would turn into a worklist.
> > ie, after we thread jumps, we walk through the IL for PHIs
> > which represent a copy that can be propagated.
> >
> Ah, here's something specific I don't follow.  Why would you have
> these PHIs anymore?  If all the arguments were ultimately
> equivalent, then the various propagators are bound to have
> removed them.  If not, that sounds like a bug in them.
They are exposed as a result of jump threading.  For example we might
have something like:

BB 3:
x3 = PHI (x2 (0), x1 (1), x1 (2));
[ ... ]

if (cond) goto X, else goto Y

If jump threading manages to determine where the conditional will go
when BB3 is reached via BB0, then we'll be able to remove the PHI
argument associated with the edge from BB0 to BB3.  That in turn exposes
a copy propagation opportunity (x3 = x1).

Or when we thread a jump we might expose a new redundancy because the
dominator graph changed.  Whenever we expose a redundancy we also expose
a copy propagation opportunity.



> 
> > What's nice about this is the bulk of DOM as we know it today
> > disappears along with the expensive iteration in the case when
> > jumps are threaded. We just need the right APIs for copy
> > propagation and value handles.
> > 
> Again, why?  The propagators are supposed to have done this
> already.  They are all supposed to work in conditional regions.
You really don't seem to understand what the threader does and how its
actions can expose new optimization opportunities.  I might suggest you
look very closely at what happens for a test like 20030714-2.c which is
derived from real code.  As we thread jumps, we expose new redundancy
elimination opportunities, which in turn expose new copy propagation
opportunities.

> > We could still potentially use ASSERT_EXPRs to encode
> > information about edge equivalences, though we probably would
> > still want the ASSERT_EXPR to appear as statement rather than
> > the RHS of a MODIFY_EXPR.
> > 
> Odd, the nice thing about assertions being on the RHS is that you
> pin their result to a specific SSA name that you then get to use
> at every place naturally dominated by the assertion.
> 
> If you could give me a concrete test case that I could sink my
> teeth in, that'd be great.
Go back the PR I've referenced twice.

jeff