Summary: | [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases | ||
---|---|---|---|
Product: | gcc | Reporter: | Andrew Pinski <pinskia> |
Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | berndtrog, bje, dnovillo, gcc-bugs, giovannibajo, jeffreyalaw, kazu |
Priority: | P2 | Keywords: | compile-time-hog |
Version: | 4.0.0 | ||
Target Milestone: | 4.0.0 | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2004-12-23 13:31:01 | |
Bug Depends on: | 10588, 17895 | ||
Bug Blocks: | |||
Attachments: |
An experimental patch for this bug
ZZZ ZZZ patch ZZZ ZZZ PPP PPP PPP PPP PPP PPP |
Description
Andrew Pinski
2004-05-18 19:43:00 UTC
The top 5 functions in a profile: 36.5% label_to_block <-- this function is small and could be inlined. 24.9% simple_cst_equal 12.1% find_taken_edge 5.5% cached_make_edge 3.4% dom_opt_initialize_block 11% remaining. The patch here should help: <http://gcc.gnu.org/ml/gcc-patches/2004-05/msg01150.html>. Confirmed. I had forgot to say all of time spent in simple_cst_equal comes from when find_taken_edge calls it. Got a patch. With my case grouping patch, the top offender according to oprofile is dom_opt_initialize_block: samples % app name symbol name 21264210 36.4196 cc1 dom_opt_initialize_block which is consistent with: dominator optimization: 79.10 (64%) usr 0.00 ( 0%) sys 79.11 (63%) wall The other ones Andrew mentioned are blown away from the top of the profile: 309485 0.5301 cc1 label_to_block 2361 0.0040 cc1 find_taken_edge ...and the tree_int_cst_{lt,compare} functions way below these. I should note that at -O0, 3.5.0 is already faster than 3.4.0, about a 4x improvement (even with checking on 3.5.0 and not on 3.4.0). Now only label_to_block is really a problem on powerpc-apple-darwin. I will note that on powerpc- apple-darwin for some reason same_case_target_p is not being inlined into expand_end_case_case_type. Also label_rtx could be inline into expand_end_case_case_type which should help the -O0 case too. Subject: Re: [3.5 Regression] jump threading on trees is slow with switch statements with large # of cases
On Monday 31 May 2004 23:01, pinskia at gcc dot gnu dot org wrote:
> Now only label_to_block is really a problem on powerpc-apple-darwin.
No. label_to_block is fine as it is. The real problem is something
of quadratic-like behavior:
1) walk every dom successor
2) walk a big part of the case label vector
which is (N_dom_children * TREE_VEC_LENGTH (label_vector)), and
since both typically are of the same order of magnitude, this is
quadratic behavior.
Walking the case label vector once is possible if the information
extracted in the dominated block is instead precomputed in the
dominating block (and probably attach the unique case label to the
edge's or the block's aux field, or something like that).
This not forgotten, but DOM is about to go away entirely thanks to Diego. When this bug still exists after DOM is gone, Ben Elliston's work to make edges a vector instead of a list will make it much much easier to fix this bug. This bug is the reason for a lot of slowness in various machine generated files, including insn-attrtab for example. Jeff is rewriting the jump threader. I'm adding him to the CC: list so he's aware of this problem. Hopefully the new threader will fix this problem. Created attachment 7163 [details]
An experimental patch for this bug
I'm pondering on something like the attached patch.
The patch makes us walk the case label vector only
once, but we do walk all successor edges of blocks
ending in a switch statement twice now. This could
probably be reduced to once - I'm not sure about it
but it's worth trying.
In any case, this part of the algorithm should not be
quadratic anymore with the patch.
This patch was bootstrapped on amd64. I haven't for
any numbers for compile time improvement, but the gut
feeling is that, unfortunately, even with this patch
DOM is still unbelievably slow.
For the test case in this PR, this patch definitely helps a lot. For more realistic code, another bottleneck in DOM means there really isn't a significant difference (though there is of course still a good speedup, but it's insignificant compared to the total compile time). Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Fri, 2004-09-17 at 17:50, steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org 2004-09-17 23:50 -------
> Created an attachment (id=7163)
> --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=7163&action=view)
> An experimental patch for this bug
>
> I'm pondering on something like the attached patch.
> The patch makes us walk the case label vector only
> once, but we do walk all successor edges of blocks
> ending in a switch statement twice now. This could
> probably be reduced to once - I'm not sure about it
> but it's worth trying.
>
> In any case, this part of the algorithm should not be
> quadratic anymore with the patch.
>
> This patch was bootstrapped on amd64. I haven't for
> any numbers for compile time improvement, but the gut
> feeling is that, unfortunately, even with this patch
> DOM is still unbelievably slow.
There may be a better approach for attacking this problem.
After we build the CFG, but before we go into SSA form we can insert
assignments on outgoing edges from the switch. ie, if we had
switch (x)
{
case 42:
whatever
}
[ Assuming there's no other path to the code for "whatever" ]
We can turn that into
switch (x)
{
case 42:
x = 42;
whatever
}
Once that's done standard copy propagation will take care of
everything for us and we no longer have to try and associate
a value with an outgoing edge from the switch in the optimizers.
That method dovetails nicely with some suggestions Kenny made
at the GCC summit. In fact, you can do that with some conditionals
as well which would obliterate the need for parts of
get_eq_expr_value.
The open questions in my mind for these kinds of things are:
1. Is it really necessary to mark these assignments as undeletable
until late in the SSA path (that's something Kenny suggested
as well).
Not deleting them means they're around through most/all the
optimization passes, which is good as we may not see the
opportunity to use those equivalences until later in the
optimizers.
But not deleting them can also mean we generate and keep a
bunch of "useless" PHI nodes hanging around, which uses time
and space and might even interfere with some optimizations.
2. Is it worth the effort to be able to add these assignments in
later? Say for example that instead we originally had
switch (x)
{
case 42:
case -1:
}
And we prove that case 0xabcdef can never occur (say because we
know that X has the range 0..maxint via VRP).
In that case does it make sense to be able to insert a copy
x = 42 into the appropriate location once we make the "-1"
case label go away. [ And if it makes sense, can we do so
efficiently?. ]
Anyway, that might ultimately be a better way to handle these
kinds of equivalences created by traversing edges out of
COND_EXPRs and SWITCH_EXPRs.
A similar scheme may be usable to record equivalences created
by stuff like
if (x < y)
goto BB2
else
goto BB4
Into:
t0 = (x < y)
t1 = (x > y)
t2 = (x >= y)
t3 = (x == y)
if (t0)
t0 = 1;
t1 = 0;
t2 = 0;
t3 = 0;
goto BB2
else
t0 = 0;
t2 = 1;
goto BB4
We'd use formal temporaries so that all instances of x < y would
get mapped to the same formal temporary.
I believe that would zap all of get_eq_expr_value, record_cond,
record_dominating_conditions and the like. It might have some
other unpleasant interactions. But it might be worth investigating.
Thoughts?
Jeff
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Fri, 2004-09-24 at 02:13, law at redhat dot com wrote: > We can turn that into > > switch (x) > { > case 42: > x = 42; > whatever > } > This is what I'm working with as part of VRP. The idea is to insert ASSERT_EXPR that act as special assignments. I have a WIP patch if you are interested, but I at the moment I'm paying more attention to fixes for 4.0. > 1. Is it really necessary to mark these assignments as undeletable > until late in the SSA path (that's something Kenny suggested > as well). > The ASSERT_EXPRs are naturally undeletable if they are used. When we find: if (x_3 == y_9) { x_4 = ASSERT_EXPR (x_3 == y_9) if (x_4 > 4) ... } If you end up knowing something about y_9, you may be able to fold the inner conditional into a constant. If you never use x_4, then the ASSERT_EXPR is naturally dead and goes away. ASSERT_EXPRs are inserted before we go into SSA and are treated as regular IL elements. VRP would propagate and fold them using the SSA propagator. This part is still to be done. Initially I had fused this with CCP, and that didn't work too well. Diego. Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Fri, 2004-09-24 at 04:29, dnovillo at redhat dot com wrote:
> ------- Additional Comments From dnovillo at redhat dot com 2004-09-24 10:29 -------
> Subject: Re: [4.0 Regression] jump threading
> on trees is slow with switch statements with large # of cases
>
> On Fri, 2004-09-24 at 02:13, law at redhat dot com wrote:
>
> > We can turn that into
> >
> > switch (x)
> > {
> > case 42:
> > x = 42;
> > whatever
> > }
> >
> This is what I'm working with as part of VRP. The idea is to insert
> ASSERT_EXPR that act as special assignments. I have a WIP patch if you
> are interested, but I at the moment I'm paying more attention to fixes
> for 4.0.
But for the case of a simple x == 42 or a case statement such as
I've shown above, an ASSERT_EXPR is, well, silly. Just issue
x = 42 in the appropriate place and let's get on with real work :-)
Of course given a conditional such as
if (x == 42)
goto BB2
else
goto BB4
Using ASSERT_EXPR I would think you'd really want
if (x == 42)
x = 42;
goto BB2
else
x = ASSERT_EXPR (x != 42)
ie, you use the ASSERT_EXPR to assert something about the value of
an expression, not the value of objects within the expression. And
if that is the case, then the LHS of these assert expressions ought
to be a new formal temporary. And if I've understood all this
correctly, you can get precisely the same behavior without using
ASSERT_EXPR.
t = (x == 42)
if (t)
t = 1;
goto BB2;
else
t = 0;
goto BB4;
You set up an equivalency between t and (x == 42) when you hit the
assignment, then you add the constant true to the equivalency list
in the true arm (removing it of course when you're done with the
true arm). Then add the constant false to the equivalancy list
in the false arm.
This is a perfect example of why I want to have an equivalency list,
much like the tree-vn code provides.
So before you go too much further with ASSERT_EXPR I think we need
to reach an agreement that ASSERT_EXPR is actually useful. I believe
you can do the same thing without ASSERT_EXPR using facilities that
already exist today.
Jeff
PR14859 is not directly related, but whatever fixes this bug probably shouldn't stop case labels being combined. So after playing with the idea of inserting copies/constant initializations on outgoing edges from COND_EXPRs and SWITCH_EXPRs and killing all the special code in DOM to propagate which handles edge equivalences, I'm pretty sure that't he wrong approach. The fundamental problem is we would need to insert new copies anytime we change the condition in a COND_EXPR or SWITCH_EXPR. Consider: D.15513 = insn->code; D.15514 = (unsigned int) D.15513; switch (D.15514) { case 5: goto <L17>; case 6: goto <L2>; case 7: goto <L3>; case 8: goto <L22>; case 9 ... 10: goto <L0>; default : goto <L23>; } # SUCC: 22 1 21 3 2 16 [ ... ] # BLOCK 3 # PRED: 0 <L3>:; _rtx = insn; D.15521 = _rtx->code; if (D.15521 != 7) goto <L4>; else goto <L5>; # SUCC: 5 (false) 4 (true) Now we all can see that the conditional at the end of block #7 is always false. Let's see what happens with the copy insertion approach. After copy insertion and rewriting into SSA we would have: D.15513_11 = insn_10->code; D.15514_12 = (unsigned int) D.15513_11; switch (D.15514_12) { case 5: goto <L17>; case 6: goto <L2>; case 7: goto <L3>; case 8: goto <L22>; case 9 ... 10: goto <L0>; default : goto <L23>; } # SUCC: 22 1 21 3 2 16 [ ... ] # BLOCK 3 # PRED: 0 <L3>:; D.15514_28 = 7; _rtx_29 = insn_10; D.15521_30 = _rtx_29->code; if (D.15521_30 != 7) goto <L4>; else goto <L5>; # SUCC: 5 (false) 4 (true) Which is exactly what I would expect with this copying method. We know that D.15514 has the value 7 at the start of block 3, so we put in a suitable assignment/copy. Note carefully that the condition tests a different variable underlying variable. After DOM we have: D.15513_11 = insn_10->code; D.15514_12 = (unsigned int) D.15513_11; switch (D.15513_11) { case 5: goto <L17>; case 6: goto <L2>; case 7: goto <L3>; case 8: goto <L22>; case 9 ... 10: goto <L0>; default : goto <L23>; } # SUCC: 22 1 21 3 2 16 [ ... ] # BLOCK 3 # PRED: 0 <L3>:; _rtx_29 = insn_10; D.15521_30 = D.15513_11; if (D.15513_11 != 7) goto <L4>; else goto <L5>; # SUCC: 5 (false) 4 (true) Ugh. The copy we inserted turned out to be totally useless for determining that the condition at the end of BB5 is always false. Argggh. To make this scheme work we'd have to do copy insertions anytime the COND_EXPR_COND or SWITCH_EXPR changed. Worse yet, we'd need to call rewrite_ssa_into_ssa to keep things in SSA form. Double Ugh. The more I think about this, the more I'm convinced that copy insertions aren't the solution. Too bad since this scheme took DOM totally off the radar for this testcase. I'm pondering other approaches based on some of the ideas in the patch. Note, we may see similar problems with the ASSERT_EXPR scheme that has been discussed as well. Note the CFG cleanup problem is filed as PR 17895. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases This patch drastically reduces the worst cast cost for finding equivalences associated with outgoing edges from switch statements. Previously upon entry into a block we would search for any equivalences associated with the edge we took into the block. If that edge was an outgoing edge from a switch statement we had to walk the entire switch vector to determine if traversing the edge created an equivalence. Thus searching for these equivalences was effectively O(N^2) where N is the number of outgoing edges from a switch statement. With this patch we reduce that to O(N) by scanning the switch vector once and recording any/all equivalences during that single pass over the switch vector. We can then query the recorded information as we enter blocks during the dominator walk. To give you some idea of how this affects worst-case behavior, the dominator optimizer takes 41.76 seconds (19% of total compilation time) for the testcase in PR 15524. After this patch the dominator optimizer takes 3.86 seconds (2% of total compilation time). I've seen no measurable impact on more normal codes from this version of the patch. It turns out that it's fairly simple to record equivalences from COND_EXPRs in the same data structure. This has the nice side effect of reducing the number of calls to invert_truthvalue and thus reducing the amount of GC garbage we create. I don't have data for that handy though. Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7434 [details]
ZZZ
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases More work to speed up 15524. I was rather puzzled that "expand" was being charged with 30% of the compilation time. I originally thought there was some lameness in the switch statement expanders. However, it turned out I was totally wrong. The culprit is actually the code to record the number of iterations of each loop. We never pushed a timevar for that pass and thus the time was charged to whatever was on the top of the timevar stack (which happens to be TV_EXPAND). The code to record the number of loop iterations was being rather dumb. It basically did something like: foreach loop foreach block if block is in loop code Clearly the problem is second loop -- it's walking *all* the blocks and testing for loop membership. Ouch. It's far more efficient to use get_loop_body (which does a DFS walk starting at the beginning of the loop to the end of the loop). To give you some idea here's some data: The original time without my patch is: expand : 54.55 (30%) usr 0.03 ( 4%) sys 55.21 (30%) TOTAL : 181.76 0.85 185.17 With my patch: expand : 6.78 ( 5%) usr 0.05 ( 6%) sys 7.20 ( 5%) TOTAL : 134.20 0.83 137.19 Not bad for 10 minutes of work. I took the liberty to add a new timevar for the loop bounds recording pass and to kill an unused field within the loop structure. Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7450 [details]
ZZZ
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Sun, 31 Oct 2004, Jeffrey A Law wrote: > > More work to speed up 15524. I was rather puzzled that "expand" was > being charged with 30% of the compilation time. I originally thought > there was some lameness in the switch statement expanders. However, > it turned out I was totally wrong. > > The culprit is actually the code to record the number of iterations > of each loop. We never pushed a timevar for that pass and thus the > time was charged to whatever was on the top of the timevar stack > (which happens to be TV_EXPAND). > > The code to record the number of loop iterations was being rather > dumb. It basically did something like: > > foreach loop > foreach block > if block is in loop > code > > > Clearly the problem is second loop -- it's walking *all* the blocks > and testing for loop membership. Ouch. I actually had fixed this as part of my last patch, but was waiting till monday to commit it. Jeff, what is the compile-time situation in mainline now, compared to 3.4 or even older versions (2.95)? We are still way behind 3.3, it takes 15 seconds on my 1.5GHz PPC 7400 with 3.3 but with 4.0, well for 4.0 time just look at Jeff's data and see that we are way behind still. Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Mon, 2004-11-01 at 05:16 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-11-01 05:16 -------
> We are still way behind 3.3, it takes 15 seconds on my 1.5GHz PPC 7400 with 3.3 but with 4.0, well for
> 4.0 time just look at Jeff's data and see that we are way behind still.
Well, I haven't looked at 3.3, but I can make a reasonable guess that
we're still way way way behind due to the way we update case labels
when forwarding edges and splitting critical edges. Some early
experiments I've done with that indicate there's another 30-35%
improvement that can be made by fixing that problem. Then there's
*another* 30% or so we're burning in the RTL branch prediction code
(I haven't looked to see if there's anything we can do with that code
yet).
I doubt it makes much sense to look closely at 3.3 vs 4.0 for this
testcase and similar code until we fix those glaring problems.
It's also not clear to me how much of an improvement those changes
will make in real-world code. ie, we could easily run into a case
where we drastically improve that testcase without improving any
real code, much like what happened recently with my changes to
improve how we find/record equivalences for edges.
Jeff
Subject: RE: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases I created a special function for the split_critical_edges pass to split edges out of SWITCH_EXPR blocks in linear time wrt. the number of outgoing edges E and the number of case labels C. The function makes the edge splitting take O(E+2C) time, instead of E*C which is effectively quadratic. The effect for the test case of this PR is huge. The effect on everything else (for example insn-attrtab) is not measurable. Dunno if this is worth the trouble. The patch I have is a hack and the real solution is using a smarter representation for the case label targets. I'll try some other code and perhaps post the patch. We might want to just do it as a hack on the gcc 4.0 release branch... > The effect for the test case of this PR is huge.
> The effect on everything else (for example insn-attrtab) is not
> measurable.
We have had many bug reports from users trying to compile interpret-like code
which is machine generated and is effectively made by giant switch statements.
Improving this PR will definitely make their code go faster.
In fact, if we are going to have a performance testsuite for GCC (as I asked to
the SC lately), I would surely propose to put such a testcase in there.
Subject: RE: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases OK, then can you see if this hack helps...? I'll be the first to admit that it's gross, but it bootstraps, and if it helps (which I haven't tested) then, well, maybe, you know, eh, ... Created attachment 7453 [details]
patch
Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
Hi Steven,
> OK, then can you see if this hack helps...?
+ /* Step 4: Update the case label vector. */
+ TREE_VEC_LENGTH (label_vec) = n;
+ for (i = 0; i < n; ++i)
+ {
+ tree elt = TREE_VEC_ELT (label_vec, i);
+ e = case_to_edge[i];
+ CASE_LABEL (elt) = tree_block_label (e->dest);
+ }
Wow, this is barely safe. You are assuming that the edge redirection
is done by changing e->dest and that cfg.c:redirect_edge_succ_nodup
does not remove e, which is actually the case because there is no
existing edge from the block containing SWITCH_EXPR to the new basic
block created by split_edge.
In any case, I agree this should work. I'll be the second one to
admit this is gross. :-)
Kazu Hirata
Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Mon, 2004-11-01 at 19:35 +0000, kazu at cs dot umass dot edu wrote:
> ------- Additional Comments From kazu at cs dot umass dot edu 2004-11-01 19:34 -------
> Subject: Re: [4.0 Regression] jump threading
> on trees is slow with switch statements with large # of cases
>
> Hi Steven,
>
> > OK, then can you see if this hack helps...?
>
> + /* Step 4: Update the case label vector. */
> + TREE_VEC_LENGTH (label_vec) = n;
> + for (i = 0; i < n; ++i)
> + {
> + tree elt = TREE_VEC_ELT (label_vec, i);
> + e = case_to_edge[i];
> + CASE_LABEL (elt) = tree_block_label (e->dest);
> + }
>
> Wow, this is barely safe. You are assuming that the edge redirection
> is done by changing e->dest and that cfg.c:redirect_edge_succ_nodup
> does not remove e, which is actually the case because there is no
> existing edge from the block containing SWITCH_EXPR to the new basic
> block created by split_edge.
>
> In any case, I agree this should work. I'll be the second one to
> admit this is gross. :-)
I think it'll ultimately be cleaner to simply drop the labels after
we've built the CFG and associate an edge with each of the cases.
We'd still have the assumption that edge redirection happens by
changing e->dest, but I think that approach is otherwise reasonably
clean.
Probably the ugliest part of that approach is the need to have edges
in the switch statement.
The nice thing is this approach matches pretty well with what we've
done with simple gotos.
Jeff
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> I think it'll ultimately be cleaner to simply drop the labels after
> we've built the CFG and associate an edge with each of the cases.
Yes, this is what I would really want to do, but I don't know if such
a patch would be safe for GCC 4.0. I suppose we could introduce a fake
'x' class tree with is really just a VEC(edge) plus a tree_common, and
stuff that into SWITCH_LABELS (which would at that point not longer be
an appropriate name of course ;-). The trouble is that we still need
to recreate the labels for expand, that's the ugly part.
Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Mon, 2004-11-01 at 16:56 +0000, giovannibajo at libero dot it wrote:
> ------- Additional Comments From giovannibajo at libero dot it 2004-11-01 16:55 -------
> > The effect for the test case of this PR is huge.
> > The effect on everything else (for example insn-attrtab) is not
> > measurable.
>
> We have had many bug reports from users trying to compile interpret-like code
> which is machine generated and is effectively made by giant switch statements.
> Improving this PR will definitely make their code go faster.
>
> In fact, if we are going to have a performance testsuite for GCC (as I asked to
> the SC lately), I would surely propose to put such a testcase in there.
But how often are those codes going to trigger jump threading and
edge splitting. Just having large switches isn't enough to cause us
to trip over the representational issues that are giving us so many
compile-time problems.
jeff
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Mon, 2004-11-01 at 20:17 +0000, stevenb at suse dot de wrote: > ------- Additional Comments From stevenb at suse dot de 2004-11-01 20:17 ------- > Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases > > > I think it'll ultimately be cleaner to simply drop the labels after > > we've built the CFG and associate an edge with each of the cases. > > Yes, this is what I would really want to do, but I don't know if such > a patch would be safe for GCC 4.0. I think this can be done in a reasonably localized way. We convert to using edges just after building the CFG and we return to using labels after we're done with the SSA path. I don't expect we'll have much (if any) code that has to handle both representations. > I suppose we could introduce a fake > 'x' class tree with is really just a VEC(edge) plus a tree_common, and > stuff that into SWITCH_LABELS (which would at that point not longer be > an appropriate name of course ;-). Or break out case labels in a manner similar to what we do with other nodes that have "interesting" fields. > The trouble is that we still need > to recreate the labels for expand, that's the ugly part. Na. That's trivial. jeff Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases I don't know about jump threading, but edge splitting in such code is very likely because we pre-split all critical edges for GVN-PRE and I can imagine that in such interpreter-like code you'd have lots of critical edges. But maybe not... Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases With loop bounds recording no longer charged to the expander it's time to deal with the inefficiencies which are in the switch expander. Basically we have code which does for each case for each case see if the label in the outer loop matches any of the cases in the inner loop (up to the current case in the outer loop). Needless to say this can get rather expensive. This patch avoids the inner loop completely by keeping track of the target labels we've seen in a bitmap and only incrementing the unique target count when we encounter a target which hasn't been seen. Before this patch we got the following times for expand in pr15524. expand : 6.53 ( 5%) usr 0.03 ( 4%) sys 6.56 ( 5%) After this patch we see: expand : 0.71 ( 1%) usr 0.02 ( 3%) sys 0.75 ( 1%) Not bad. Unfortunately, no measurable difference on insn-attrtab.i and friends. Oh well. Bootstrapped and regression tested on i686-pc-linux-gnu Created attachment 7466 [details]
ZZZ
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Nov 3, 2004, at 10:03 AM, Jeffrey A Law wrote: > > With loop bounds recording no longer charged to the expander it's time > to deal with the inefficiencies which are in the switch expander. > > Basically we have code which does > > for each case > for each case > see if the label in the outer loop matches any of the cases in the > inner loop (up to the current case in the outer loop). We don't really need some of this code anyways because the gimplifier now reorders the case statements so that you don't have to do the loop at all. Thanks, Andrew Pinski Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Wed, 2004-11-03 at 15:08 +0000, pinskia at physics dot uc dot edu
wrote:
> ------- Additional Comments From pinskia at physics dot uc dot edu 2004-11-03 15:08 -------
> Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
>
>
> On Nov 3, 2004, at 10:03 AM, Jeffrey A Law wrote:
>
> >
> > With loop bounds recording no longer charged to the expander it's time
> > to deal with the inefficiencies which are in the switch expander.
> >
> > Basically we have code which does
> >
> > for each case
> > for each case
> > see if the label in the outer loop matches any of the cases in the
> > inner loop (up to the current case in the outer loop).
>
> We don't really need some of this code anyways because the gimplifier
> now reorders the case statements so that you don't have to do the loop
> at all.
Note carefully we are looking at the destination labels for each
case within the switch. Thus to be able to avoid this loop we would
have to know that even after the optimizers run that all cases which
transfer control to the same destination are grouped together.
jef
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Wed, 3 Nov 2004 10:07:58 -0500, Andrew Pinski <pinskia@physics.uc.edu> wrote: > > On Nov 3, 2004, at 10:03 AM, Jeffrey A Law wrote: > > > > > With loop bounds recording no longer charged to the expander it's time > > to deal with the inefficiencies which are in the switch expander. > > > > Basically we have code which does > > > > for each case > > for each case > > see if the label in the outer loop matches any of the cases in the > > inner loop (up to the current case in the outer loop). > > We don't really need some of this code anyways because the gimplifier > now reorders the case statements so that you don't have to do the loop > at all. I believe I have seen patches from Kazu that address this. Richard. Richard, My patches to expand_case does not change its asymptotic behavior. Jeff, You might want to use SBITMAP instead of BITMAP in your patch to expand_case because the bitmap you construct will be dense, and you know its size in advance. Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Wed, 2004-11-03 at 16:33 +0000, kazu at cs dot umass dot edu wrote:
> ------- Additional Comments From kazu at cs dot umass dot edu 2004-11-03 16:33 -------
> Richard,
>
> My patches to expand_case does not change its asymptotic behavior.
>
> Jeff,
>
> You might want to use SBITMAP instead of BITMAP in your patch to
> expand_case because the bitmap you construct will be dense, and you know
> its size in advance.
No you don't know it's size or density in advance. We're not looking
at case values, but at the label where each case statement will jump to.
Jeff
Jeff, Oops, you're right! Kazu Hirata Subject: RE: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
> No you don't know it's size or density in advance. We're not looking
> at case values, but at the label where each case statement will jump to.
You could move the expand code for SWITCH_EXPRs out of
expand_expr, and call it from cfgexpand.c, passing the
number of outgoing edges.
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases We've known for a little while that the prediction code compile time behavior for code like pr15524 is, err, poor at best. The first problem is the prediction code, much like the code to record loop bounds, was walking all the blocks in the function and performing a member test on them. To give you an idea of how expensive this style of coding is, here's the timevar info for pr15524 before my patch: branch prediction : 44.17 (35%) usr 0.01 ( 1%) sys 44.28 (32%) OUCH. Over 1/3 of our total compilation time is spent in the branch prediction code. I changed the code to record the blocks to visit in a bitmap and we use EXECUTE_IF_SET_IN_BITMAP. After that change we get: branch prediction : 11.23 (12%) usr 0.00 ( 0%) sys 11.25 (12%) Which is a nice improvement. However, there's clearly an algorithmic problem in this code. If I'm reading the code correctly it has a fundamental problem in that the nodes in the inner loops are processed multiple times (once per loop nest). It would seem to me we could get a nice speedup by aggregating and recording the information from inner loops, then not walking them again and instead using the cached information. Anyway, it's a pretty straightforward change and gives us a nice speedup on the 15524 code. Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7469 [details]
ZZZ
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> However, there's clearly an algorithmic problem in this code.
There is. The loop predictors are quadratic in the loop nest depth.
Honza already suggested predicting the loops from the innermost one to
outer, and just not predict the outer loops with such an expensive
algorithm.
- It seems that connect_infinite_loops_to_exit is the next bottleneck for this test case. Probably not important enough to care about for GCC 4.0 (would *any* real-world code ever consist of 30000-odd infinite loops in a single switch statement? ;-) Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases Ho hum. Another 12% improvement. There's two closely related changes in this patch. First, we can drastically speed up find_edge for code with large switch statements. Right now we call find_edge with a src/dst edge pair. find_edge walks src->succs to find a successor with a matching destination block. That is usually a pretty efficient scheme as most blocks have one or two successors and thus the search terminates quickly. However, when handling large switch statements that scheme can perform poorly. Instead it is sometimes better to walk dst->preds searching for the right src/dst pair. Previously we had no good way to determine which list to search through. However, with the recent work on our pred/succ list representations we can trivially check which of the two important lists is shorter. We (of course) search the shorter list :-) redirect_edge_succ_nodup actually implements find_edge inline and thus it does not perform well on blocks with lots of successors. Clearly this code can (and does) benefit from calling the significantly improved find_edge. Key timevars before the change: cfg cleanup : 2.80 ( 3%) usr 0.00 ( 0%) sys 2.78 ( 3%) tree CFG cleanup : 34.23 (36%) usr 0.00 ( 0%) sys 34.26 (35%) dominator optimization: 5.11 ( 5%) usr 0.02 ( 2%) sys 5.13 ( 5%) tree split crit edges : 17.36 (18%) usr 0.00 ( 0%) sys 17.36 (18%) TOTAL : 96.21 0.82 97.53 Key timevars after the change: cfg cleanup : 0.62 ( 1%) usr 0.00 ( 0%) sys 0.62 ( 1%) tree CFG cleanup : 28.27 (33%) usr 0.00 ( 0%) sys 28.33 (32%) dominator optimization: 3.77 ( 4%) usr 0.01 ( 1%) sys 3.80 ( 4%) tree split crit edges : 15.58 (18%) usr 0.00 ( 0%) sys 15.57 (18%) TOTAL : 84.81 0.81 87.92 Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7495 [details]
PPP
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases Whee, almost another 30% improvement... In this installment we're attacking the insane amount of time we're taking to insert fake edges in the CFG to connect infinite loops to the exit block (which we do 3 times during the course of compiling this testcase). The testcase in pr15524 is somewhat unique in that it has a huge number of infinite loops. In fact, that's about all the testcase is once jumps are threaded. I agree with Steven that it is very unlikely that any normal looking code will have similar numbers of unreachable loops and such a large enough CFG to trigger the insane compile-time behavior like we see with pr15524. However, I do think it is worth fixing this problem. Particularly when we realize that we're doing unnecessary work anytime we encounter a function with an infinite loop and that it's pretty trivial to avoid the unnecessary work. The fundamental problem is after DFS walking the reverse CFG we start searching blocks from EXIT to ENTRY looking for the first one which has not been visited. We perform that search each time the reverse CFG DFS walk stops (ie, each time we find an infinite loop), each time we start the search at the EXIT block. For any function with an infinite loop we're wasting compilation time searching blocks multiple times. In 15524 we happen to have enough blocks and enough infinite loops to make the lameness of our implementation obvious. Luckily fixing this is quite trivial. All we have to do is pass in the unvisited block from the previous search and use that as the starting point for the loop instead of EXIT_BLOCK. This basically takes PRE and the branch predictors off the radar for this testcase. Before the fix: tree PRE : 12.69 (15%) usr 0.01 ( 1%) sys 12.73 (15%) branch prediction : 11.36 (14%) usr 0.00 ( 0%) sys 11.37 (13%) TOTAL : 83.16 0.79 84.45 After the fix: tree PRE : 0.12 ( 0%) usr 0.02 ( 3%) sys 0.14 ( 0%) branch prediction : 0.41 ( 1%) usr 0.01 ( 1%) sys 0.41 ( 1%) TOTAL : 59.54 0.80 61.29 [ At this point our SWITCH_EXPR issues account for roughly 70% of the remaining compilation time. Yup, that's right, I'd expect a reduction from ~60 seconds to around 15-25 seconds depending on how well we solve our SWITCH_EXPR issues. ] Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7514 [details]
PPP
GRRR. I thought I had settled on a scheme that would address the SWITCH_EXPR representation issues with 15524. It was a combination of a hash table to get us from an edge to a CASE_LABEL_EXPR and the "CASE_LEADER" concept from Kazu. It works amazingly well on 15524 -- cuts compile-time by just over 70% and should have been suitable for more general code. The problem is we don't have a way to get notification of edge deletions and there are too many places where we can get into remove_edge without first going through ssa_remove_edge, which is were I had put my code to handle edge removals from the hash table. It looks like we might be stuck building the hash table lazily in the CFG cleanup code and when splitting critical edges. Ugh. I may go forward with the CASE_LEADER changes separately -- they should give a measurable benefit, and can work with or without the hash table to map from edge to CASE_LABEL_EXPRs. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases And the saga continues.... This patch introduces the concept of a case leader to CASE_LABEL_EXPR. As Kazu outlined, the case leader is the "representative value" for the partition of cases which reach the same destination block. The case leader is the only CASE_LABEL_EXPR which contains a LABEL_DECL; other members of the partition point to the case leader. We convert to the case leader representation as we create edges for each SWITCH_EXPR. In this iteration of the change we still have to do a search of the case vector to find the case leader when we redirect an edge out of the SWITCH_EXPR. A future version is expected to use a hash table, but some kinks in that code still need to be worked out. Even though we're still doing a search of the case vector, this patch represents a big improvement for 15524. Instead of having to search the entire case vector when we redirect a jump, we (on average) half the case vector to find the leader. To give you an idea of how this affects compile time, we have the critical timevars before this patch: tree CFG cleanup : 28.20 (49%) usr 0.06 ( 9%) sys 29.81 (48%) tree split crit edges : 14.55 (25%) usr 0.02 ( 3%) sys 14.83 (24%) TOTAL : 57.76 0.70 62.18 After this patch: tree CFG cleanup : 12.29 (36%) usr 0.01 ( 2%) sys 12.35 (34%) tree split crit edges : 5.91 (17%) usr 0.00 ( 0%) sys 5.92 (16%) TOTAL : 33.89 0.57 36.65 That's a pretty good improvement. FWIW, a version which uses a persistent hash table for lookups of edge to case leaders brings the CFG cleanup and critical edge splitting phases down into the noise with a total time of < 18 seconds. Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7538 [details]
PPP
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases OK. This ought to fix 18478, 18521 and any other breakages caused by the introduction of CASE_LEADER _and_ speed up 15524 considerably. As Kazu determined, sharing nodes within the case vector has some rather unpleasant effects and IMHO fixing them was going to create an unmaintainable mess in the tree copying code. This patch kills the concept of a CASE_LEADER; however, it introduces an edge to cases hash table which allows us to map from an edge to the CASE_LABEL_EXPRs which use that edge. Because we do not have all the hooks in place to be notified of CFG changes, the hash table is not persistent. Instead we create on-demand in response to requests to redirect jumps out of SWITCH_EXPRs. The code is designed so that it will work with or without the hash table. That way we don't have to worry about correctness issues with passes that may redirect edges without creating the hash table. [ Such passes will simply run slower if they redirect a lot of edges out of SWITCH_EXPRs. ] The effect on 15524 is dramatic. Before this change: tree CFG cleanup : 15.14 (38%) usr 0.00 ( 0%) sys 15.16 (37%) tree split crit edges : 7.50 (19%) usr 0.00 ( 0%) sys 7.50 (18%) TOTAL : 40.27 0.80 41.09 After this change: tree CFG cleanup : 0.46 ( 3%) usr 0.00 ( 0%) sys 0.47 ( 2%) tree split crit edges : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) TOTAL : 18.35 0.81 19.18 FWIW, if we could have a persistent hash table which was built during CFG construction, the time for CFG cleanup would be effectively zero. Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7563 [details]
PPP
*** Bug 17895 has been marked as a duplicate of this bug. *** I will do some timings today but I think this is fixed now. I think this fixed except I have not checked with checking turned off. With checking still on I get the following passes as hot: dominator optimization: 31.03 (42%) usr 0.02 ( 1%) sys 31.26 (40%) wall tree SSA verifier : 5.64 ( 8%) usr 0.00 ( 0%) sys 5.68 ( 7%) wall rename registers : 7.37 (10%) usr 0.02 ( 1%) sys 7.46 (10%) wall But cfg cleanup went down a huge amount: tree CFG cleanup : 1.23 ( 2%) usr 0.00 ( 0%) sys 1.25 ( 2%) wall Before (20041113) on a slightly faster machine and different OS: tree CFG cleanup : 33.36 (43%) usr 0.01 ( 1%) sys 33.41 (42%) wall tree split crit edges : 16.83 (22%) usr 0.01 ( 1%) sys 16.87 (21%) wall dominator optimization: 7.94 (10%) usr 0.03 ( 2%) sys 7.97 (10%) wall tree SSA verifier : 2.57 ( 3%) usr 0.01 ( 1%) sys 2.53 ( 3%) wall rename registers : 5.57 ( 7%) usr 0.03 ( 2%) sys 5.61 ( 7%) wall Hmm, either DOM was just slowed down (because of checking) or something is really different between these two OS's. But as you can see that tree CFG cleanup and tree split crit edges have sped up a lot. I will try to get a new compiler built without checking turned on later today to get final numbers. Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Thu, 2004-11-18 at 13:00 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-11-18 13:00 -------
> I will do some timings today but I think this is fixed now.
Please don't mark it as fixed. THere's still some significant
lameness. I suspect there's another 10-20% improvement waiting
to be realized without doing anything terribly difficult.
jeff
Here are the current timings for the mainline and 3.3: mainline: [zhivago:~/src/localgccPRs] pinskia% time ~/local3/bin/gcc -S pr15524.c -O1 27.680u 1.240s 0:32.98 87.6% 0+0k 0+6io 0pf+0w 3.3: [zhivago:~/src/localgccPRs] pinskia% time gcc -S pr15524.c -O1 14.260u 0.560s 0:16.30 90.9% 0+0k 0+4io 0pf+0w At -O0: mainline: [zhivago:~/src/localgccPRs] pinskia% time ~/local3/bin/gcc -S pr15524.c 2.790u 0.350s 0:03.52 89.2% 0+0k 0+3io 0pf+0w 3.3: [zhivago:~/src/localgccPRs] pinskia% time gcc -S pr15524.c 1.930u 0.280s 0:02.28 96.9% 0+0k 1+3io 0pf+0w Note that is this on powerpc-darwin and with Apple's 3.3 (build 1650 or so). So we are still twice as slow as 3.3 but faster than before. Using -ftime-report I get: dominator optimization: 11.69 (42%) usr 0.13 ( 4%) sys 12.28 (38%) wall rename registers : 3.32 (12%) usr 0.12 ( 4%) sys 3.62 (11%) wall So only DOM needs to be sped up. For DOM the problem comes from thread_across_edge when we go and update the profile of the BB. For renane registers, most of a profile of compiling this source comes from regrename.c the loop which is around line 1761. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Thu, 2004-11-18 at 18:23 +0000, pinskia at gcc dot gnu dot org wrote: > So we are still twice as slow as 3.3 but faster than before. > Using -ftime-report I get: > dominator optimization: 11.69 (42%) usr 0.13 ( 4%) sys 12.28 (38%) wall > rename registers : 3.32 (12%) usr 0.12 ( 4%) sys 3.62 (11%) wall > > So only DOM needs to be sped up. > > For DOM the problem comes from thread_across_edge when we go and update the profile of the BB. That's part of the problem, but there's definitely more stuff going on than just that annoying useless divide operation in the code to update the profile. There's some issues in tree-ssa-threadupdate as well. > For renane registers, most of a profile of compiling this source comes from regrename.c the loop which > is around line 1761. Yes. I know. Jeff Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases This gets regrename off the radar. Look at this code from regrename.c and tell me if you can find the part that isn't particularly intelligent: FOR_EACH_BB (bb) { /* If a block has a single predecessor, that we've already processed, begin with the value data that was live at the end of the predecessor block. */ /* ??? Ought to use more intelligent queuing of blocks. */ if (EDGE_COUNT (bb->preds) > 0) for (bbp = bb; bbp && bbp != EDGE_PRED (bb, 0)->src; bbp = bbp- prev_bb); if (EDGE_COUNT (bb->preds) == 1 && ! (EDGE_PRED (bb, 0)->flags & (EDGE_ABNORMAL_CALL | EDGE_EH)) && EDGE_PRED (bb, 0)->src != ENTRY_BLOCK_PTR && bbp) all_vd[bb->index] = all_vd[EDGE_PRED (bb, 0)->src->index]; else init_value_data (all_vd + bb->index); if (copyprop_hardreg_forward_1 (bb, all_vd + bb->index)) need_refresh = true; } OK. A hint. bbp isn't used/set anywhere outside this loop... OK. Another hint. If BB has more than one predecessor is the loop walking through the prev_bb blocks to set bbp even necessary? That's right. We do that silly loop setting bbp even in cases where it's totally unnecessary. Before: rename registers : 3.27 (18%) TOTAL : 17.77 After: rename registers : 0.19 ( 1%) TOTAL : 14.53 Yup, that's nearly a net 20% improvement for pr15524. I would expect a small improvement for more normal codes. [ Yes, I got the message regarding the bootstrap comparison failure on darwin. I'm going to try and reproduce it on a relatively old PPC box here. It'll take quite some time though. I do plan on flushing out one more patch while my PPC bootstrap is running. ] Created attachment 7572 [details]
PPP
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> [ Yes, I got the message regarding the bootstrap comparison failure
> on darwin. I'm going to try and reproduce it on a relatively old
> PPC box here. It'll take quite some time though. I do plan on
> flushing out one more patch while my PPC bootstrap is running. ]
There are problems on x86_64, SPARC, PowerPC and i686, see the PR.
Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Sat, 2004-11-20 at 21:33 +0100, Eric Botcazou wrote:
> > [ Yes, I got the message regarding the bootstrap comparison failure
> > on darwin. I'm going to try and reproduce it on a relatively old
> > PPC box here. It'll take quite some time though. I do plan on
> > flushing out one more patch while my PPC bootstrap is running. ]
>
> There are problems on x86_64, SPARC, PowerPC and i686, see the PR.
I haven't seen the comparison failure on i686 -- if I had I never
would have checked in the patch. PPC is the only other kind of box
I have here locally.
There's some other approaches I can (and will) try while the PPC
box is doing its thing.
jeff
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases
> I haven't seen the comparison failure on i686 -- if I had I never
> would have checked in the patch. PPC is the only other kind of box
> I have here locally.
Right, the problem seems to be very sensitive to the environment. H.J. saw it
on AMD64 while I didn't.
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases And nearly another 20% improvement. As mentioned in a previous message, the most expensive code right now for pr15524 is the code to rescale edge probabilities. This patch has two changes to improve this situation. First, if the scale factor is 1, then the probabilities will not change and thus there is no work to do. This is the common case for pr15524. In the case where the scale factor is != 1, we compute the scale factor outside the loop. This saves us a divide every iteration of the loop. Before: dominator optimization: 3.50 (24%) TOTAL : 14.53 After: dominator optimization: 0.61 ( 5%) TOTAL : 11.84 Bootstrapped and regression tested on i686-pc-linux-gnu. Created attachment 7578 [details]
PPP
This is now back to 3.3 timings so closing, I filed PR 18601 for another case where tree cfg cleanup is slow, it is a variant of this testcase but I have seen the same problem in another testcase but I cannot remember which one. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Sun, 2004-11-21 at 21:55 +0000, pinskia at gcc dot gnu dot org wrote: > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-11-21 21:55 ------- > This is now back to 3.3 timings so closing, I filed PR 18601 for another case where tree cfg cleanup is > slow, it is a variant of this testcase but I have seen the same problem in another testcase but I cannot > remember which one. Don't close yet. There's still some fat to be trimmed. jeff > I am sure that Jeff can use Bugzilla himself, when he wants this bug closed. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases It's clearly getting more difficult to find compile-time improvements for 15524. But that doesn't mean we're done :-) As it turns out we've got more code which is just inlined versions of find_edge, but without the ability to select which list (src->succ or dest->pred) to walk depending on their lengths. This patch replaces those inlined instances of find_edge with calls to find_edge. The big winner is tree CFG construction which goes from .40 seconds to around .04 seconds. There are other wins scattered throughout the various passes. In all we go from 11.63 seconds of total time to 10.97 seconds of total time -- a net improvement of around 5%. [ Please don't close this PR. I still think there's another 4-5% that we'll be able to get without major surgery. ] Bootstrapped and regression tested on i686-pc-linux-gnu. * cfg.c (cached_make_edge): Use find_edge rather than an inlined variant. * cfgbuild.c (make_edges): Likewise. * cfghooks.c (can_duplicate_block_p): Likewise. * cfgloop.c (loop_latch_edge): Likewise. * cfgloopmanip.c (force_single_succ_latches): Likewise. * cfgrtl.c (rtl_flow_call_edges_add): Likewise. * predict.c (predict_loops, propagate_freq): Likewise. * tracer.c (tail_duplicate): Likewise. * tree-cfg.c (disband_implicit_edges): Likewise. (tree_forwarder_block_p, tree_flow_call_edges_add): Likewise. Index: cfg.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cfg.c,v retrieving revision 1.76 diff -c -p -r1.76 cfg.c *** cfg.c 22 Nov 2004 12:23:43 -0000 1.76 --- cfg.c 22 Nov 2004 16:57:55 -0000 *************** cached_make_edge (sbitmap *edge_cache, b *** 288,294 **** { int use_edge_cache; edge e; - edge_iterator ei; /* Don't bother with edge cache for ENTRY or EXIT, if there aren't that many edges to them, or we didn't allocate memory for it. */ --- 288,293 ---- *************** cached_make_edge (sbitmap *edge_cache, b *** 309,320 **** /* Fall through. */ case 0: ! FOR_EACH_EDGE (e, ei, src->succs) ! if (e->dest == dst) ! { ! e->flags |= flags; ! return NULL; ! } break; } --- 308,319 ---- /* Fall through. */ case 0: ! e = find_edge (src, dst); ! if (e) ! { ! e->flags |= flags; ! return NULL; ! } break; } Index: cfgbuild.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cfgbuild.c,v retrieving revision 1.55 diff -c -p -r1.55 cfgbuild.c *** cfgbuild.c 28 Sep 2004 07:59:42 -0000 1.55 --- cfgbuild.c 22 Nov 2004 16:57:55 -0000 *************** make_edges (basic_block min, basic_block *** 271,277 **** enum rtx_code code; int force_fallthru = 0; edge e; - edge_iterator ei; if (LABEL_P (BB_HEAD (bb)) && LABEL_ALT_ENTRY_P (BB_HEAD (bb))) --- 271,276 ---- *************** make_edges (basic_block min, basic_block *** 390,401 **** /* Find out if we can drop through to the next block. */ insn = NEXT_INSN (insn); ! FOR_EACH_EDGE (e, ei, bb->succs) ! if (e->dest == EXIT_BLOCK_PTR && e->flags & EDGE_FALLTHRU) ! { ! insn = 0; ! break; ! } while (insn && NOTE_P (insn) && NOTE_LINE_NUMBER (insn) != NOTE_INSN_BASIC_BLOCK) --- 389,398 ---- /* Find out if we can drop through to the next block. */ insn = NEXT_INSN (insn); ! e = find_edge (bb, EXIT_BLOCK_PTR); ! if (e && e->flags & EDGE_FALLTHRU) ! insn = NULL; ! while (insn && NOTE_P (insn) && NOTE_LINE_NUMBER (insn) != NOTE_INSN_BASIC_BLOCK) Index: cfghooks.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cfghooks.c,v retrieving revision 1.19 diff -c -p -r1.19 cfghooks.c *** cfghooks.c 20 Nov 2004 05:02:28 -0000 1.19 --- cfghooks.c 22 Nov 2004 16:57:55 -0000 *************** bool *** 675,681 **** can_duplicate_block_p (basic_block bb) { edge e; - edge_iterator ei; if (!cfg_hooks->can_duplicate_block_p) internal_error ("%s does not support can_duplicate_block_p.", --- 675,680 ---- *************** can_duplicate_block_p (basic_block bb) *** 686,694 **** /* Duplicating fallthru block to exit would require adding a jump and splitting the real last BB. */ ! FOR_EACH_EDGE (e, ei, bb->succs) ! if (e->dest == EXIT_BLOCK_PTR && e->flags & EDGE_FALLTHRU) ! return false; return cfg_hooks->can_duplicate_block_p (bb); } --- 685,693 ---- /* Duplicating fallthru block to exit would require adding a jump and splitting the real last BB. */ ! e = find_edge (bb, EXIT_BLOCK_PTR); ! if (e && (e->flags & EDGE_FALLTHRU)) ! return false; return cfg_hooks->can_duplicate_block_p (bb); } Index: cfgloop.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cfgloop.c,v retrieving revision 1.43 diff -c -p -r1.43 cfgloop.c *** cfgloop.c 22 Nov 2004 12:23:44 -0000 1.43 --- cfgloop.c 22 Nov 2004 16:57:56 -0000 *************** verify_loop_structure (struct loops *loo *** 1499,1512 **** edge loop_latch_edge (const struct loop *loop) { ! edge e; ! edge_iterator ei; ! ! FOR_EACH_EDGE (e, ei, loop->header->preds) ! if (e->src == loop->latch) ! break; ! ! return e; } /* Returns preheader edge of LOOP. */ --- 1499,1505 ---- edge loop_latch_edge (const struct loop *loop) { ! return find_edge (loop->latch, loop->header); } /* Returns preheader edge of LOOP. */ Index: cfgloopmanip.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cfgloopmanip.c,v retrieving revision 1.37 diff -c -p -r1.37 cfgloopmanip.c *** cfgloopmanip.c 22 Nov 2004 12:23:45 -0000 1.37 --- cfgloopmanip.c 22 Nov 2004 16:57:56 -0000 *************** force_single_succ_latches (struct loops *** 1221,1234 **** for (i = 1; i < loops->num; i++) { - edge_iterator ei; loop = loops->parray[i]; if (loop->latch != loop->header && EDGE_COUNT (loop->latch->succs) == 1) continue; ! FOR_EACH_EDGE (e, ei, loop->header->preds) ! if (e->src == loop->latch) ! break; loop_split_edge_with (e, NULL_RTX); } --- 1221,1231 ---- for (i = 1; i < loops->num; i++) { loop = loops->parray[i]; if (loop->latch != loop->header && EDGE_COUNT (loop->latch->succs) == 1) continue; ! e = find_edge (loop->latch, loop->header); loop_split_edge_with (e, NULL_RTX); } Index: cfgrtl.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/cfgrtl.c,v retrieving revision 1.147 diff -c -p -r1.147 cfgrtl.c *** cfgrtl.c 22 Nov 2004 12:23:45 -0000 1.147 --- cfgrtl.c 22 Nov 2004 16:57:57 -0000 *************** rtl_flow_call_edges_add (sbitmap blocks) *** 2972,2986 **** if (need_fake_edge_p (insn)) { edge e; - edge_iterator ei; ! FOR_EACH_EDGE (e, ei, bb->succs) ! if (e->dest == EXIT_BLOCK_PTR) ! { ! insert_insn_on_edge (gen_rtx_USE (VOIDmode, const0_rtx), e); ! commit_edge_insertions (); ! break; ! } } } --- 2972,2984 ---- if (need_fake_edge_p (insn)) { edge e; ! e = find_edge (bb, EXIT_BLOCK_PTR); ! if (e) ! { ! insert_insn_on_edge (gen_rtx_USE (VOIDmode, const0_rtx), e); ! commit_edge_insertions (); ! } } } *************** rtl_flow_call_edges_add (sbitmap blocks) *** 3023,3031 **** #ifdef ENABLE_CHECKING if (split_at_insn == BB_END (bb)) { ! edge_iterator ei; ! FOR_EACH_EDGE (e, ei, bb->succs) ! gcc_assert (e->dest != EXIT_BLOCK_PTR); } #endif --- 3021,3028 ---- #ifdef ENABLE_CHECKING if (split_at_insn == BB_END (bb)) { ! e = find_edge (bb, EXIT_BLOCK_PTR); ! gcc_assert (e == NULL); } #endif Index: predict.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/predict.c,v retrieving revision 1.134 diff -c -p -r1.134 predict.c *** predict.c 19 Nov 2004 00:03:14 -0000 1.134 --- predict.c 22 Nov 2004 16:57:58 -0000 *************** predict_loops (struct loops *loops_info, *** 669,681 **** /* Loop branch heuristics - predict an edge back to a loop's head as taken. */ ! FOR_EACH_EDGE (e, ei, bb->succs) ! if (e->dest == loop->header ! && e->src == loop->latch) ! { ! header_found = 1; ! predict_edge_def (e, PRED_LOOP_BRANCH, TAKEN); ! } /* Loop exit heuristics - predict an edge exiting the loop if the conditional has no loop header successors as not taken. */ --- 669,683 ---- /* Loop branch heuristics - predict an edge back to a loop's head as taken. */ ! if (bb == loop->latch) ! { ! e = find_edge (loop->latch, loop->header); ! if (e) ! { ! header_found = 1; ! predict_edge_def (e, PRED_LOOP_BRANCH, TAKEN); ! } ! } /* Loop exit heuristics - predict an edge exiting the loop if the conditional has no loop header successors as not taken. */ *************** propagate_freq (struct loop *loop, bitma *** 1660,1680 **** bitmap_clear_bit (tovisit, bb->index); ! /* Compute back edge frequencies. */ ! FOR_EACH_EDGE (e, ei, bb->succs) ! if (e->dest == head) ! { ! sreal tmp; ! /* EDGE_INFO (e)->back_edge_prob ! = ((e->probability * BLOCK_INFO (bb)->frequency) ! / REG_BR_PROB_BASE); */ ! sreal_init (&tmp, e->probability, 0); ! sreal_mul (&tmp, &tmp, &BLOCK_INFO (bb)->frequency); ! sreal_mul (&EDGE_INFO (e)->back_edge_prob, ! &tmp, &real_inv_br_prob_base); ! } /* Propagate to successor blocks. */ FOR_EACH_EDGE (e, ei, bb->succs) --- 1662,1681 ---- bitmap_clear_bit (tovisit, bb->index); ! e = find_edge (bb, head); ! if (e) ! { ! sreal tmp; ! /* EDGE_INFO (e)->back_edge_prob ! = ((e->probability * BLOCK_INFO (bb)->frequency) ! / REG_BR_PROB_BASE); */ ! sreal_init (&tmp, e->probability, 0); ! sreal_mul (&tmp, &tmp, &BLOCK_INFO (bb)->frequency); ! sreal_mul (&EDGE_INFO (e)->back_edge_prob, ! &tmp, &real_inv_br_prob_base); ! } /* Propagate to successor blocks. */ FOR_EACH_EDGE (e, ei, bb->succs) Index: tracer.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/tracer.c,v retrieving revision 1.22 diff -c -p -r1.22 tracer.c *** tracer.c 28 Sep 2004 07:59:50 -0000 1.22 --- tracer.c 22 Nov 2004 16:57:58 -0000 *************** tail_duplicate (void) *** 275,286 **** && can_duplicate_block_p (bb2)) { edge e; - edge_iterator ei; basic_block old = bb2; ! FOR_EACH_EDGE (e, ei, bb2->preds) ! if (e->src == bb) ! break; nduplicated += counts [bb2->index]; bb2 = duplicate_block (bb2, e); --- 275,283 ---- && can_duplicate_block_p (bb2)) { edge e; basic_block old = bb2; ! e = find_edge (bb, bb2); nduplicated += counts [bb2->index]; bb2 = duplicate_block (bb2, e); Index: tree-cfg.c =================================================================== RCS file: /cvs/gcc/gcc/gcc/tree-cfg.c,v retrieving revision 2.112 diff -c -p -r2.112 tree-cfg.c *** tree-cfg.c 19 Nov 2004 22:14:35 -0000 2.112 --- tree-cfg.c 22 Nov 2004 16:57:59 -0000 *************** disband_implicit_edges (void) *** 2649,2659 **** from cfg_remove_useless_stmts here since it violates the invariants for tree--cfg correspondence and thus fits better here where we do it anyway. */ ! FOR_EACH_EDGE (e, ei, bb->succs) { - if (e->dest != bb->next_bb) - continue; - if (e->flags & EDGE_TRUE_VALUE) COND_EXPR_THEN (stmt) = build_empty_stmt (); else if (e->flags & EDGE_FALSE_VALUE) --- 2649,2657 ---- from cfg_remove_useless_stmts here since it violates the invariants for tree--cfg correspondence and thus fits better here where we do it anyway. */ ! e = find_edge (bb, bb->next_bb); ! if (e) { if (e->flags & EDGE_TRUE_VALUE) COND_EXPR_THEN (stmt) = build_empty_stmt (); else if (e->flags & EDGE_FALSE_VALUE) *************** static bool *** 3892,3899 **** tree_forwarder_block_p (basic_block bb) { block_stmt_iterator bsi; - edge e; - edge_iterator ei; /* BB must have a single outgoing edge. */ if (EDGE_COUNT (bb->succs) != 1 --- 3890,3895 ---- *************** tree_forwarder_block_p (basic_block bb) *** 3911,3920 **** gcc_assert (bb != ENTRY_BLOCK_PTR); #endif ! /* Successors of the entry block are not forwarders. */ ! FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs) ! if (e->dest == bb) ! return false; /* Now walk through the statements. We can ignore labels, anything else means this is not a forwarder block. */ --- 3907,3914 ---- gcc_assert (bb != ENTRY_BLOCK_PTR); #endif ! if (find_edge (ENTRY_BLOCK_PTR, bb)) ! return false; /* Now walk through the statements. We can ignore labels, anything else means this is not a forwarder block. */ *************** tree_flow_call_edges_add (sbitmap blocks *** 5206,5212 **** Handle this by adding a dummy instruction in a new last basic block. */ if (check_last_block) { - edge_iterator ei; basic_block bb = EXIT_BLOCK_PTR->prev_bb; block_stmt_iterator bsi = bsi_last (bb); tree t = NULL_TREE; --- 5200,5205 ---- *************** tree_flow_call_edges_add (sbitmap blocks *** 5217,5229 **** { edge e; ! FOR_EACH_EDGE (e, ei, bb->succs) ! if (e->dest == EXIT_BLOCK_PTR) ! { ! bsi_insert_on_edge (e, build_empty_stmt ()); ! bsi_commit_edge_inserts (); ! break; ! } } } --- 5210,5221 ---- { edge e; ! e = find_edge (bb, EXIT_BLOCK_PTR); ! if (e) ! { ! bsi_insert_on_edge (e, build_empty_stmt ()); ! bsi_commit_edge_inserts (); ! } } } *************** tree_flow_call_edges_add (sbitmap blocks *** 5260,5268 **** #ifdef ENABLE_CHECKING if (stmt == last_stmt) { ! edge_iterator ei; ! FOR_EACH_EDGE (e, ei, bb->succs) ! gcc_assert (e->dest != EXIT_BLOCK_PTR); } #endif --- 5252,5259 ---- #ifdef ENABLE_CHECKING if (stmt == last_stmt) { ! e = find_edge (bb, EXIT_BLOCK_PTR); ! gcc_assert (e == NULL); } #endif Here are the new numbers. option mainline 3.3.2 -O0 2.790 2.680 -O1 12.51 11.44 -O2 18.26 19.84 -O3 18.64 19.53 So we are faster or just as fast except for at -O2 where we are 9% slower. We are no where near the numbers we were at before. I suggest we close this one, 9% is just comparable to the overall slowdown we see in pretty much all code. This is just noise in the list of regressions. If nobody objects, I'll close this in a couple of days from now... Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Mon, 2004-12-20 at 00:51 +0000, steven at gcc dot gnu dot org wrote: > ------- Additional Comments From steven at gcc dot gnu dot org 2004-12-20 00:51 ------- > I suggest we close this one, 9% is just comparable to the overall > slowdown we see in pretty much all code. This is just noise in > the list of regressions. If nobody objects, I'll close this in a > couple of days from now... I wouldn't close it. There's still some significant representational issues that need addressing (jump vectors in RTL). It exposes some major lamness in invariant code motion (trying to hoist a few thousand instances of GLOBAL_VAR), and some lameness in our CFG code as well (the edge cache). The fact that we're spending so much time redirecting edges at the RTL level is also an indication that we may not be passing off optimal trees to the expanders. In all, there may be another 10% gain for pr15524 waiting for us to nail down. These are all issues I want to see examined, but they're lower priority than fixing the extreme lameness in our aliasing code which is causing even bigger compile-time regressions in other testcodes. Jeff That may all be true, but the fact is that you're now talking about missed-optimizations, and not a real compile time hog. Yes we can win. No, this is not a hog anymore, it's just not as good as it could be. You're also talking about RTL, not tree jump threading. In fact I think it's plain wrong to keep this bug open. It's just keeping the number of real regressions artificially high. We have compile time slowdowns, and that's a regression. But we don't need to keep bugs open when the issue of that bug was been fixed. I suggest someone updates the summary and keywords to reflect the actual status of the bug, or close this one and open a new one for the basically unrelated issues Jeff mentioned Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Wed, 2004-12-22 at 19:18 +0000, steven at gcc dot gnu dot org wrote: > ------- Additional Comments From steven at gcc dot gnu dot org 2004-12-22 19:18 ------- > That may all be true, but the fact is that you're now talking about > missed-optimizations, and not a real compile time hog. Yes we can > win. No, this is not a hog anymore, it's just not as good as it > could be. You're also talking about RTL, not tree jump threading. I'm talking strictly about compile-time hogs. It's insanely dumb that we're burning 5% of compilation time just trying to hoist the fake GLOBAL_VAR variable out of empty loops. It's insanely dumb that we're burning 5% of compilation time threading jumps at the RTL level when they all should have been threaded at the tree level. > > In fact I think it's plain wrong to keep this bug open. It's just > keeping the number of real regressions artificially high. We have > compile time slowdowns, and that's a regression. But we don't need > to keep bugs open when the issue of that bug was been fixed. I stongly disagree. Please do not close this bug report. jeff Jeff, as I said, you're attacking problems not related to the original problem reported in this PR. Maybe you don't use bugzilla often enough to see how annoying it is when the PR summary and the problem described in the bug report don't match. But oh well, whatever, if you can win 10% here, why am I complaining ;-) I'm very curious about your tree alias rewrite. Got a plan/overview of it somewhere? Is it GCC 4.0 material? Relation to this PR: Last time I looked, ~40% of the RTL jump threads were related to more accurate alias info (components!) on the RTL level. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Thu, 2004-12-23 at 13:31 +0000, steven at gcc dot gnu dot org wrote: > ------- Additional Comments From steven at gcc dot gnu dot org 2004-12-23 13:31 ------- > Jeff, as I said, you're attacking problems not related to the original > problem reported in this PR. Maybe you don't use bugzilla often enough > to see how annoying it is when the PR summary and the problem described > in the bug report don't match. But oh well, whatever, if you can win > 10% here, why am I complaining ;-) The bug report really should have been called "compilation speed regression" rather than calling out "jump threading" -- a goodly number of the changes necessary to get 15524 to a point of almost acceptable were not jump threading related. How about this, open a new report with the same testcase and a generic description about compile-time regression, then close 15524. Seems like a make-work project, but hey, if it makes you happy.... > I'm very curious about your tree alias rewrite. Got a plan/overview of > it somewhere? In simplest terms, we lose the braindamaged type memory tag nonsense and only do type based alias analysis on objects that have not been disambiguated by other means. This as the effect of greatly reducing number of type based aliasing checks -- which accounts for the vast majority of the compilation time for 15855. > Is it GCC 4.0 material? Unsure, mostly because I will not have much time to work on it for the next 3 weeks. Jeff The timings for 3.3.2 and 4.0.0 are the same are faster for 4.0.0 so closing as fixed. (In reply to comment #18) > Ugh. The copy we inserted turned out to be totally useless for determining > that the condition at the end of BB5 is always false. Argggh. > > To make this scheme work we'd have to do copy insertions anytime the > COND_EXPR_COND or SWITCH_EXPR changed. Worse yet, we'd need to call > rewrite_ssa_into_ssa to keep things in SSA form. Double Ugh. > It's a matter of scheduling the passes mostly. The reason you're having a hard time here is mostly because some of the scalar cleanups haven't run. Suppose that we do this after FRE and copy propagation. We end up with: D.15513_11 = insn_10->code; switch (D.15514_11) { case 5: goto <L17>; case 6: goto <L2>; case 7: goto <L3>; case 8: goto <L22>; case 9 ... 10: goto <L0>; default : goto <L23>; } # SUCC: 22 1 21 3 2 16 [ ... ] # BLOCK 3 # PRED: 0 <L3>:; if (D.15521_11 != 7) goto <L4>; else goto <L5>; # SUCC: 5 (false) 4 (true) Now, if you insert an assertion of any kind at the start of block #3, you'll get the elimination you're looking for. Currently, VRP will not handle this case because of the switch. That is easily fixable. In TCB, the code I get after VRP is this: # BLOCK 0 # PRED: ENTRY (fallthru,exec) # VUSE <TMT.0_7>; D.1144_2 = insn_1->code; insn_3 = insn_1; switch (D.1144_2) { case 5: goto L17; case 6: goto L17; case 7: goto L3; default : goto L17; } # SUCC: 1 (exec) 3 (exec) # BLOCK 1 # PRED: 0 (exec) L3:; D.1145_6 = D.1144_2; if (D.1145_6 != 7) goto <L5>; else goto L17; # SUCC: 2 (true,exec) 3 (false,exec) [ ... ] So, it would be a matter of choosing the ordering and arranging for VRP to handle SWITCH_EXPRs. Now, if what you want is a design where all these actions are done in one super-pass, well, that's a different can of worms. Diego. Subject: Re: [4.0 Regression] jump threading
on trees is slow with switch statements with large # of cases
On Fri, 2005-04-08 at 21:49 +0000, dnovillo at gcc dot gnu dot org
wrote:
> ------- Additional Comments From dnovillo at gcc dot gnu dot org 2005-04-08 21:49 -------
> (In reply to comment #18)
>
> > Ugh. The copy we inserted turned out to be totally useless for determining
> > that the condition at the end of BB5 is always false. Argggh.
> >
> > To make this scheme work we'd have to do copy insertions anytime the
> > COND_EXPR_COND or SWITCH_EXPR changed. Worse yet, we'd need to call
> > rewrite_ssa_into_ssa to keep things in SSA form. Double Ugh.
> >
> It's a matter of scheduling the passes mostly. The reason you're having a hard
> time here is mostly because some of the scalar cleanups haven't run. Suppose
> that we do this after FRE and copy propagation. We end up with:
As we discussed on the phone, it can be seen in that way. For VRP you
have the nice property that it's self-contained. ie, the other
optimizers have already run, you insert your ASSERT_EXPRs, then run the
VPR algorithm, then destroy the ASSERT_EXPRs.
That mental model doesn't work right now with the way DOM & the jump
threader because they are too tightly intertwined.
As I mentioned on the phone, what I want to do long term is have the
jump threader which maintains its temporary equivalences and queries
the value handles created by PRE/FRE or whatever.
The iteration step we see today would turn into a worklist. ie, after
we thread jumps, we walk through the IL for PHIs which represent a copy
that can be propagated. The uses of the PHI result are put onto a
worklist of statements which need their value handles updated and
which may now be redundant. If the state is proven redundant, then
we have a new copy to propagate and the uses of the LHS of the
statement are put on the worklist.
What's nice about this is the bulk of DOM as we know it today disappears
along with the expensive iteration in the case when jumps are threaded.
We just need the right APIs for copy propagation and value handles.
We could still potentially use ASSERT_EXPRs to encode information about
edge equivalences, though we probably would still want the ASSERT_EXPR
to appear as statement rather than the RHS of a MODIFY_EXPR.
Jeff
Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Tue, Apr 12, 2005 at 04:55:20PM -0000, law at redhat dot com wrote: > That mental model doesn't work right now with the way DOM & the > jump threader because they are too tightly intertwined. > The link that you have still not shown me is why doesn't this mental model work for the jump threader. That's why I said that I need to run the threader myself and see why it needs all these additional crutches. If the code has been cleaned up of redundancies, copies propagated, names have known values and ranges are set, then I don't see why we would need all the extra baggage. > The iteration step we see today would turn into a worklist. > ie, after we thread jumps, we walk through the IL for PHIs > which represent a copy that can be propagated. > Ah, here's something specific I don't follow. Why would you have these PHIs anymore? If all the arguments were ultimately equivalent, then the various propagators are bound to have removed them. If not, that sounds like a bug in them. > What's nice about this is the bulk of DOM as we know it today > disappears along with the expensive iteration in the case when > jumps are threaded. We just need the right APIs for copy > propagation and value handles. > Again, why? The propagators are supposed to have done this already. They are all supposed to work in conditional regions. > We could still potentially use ASSERT_EXPRs to encode > information about edge equivalences, though we probably would > still want the ASSERT_EXPR to appear as statement rather than > the RHS of a MODIFY_EXPR. > Odd, the nice thing about assertions being on the RHS is that you pin their result to a specific SSA name that you then get to use at every place naturally dominated by the assertion. If you could give me a concrete test case that I could sink my teeth in, that'd be great. Thanks. Diego. Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases On Wed, 2005-04-13 at 13:04 +0000, dnovillo at redhat dot com wrote: > ------- Additional Comments From dnovillo at redhat dot com 2005-04-13 13:03 ------- > Subject: Re: [4.0 Regression] jump threading on trees is slow with switch statements with large # of cases > > On Tue, Apr 12, 2005 at 04:55:20PM -0000, law at redhat dot com wrote: > > > That mental model doesn't work right now with the way DOM & the > > jump threader because they are too tightly intertwined. > > > The link that you have still not shown me is why doesn't this > mental model work for the jump threader. That's why I said that > I need to run the threader myself and see why it needs all these > additional crutches. That the model we wont to go to -- but it's certainly not where we are today. Why? Because the jump threader exposes more optimization opportunities for DOM (including constant and copy propagations) and DOM exposes more opportunities for the threader. They are inherently tied together with their current structure. > > If the code has been cleaned up of redundancies, copies > propagated, names have known values and ranges are set, then I > don't see why we would need all the extra baggage. Because threading exposes new opportunities to perform constant and copy propagation and redundancy elimination. > > > The iteration step we see today would turn into a worklist. > > ie, after we thread jumps, we walk through the IL for PHIs > > which represent a copy that can be propagated. > > > Ah, here's something specific I don't follow. Why would you have > these PHIs anymore? If all the arguments were ultimately > equivalent, then the various propagators are bound to have > removed them. If not, that sounds like a bug in them. They are exposed as a result of jump threading. For example we might have something like: BB 3: x3 = PHI (x2 (0), x1 (1), x1 (2)); [ ... ] if (cond) goto X, else goto Y If jump threading manages to determine where the conditional will go when BB3 is reached via BB0, then we'll be able to remove the PHI argument associated with the edge from BB0 to BB3. That in turn exposes a copy propagation opportunity (x3 = x1). Or when we thread a jump we might expose a new redundancy because the dominator graph changed. Whenever we expose a redundancy we also expose a copy propagation opportunity. > > > What's nice about this is the bulk of DOM as we know it today > > disappears along with the expensive iteration in the case when > > jumps are threaded. We just need the right APIs for copy > > propagation and value handles. > > > Again, why? The propagators are supposed to have done this > already. They are all supposed to work in conditional regions. You really don't seem to understand what the threader does and how its actions can expose new optimization opportunities. I might suggest you look very closely at what happens for a test like 20030714-2.c which is derived from real code. As we thread jumps, we expose new redundancy elimination opportunities, which in turn expose new copy propagation opportunities. > > We could still potentially use ASSERT_EXPRs to encode > > information about edge equivalences, though we probably would > > still want the ASSERT_EXPR to appear as statement rather than > > the RHS of a MODIFY_EXPR. > > > Odd, the nice thing about assertions being on the RHS is that you > pin their result to a specific SSA name that you then get to use > at every place naturally dominated by the assertion. > > If you could give me a concrete test case that I could sink my > teeth in, that'd be great. Go back the PR I've referenced twice. jeff |