[PATCH] Eliminates phi on branch for CMP result

Wed May 8 12:20:00 GMT 2019

On Wed, 8 May 2019, Jiufu Guo wrote:

> Hi,
> 
> Thanks Richard, Segher, Andrew and all.
> 
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> 
> > Let me try to answer some of this...
> >
> > On Tue, May 07, 2019 at 03:31:27PM +0200, Richard Biener wrote:
> >> On Mon, 6 May 2019, Jiufu Guo wrote:
> >> > This patch implements the optimization in PR77820.  The optimization
> >> > eliminates phi and phi's basic block, if the phi is used only by
> >> > condition branch, and the phi's incoming value in the result of a
> >> > CMP result.
> >
> >> I'm not sure I like a new pass here ;)  The transform is basically
> >> tail-duplicating the PHI block because the exit conditional can
> >> be "simplfied" - that's something jump threading generally does
> >> though it relies on "simplified" being a little bit more simplified
> >> than above.
> >
> Adding a new pass is just because it may be relatively easy to extend
> and maintain. 
> 
> Adding this micor-optimization into other passes is also a good
> sugguestion. 
> 
> - Similar with jump threading, this new transform is replacing jump
> old destination with new destination. While for this case, the new
> destination can not be determined at compile time. 
> 
> - And as Andrew said: forwprop/backprop are similar, but this case is in
> opposite: it is spread common code(phi+gcond) into different preds.
> 
> - And there is phiopt pass which focus on making 'phi' stmt better. 
> it also do some similar thing:
>      bb0:
>        if (a != b) goto bb2; else goto bb1;
>      bb1:
>      bb2:
>        x = PHI <a (bb1), b (bb0), ...>;
> 
>    tranform to:
> 
>      bb0:
>      bb2:
>        x = PHI <b (bb0), ...>;
> 
> If I avoid to add new pass, any suggustions about which exiting pass
> (jumpthreading/forwprop/phiopt/...) would be more suitable to extend to
> support this new transform? 

The main thing the transform does is tail-duplicate the PHI block,
if we'd just do that followup transforms would do the rest.

So my suggestion would be to look at passes doing exactly that
which would be tracer and path splitting (but that's just run for
loops).  If you want to fit it into jump threading then it would
be the backwards threader since the heuristics would look at
opportunities to simplify the conditional, see it fed by a PHI
and there from (two) conditional(s).

You do not put any constraints on the block relation of the
conditional generators, for example path splitting looks
for diamonds, duplicating the merge block.

It might be that extending the backwards threader with the
pattern matching is easiest (this one also runs quite a few times).

> > Right, but where in the pipeline does this fit in?
> >
> >> I suspect this transform was implemented because of some benchmark?
> >
> > Something in SPEC...  2006 iirc...  Will need to dig it up, I forgot
> > the details.
> >
> >> I suspect the performance benefit is because of better branch
> >> prediction by not mangling both conditional branches into one?
> >
> > No, it is that previously a condition was moved to a GPR, and then compared
> > again.  See PR77820.  This is expensive, and serial, too.
> >
> This transform focusing on eliminating additional GPR saving and
> additional compare, then it would helps a lot if the comparasion in in
> hot path.
> >> The transform is also somewhat similar to tail-duplication done
> >> in path splitting or tracer.
> >
> > Yes.
> >
> >> The pass itself does quite strict pattern-matching but I wonder
> >> if more cases should be handled this way.
> >
> > Maybe.  Probably.  But which?
> Currently this pass handles the case if there is only one PHI and the
> phi is used by gcond. If there are other suitable cases, we can just
> extend this tranform.

I was just thinking of a small amount of unrelated stmts in those
blocks or the case where just one PHI argument is fed by a conditional
thus only one path would simplify (but maybe the hot one if you look
at edge probabilities).

> >
> >> Any specific reason for the pass placement between PRE and sinking?
> >> tracer and path splitting run much later, jump threading runs
> >> all over the place.
> >
> > Dunno.  Jiufu, does the pass placement matter much?
> This pass would just need be inserted after ssa, and before the last dce
> and forwprop which could help to eliminate temp conversion from bool to
> int:
>  _1 = a_8(D) < b_9(D);
>  t_14 = (int) _1;
>  if (_1 != 0)
> 
> Putting it after PRE is just because some case is better be handled by PREx,
> like below case:  if (x) t = a < b; else t = a < b; if (t) xx.
> While actually, this pass is ok to insert at other placement.

OK, so all of the suggestions above would work since we have a late
forwprop and DCE pass to clean up.

Btw, I wonder if on RTL basic-block reordering (which also does
some tail duplication) could be a place to do such transform?
Or is it too late to do the desired cleanups after that?
Possibly since we're after RA.

Richard.