[PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

Fri Sep 11 15:50:00 GMT 2015

On Fri, 2015-09-11 at 16:28 +0100, Richard Sandiford wrote:
> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
> > On Fri, 2015-09-11 at 15:29 +0100, Richard Sandiford wrote:
> >> Ramana Radhakrishnan <ramana.gcc@googlemail.com> writes:
> >> > On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
> >> > <wschmidt@linux.vnet.ibm.com> wrote:
> >> >> Hi Alan,
> >> >>
> >> >> I probably wasn't clear enough.  The implementation in the vectorizer is
> >> >> fine and I'm not asking that to change per target.  What I'm objecting
> >> >> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
> >> >> with vec_to_scalar.  This assumes that the back end will implement a
> >> >> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
> >> >> But those back ends should be free to model the cost of the
> >> >> REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
> >> >> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
> >> >> ARM, this cost will be the same as a vec_to_scalar.  For others, it may
> >> >> not be; for powerpc, it certainly will not be.
> >> >>
> >> >> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
> >> >> expansion, and therefore it is not correct for us to explode this in
> >> >> tree-vect-generic.  This would expand the code size without providing
> >> >> any significant optimization opportunity, and could reduce the ability
> >> >> to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
> >> >> gimple vectorizers.
> >> >>
> >> >> I apologize if my loose use of language confused the issue.  It isn't
> >> >> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
> >> >> that are used by it.
> >> >>
> >> >> (The costs in powerpc won't be enormous, but they are definitely
> >> >> mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
> >> >> instructions, where n is the number of elements in the mode being
> >> >> vectorized.)
> >> >
> >> > IIUC, on AArch64 a reduc_max_expr matches with a single reduction
> >> > operation but on AArch32 Neon a reduc_smax gets implemented as a
> >> > sequence of vpmax instructions which sounds similar to the PowerPC
> >> > example as well. Thus mapping a reduc_smax expression to the cost of a
> >> > vec_to_scalar is probably not right in this particular situation.
> >> 
> >> But AIUI vec_to_scalar exists to represent reduction operations.
> >> (I see it was also used for strided stores.)  So for better or worse,
> >> I think the interface that Alan's patch uses is the defined interface
> >> for measuring the cost of a reduction.
> >>
> >> If a backend implemented reduc_umax_scal_optab in current sources,
> >> without Alan's patch, then that optab would be used for a "natural"
> >> unsigned max reduction (i.e. a reduction of a MAX_EXPR with unsigned
> >> inputs).  vec_to_scalar would be used to weigh the cost of the epilogue
> >> reduction statement in that case.
> >> 
> >> So if defining a new Power pattern might cause Alan's patch to trigger
> >> in cases where the transformation is actually too expensive, I would
> >> expect the same to be true for a natural umax without Alan's patch.
> >> The two cases ought to underestimate the true cost by the same degree.
> >> 
> >> In other words, whether the cost interface is flexible enough is
> >> definitely interesting but seems orthogonal to this patch.
> >
> > That's a reasonable argument, but is this not a good opportunity to fix
> > an incorrect assumption in the vectorizer cost model?  I would prefer
> > for this issue not to get lost on a technicality.
> 
> I think it's more than technicality though.  I don't think it should be
> Alan's responsibility to extend the cost model when (a) his patch uses the
> current model in the way that it was intended to be used (at least AIUI) and
> (b) in this case, the motivating example for the new model is a pattern
> that hasn't been written yet. :-)

Agreed.  However, the original patch description said in essence, this
is good for everybody, powerpc and x86 should go and implement their
patterns and use it.  It turns out not to be so simple, unfortunately.

> 
> So...
> 
> > The vectorizer cost model has many small flaws, and we all need to be
> > mindful of trying to improve it at every opportunity, rather than
> > allowing it to continue to degrade.  We just had a big discussion about
> > improving cost models at the last Cauldron, and my request is consistent
> > with that direction.
> >
> > Saying that all reductions have equivalent performance is unlikely to be
> > true for many platforms.  On PowerPC, for example, a PLUS reduction has
> > very different cost from a MAX reduction.  If the model isn't
> > fine-grained enough, let's please be aggressive about fixing it.  I'm
> > fine if it's a separate patch, but in my mind this shouldn't be allowed
> > to languish.
> 
> ...I agree that the general vectoriser cost model could probably be
> improved, but it seems fairer for that improvement to be done by whoever
> adds the patterns that need it.

All right.  But in response to Ramana's comment, are all relevant
reductions of similar cost on each ARM platform?  Obviously they don't
have the same cost on different platforms, but the question is whether a
reduc_plus, reduc_max, etc., has identical cost on each individual
platform.  If not, ARM may have a concern as well.  I don't know the
situation for x86 either.

I assume that the IBM folks that originally created a lot of this stuff
are responsible for how reductions are treated.  It looks like there is
some commentary that a reduction will only be used if it is a single
instruction (not sure this is true).  In any case, by not implementing
the smax patterns for some reason, this kept us from seeing a problem.

So I'll put this on my list of windmills to tilt at in a few weeks.

Thanks for the discussion,
Bill

> 
> Thanks,
> Richard
>