[PATCH, rs6000] Add expansions for min/max vector reductions
Thu Sep 17 14:28:00 GMT 2015
On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> On Wed, 16 Sep 2015, Alan Lawrence wrote:
> > On 16/09/15 17:10, Bill Schmidt wrote:
> > >
> > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > > 2015-09-16 Bill Schmidt <firstname.lastname@example.org>
> > > > >
> > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > > UNSPEC_REDUC_SMIN,
> > > > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > > > (reduc_smax_v2di): New define_expand.
> > > > > (reduc_smax_scal_v2di): Likewise.
> > > > > (reduc_smin_v2di): Likewise.
> > > > > (reduc_smin_scal_v2di): Likewise.
> > > > > (reduc_umax_v2di): Likewise.
> > > > > (reduc_umax_scal_v2di): Likewise.
> > > > > (reduc_umin_v2di): Likewise.
> > > > > (reduc_umin_scal_v2di): Likewise.
> > > > > (reduc_smax_v4si): Likewise.
> > > > > (reduc_smin_v4si): Likewise.
> > > > > (reduc_umax_v4si): Likewise.
> > > > > (reduc_umin_v4si): Likewise.
> > > > > (reduc_smax_v8hi): Likewise.
> > > > > (reduc_smin_v8hi): Likewise.
> > > > > (reduc_umax_v8hi): Likewise.
> > > > > (reduc_umin_v8hi): Likewise.
> > > > > (reduc_smax_v16qi): Likewise.
> > > > > (reduc_smin_v16qi): Likewise.
> > > > > (reduc_umax_v16qi): Likewise.
> > > > > (reduc_umin_v16qi): Likewise.
> > > > > (reduc_smax_scal_<mode>): Likewise.
> > > > > (reduc_smin_scal_<mode>): Likewise.
> > > > > (reduc_umax_scal_<mode>): Likewise.
> > > > > (reduc_umin_scal_<mode>): Likewise.
> > > >
> > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > > used if
> > > > the _scal are present. The non-_scal's were previously defined as
> > > > producing a
> > > > vector with one element holding the result and the other elements all
> > > > zero, and
> > > > this was only ever used with a vec_extract immediately after; the _scal
> > > > pattern
> > > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > > deprecated / considered legacy, as per md.texi.
> > >
> > > Thanks -- I had misread the description of the non-scalar versions,
> > > missing the part where the other elements are zero. What I really
> > > want/need is an optab defined as computing the maximum value in all
> > > elements of the vector.
> > Yes, indeed. It seems reasonable to me that this would coexist with an optab
> > which computes only a single value (i.e. a scalar).
> So just to clarify - you need to reduce the vector with max to a scalar
> but want the (same) result in all vector elements?
Yes. Alan Hayward's cond-reduction patch is set up to perform a
reduction to scalar, followed by a scalar broadcast to get the value
into all positions. It happens that our most efficient expansion to
reduce to scalar will naturally produce the value in all positions.
Using the reduc_smax_scal_<mode> expander, we are required to extract
this into a scalar and then perform the broadcast. Those two operations
are unnecessary. We can clean up after the fact, but the cost model
will still reflect the unnecessary activity.
> > At that point it might be appropriate to change the cond-reduction code to
> > generate the reduce-to-vector in all cases, and optabs.c expand it to
> > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> > the (parallel) changes to cost model already proposed, does that cover all the
> > cases? It does add a new tree code, yes, but I'm feeling that could be
> > justified if we go down this route.
> I'd rather have combine recognize an insn that does both (if it
> exists). As I understand powerpc doesn't have reduction to scalar
> (I think no ISA actually has this, but all ISAs have reduce to
> one vector element plus a cheap way of extraction (effectively a subreg))
> but it's reduction already has all result vector elements set to the
> same value (as opposed to having some undefined or zero or whatever).
The extraction is not necessarily so cheap. For floating-point values
in a register, we can indeed use a subreg. But in this case we are
talking about integers, and an extract must move them from the vector
registers to the GPRs. With POWER8, we can do this with a 5-cycle move
instruction. Prior to that, we can only do this with a very expensive
path through memory, as previous processors had no direct path between
the vector and integer registers.
> > However, another point that springs to mind: if you reduce a loop containing
> > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> > using shifts, and I think will also use shifts for platforms not possessing a
> > reduc_plus/min/max. If shifts could be changed for rotates, the code there
> > would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> > (sensibly!) bring all of these together?
If you think it's reasonable, I will take a look at whether this can
work. For targets like PowerPC and AArch32, perhaps we are better off
not defining an expander at all, but letting the vector epilogue code
expose the sequence of shifts/max-mins at the GIMPLE level. For this
case the vectorizer would then not generate the unnecessary extract and
However, I'd want to let Alan Hayward finish revising and committing his
> > > Perhaps the practical thing is to have the vectorizer also do an
> > > add_stmt_cost with some new token that indicates the cost model should
> > > make an adjustment if the back end doesn't need the extract/broadcast.
> > > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > > cost, and remove the unnecessary code in simplify-rtx.
> > I think it'd be good if we could do it before simplify-rtx, really, although
> > I'm not sure I have a strong argument as to why, as long as we can cost it
> > appropriately.
> > > In any case, I will remove implementing the deprecated optabs, and I'll
> > > also try to look at Alan L's patch shortly.
> > That'd be great, thanks :)
> As for the cost modeling the simple solution is of course to have
> these as "high-level" costs (a new enum entry), but the sustainable
> solution is to have the target see the vectorized code and decide
> for itself. Iff we'd go down the simple route then the target
> may as well create the vectorized code for that special high-level
> operation itself.
More information about the Gcc-patches