[PATCH, rs6000] Add expansions for min/max vector reductions

Thu Sep 17 16:32:00 GMT 2015

On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote:
> On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > On Wed, 16 Sep 2015, Alan Lawrence wrote:
> > 
> > > On 16/09/15 17:10, Bill Schmidt wrote:
> > > > 
> > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > > > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> > > > > > 
> > > > > >           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > > > UNSPEC_REDUC_SMIN,
> > > > > >           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > > > >           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > > > >           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > > > >           (reduc_smax_v2di): New define_expand.
> > > > > >           (reduc_smax_scal_v2di): Likewise.
> > > > > >           (reduc_smin_v2di): Likewise.
> > > > > >           (reduc_smin_scal_v2di): Likewise.
> > > > > >           (reduc_umax_v2di): Likewise.
> > > > > >           (reduc_umax_scal_v2di): Likewise.
> > > > > >           (reduc_umin_v2di): Likewise.
> > > > > >           (reduc_umin_scal_v2di): Likewise.
> > > > > >           (reduc_smax_v4si): Likewise.
> > > > > >           (reduc_smin_v4si): Likewise.
> > > > > >           (reduc_umax_v4si): Likewise.
> > > > > >           (reduc_umin_v4si): Likewise.
> > > > > >           (reduc_smax_v8hi): Likewise.
> > > > > >           (reduc_smin_v8hi): Likewise.
> > > > > >           (reduc_umax_v8hi): Likewise.
> > > > > >           (reduc_umin_v8hi): Likewise.
> > > > > >           (reduc_smax_v16qi): Likewise.
> > > > > >           (reduc_smin_v16qi): Likewise.
> > > > > >           (reduc_umax_v16qi): Likewise.
> > > > > >           (reduc_umin_v16qi): Likewise.
> > > > > >           (reduc_smax_scal_<mode>): Likewise.
> > > > > >           (reduc_smin_scal_<mode>): Likewise.
> > > > > >           (reduc_umax_scal_<mode>): Likewise.
> > > > > >           (reduc_umin_scal_<mode>): Likewise.
> > > > > 
> > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > > > used if
> > > > > the _scal are present. The non-_scal's were previously defined as
> > > > > producing a
> > > > > vector with one element holding the result and the other elements all
> > > > > zero, and
> > > > > this was only ever used with a vec_extract immediately after; the _scal
> > > > > pattern
> > > > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > > > deprecated / considered legacy, as per md.texi.
> > > > 
> > > > Thanks -- I had misread the description of the non-scalar versions,
> > > > missing the part where the other elements are zero.  What I really
> > > > want/need is an optab defined as computing the maximum value in all
> > > > elements of the vector.
> > > 
> > > Yes, indeed. It seems reasonable to me that this would coexist with an optab
> > > which computes only a single value (i.e. a scalar).
> > 
> > So just to clarify - you need to reduce the vector with max to a scalar
> > but want the (same) result in all vector elements?
> 
> Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> reduction to scalar, followed by a scalar broadcast to get the value
> into all positions.  It happens that our most efficient expansion to
> reduce to scalar will naturally produce the value in all positions.
> Using the reduc_smax_scal_<mode> expander, we are required to extract
> this into a scalar and then perform the broadcast.  Those two operations
> are unnecessary.  We can clean up after the fact, but the cost model
> will still reflect the unnecessary activity.
> 
> > 
> > > At that point it might be appropriate to change the cond-reduction code to
> > > generate the reduce-to-vector in all cases, and optabs.c expand it to
> > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> > > the (parallel) changes to cost model already proposed, does that cover all the
> > > cases? It does add a new tree code, yes, but I'm feeling that could be
> > > justified if we go down this route.
> > 
> > I'd rather have combine recognize an insn that does both (if it
> > exists).  As I understand powerpc doesn't have reduction to scalar
> > (I think no ISA actually has this, but all ISAs have reduce to
> > one vector element plus a cheap way of extraction (effectively a subreg))
> > but it's reduction already has all result vector elements set to the
> > same value (as opposed to having some undefined or zero or whatever).
> 
> The extraction is not necessarily so cheap.  For floating-point values
> in a register, we can indeed use a subreg.  But in this case we are
> talking about integers, and an extract must move them from the vector
> registers to the GPRs.  With POWER8, we can do this with a 5-cycle move
> instruction.  Prior to that, we can only do this with a very expensive
> path through memory, as previous processors had no direct path between
> the vector and integer registers.
> 
> > 
> > > However, another point that springs to mind: if you reduce a loop containing
> > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> > > using shifts, and I think will also use shifts for platforms not possessing a
> > > reduc_plus/min/max. If shifts could be changed for rotates, the code there
> > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> > > (sensibly!) bring all of these together?
> 
> If you think it's reasonable, I will take a look at whether this can
> work.  For targets like PowerPC and AArch32, perhaps we are better off
> not defining an expander at all, but letting the vector epilogue code
> expose the sequence of shifts/max-mins at the GIMPLE level.  For this
> case the vectorizer would then not generate the unnecessary extract and
> broadcast.

Well, I don't see an easy way to do this either.  We really need either
a whole-vector rotate or a double-vector shift to do the expansion
early, neither of which has a defined standard pattern name.  So either
way we would need to introduce a new GIMPLE code and a new optab entry.
I'm still leaning towards interest in exploring the reduce-to-vector
approach.

Thanks,
Bill

> 
> However, I'd want to let Alan Hayward finish revising and committing his
> patch first.
> 
> Regards,
> Bill
> 
> > >
> > > > Perhaps the practical thing is to have the vectorizer also do an
> > > > add_stmt_cost with some new token that indicates the cost model should
> > > > make an adjustment if the back end doesn't need the extract/broadcast.
> > > > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > > > cost, and remove the unnecessary code in simplify-rtx.
> > > 
> > > I think it'd be good if we could do it before simplify-rtx, really, although
> > > I'm not sure I have a strong argument as to why, as long as we can cost it
> > > appropriately.
> > > 
> > > > In any case, I will remove implementing the deprecated optabs, and I'll
> > > > also try to look at Alan L's patch shortly.
> > > 
> > > That'd be great, thanks :)
> > 
> > As for the cost modeling the simple solution is of course to have
> > these as "high-level" costs (a new enum entry), but the sustainable
> > solution is to have the target see the vectorized code and decide
> > for itself.  Iff we'd go down the simple route then the target
> > may as well create the vectorized code for that special high-level
> > operation itself.
> > 
> > Richard.
> > 
>