[PATCH, rs6000] Add expansions for min/max vector reductions

Richard Biener rguenther@suse.de
Fri Sep 18 08:36:00 GMT 2015


On Thu, 17 Sep 2015, Bill Schmidt wrote:

> On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote:
> > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > > On Wed, 16 Sep 2015, Alan Lawrence wrote:
> > > 
> > > > On 16/09/15 17:10, Bill Schmidt wrote:
> > > > > 
> > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > > > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > > > > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > >           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > > > > UNSPEC_REDUC_SMIN,
> > > > > > >           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > > > > >           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > > > > >           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > > > > >           (reduc_smax_v2di): New define_expand.
> > > > > > >           (reduc_smax_scal_v2di): Likewise.
> > > > > > >           (reduc_smin_v2di): Likewise.
> > > > > > >           (reduc_smin_scal_v2di): Likewise.
> > > > > > >           (reduc_umax_v2di): Likewise.
> > > > > > >           (reduc_umax_scal_v2di): Likewise.
> > > > > > >           (reduc_umin_v2di): Likewise.
> > > > > > >           (reduc_umin_scal_v2di): Likewise.
> > > > > > >           (reduc_smax_v4si): Likewise.
> > > > > > >           (reduc_smin_v4si): Likewise.
> > > > > > >           (reduc_umax_v4si): Likewise.
> > > > > > >           (reduc_umin_v4si): Likewise.
> > > > > > >           (reduc_smax_v8hi): Likewise.
> > > > > > >           (reduc_smin_v8hi): Likewise.
> > > > > > >           (reduc_umax_v8hi): Likewise.
> > > > > > >           (reduc_umin_v8hi): Likewise.
> > > > > > >           (reduc_smax_v16qi): Likewise.
> > > > > > >           (reduc_smin_v16qi): Likewise.
> > > > > > >           (reduc_umax_v16qi): Likewise.
> > > > > > >           (reduc_umin_v16qi): Likewise.
> > > > > > >           (reduc_smax_scal_<mode>): Likewise.
> > > > > > >           (reduc_smin_scal_<mode>): Likewise.
> > > > > > >           (reduc_umax_scal_<mode>): Likewise.
> > > > > > >           (reduc_umin_scal_<mode>): Likewise.
> > > > > > 
> > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > > > > used if
> > > > > > the _scal are present. The non-_scal's were previously defined as
> > > > > > producing a
> > > > > > vector with one element holding the result and the other elements all
> > > > > > zero, and
> > > > > > this was only ever used with a vec_extract immediately after; the _scal
> > > > > > pattern
> > > > > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > > > > deprecated / considered legacy, as per md.texi.
> > > > > 
> > > > > Thanks -- I had misread the description of the non-scalar versions,
> > > > > missing the part where the other elements are zero.  What I really
> > > > > want/need is an optab defined as computing the maximum value in all
> > > > > elements of the vector.
> > > > 
> > > > Yes, indeed. It seems reasonable to me that this would coexist with an optab
> > > > which computes only a single value (i.e. a scalar).
> > > 
> > > So just to clarify - you need to reduce the vector with max to a scalar
> > > but want the (same) result in all vector elements?
> > 
> > Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> > reduction to scalar, followed by a scalar broadcast to get the value
> > into all positions.  It happens that our most efficient expansion to
> > reduce to scalar will naturally produce the value in all positions.
> > Using the reduc_smax_scal_<mode> expander, we are required to extract
> > this into a scalar and then perform the broadcast.  Those two operations
> > are unnecessary.  We can clean up after the fact, but the cost model
> > will still reflect the unnecessary activity.
> > 
> > > 
> > > > At that point it might be appropriate to change the cond-reduction code to
> > > > generate the reduce-to-vector in all cases, and optabs.c expand it to
> > > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> > > > the (parallel) changes to cost model already proposed, does that cover all the
> > > > cases? It does add a new tree code, yes, but I'm feeling that could be
> > > > justified if we go down this route.
> > > 
> > > I'd rather have combine recognize an insn that does both (if it
> > > exists).  As I understand powerpc doesn't have reduction to scalar
> > > (I think no ISA actually has this, but all ISAs have reduce to
> > > one vector element plus a cheap way of extraction (effectively a subreg))
> > > but it's reduction already has all result vector elements set to the
> > > same value (as opposed to having some undefined or zero or whatever).
> > 
> > The extraction is not necessarily so cheap.  For floating-point values
> > in a register, we can indeed use a subreg.  But in this case we are
> > talking about integers, and an extract must move them from the vector
> > registers to the GPRs.  With POWER8, we can do this with a 5-cycle move
> > instruction.  Prior to that, we can only do this with a very expensive
> > path through memory, as previous processors had no direct path between
> > the vector and integer registers.
> > 
> > > 
> > > > However, another point that springs to mind: if you reduce a loop containing
> > > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> > > > using shifts, and I think will also use shifts for platforms not possessing a
> > > > reduc_plus/min/max. If shifts could be changed for rotates, the code there
> > > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> > > > (sensibly!) bring all of these together?
> > 
> > If you think it's reasonable, I will take a look at whether this can
> > work.  For targets like PowerPC and AArch32, perhaps we are better off
> > not defining an expander at all, but letting the vector epilogue code
> > expose the sequence of shifts/max-mins at the GIMPLE level.  For this
> > case the vectorizer would then not generate the unnecessary extract and
> > broadcast.
> 
> Well, I don't see an easy way to do this either.  We really need either
> a whole-vector rotate or a double-vector shift to do the expansion
> early, neither of which has a defined standard pattern name.  So either
> way we would need to introduce a new GIMPLE code and a new optab entry.
> I'm still leaning towards interest in exploring the reduce-to-vector
> approach.

Btw, we ditched the original reduce-to-vector variant due to its
endianess issues (it only had _one_ element of the vector contain
the reduction result).  Re-introducing reduce-to-vector but with
the reduction result in all elements wouldn't have any issue like
that.

Of course it would introduce a third optab variant similar to the
old (now legacy) version but with different semantics...

Richard.

> Thanks,
> Bill
> 
> > 
> > However, I'd want to let Alan Hayward finish revising and committing his
> > patch first.
> > 
> > Regards,
> > Bill
> > 
> > > >
> > > > > Perhaps the practical thing is to have the vectorizer also do an
> > > > > add_stmt_cost with some new token that indicates the cost model should
> > > > > make an adjustment if the back end doesn't need the extract/broadcast.
> > > > > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > > > > cost, and remove the unnecessary code in simplify-rtx.
> > > > 
> > > > I think it'd be good if we could do it before simplify-rtx, really, although
> > > > I'm not sure I have a strong argument as to why, as long as we can cost it
> > > > appropriately.
> > > > 
> > > > > In any case, I will remove implementing the deprecated optabs, and I'll
> > > > > also try to look at Alan L's patch shortly.
> > > > 
> > > > That'd be great, thanks :)
> > > 
> > > As for the cost modeling the simple solution is of course to have
> > > these as "high-level" costs (a new enum entry), but the sustainable
> > > solution is to have the target see the vectorized code and decide
> > > for itself.  Iff we'd go down the simple route then the target
> > > may as well create the vectorized code for that special high-level
> > > operation itself.
> > > 
> > > Richard.
> > > 
> > 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)



More information about the Gcc-patches mailing list