[patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP

Ira Rosen IRAR@il.ibm.com
Mon Aug 25 12:55:00 GMT 2008



Dorit Nuzman/Haifa/IBM wrote on 23/08/2008 09:52:47:

> Ira Rosen/Haifa/IBM wrote on 21/08/2008 14:21:41:
>
> > Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:
> >
> > > I have a problem with the fact that this specific permutation is so
> > > hard-coded into the analysis. It's ok to support only one
> > > permutation as a start, but the analysis itself should be general.
> > > Hopefully this could be rewritten to identify more general patterns
> > > during the analysis, represent the identified permutation somehow
> > > (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize
> > it or not.
> >
> > I changed the analysis part, so now during the SLP tree construction
> > we only store the permutation, and check if the permutation is
> > supported afterwards. I am attaching the updated (not fully tested)
> > analysis part of the patch.>
> >
>
> great, thanks! (when you ci this patch, maybe add a couple testcases
> for permutations that are not yet supported).

Such testcases already exist in the original patch.

>
> (small question/request: can you please document what's the
> difference between vect_supported_slp_permutation_p  and
> vect_supported_load_permutation_p?)

Sure.

>
> ...
> > > I also have a problem with the transformation: it assumes a very
> > > specific form of permute at the gimple level - a permute that takes
> > > two vectors as input and a byte mask. I don't think this is a
> > > general enough representation (I don't think that the SSE shuffles
> > > take a byte mask for example?).
> > > We need to think of a more general
> > > way to represent a permute at this level, and maybe have a target
> > > specific builtin expand it using byte mask when appropriate.
> >
> > AFAIK, SSE5 permute does take two vectors as input and a byte mask.
> > But the mask is not similar to altivec/spu mask.
>
> Intels SSE/AVX shuffle/permute insns (e.g. pshuf*, vpermil2*) have
> 8-bit control fields per element (rather than per byte), and some of
> these insns shufle/permute elements only from a single input vector
> (rather than two) ...
>
> > Maybe I can create
> > an element mask at the tree level and leave the correct mask
> > creation to the target (builtin)?,
> >>
>
> yes, I think the mask creation should be done on a target specific
> basis; the vectorizer could create a control mask given as a vector
> of indices per element.

Actually, the original patch already creates element mask and then calls
vect_get_mask_element() to convert the mask according to its type (received
from builtin_vec_perm).
I must change the variable names (like mask_bytes, first_byte, etc.), but
otherwise the mask creation is already target specific...

>
> I guess we can start by introducing a 2-operand permute (this is
> what the vectorizer would currently know how to use), but it may be
> useful to consider a single operand permute (+ control mask) later on.
>
> > >
> > > Actually, I think the particular testcase you are targeting could be
> > > vectorized by preparing an appropriate vector of constants instead
> > > of working so hard on permuting the loads. Maybe we can try
> > > something like that for now (and potentially defer the decision on a
> > > representation of permute to a separate patch (and testcase)?)
> >
> > I don't think this will work. If we only permute the constants, we
> > can't get the multiples in the correct order and we will have to
> > permute them anyway:
> > yi = M00 * ri + M01 * gi + M02 * bi
> > ui = M11 * gi + M12 * bi + M10 * ri
> > vi = M22 * bi + M20 * ri + M21 * gi
> > (we have gbr and brg in the second an third columns instead of rgb).
> >
> > In case that the number of the grouped statements is smaller than
> > the vector size (as in the rgb conversion), we need to unroll the
> > loop, and then such permutation will be done across several vectors
> > and will be as painful as the load permutation.
> >
>
> ok. In a separate followup patch we could look into optimizing cases
> in which the group size is equal to the vector size (like rgba).

OK.>

Thanks,.
Ira




More information about the Gcc-patches mailing list