This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP


Ira Rosen/Haifa/IBM wrote on 25/08/2008 15:22:28:

> Dorit Nuzman/Haifa/IBM wrote on 23/08/2008 09:52:47:
>
> > Ira Rosen/Haifa/IBM wrote on 21/08/2008 14:21:41:
> >
> > > Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:
> > >
> > > > I have a problem with the fact that this specific permutation is so

> > > > hard-coded into the analysis. It's ok to support only one
> > > > permutation as a start, but the analysis itself should be general.
> > > > Hopefully this could be rewritten to identify more general patterns

> > > > during the analysis, represent the identified permutation somehow
> > > > (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize
> > > it or not.
> > >
> > > I changed the analysis part, so now during the SLP tree construction
> > > we only store the permutation, and check if the permutation is
> > > supported afterwards. I am attaching the updated (not fully tested)
> > > analysis part of the patch.
> > >
> >
> > great, thanks! (when you ci this patch, maybe add a couple testcases
> > for permutations that are not yet supported).
>
> Such testcases already exist in the original patch.
>
> >
> > (small question/request: can you please document what's the
> > difference between vect_supported_slp_permutation_p  and
> > vect_supported_load_permutation_p?)
>
> Sure.>
>
> >
> > ...
> > > > I also have a problem with the transformation: it assumes a very
> > > > specific form of permute at the gimple level - a permute that takes

> > > > two vectors as input and a byte mask. I don't think this is a
> > > > general enough representation (I don't think that the SSE shuffles
> > > > take a byte mask for example?).
> > > > We need to think of a more general
> > > > way to represent a permute at this level, and maybe have a target
> > > > specific builtin expand it using byte mask when appropriate.
> > >
> > > AFAIK, SSE5 permute does take two vectors as input and a byte mask.
> > > But the mask is not similar to altivec/spu mask.
> >
> > Intels SSE/AVX shuffle/permute insns (e.g. pshuf*, vpermil2*) have
> > 8-bit control fields per element (rather than per byte), and some of
> > these insns shufle/permute elements only from a single input vector
> > (rather than two) ...
> >
> > > Maybe I can create
> > > an element mask at the tree level and leave the correct mask
> > > creation to the target (builtin)?
> > >>
> >
> > yes, I think the mask creation should be done on a target specific
> > basis; the vectorizer could create a control mask given as a vector
> > of indices per element.
>
> Actually, the original patch already creates element mask and then
> calls vect_get_mask_element() to convert the mask according to its
> type (received from builtin_vec_perm).
> I must change the variable names (like mask_bytes, first_byte,
> etc.), but otherwise the mask creation is already target specific...
>

great.
ok with these changes,

thanks,
dorit

> >
> > I guess we can start by introducing a 2-operand permute (this is
> > what the vectorizer would currently know how to use), but it may be
> > useful to consider a single operand permute (+ control mask) later on.
> >
> > > >
> > > > Actually, I think the particular testcase you are targeting could
be
> > > > vectorized by preparing an appropriate vector of constants instead
> > > > of working so hard on permuting the loads. Maybe we can try
> > > > something like that for now (and potentially defer the decision on
a
> > > > representation of permute to a separate patch (and testcase)?)>
> > >
> > > I don't think this will work. If we only permute the constants, we
> > > can't get the multiples in the correct order and we will have to
> > > permute them anyway:
> > > yi = M00 * ri + M01 * gi + M02 * bi
> > > ui = M11 * gi + M12 * bi + M10 * ri
> > > vi = M22 * bi + M20 * ri + M21 * gi
> > > (we have gbr and brg in the second an third columns instead of rgb).
> > >
> > > In case that the number of the grouped statements is smaller than
> > > the vector size (as in the rgb conversion), we need to unroll the
> > > loop, and then such permutation will be done across several vectors
> > > and will be as painful as the load permutation.
> > >
> >
> > ok. In a separate followup patch we could look into optimizing cases
> > in which the group size is equal to the vector size (like rgba).
>
> OK.
>
> Thanks,
> Ira


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]