This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP
- From: Dorit Nuzman <DORIT at il dot ibm dot com>
- To: Ira Rosen <IRAR at il dot ibm dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Sat, 23 Aug 2008 09:52:47 +0300
- Subject: Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP
Ira Rosen/Haifa/IBM wrote on 21/08/2008 14:21:41:
> Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:
>
> > I have a problem with the fact that this specific permutation is so
> > hard-coded into the analysis. It's ok to support only one
> > permutation as a start, but the analysis itself should be general.
> > Hopefully this could be rewritten to identify more general patterns
> > during the analysis, represent the identified permutation somehow
> > (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize
> it or not.
>
> I changed the analysis part, so now during the SLP tree construction
> we only store the permutation, and check if the permutation is
> supported afterwards. I am attaching the updated (not fully tested)
> analysis part of the patch.
>
great, thanks! (when you ci this patch, maybe add a couple testcases for
permutations that are not yet supported).
(small question/request: can you please document what's the difference
between vect_supported_slp_permutation_p and
vect_supported_load_permutation_p?)
...
> > I also have a problem with the transformation: it assumes a very
> > specific form of permute at the gimple level - a permute that takes
> > two vectors as input and a byte mask. I don't think this is a
> > general enough representation (I don't think that the SSE shuffles
> > take a byte mask for example?).
> > We need to think of a more general
> > way to represent a permute at this level, and maybe have a target
> > specific builtin expand it using byte mask when appropriate.
>
> AFAIK, SSE5 permute does take two vectors as input and a byte mask.
> But the mask is not similar to altivec/spu mask.
Intels SSE/AVX shuffle/permute insns (e.g. pshuf*, vpermil2*) have 8-bit
control fields per element (rather than per byte), and some of these insns
shufle/permute elements only from a single input vector (rather than
two) ...
> Maybe I can create
> an element mask at the tree level and leave the correct mask
> creation to the target (builtin)?),
>
yes, I think the mask creation should be done on a target specific basis;
the vectorizer could create a control mask given as a vector of indices per
element.
I guess we can start by introducing a 2-operand permute (this is what the
vectorizer would currently know how to use), but it may be useful to
consider a single operand permute (+ control mask) later on.
> >
> > Actually, I think the particular testcase you are targeting could be
> > vectorized by preparing an appropriate vector of constants instead
> > of working so hard on permuting the loads. Maybe we can try
> > something like that for now (and potentially defer the decision on a
> > representation of permute to a separate patch (and testcase)?)
>
> I don't think this will work. If we only permute the constants, we
> can't get the multiples in the correct order and we will have to
> permute them anyway:
> yi = M00 * ri + M01 * gi + M02 * bi
> ui = M11 * gi + M12 * bi + M10 * ri
> vi = M22 * bi + M20 * ri + M21 * gi
> (we have gbr and brg in the second an third columns instead of rgb).
>
> In case that the number of the grouped statements is smaller than
> the vector size (as in the rgb conversion), we need to unroll the
> loop, and then such permutation will be done across several vectors
> and will be as painful as the load permutation.
>
ok. In a separate followup patch we could look into optimizing cases in
which the group size is equal to the vector size (like rgba).
thanks,
dorit
> Thanks,
> Ira
>
> [attachment "slp-perm-updated.txt" deleted by Dorit Nuzman/Haifa/IBM]