[patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP

Thu Aug 21 11:45:00 GMT 2008

Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:

> I have a problem with the fact that this specific permutation is so
> hard-coded into the analysis. It's ok to support only one
> permutation as a start, but the analysis itself should be general.
> Hopefully this could be rewritten to identify more general patterns
> during the analysis, represent the identified permutation somehow
> (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize it or
not.

I changed the analysis part, so now during the SLP tree construction we
only store the permutation, and check if the permutation is supported
afterwards. I am attaching the updated (not fully tested)  analysis part of
the patch.

>
> > to YUV conversion, that can be viewed as {y, u, v} = M * {r, g, b},
where M
> > is a matrix of constant coefficients, and the calculation is performed
in a
> > single-nested loop:
> > for i
> > yi = M00 * ri +  M01 * gi + M02 * bi
> > ui = M10 * ri +  M11 * gi + M12 * bi
> > vi = M20 * ri +  M21 * gi + M22 * bi
> > The required permutation of loads is to transform rgb stream into
{r,r,r},
> > {g,g,g} and {b,b,b} vectors (ignoring vector size for simplicity).
>
> > The SLP analysis detects such cases: all the loads in the same SLP node
> > must access the same memory location, and all the SLP nodes that
contain
> > loads must form a group of adjacent memory accesses. The transformation
> > phase generates vector permutations of the input vectors with compiler
> > generated masks, depending on the data type, vectorization factor and
size
> > of SLP nodes.

> I also have a problem with the transformation: it assumes a very
> specific form of permute at the gimple level - a permute that takes
> two vectors as input and a byte mask. I don't think this is a
> general enough representation (I don't think that the SSE shuffles
> take a byte mask for example?).
> We need to think of a more general
> way to represent a permute at this level, and maybe have a target
> specific builtin expand it using byte mask when appropriate.

AFAIK, SSE5 permute does take two vectors as input and a byte mask. But the
mask is not similar to altivec/spu mask. Maybe I can create an element mask
at the tree level and leave the correct mask creation to the target
(builtin)?

>
> Actually, I think the particular testcase you are targeting could be
> vectorized by preparing an appropriate vector of constants instead
> of working so hard on permuting the loads. Maybe we can try
> something like that for now (and potentially defer the decision on a
> representation of permute to a separate patch (and testcase)?)

I don't think this will work. If we only permute the constants, we can't
get the multiples in the correct order and we will have to permute them
anyway:
yi = M00 * ri + M01 * gi + M02 * bi
ui = M11 * gi + M12 * bi + M10 * ri
vi = M22 * bi + M20 * ri + M21 * gi
(we have gbr and brg in the second an third columns instead of rgb).

In case that the number of the grouped statements is smaller than the
vector size (as in the rgb conversion), we need to unroll the loop, and
then such permutation will be done across several vectors and will be as
painful as the load permutation.

Thanks,
Ira

(See attached file: slp-perm-updated.txt)

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: slp-perm-updated.txt
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20080821/3db9f44b/attachment.txt>