This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: targetm.vectorize.builtin_vec_perm
...
>
> >
> > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of
> > the form:
> >
> > VEC_PERMUTE_EXPR (vlow, vhigh, vperm)
> >
> > which would be exactly equal to
> >
> > (vec_select
> > (vec_concat vlow vhigh)
> > vperm)
> >
> > at the rtl level. I.e. vperm is an integral vector of the same number
> > of elements as vlow.
> >
> > Truly variable permutation is something that's only supported by ppc
and
> > spu.
>
> Also Altivec and SPU support byte permutation (and not only element
> permutation), however, the vectorizer does not make use of this at
present.
>
Yes. I was trying to think if it would be useful to express
byte-permutations instead of element-permutations, but the only two useful
cases that came to mind are things we have covered by other, probably more
appropriate, idioms.
[One is realignment (for which we use the builtin_mask_for_load +
REALIGN_LOAD). The other is the VEC_PACK_TRUNC idiom (where the number of
elements in 'vperm' would be twice the number of elements as 'vlow'), but
other VEC_PACK variants are a little more than just a special case of
permute.]
So (unless we want VEC_PERMUTE to cover these cases, which I think we
don't), an element-wise permutations should suffice, so sounds like a good
suggestion to me.
> > Intel AVX has a limited variable permutation -- 64-bit or 32-bit
> > elements can be rearranged but only within a 128-bit subvector.
> > So if you're working with 128-bit vectors, it's fully variable, but if
> > you're working with 256-bit vectors, it's like doing 2 128-bit permute
> > operations in parallel. Intel before AVX has no variable permute.
> >
> > HOWEVER! Most of the useful permutations that I can think of for the
> > optimizers to generate are actually constant. And these can be
> > implemented everywhere (with varying degrees of efficiency).
> >
That's true for the moment, but there are cases where a variable permute
would be useful for vectorization. E.g. where vectors are used as a lookup
table. One example I know of is for finding delimiters (e.g. for XML
processing) - a lookup table of 256 bits holds one bit per ASCII character
to indicates if a character is a delimiter or not, and the scalar code
looks something like this:
table[256]={1,0,0,....};
for (i...)
if (table[data[i]] == 1)
{found delimiter}
...and this is vectorized with 2 vector registers that hold the lookup
table and a shift on the input data vector to create the permutation mask
to access the table. I think there should be other examples for lookup
tables like that used for vectorization. I also saw variable permutes used
for sorting (
http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf).
Indeed there are some serious challenges to overcome in order to do all
that automatically in the compiler... but some pattern-matching based
vectorization approach could conceptually do this.
Also, if one day someone was to introduce platform-independent vector
intrinsics, then such a generic permute would allow programmers to take
advantage of it, even for the cases that would be otherwise too complicated
for the compiler to auto-vectorize.
So I think it would be nice to allow the more general form, but since it
will probably take a while before we actually make use of it, it's probably
not critical for the short term...
> > Anyway, I'm thinking that it might be better to add such a general
> > operation instead of continuing to add things like
> >
> > VEC_EXTRACT_EVEN_EXPR,
> > VEC_EXTRACT_ODD_EXPR,
> > VEC_INTERLEAVE_HIGH_EXPR,
> > VEC_INTERLEAVE_LOW_EXPR,
> >
> > and other obvious patterns like broadcast, duplicate even to odd,
> > duplicate odd to even, etc.
>
agreed
> If the back end will be able to identify specific masks, e.g., {0,2,4,6}
as
> extract even operation, then we can certainly remove those codes.
>
agreed
dorit
> >
> > I can imagine having some sort of target hook that computed a cost
> > metric for a given constant permutation pattern. For instance, I'd
> > imagine that the interleave patterns are half as expensive as a full
> > permute for altivec, due to not having to load a mask. This hook would
> > be fairly complicated for x86, given all of the permuting insns that
> > were incrementally added in various ISA revisions, but such is life.
> >
> > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the
> > uses of builtin_vec_perm within the vectorizer at present?
>
> Yes.
>
> Ira
>
> >
> >
> > r~
>