This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: targetm.vectorize.builtin_vec_perm

From: Dorit Nuzman <DORIT at il dot ibm dot com>
To: Ira Rosen <IRAR at il dot ibm dot com>
Cc: gcc at gcc dot gnu dot org, Richard Henderson <rth at redhat dot com>
Date: Tue, 17 Nov 2009 14:53:37 +0200
Subject: Re: targetm.vectorize.builtin_vec_perm
References: <4B01FEDE.8000406@redhat.com> <OF0FC5895C.056A1233-ONC2257671.0028281D-C2257671.00312D6F@il.ibm.com>

...
>
> >
> > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of
> > the form:
> >
> >    VEC_PERMUTE_EXPR (vlow, vhigh, vperm)
> >
> > which would be exactly equal to
> >
> >    (vec_select
> >      (vec_concat vlow vhigh)
> >      vperm)
> >
> > at the rtl level.  I.e. vperm is an integral vector of the same number
> > of elements as vlow.
> >
> > Truly variable permutation is something that's only supported by ppc
and
> > spu.
>
> Also Altivec and SPU support byte permutation (and not only element
> permutation), however, the vectorizer does not make use of this at
present.
>

Yes. I was trying to think if it would be useful to express
byte-permutations instead of element-permutations, but the only two useful
cases that came to mind are things we have covered by other, probably more
appropriate, idioms.

[One is realignment (for which we use the builtin_mask_for_load +
REALIGN_LOAD). The other is the VEC_PACK_TRUNC idiom (where the number of
elements in 'vperm' would be twice the number of elements as 'vlow'), but
other VEC_PACK variants are a little more than just a special case of
permute.]

So (unless we want VEC_PERMUTE to cover these cases, which I think we
don't), an element-wise permutations should suffice, so sounds like a good
suggestion to me.

> > Intel AVX has a limited variable permutation -- 64-bit or 32-bit
> > elements can be rearranged but only within a 128-bit subvector.
> > So if you're working with 128-bit vectors, it's fully variable, but if
> > you're working with 256-bit vectors, it's like doing 2 128-bit permute
> > operations in parallel.  Intel before AVX has no variable permute.
> >
> > HOWEVER!  Most of the useful permutations that I can think of for the
> > optimizers to generate are actually constant.  And these can be
> > implemented everywhere (with varying degrees of efficiency).
> >

That's true for the moment, but there are cases where a variable permute
would be useful for vectorization. E.g. where vectors are used as a lookup
table. One example I know of is for finding delimiters (e.g. for XML
processing) - a lookup table of 256 bits holds one bit per ASCII character
to indicates if a character is a delimiter or not, and the scalar code
looks something like this:
table[256]={1,0,0,....};
for (i...)
   if (table[data[i]] == 1)
     {found delimiter}
...and this is vectorized with 2 vector registers that hold the lookup
table and a shift on the input data vector to create the permutation mask
to access the table. I think there should be other examples for lookup
tables like that used for vectorization. I also saw variable permutes used
for sorting (
http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf).

Indeed there are some serious challenges to overcome in order to do all
that automatically in the compiler... but some pattern-matching based
vectorization approach could conceptually do this.

Also, if one day someone was to introduce platform-independent vector
intrinsics, then such a generic permute would allow programmers to take
advantage of it, even for the cases that would be otherwise too complicated
for the compiler to auto-vectorize.

So I think it would be nice to allow the more general form, but since it
will probably take a while before we actually make use of it, it's probably
not critical for the short term...

> > Anyway, I'm thinking that it might be better to add such a general
> > operation instead of continuing to add things like
> >
> >    VEC_EXTRACT_EVEN_EXPR,
> >    VEC_EXTRACT_ODD_EXPR,
> >    VEC_INTERLEAVE_HIGH_EXPR,
> >    VEC_INTERLEAVE_LOW_EXPR,
> >
> > and other obvious patterns like broadcast, duplicate even to odd,
> > duplicate odd to even, etc.
>

agreed

> If the back end will be able to identify specific masks, e.g., {0,2,4,6}
as
> extract even operation, then we can certainly remove those codes.
>

agreed

dorit

> >
> > I can imagine having some sort of target hook that computed a cost
> > metric for a given constant permutation pattern.  For instance, I'd
> > imagine that the interleave patterns are half as expensive as a full
> > permute for altivec, due to not having to load a mask.  This hook would
> > be fairly complicated for x86, given all of the permuting insns that
> > were incrementally added in various ISA revisions, but such is life.
> >
> > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the
> > uses of builtin_vec_perm within the vectorizer at present?
>
> Yes.
>
> Ira
>
> >
> >
> > r~
>

References:
- targetm.vectorize.builtin_vec_perm
  - From: Richard Henderson
- Re: targetm.vectorize.builtin_vec_perm
  - From: Ira Rosen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]