[patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP

Fri Aug 8 15:11:00 GMT 2008

> Hi,

Hi Ira,

> Current loop-aware SLP scheme starts from a group of adjacent stores and
> follows use-def chains until getting to a group of loads. The loads must
be
> adjacent and their order must match the order of the stores, i.e., no
> permutations are currently allowed.

> This patch adds a support of a specific type of load permutations along
> with general support of load permutations in SLP. It aims to vectorize
RGB

I have a problem with the fact that this specific permutation is so
hard-coded into the analysis. It's ok to support only one permutation as a
start, but the analysis itself should be general. Hopefully this could be
rewritten to identify more general patterns during the analysis, represent
the identified permutation somehow (e.g. [3,2,1,0]), and then decide if we
can proceed to vectorize it or not.

> to YUV conversion, that can be viewed as {y, u, v} = M * {r, g, b}, where
M
> is a matrix of constant coefficients, and the calculation is performed in
a
> single-nested loop:
> for i
> yi = M00 * ri +  M01 * gi + M02 * bi
> ui = M10 * ri +  M11 * gi + M12 * bi
> vi = M20 * ri +  M21 * gi + M22 * bi
> The required permutation of loads is to transform rgb stream into
{r,r,r},
> {g,g,g} and {b,b,b} vectors (ignoring vector size for simplicity).

> The SLP analysis detects such cases: all the loads in the same SLP node
> must access the same memory location, and all the SLP nodes that contain
> loads must form a group of adjacent memory accesses. The transformation
> phase generates vector permutations of the input vectors with compiler
> generated masks, depending on the data type, vectorization factor and
size
> of SLP nodes.

I also have a problem with the transformation: it assumes a very specific
form of permute at the gimple level - a permute that takes two vectors as
input and a byte mask. I don't think this is a general enough
representation (I don't think that the SSE shuffles take a byte mask for
example?). We need to think of a more general way to represent a permute at
this level, and maybe have a target specific builtin expand it using byte
mask when appropriate.

Actually, I think the particular testcase you are targeting could be
vectorized by preparing an appropriate vector of constants instead of
working so hard on permuting the loads. Maybe we can try something like
that for now (and potentially defer the decision on a representation of
permute to a separate patch (and testcase)?)

thanks,
dorit

> Bootstrapped with vectorization enabled on ppc-linux and tested on Cell
SPU
> and ppc-linux.
> O.K. for mainline?

> Thanks,
> Ira

> ChangeLog:

> * target.h (struct vectorize): Add new target builtin.
> * tree-vectorizer.h (enum slp_load_perm_type): New.
> (struct _slp_tree): Add new field loads_perm_type..
> (struct _slp_instance): Add new field same_perm_nodes.
> (SLP_INSTANCE_SAME_PERM_NODES): New.
> (SLP_TREE_LOADS_PERM_TYPE, TARG_VEC_PERMUTE_COST): New.
> (vectorizable_load): Add argument.
> (vect_transform_slp_perm_load): new.
> * tree-vect-analyze.c (vect_analyze_operations): Add an argument to
> vectorizable_load.
> (vect_build_slp_tree): Add new argument. Allow load permutations for
> the case
> when all the loads in the same SLP node access the same memory
> location.
> (vect_analyze_slp_instance): In case of same location loads check
> that the
> loads from different nodes form an interleaving chain. Sort the nodes
> according
> to the chain.
> * target-def.h (TARGET_VECTORIZE_BUILTIN_VEC_PERM): New.
> * tree-vect-transform.c (vect_transform_stmt): Add new argument.
> (vectorizable_store): Allow number of created vectors to be greater
> than the
> size of an interleaving group. Don't go along the interleaving chain
> for SLP.
> (vect_create_mask_and_perm): New function.
> (vect_get_mask_element, vect_transform_slp_perm_load): Likewise.
> (vectorizable_load): Allocate DR_CHAIN according to the number of
> generated
> vectors. Don't keep the created vectors statements in the node if
> permutation
> is required. Call vect_transform_slp_perm_load to generate the
> permutation.
> (vect_transform_stmt): Add new argument. Call vectorizable_load with
> additional
> argument. Don't wait for other stores in case of SLP.
> (vect_schedule_slp_instance): Add new argument. Calculate the number
> of vector
> statements. In case of loads from the same location, allocate
> vectorized
> statements structure for all the related SLP nodes. Call
> vect_transform_stmt with
> additional argument.
> (vect_schedule_slp): Remove one argument. Move number of vector
> statements
> calculation to vect_schedule_slp_instance.
> (vect_transform_loop): Call vect_transform_stmt and vect_schedule_slp
> with
> correct arguments.
> * config/spu/spu.c (spu_builtin_vec_perm): New.
> (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine..
> * config/spu/spu.h (TARG_VEC_PERMUTE_COS): Define.
> * config/rs6000/rs6000.c (rs6000_builtin_vec_perm): New.
> (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine.,
>
> testsuite/ChangeLog:

> * lib/target-supports.exp (check_effective_target_vect_perm): New.
> * gcc.dg/vect/slp-perm-1.c: New testcase.
> * gcc.dg/vect/slp-perm-2.c: Likewise.
> * gcc.dg/vect/slp-perm-3.c: Likewise.
> * gcc.dg/vect/slp-perm-4.c: Likewise.
> * gcc.dg/vect/slp-perm-5.c: Likewise.
> * gcc.dg/vect/slp-perm-6.c: Likewise.
> * gcc.dg/vect/slp-perm-7.c: Likewise.
> * gcc.dg/vect/slp-perm-8.c: Likewise.
> * gcc.dg/vect/slp-perm-9.c: Likewise.
>
> (See attached file: slp-perm.txt)(See attached file: tests.txt)

>
> [attachment "tests.txt" deleted by Dorit Nuzman/Haifa/IBM]
>
> #### slp-perm.txt has been deleted (was saved in repository
> MyAttachments Repository ->) from this note on 07 July 2008 by Dorit
Nuzman