This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.

On 06/17/2014 05:33 AM, Evgeny Stupachenko wrote:
> +   1st vec:   0  1  2  3  4  5  6  7
> +   2nd vec:   8  9 10 11 12 13 14 15
> +   3rd vec:  16 17 18 19 20 21 22 23
> +
> +   The output sequence should be:
> +
> +   1st vec:  0 3 6  9 12 15 18 21
> +   2nd vec:  1 4 7 10 13 16 19 22
> +   3rd vec:  2 5 8 11 14 17 20 23
> +
> +   We use 3 shuffle instructions and 3 * 3 - 1 shifts to create such output.

Why not 3 * 2 blend followed by 3 shuffle?  When length is prime, as here, we
know that no blend will ever overlap elements.  So:

1st step

  A1 = blend V1 V2 =  0  9  2  3 12  5  6 15
  A2 = blend V1 V2 =  8  1 10 11  4 13 14  7
  A3 = blend V1 V3 = 16 17  2 19 20  5 22 23

2nd step

  B1 = blend A1 V3 =  0  9 18  3 12 21  6 15
  B2 = blend A2 V3 = 16  1 10 19  4 13 22  7
  B3 = blend A3 V2 =  8 17  2 11 20  5 14 23

3rd step

  C1 = perm B1     =  0  3  6  9 12 15 18 21
  C2 = perm B2     =  1  4  7 10 13 16 19 22
  C3 = perm B3     =  2  5  8 11 14 17 20 23

The final permute here isn't trivial, crossing lanes for avx2 and all, but the
initial permute you use is similar.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]