This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.


While developing I've tried the following scheme:

First step is 3 shuffles (as initially):

A1 = (0 3 6) (1 4 7) (2 5)
A2 = (8 11 14) (9 12 15) (10 13)
A3 = (16 19 22) (17 20 23) (18 21)

R1 = blend [ blend [A1 A2], A3] =  (0 3 6) (9 12 15) (18 21)
  B2 = blend [A1, A2] = (0 3 6) (1 4 7) (10 13)
R2 = shift 3, B2 ... (1 4 7) (10 13) + A3 (16 19 22) ... = (1 4 7) (10
13) (16 19 22)
  B3 = blend [ A2, A3] = (8 11 14) (17 20 23) (18 21)
R3 = shift 6, A1 ... (2 5) + B3 (8 11 14) (17 20 23) ... = (2 5) (8 11
14) (17 20 23)

But it was slower than scheme in the patch as blend costs more than
shift (palign).
For AVX2 the scheme is not ok as have much more dependencies than
current (in vect_permute_load_chain).

Evgeny

On Tue, Jun 17, 2014 at 7:41 PM, Richard Henderson <rth@redhat.com> wrote:
> On 06/17/2014 05:33 AM, Evgeny Stupachenko wrote:
>> +   1st vec:   0  1  2  3  4  5  6  7
>> +   2nd vec:   8  9 10 11 12 13 14 15
>> +   3rd vec:  16 17 18 19 20 21 22 23
>> +
>> +   The output sequence should be:
>> +
>> +   1st vec:  0 3 6  9 12 15 18 21
>> +   2nd vec:  1 4 7 10 13 16 19 22
>> +   3rd vec:  2 5 8 11 14 17 20 23
>> +
>> +   We use 3 shuffle instructions and 3 * 3 - 1 shifts to create such output.
>
> Why not 3 * 2 blend followed by 3 shuffle?  When length is prime, as here, we
> know that no blend will ever overlap elements.  So:
>
> 1st step
>
>   A1 = blend V1 V2 =  0  9  2  3 12  5  6 15
>   A2 = blend V1 V2 =  8  1 10 11  4 13 14  7
>   A3 = blend V1 V3 = 16 17  2 19 20  5 22 23
>
> 2nd step
>
>   B1 = blend A1 V3 =  0  9 18  3 12 21  6 15
>   B2 = blend A2 V3 = 16  1 10 19  4 13 22  7
>   B3 = blend A3 V2 =  8 17  2 11 20  5 14 23
>
> 3rd step
>
>   C1 = perm B1     =  0  3  6  9 12 15 18 21
>   C2 = perm B2     =  1  4  7 10 13 16 19 22
>   C3 = perm B3     =  2  5  8 11 14 17 20 23
>
> The final permute here isn't trivial, crossing lanes for avx2 and all, but the
> initial permute you use is similar.
>
>
> r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]