This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
- From: Evgeny Stupachenko <evstupac at gmail dot com>
- To: Richard Henderson <rth at redhat dot com>
- Cc: Richard Biener <richard dot guenther at gmail dot com>, hubicka at ucw dot cz, Ramana Radhakrishnan <ramana dot radhakrishnan at arm dot com>, Richard Biener <rguenther at suse dot de>, Uros Bizjak <ubizjak at gmail dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Jakub Jelinek <jakub at redhat dot com>
- Date: Tue, 17 Jun 2014 21:29:02 +0400
- Subject: Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
- Authentication-results: sourceware.org; auth=none
- References: <CAOvf_xz4y6u9-YZCdTM8j3Awm7pdARvyb-58=obT+U9Tkt0HNg at mail dot gmail dot com> <CAJA7tRb4qV7PCbYSQzkFRnP4TkqqvZiA4nmCmopCzCCvDs-THw at mail dot gmail dot com> <CAOvf_xzj6=MkCPnLvVuQbRh1B_7LaHuNaSuZAHgAZQrX=+h59Q at mail dot gmail dot com> <53905A7B dot 5030408 at arm dot com> <CAOvf_xyjysS4Sx_cjEi-Mx8HqxgBZ1WGSjFz1H93uwHXebW4Vw at mail dot gmail dot com> <CAOvf_xwa07xmyqVGf7Gu19BvfXjV9u9Hsbby-Z-gqtjGJPW4Ag at mail dot gmail dot com> <CAFiYyc3QtfLP6TQWvO-xRABYVn7nhFjcJbtG63QN9Z66kgHDcw at mail dot gmail dot com> <CAOvf_xyhOdHbK6fTm8OEVL=17MFUmpS70Us0Sjy7p_bzgzxxpA at mail dot gmail dot com> <53A0619A dot 4000003 at redhat dot com>
While developing I've tried the following scheme:
First step is 3 shuffles (as initially):
A1 = (0 3 6) (1 4 7) (2 5)
A2 = (8 11 14) (9 12 15) (10 13)
A3 = (16 19 22) (17 20 23) (18 21)
R1 = blend [ blend [A1 A2], A3] = (0 3 6) (9 12 15) (18 21)
B2 = blend [A1, A2] = (0 3 6) (1 4 7) (10 13)
R2 = shift 3, B2 ... (1 4 7) (10 13) + A3 (16 19 22) ... = (1 4 7) (10
13) (16 19 22)
B3 = blend [ A2, A3] = (8 11 14) (17 20 23) (18 21)
R3 = shift 6, A1 ... (2 5) + B3 (8 11 14) (17 20 23) ... = (2 5) (8 11
14) (17 20 23)
But it was slower than scheme in the patch as blend costs more than
shift (palign).
For AVX2 the scheme is not ok as have much more dependencies than
current (in vect_permute_load_chain).
Evgeny
On Tue, Jun 17, 2014 at 7:41 PM, Richard Henderson <rth@redhat.com> wrote:
> On 06/17/2014 05:33 AM, Evgeny Stupachenko wrote:
>> + 1st vec: 0 1 2 3 4 5 6 7
>> + 2nd vec: 8 9 10 11 12 13 14 15
>> + 3rd vec: 16 17 18 19 20 21 22 23
>> +
>> + The output sequence should be:
>> +
>> + 1st vec: 0 3 6 9 12 15 18 21
>> + 2nd vec: 1 4 7 10 13 16 19 22
>> + 3rd vec: 2 5 8 11 14 17 20 23
>> +
>> + We use 3 shuffle instructions and 3 * 3 - 1 shifts to create such output.
>
> Why not 3 * 2 blend followed by 3 shuffle? When length is prime, as here, we
> know that no blend will ever overlap elements. So:
>
> 1st step
>
> A1 = blend V1 V2 = 0 9 2 3 12 5 6 15
> A2 = blend V1 V2 = 8 1 10 11 4 13 14 7
> A3 = blend V1 V3 = 16 17 2 19 20 5 22 23
>
> 2nd step
>
> B1 = blend A1 V3 = 0 9 18 3 12 21 6 15
> B2 = blend A2 V3 = 16 1 10 19 4 13 22 7
> B3 = blend A3 V2 = 8 17 2 11 20 5 14 23
>
> 3rd step
>
> C1 = perm B1 = 0 3 6 9 12 15 18 21
> C2 = perm B2 = 1 4 7 10 13 16 19 22
> C3 = perm B3 = 2 5 8 11 14 17 20 23
>
> The final permute here isn't trivial, crossing lanes for avx2 and all, but the
> initial permute you use is similar.
>
>
> r~