This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
- From: Richard Henderson <rth at redhat dot com>
- To: Evgeny Stupachenko <evstupac at gmail dot com>, Richard Biener <richard dot guenther at gmail dot com>, hubicka at ucw dot cz
- Cc: Ramana Radhakrishnan <ramana dot radhakrishnan at arm dot com>, Richard Biener <rguenther at suse dot de>, Uros Bizjak <ubizjak at gmail dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Jakub Jelinek <jakub at redhat dot com>
- Date: Tue, 17 Jun 2014 08:41:14 -0700
- Subject: Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
- Authentication-results: sourceware.org; auth=none
- References: <CAOvf_xz4y6u9-YZCdTM8j3Awm7pdARvyb-58=obT+U9Tkt0HNg at mail dot gmail dot com> <CAJA7tRb4qV7PCbYSQzkFRnP4TkqqvZiA4nmCmopCzCCvDs-THw at mail dot gmail dot com> <CAOvf_xzj6=MkCPnLvVuQbRh1B_7LaHuNaSuZAHgAZQrX=+h59Q at mail dot gmail dot com> <53905A7B dot 5030408 at arm dot com> <CAOvf_xyjysS4Sx_cjEi-Mx8HqxgBZ1WGSjFz1H93uwHXebW4Vw at mail dot gmail dot com> <CAOvf_xwa07xmyqVGf7Gu19BvfXjV9u9Hsbby-Z-gqtjGJPW4Ag at mail dot gmail dot com> <CAFiYyc3QtfLP6TQWvO-xRABYVn7nhFjcJbtG63QN9Z66kgHDcw at mail dot gmail dot com> <CAOvf_xyhOdHbK6fTm8OEVL=17MFUmpS70Us0Sjy7p_bzgzxxpA at mail dot gmail dot com>
On 06/17/2014 05:33 AM, Evgeny Stupachenko wrote:
> + 1st vec: 0 1 2 3 4 5 6 7
> + 2nd vec: 8 9 10 11 12 13 14 15
> + 3rd vec: 16 17 18 19 20 21 22 23
> +
> + The output sequence should be:
> +
> + 1st vec: 0 3 6 9 12 15 18 21
> + 2nd vec: 1 4 7 10 13 16 19 22
> + 3rd vec: 2 5 8 11 14 17 20 23
> +
> + We use 3 shuffle instructions and 3 * 3 - 1 shifts to create such output.
Why not 3 * 2 blend followed by 3 shuffle? When length is prime, as here, we
know that no blend will ever overlap elements. So:
1st step
A1 = blend V1 V2 = 0 9 2 3 12 5 6 15
A2 = blend V1 V2 = 8 1 10 11 4 13 14 7
A3 = blend V1 V3 = 16 17 2 19 20 5 22 23
2nd step
B1 = blend A1 V3 = 0 9 18 3 12 21 6 15
B2 = blend A2 V3 = 16 1 10 19 4 13 22 7
B3 = blend A3 V2 = 8 17 2 11 20 5 14 23
3rd step
C1 = perm B1 = 0 3 6 9 12 15 18 21
C2 = perm B2 = 1 4 7 10 13 16 19 22
C3 = perm B3 = 2 5 8 11 14 17 20 23
The final permute here isn't trivial, crossing lanes for avx2 and all, but the
initial permute you use is similar.
r~