This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.

From: Richard Henderson <rth at redhat dot com>
To: Evgeny Stupachenko <evstupac at gmail dot com>, Richard Biener <richard dot guenther at gmail dot com>, hubicka at ucw dot cz
Cc: Ramana Radhakrishnan <ramana dot radhakrishnan at arm dot com>, Richard Biener <rguenther at suse dot de>, Uros Bizjak <ubizjak at gmail dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Jakub Jelinek <jakub at redhat dot com>
Date: Tue, 17 Jun 2014 08:41:14 -0700
Subject: Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
Authentication-results: sourceware.org; auth=none
References: <CAOvf_xz4y6u9-YZCdTM8j3Awm7pdARvyb-58=obT+U9Tkt0HNg at mail dot gmail dot com> <CAJA7tRb4qV7PCbYSQzkFRnP4TkqqvZiA4nmCmopCzCCvDs-THw at mail dot gmail dot com> <CAOvf_xzj6=MkCPnLvVuQbRh1B_7LaHuNaSuZAHgAZQrX=+h59Q at mail dot gmail dot com> <53905A7B dot 5030408 at arm dot com> <CAOvf_xyjysS4Sx_cjEi-Mx8HqxgBZ1WGSjFz1H93uwHXebW4Vw at mail dot gmail dot com> <CAOvf_xwa07xmyqVGf7Gu19BvfXjV9u9Hsbby-Z-gqtjGJPW4Ag at mail dot gmail dot com> <CAFiYyc3QtfLP6TQWvO-xRABYVn7nhFjcJbtG63QN9Z66kgHDcw at mail dot gmail dot com> <CAOvf_xyhOdHbK6fTm8OEVL=17MFUmpS70Us0Sjy7p_bzgzxxpA at mail dot gmail dot com>

On 06/17/2014 05:33 AM, Evgeny Stupachenko wrote:
> +   1st vec:   0  1  2  3  4  5  6  7
> +   2nd vec:   8  9 10 11 12 13 14 15
> +   3rd vec:  16 17 18 19 20 21 22 23
> +
> +   The output sequence should be:
> +
> +   1st vec:  0 3 6  9 12 15 18 21
> +   2nd vec:  1 4 7 10 13 16 19 22
> +   3rd vec:  2 5 8 11 14 17 20 23
> +
> +   We use 3 shuffle instructions and 3 * 3 - 1 shifts to create such output.

Why not 3 * 2 blend followed by 3 shuffle?  When length is prime, as here, we
know that no blend will ever overlap elements.  So:

1st step

  A1 = blend V1 V2 =  0  9  2  3 12  5  6 15
  A2 = blend V1 V2 =  8  1 10 11  4 13 14  7
  A3 = blend V1 V3 = 16 17  2 19 20  5 22 23

2nd step

  B1 = blend A1 V3 =  0  9 18  3 12 21  6 15
  B2 = blend A2 V3 = 16  1 10 19  4 13 22  7
  B3 = blend A3 V2 =  8 17  2 11 20  5 14 23

3rd step

  C1 = perm B1     =  0  3  6  9 12 15 18 21
  C2 = perm B2     =  1  4  7 10 13 16 19 22
  C3 = perm B3     =  2  5  8 11 14 17 20 23

The final permute here isn't trivial, crossing lanes for avx2 and all, but the
initial permute you use is similar.


r~

Follow-Ups:
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Evgeny Stupachenko

References:
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Ramana Radhakrishnan
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Evgeny Stupachenko
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Ramana Radhakrishnan
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Evgeny Stupachenko
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Evgeny Stupachenko
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Richard Biener
- Re: [PATCH, PR52252] Alternative way of vectorization for load groups of size 2 and 3.
  - From: Evgeny Stupachenko

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]