This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] Combine vectorized loops with its scalar remainder.
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Yuri Rumyantsev <ysrumyan at gmail dot com>
- Cc: gcc-patches <gcc-patches at gcc dot gnu dot org>, Jeff Law <law at redhat dot com>, Igor Zamyatin <izamyatin at gmail dot com>, ÐÐÑÑ ÐÐÐÐÐÐÑ <enkovich dot gnu at gmail dot com>
- Date: Tue, 3 Nov 2015 12:47:41 +0100
- Subject: Re: [RFC] Combine vectorized loops with its scalar remainder.
- Authentication-results: sourceware.org; auth=none
- References: <CAEoMCqSmMRW1C2LniYShbfdA+JfSS6kzfrPYCcdd-rdVXa4mzg at mail dot gmail dot com>
On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi All,
>
> Here is a preliminary patch to combine vectorized loop with its scalar
> remainder, draft of which was proposed by Kirill Yukhin month ago:
> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
> It was tested wwith '-mavx2' option to run on Haswell processor.
> The main goal of it is to improve performance of vectorized loops for AVX512.
> Note that only loads/stores and simple reductions with binary operations are
> converted to masked form, e.g. load --> masked load and reduction like
> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
> creation of a new vector induction variable initialized with consequent values
> from 0.. VF-1, new const vector upper bound which contains number of iterations
> and the result of comparison which is considered as mask vector.
> This implementation has several restrictions:
>
> 1. Multiple types are not supported.
> 2. SLP is not supported.
> 3. Gather/Scatter's are also not supported.
> 4. Vectorization of the loops with low trip count is not implemented yet since
> it requires additional design and tuning.
>
> We are planning to eleminate all these restrictions in GCCv7.
>
> This patch will be extended to include cost model to reject unprofutable
> transformations, e.g. new vector body cost will be evaluated through new
> target hook which estimates cast of masking different vector statements. New
> threshold parameter will be introduced which determines permissible cost
> increasing which will be tuned on an AVX512 machine.
> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
> load/store support since only part of them is in trunk compiler.
>
> Any comments will be appreciated.
As stated in the previous discussion I don't think the extra mask IV
is a good idea
and we instead should have a masked final iteration for the epilogue
(yes, that's
not really "combined" then). This is because in the end we'd not only
want AVX512
to benefit from this work but also other ISAs that can do unaligned or masked
operations (we can overlap the epilogue work with the vectorized work or use
masked loads/stores available with AVX). Note that the same applies to
the alignment prologue if present, I can't see how you can handle that with the
in-loop approach.
Richard.