This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] Combine vectorized loops with its scalar remainder.
- From: Yuri Rumyantsev <ysrumyan at gmail dot com>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: gcc-patches <gcc-patches at gcc dot gnu dot org>, Jeff Law <law at redhat dot com>, Igor Zamyatin <izamyatin at gmail dot com>, ÐÐÑÑ ÐÐÐÐÐÐÑ <enkovich dot gnu at gmail dot com>
- Date: Tue, 3 Nov 2015 15:08:30 +0300
- Subject: Re: [RFC] Combine vectorized loops with its scalar remainder.
- Authentication-results: sourceware.org; auth=none
- References: <CAEoMCqSmMRW1C2LniYShbfdA+JfSS6kzfrPYCcdd-rdVXa4mzg at mail dot gmail dot com> <CAFiYyc2badGgiQDyAuW6N5CnD6qMGCNCHD3fFvqK=un5V5BmWg at mail dot gmail dot com>
Richard,
It looks like misunderstanding - we assume that for GCCv6 the simple
scheme of remainder will be used through introducing new IV :
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
Is it true or we missed something?
Now we are testing vectorization of loops with small non-constant trip count.
Yuri.
2015-11-03 14:47 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi All,
>>
>> Here is a preliminary patch to combine vectorized loop with its scalar
>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>> It was tested wwith '-mavx2' option to run on Haswell processor.
>> The main goal of it is to improve performance of vectorized loops for AVX512.
>> Note that only loads/stores and simple reductions with binary operations are
>> converted to masked form, e.g. load --> masked load and reduction like
>> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
>> creation of a new vector induction variable initialized with consequent values
>> from 0.. VF-1, new const vector upper bound which contains number of iterations
>> and the result of comparison which is considered as mask vector.
>> This implementation has several restrictions:
>>
>> 1. Multiple types are not supported.
>> 2. SLP is not supported.
>> 3. Gather/Scatter's are also not supported.
>> 4. Vectorization of the loops with low trip count is not implemented yet since
>> it requires additional design and tuning.
>>
>> We are planning to eleminate all these restrictions in GCCv7.
>>
>> This patch will be extended to include cost model to reject unprofutable
>> transformations, e.g. new vector body cost will be evaluated through new
>> target hook which estimates cast of masking different vector statements. New
>> threshold parameter will be introduced which determines permissible cost
>> increasing which will be tuned on an AVX512 machine.
>> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
>> load/store support since only part of them is in trunk compiler.
>>
>> Any comments will be appreciated.
>
> As stated in the previous discussion I don't think the extra mask IV
> is a good idea
> and we instead should have a masked final iteration for the epilogue
> (yes, that's
> not really "combined" then). This is because in the end we'd not only
> want AVX512
> to benefit from this work but also other ISAs that can do unaligned or masked
> operations (we can overlap the epilogue work with the vectorized work or use
> masked loads/stores available with AVX). Note that the same applies to
> the alignment prologue if present, I can't see how you can handle that with the
> in-loop approach.
>
> Richard.