[RFC] Combine vectorized loops with its scalar remainder.

Tue Nov 10 12:30:00 GMT 2015

On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Richard,
>
> It looks like misunderstanding - we assume that for GCCv6 the simple
> scheme of remainder will be used through introducing new IV :
> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>
> Is it true or we missed something?

<quote>
> > Do you have an idea how "masking" is better be organized to be usable
> > for both 4b and 4c?
>
> Do 2a ...
Okay.
</quote>

Richard.

> Now we are testing vectorization of loops with small non-constant trip count.
> Yuri.
>
> 2015-11-03 14:47 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi All,
>>>
>>> Here is a preliminary patch to combine vectorized loop with its scalar
>>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>> It was tested wwith '-mavx2' option to run on Haswell processor.
>>> The main goal of it is to improve performance of vectorized loops for AVX512.
>>> Note that only loads/stores and simple reductions with binary operations are
>>> converted to masked form, e.g. load --> masked load and reduction like
>>> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
>>> creation of a new vector induction variable initialized with consequent values
>>> from 0.. VF-1, new const vector upper bound which contains number of iterations
>>> and the result of comparison which is considered as mask vector.
>>> This implementation has several restrictions:
>>>
>>> 1. Multiple types are not supported.
>>> 2. SLP is not supported.
>>> 3. Gather/Scatter's are also not supported.
>>> 4. Vectorization of the loops with low trip count is not implemented yet since
>>>    it requires additional design and tuning.
>>>
>>> We are planning to eleminate all these restrictions in GCCv7.
>>>
>>> This patch will be extended to include cost model to reject unprofutable
>>> transformations, e.g. new vector body cost will be evaluated through new
>>> target hook which estimates cast of masking different vector statements. New
>>> threshold parameter will be introduced which determines permissible cost
>>> increasing which will be tuned on an AVX512 machine.
>>> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
>>> load/store support since only part of them is in trunk compiler.
>>>
>>> Any comments will be appreciated.
>>
>> As stated in the previous discussion I don't think the extra mask IV
>> is a good idea
>> and we instead should have a masked final iteration for the epilogue
>> (yes, that's
>> not really "combined" then).  This is because in the end we'd not only
>> want AVX512
>> to benefit from this work but also other ISAs that can do unaligned or masked
>> operations (we can overlap the epilogue work with the vectorized work or use
>> masked loads/stores available with AVX).  Note that the same applies to
>> the alignment prologue if present, I can't see how you can handle that with the
>> in-loop approach.
>>
>> Richard.