This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Masking vectorized loops with bound not aligned to VF.

From: Richard Biener <rguenther at suse dot de>
To: Kirill Yukhin <kirill dot yukhin at gmail dot com>
Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, Ilya Enkovich <enkovich dot gnu at gmail dot com>, Yuri Rumyantsev <ysrumyan at gmail dot com>, Zamyatin Igor <igor dot zamyatin at intel dot com>
Date: Wed, 16 Sep 2015 14:30:30 +0200 (CEST)
Subject: Re: [RFC] Masking vectorized loops with bound not aligned to VF.
Authentication-results: sourceware.org; auth=none
References: <20150914201415 dot GA47817 at msticlxl57 dot ims dot intel dot com>

On Mon, 14 Sep 2015, Kirill Yukhin wrote:

> Hello,
> I'd like to initiate discussion on vectorization of loops which 
> boundaries are not aligned to VF. Main target for this optimization 
> right now is x86's AVX-512, which features per-element embedded masking 
> for all instructions. The main goal for this mail is to agree on overall 
> design of the feature.
> 
> This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1].
>  
> Here's a sketch of the algorithm:
>   1. Add check on basic stmts for masking: possibility to introduce index vector and
>      corresponding mask
>   2. At the check if statements are vectorizable we additionally check if stmts 
>      need and can be masked and compute masking cost. Result is stored in `stmt_vinfo`.
>      We are going  to mask only mem. accesses, reductions and modify mask for already 
>      masked stmts (mask load, mask store and vect. condition)

I think you also need to mask divisions (for integer divide by zero) and
want to mask FP ops which may result in NaNs or denormals (because that's 
generally to slow down execution a lot in my experience).

Why not simply mask all stmts?

>   3. Make a decision about masking: take computed costs and est. iterations count
>      into consideration
>   4. Modify prologue/epilogue generation according decision made at analysis. Three
>      options available:
>     a. Use scalar remainder
>     b. Use masked remainder. Won't be supported in first version
>     c. Mask main loop
>   5.Support vectorized loop masking: 
>     - Create stmts for mask generation
>     - Support generation of masked vector code (create generic vector code then
>       patch it w/ masks)
>       -  Mask loads/stores/vconds/reductions only
>
>  In first version (targeted v6) we're not going to support 4.b and loop 
> mask pack/unpack. No `pack/unpack` means that masking will be supported 
> only for types w/ the same size as index variable

This means that if ncopies for any stmt is > 1 masking won't be supported,
right?  (you'd need two or more different masks)

>  
> [1] - https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf
> 
> What do you think?

There was the idea some time ago to use single-iteration vector
variants for prologues/epilogues by simply overlapping them with
the vector loop (and either making sure to mask out the overlap
area or make sure the result stays the same).  This kind-of is
similar to 4b and thus IMHO it's better to get 4b implemented
rather than trying 4c.  So for example

 int a[];
 for (i=0; i < 13; ++i)
   a[i] = i;

would be vectorized (with v4si) as

 for (i=0; i < 13 / 4; ++i)
   ((v4si *)a)[i] = { ... };
 *(v4si *)(&a[9]) = { ... };

where the epilogue store of course would be unaligned.  The masked
variant can avoid the data pointer adjustment and instead use a masked
store.

OTOH it might be that the unaligned scheme is as efficient as the
masked version.  Only the masked version is more trivially correct,
data dependences can make the above idea not work without masking
out stores like for

 for (i=0; i < 13; ++i)
   a[i] = a[i+1];

obviously the overlapping iterations in the epilogue would
compute bogus values.  To avoid this we can merge the result
with the previously stored values (using properly computed masks)
before storing it.

Basically both 4b and the above idea need to peel a vector
iteration and "modify" it.  The same trick can be applied to
prologue loops of course.

Any chance you can try working on 4b instead?  It also feels
like it would need less hacks throughout the vectorizer
(basically post-processing the generated vector loop).

If 4b is implemented I don't think 4c is worth doing.

Thanks,
Richard.

Follow-Ups:
- Re: [RFC] Masking vectorized loops with bound not aligned to VF.
  - From: Ilya Enkovich

References:
- [RFC] Masking vectorized loops with bound not aligned to VF.
  - From: Kirill Yukhin

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]