[RFC][tree-vect]PR 88915: Further vectorize second loop when versioning

Fri Jul 19 12:23:00 GMT 2019

On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:

> 
> 
> On 15/07/2019 11:54, Richard Biener wrote:
> > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:
> > 
> > > 
> > > 
> > > On 12/07/2019 11:19, Richard Biener wrote:
> > > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:
> > > 
> > > 
> > > I have code that can split the condition and alias checks in
> > > 'vect_loop_versioning'. For this approach I am considering keeping that
> > > bit of
> > > code and seeing if I can patch up the checks after vectorizing the
> > > epilogue
> > > further. I think initially I will just go with a "hacked up" way of
> > > passing
> > > down the bb with the iteration check and split the false edge every time
> > > we
> > > vectorize it further. Will keep you posted on progress. If you have any
> > > pointers/tips they are most welc ome!
> > 
> > I thought to somehow force the idea that we have a prologue loop
> > to the vectorizer so it creates the number-of-vectorized iterations
> > check and branch around the main (highest VF) vectorized loop.
> > 
> 
> Hmm I think I may have skimmed over this earlier. I am reading it now and am
> not entirely sure what you mean by this. Specifically the "number of
> vectorized iterations" or how a prologue loop plays a role in this.

When there is no prologue then the versioning condition currently
ensures we always enter the vector loop.  Only if there is a prologue
is there a check and branch eventually skipping right to the epilogue.
If the versioning condition (now using a lower VF) doesn't guarantee
this we have to add this jump-around.

> 
> > > > 
> > > > The advantage is that this would just use the epilogue vectorization
> > > > code and it would avoid excessive code growth if you have many
> > > > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and
> > > > 64 byte vectors...).  The disadvantage is of course that a small
> > > > number of loops will not enter the vector code at all - namely
> > > > those that would pass the alias check for lowest_VF but not the
> > > > one for highest_VF.  I'm sure this isn't a common situation and
> > > > in quite a number of cases we formulate the alias check in a way
> > > > that it isn't dependent on the VF anyways.
> > > 
> > > The code growth is indeed a factor and I can see the argument for choosing
> > > this approach over the other. Cases of such specific overlaps are most
> > > likely
> > > oddities rather than the common situation.
> > 
> > Yeah, it also looks simplest to me (and a motivation to enable
> > epilogue vectorization by default).
> > 
> > > > There's also possibly
> > > > an extra branch for the case the highest_VF loop isn't entered
> > > > (unless there already was a prologue loop).
> > > I don't understand this one, can you elaborate?
> > 
> > The branch around the main vectorized loop I talked about above.
> > So I'd fool the versioning condition to use the lowest VF for
> > the iteration count checking and use the code that handles
> > zero-trip iteration count for the vector loop unconditionally.
> 
> I guess this sheds some light on the comment above. And it definitely implies
> we would need to know the lowest VF when creating this condition. Which is
> tricky.

We can simply use the smallest vector size supported by the target to
derive it from the actual VF, no?

> > 
> > In some way this makes checking the niter condition on the version
> > check pointless (at least if we have a really low lowest VF like
> > on x64 where it will likely be 2), so we may want to elide that
> > completely?  For the check to be "correct" we'd also need to
> > compute the lowest VF a vectorized epilogue is still profitable
> > (on x86 those will run once or never, but we can also end up
> > with say main AVX512 vectorization, and a single vectorized
> > epilogue with SSE2 if we somehow figure AVX256 vectorization
> > isn't profitable for it - we can also end up with non-vectorizable
> > epilogue).  So with the current setup how we vectorize epilogues
> > we maybe want to have a location of the version niter check we
> > can "patch up" later after (not) vectorizing the epilogue(s).
> 
> I think you come to the same conclusion here as I mentioned above. Somehow I
> wish I had understood this better when I first read it ... but eh such is life
> :)
> 
> I went on and continued hacking around the approach of splitting the niter and
> alias check I had earlier. I got it to work with a single loop. However, when
> dealing with nested loops I run into the problem that I'd need to sink the
> niter checks. Otherwise you could end up with an alias check and niter checks
> outside the outer loop. Where the 2nd and consequent VF niter checks point to
> the corresponding epilogues in the inner loop.  However, once those are done
> and it iterates over the outer-loop, it will go through the higher VF's first,
> leading to wrong behavior.
> 
> To illustrate what I mean, here is a very simplistic illustration of what is
> happening:
> 
> BB1: Alias check
> BB2: niter check VF 32
> BB3: niter check VF 16
> BB4: Vectorized loop VF32
> BB5: Vectorized loop VF16
> BB6: Remaining epilogue scalar loop
> BB7: Outer loop iteration (updates IV's and DRs of inner loop)
> BB8: Scalar inner&outer loop
> 
> With edges:
> BB1 -T-> BB2
> BB1 -F-> BB8
> BB2 -T-> BB4
> BB2 -F-> BB3
> BB3 -T-> BB5
> BB3 -F-> BB8
> BB4 -> BB5
> BB5 -> BB6
> BB6 -> BB7
> BB7 -> BB4
> 
> Where -T-> is a True edge and -F-> is a False edge
> 
> So my first thought to solve this is to sink BB2 and BB3 into the loop for
> which BB7 is the latch.
> 
> I.e. make BB7 -> BB2
> 
> But then I would argue, it would be good to introduce a BB9:
> BB1 -T-> BB9
> BB9 -T-> BB2
> BB9 -F-> BB8
> 
> Where BB9 checks that niter is at least the lowest VF.
> 
> Sorry if I am repeating what you were telling me to do all along :')

I'm not sure I understand - why would you have any check not inside
the outer loop?  Yes, we now eventually hoist versioning checks
but the VF checks for the individual variants should be around
the vectorized loop itself (so not really part of the versioning check).

> Cheers,
> Andre
> 
> PS: I often find myself having to patch the DOMINATOR information, sometimes
> its easy to, but sometimes it can get pretty complicated. I wonder whether it
> would make sense to write something that traverses a loop and corrects this,
> if it doesn't exist already.

There's iterate-fix-dominators, but unless you create new edges/blocks
manually rather than doing split-block/redirect-edge which should do
dominator updating for you.

Richard.

> 
> > 
> > Richard.
> > 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix ImendÃ¶rffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG NÃ¼rnberg)