[RFC] Using main loop's updated IV as base_address for epilogue vectorization

Richard Biener rguenther@suse.de
Mon Jun 14 10:57:51 GMT 2021


On Mon, 14 Jun 2021, Richard Biener wrote:

> On Mon, 14 Jun 2021, Andre Vieira (lists) wrote:
> 
> > Hi,
> > 
> > 
> > On 20/05/2021 11:22, Richard Biener wrote:
> > > On Mon, 17 May 2021, Andre Vieira (lists) wrote:
> > >
> > >> Hi,
> > >>
> > >> So this is my second attempt at finding a way to improve how we generate
> > >> the
> > >> vector IV's and teach the vectorizer to share them between main loop and
> > >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but
> > >> that
> > >> was a terrible idea and I quickly threw it in the bin. The main problem,
> > >> that
> > >> for some reason I failed to see, was that the control_iv increases by 's'
> > >> and
> > >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> > >> amount of elements we handle per iteration. That means the epilogue loops
> > >> would have to start from the last loop's IV * the last loop's NELEMENT's
> > >> and
> > >> that would just cause a mess.
> > >>
> > >> Instead I started to think about creating IV's for the datarefs and what I
> > >> thought worked best was to create these in scalar before peeling. That way
> > >> the
> > >> peeling mechanisms takes care of the duplication of these for the vector
> > >> and
> > >> scalar epilogues and it also takes care of adding phi-nodes for the
> > >> skip_vector paths.
> > > How does this work for if-converted loops where we use the
> > > non-if-converted scalar loop for (scalar) peeling but the
> > > if-converted scalar loop for vectorized epilogues?  I suppose
> > > you're only adjusting the if-converted copy.
> > True hadn't thought about this :(
> > >
> > >> These new IV's have two functions:
> > >> 1) 'vect_create_data_ref_ptr' can use them to:
> > >>  a) if it's the main loop: replace the values of the 'initial' value of the
> > >> main loop's IV and the initial values in the skip_vector phi-nodes
> > >>  b) Update the the skip_vector phi-nodes argument for the non-skip path
> > >> with
> > >> the updated vector ptr.
> > > b) means the prologue IV will not be dead there so we actually need
> > > to compute it?  I suppose IVOPTs could be teached to replace an
> > > IV with its final value (based on some other IV) when it's unused?
> > > Or does it already magically do good?
> > It does not and ...
> > >
> > >> 2) They are used for the scalar epilogue ensuring they share the same
> > >> datareference ptr.
> > >>
> > >> There are still a variety of 'hacky' elements here and a lot of testing to
> > >> be
> > >> done, but I hope to be able to clean them away. One of the main issues I
> > >> had
> > >> was that I had to skip a couple of checks and things for the added
> > >> phi-nodes
> > >> and update statements as these do not have stmt_vec_info representation.
> > >> Though I'm not sure adding this representation at their creation was much
> > >> cleaner... It is something I could play around with but I thought this was
> > >> a
> > >> good moment to ask you for input. For instance, maybe we could do this
> > >> transformation before analysis?
> > >>
> > >> Also be aware that because I create a IV for each dataref this leads to
> > >> regressions with SVE codegen for instance. NEON is able to use the
> > >> post-index
> > >> addressing mode to increase each dr IV at access time, but SVE can't do
> > >> this.
> > >> For this I don't know if maybe we could try to be smart and create shared
> > >> IV's. So rather than make them based on the actual vector ptr, use a shared
> > >> sizetype IV that can be shared among dr IV's with the same step. Or maybe
> > >> this
> > >> is something for IVOPTs?
> > > Certainly IVOPTs could decide to use the newly created IVs in the
> > > scalar loops for the DRs therein as well.  But since IVOPTs only
> > > considers a single loop at a time it will probably not pay too
> > > much attention and is only influenced by the out-of-loop uses of
> > > the final values of the IVs.
> > >
> > > My gut feeling tells me that whatever we do we'll have to look
> > > into improving IVOPTs to consider multiple loops.
> > 
> > So I redid the IV-sharing and it's looking a lot simpler and neater, however
> > it only shares IVs between vectorized loops and not scalar pro- or epilogues.
> > I am not certain IVOPTs will be able to deal with these, as it has no
> > knowledge of the number of iterations of each different loop. So take for
> > instance a prologue peeling for alignment loop and a first main vectorization
> > loop. To be able to reuse the IV's from the prologue in the main vectorization
> > loop it would need to know that the initial start adress + PEELING_NITERS ==
> > base address for main vectorization loop.
> 
> Yes, indeed.  I think what we may want to do is to make this CSE/PRE
> "easy" when IVOPTs manages to choose compatible IVs.  Since we generally
> have bypass jumps around the various loops this boils down to PRE
> since the redundancies only appear on the loop-to-loop fallthru edges
> which means we won't see those opportunities on GIMPLE (unless we
> realize the PRE opportunities in the actual vectorizer generated code).
> But eventually RTL PRE does all the magic.
> 
> Iff we'd manage to present IVOPTs with out-of-loop IV uses that are
> actual initial values of IVs in other loops that might be a key trick
> to bias its cost model.  But then in the end I think it's much more
> important to choose the best IV for the loop bodies rather than
> paying attention to IV base setup costs for adjacent loops
> (in the vectorizer case we of course know the vectorized epilogue
> loops usually do not iterate and the final scalar epilogue iterates
> only a few times...).
> 
> > I'll starting testing this approach for correctness if there are no major
> > concerns. Though I suspect we will only want to turn this into a patch once we
> > have the IVOPTs work done to a point where it at least doesn't regress codegen
> > because of shared IVs and eventually we can look at how to solve the sharing
> > between vectorized and scalar loops.
> 
> Indeed.  For example a simple
> 
> int a[1024], b[1024], c[1024];
> 
> void foo(int n)
> {
>   for (int i = 0; i < n; ++i)
>     a[i+1] += c[i+i] ? b[i+1] : 0;
> }
> 
> should usually see peeling for alignment (though on x86 you need
> exotic -march= since cost models generally have equal aligned and
> unaligned access costs).  For example with -mavx2 -mtune=atom
> we'll see an alignment peeling prologue, a AVX2 vector loop,
> a SSE2 vectorized epilogue and a scalar epilogue.  It also
> shows the original scalar loop being used in the scalar prologue
> and epilogue.
> 
> We're not even trying to make the counting IV easily used
> across loops (we're not counting scalar iterations in the
> vector loops).

Specifically we see

<bb 33> [local count: 94607391]:
niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
_67 = niters_vector_mult_vf.10_62 + 7;
_64 = (int) niters_vector_mult_vf.10_62;
tmp.11_63 = i_43 + _64;
if (niters.8_45 == niters_vector_mult_vf.10_62)
  goto <bb 37>; [12.50%]
else
  goto <bb 36>; [87.50%]

after the maini vect loop, recomputing the original IV (i) rather
than using the inserted canonical IV.  And then the vectorized
epilogue header check doing

<bb 36> [local count: 93293400]:
# i_59 = PHI <tmp.11_63(33), 0(18)>
# _66 = PHI <_67(33), 0(18)>
_96 = (unsigned int) n_10(D);
niters.26_95 = _96 - _66;
_108 = (unsigned int) n_10(D);
_109 = _108 - _66;
_110 = _109 + 4294967295;
if (_110 <= 3)
  goto <bb 47>; [10.00%]
else
  goto <bb 40>; [90.00%]

re-computing everything from scratch again (also notice how
the main vect loop guard jumps around the alignment prologue
as well and lands here - and the vectorized epilogue using
unaligned accesses - good!).

That is, I'd expect _much_ easier jobs if we'd manage to
track the number of performed scalar iterations (or the
number of scalar iterations remaining) using the canonical
IV we add to all loops across all of the involved loops.

Richard.


More information about the Gcc-patches mailing list