[RFC] Using main loop's updated IV as base_address for epilogue vectorization

Richard Biener rguenther@suse.de
Wed Jun 16 11:13:55 GMT 2021


On Wed, 16 Jun 2021, Andre Vieira (lists) wrote:

> 
> On 14/06/2021 11:57, Richard Biener wrote:
> > On Mon, 14 Jun 2021, Richard Biener wrote:
> >
> >> Indeed. For example a simple
> >> int a[1024], b[1024], c[1024];
> >>
> >> void foo(int n)
> >> {
> >>    for (int i = 0; i < n; ++i)
> >>      a[i+1] += c[i+i] ? b[i+1] : 0;
> >> }
> >>
> >> should usually see peeling for alignment (though on x86 you need
> >> exotic -march= since cost models generally have equal aligned and
> >> unaligned access costs).  For example with -mavx2 -mtune=atom
> >> we'll see an alignment peeling prologue, a AVX2 vector loop,
> >> a SSE2 vectorized epilogue and a scalar epilogue.  It also
> >> shows the original scalar loop being used in the scalar prologue
> >> and epilogue.
> >>
> >> We're not even trying to make the counting IV easily used
> >> across loops (we're not counting scalar iterations in the
> >> vector loops).
> > Specifically we see
> >
> > <bb 33> [local count: 94607391]:
> > niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
> > _67 = niters_vector_mult_vf.10_62 + 7;
> > _64 = (int) niters_vector_mult_vf.10_62;
> > tmp.11_63 = i_43 + _64;
> > if (niters.8_45 == niters_vector_mult_vf.10_62)
> >    goto <bb 37>; [12.50%]
> > else
> >    goto <bb 36>; [87.50%]
> >
> > after the maini vect loop, recomputing the original IV (i) rather
> > than using the inserted canonical IV.  And then the vectorized
> > epilogue header check doing
> >
> > <bb 36> [local count: 93293400]:
> > # i_59 = PHI <tmp.11_63(33), 0(18)>
> > # _66 = PHI <_67(33), 0(18)>
> > _96 = (unsigned int) n_10(D);
> > niters.26_95 = _96 - _66;
> > _108 = (unsigned int) n_10(D);
> > _109 = _108 - _66;
> > _110 = _109 + 4294967295;
> > if (_110 <= 3)
> >    goto <bb 47>; [10.00%]
> > else
> >    goto <bb 40>; [90.00%]
> >
> > re-computing everything from scratch again (also notice how
> > the main vect loop guard jumps around the alignment prologue
> > as well and lands here - and the vectorized epilogue using
> > unaligned accesses - good!).
> >
> > That is, I'd expect _much_ easier jobs if we'd manage to
> > track the number of performed scalar iterations (or the
> > number of scalar iterations remaining) using the canonical
> > IV we add to all loops across all of the involved loops.
> >
> > Richard.
> 
> 
> So I am now looking at using an IV that counts scalar iterations rather than
> vector iterations and reusing that through all loops, (prologue, main loop,
> vect_epilogue and scalar epilogue). The first is easy, since that's what we
> already do for partial vectors or non-constant VFs. The latter requires some
> plumbing and removing a lot of the code in there that creates new IV's going
> from [0, niters - previous iterations]. I don't yet have a clear cut view of
> how to do this, I first thought of keeping track of the 'control' IV in the
> loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps
> a control_ivs struct, but that is used for overflow detection and only keeps
> track of what looks like a constant 'base' and 'step'. Not quite sure how all
> that works, but intuitively doesn't seem like the right thing to reuse.

Maybe it's enough to maintain this [remaining] scalar iterations counter
between loops, thus after the vector loop do

  remain_scalar_iter -= vector_iters * vf;

etc., this should make it possible to do some first order cleanups,
avoiding some repeated computations.  It does involve placing
additional PHIs for this remain_scalar_iter var of course (I'd be
hesitant to rely on the SSA renamer for this due to its expense).

I think that for all later jump-around tests tracking remaining
scalar iters is more convenient than tracking performed scalar iters.

> I'll go hack around and keep you posted on progress.

Thanks - it's an iffy area ...
Richard.


More information about the Gcc-patches mailing list