[PATCH][RFC] Add versioning for constant strides for vectorization

Richard Guenther rguenther@suse.de
Tue Jan 27 12:55:00 GMT 2009


On Tue, 27 Jan 2009, Dorit Nuzman wrote:

> >
> > This patch adds the capability to the vectorizer to perform versioning
> > for the case of a constant (suitable) stride.  For example for
> >
> 
> awesome!
> 
> Another very useful application of the ability to work with unknown strides
> is in the context of outer-loop vectorization: if you have an unknown
> stride in the inner-loop and stride one in the outer-loop, you can
> vectorize the outer-loop (in which case you don't care about the inner-loop
> stride as long as it is invariant). So for this purpose there's no need for
> versioning - just to relax some assumptions here and there in the
> vectorizer about the inner-loop stride having to be a known constant, which
> is what part of your patch already starts to do (I know this is not the
> purpose of your patch, but it provides initial infrastructure towards
> that). I think this is quite a common case, and it appears in several PRs -
> e.g. PR33113 (with a bit more detailed discussion) and PR37021 (which has a
> (i,j) instead of a(j,i) in the example you show below); Outer-loop
> vectorization would be better cause the reduction would be done more
> accurately and more efficiently than with inner-loop vectorization, and it
> could be done for any inner-loop stride (not only 1).
> 
> [off topic comment:  At least a couple tweaks that are currently required
> to allow outer-loop vectorization for examples like in the above mentioned
> PRs:
> 1) somehow don't let gcc create guard code around the innermost loop to
> check that it executes more than zero iterations. This creates a
> complicated control flow structure within the outer-loop. For now you have
> to have  constant number of iterations for the inner-loop because of that,
> or insert a statement like "if (LB<=0) return;" before the loop...
> 2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
> latch block and the vectorizer likes to have the latch block empty...
> I really hope we could do something about these... maybe use a COND
> expression for the LB to solve the first problem, and clean up non-empty
> latch blocks as part of vectorization preprocessing for the second?
> ]
> 
> I'm sorry for digressing... I know this is a bit off topic, but besides
> this patch adding a very good feature to have as is, it's also a beginning
> of a direction that can really help outer-loop vectorization in really
> common cases - so this is very good news!
> 
> > subroutine to_product_of(self,a,b,a1,a2)
> >   complex(kind=8) :: self (:)
> >   complex(kind=8), intent(in) :: a(:,:)
> >   complex(kind=8), intent(in) :: b(:)
> >   integer a1,a2
> >   do i = 1,a1
> >     do j = 1,a2
> >       self(i) = self(i) + a(j,i)*b(j)
> >     end do
> >   end do
> > end subroutine
> >
> > we can only apply vectorization if the strides of the fastest dimension
> > of self, a and b are one (they are loaded from the passed array
> > descriptors and thus appear as (loop invariant) variables).
> >
> > During the implementation of this I noticed that peeling for
> > number of iterations (we have to unroll the above loop twice, and so
> > for an odd number of iterations have a epilogue loop for the remaining
> > iteration(s)) does not play well with versioning and we end up
> > vectorizing the wrong loop.  So I just disabled versioning if we
> > apply peeling with an epilogue loop
> 
> yes, this is an old limitation since the introduction of loop versioning to
> the vectorizer - it's either peeling or versioning... (not for any good
> reason):)
> 
> >  and instead attach the versioning
> > condition to the pre-condition of the main loop that skips directly
> > to the epilogue if the number of iterations is too small.  We obviously
> > can use the epilogue loop as the non-vectorized version.
> >
> 
> maybe we should consider doing that also when versioning for
> aliasing/alignment...

Yes, the patch does this.

> > This patch also inserts an extra copyprop and dce pass before the
> > vectorizer so it can recognize the reduction in the above testcase
> > (LIM has made that reduction non-obvious).
> 
> cool. This was also repeatedly reported problem... I wonder if it doesn't
> solve a few missed-optimization PRs on the way...
>
> > So I noticed that
> > copyprop does not preserve loop-closed SSA form and fixed that as well.
> >
> > Some earlier version bootstrapped and tested ok on
> > x86_64-unknown-linux-gnu, a final attempt is still running.
> >
> > I didn't yet performance test this extensively, but it might need
> > cost-model adjustments and/or need to wait until we have profile
> > feedback to properly seed vectorizer analysis here.  A micro-benchmark
> > based on the above loop shows around 15% improvement on AMD K10.
> >
> > Feedback (and ppc testing) is still welcome of course.
> >
> 
> I didn't review the code very closely, but one tiny thing that would be
> nice to add is a printout to the dump file reporting that versioining for
> stride is done.

Indeed.

> about:
> 
> > +   /* ???  Delay this change until after versioning or
> > +      preserve the original step somewhere.  */

Ah, this comment is a left-over.  The change here is definitely needed.

> When we take this forward to support outer-loop vectorization with unknown
> inner-loop strides, we'd indeed want to
> delay this change.

I guess outer-loop vectorization should simply deal with non-constant
DR_STEP then, even when the versioning for non-constant strides is not
done?  So, the data-ref change should already support this?

I have SPEC2006 tested the patch and it doesn't show any improvements
or regressions.

Richard.



More information about the Gcc-patches mailing list