[PATCH][RFC] Add versioning for constant strides for vectorization
Tue Jan 27 12:55:00 GMT 2009
On Tue, 27 Jan 2009, Dorit Nuzman wrote:
> > This patch adds the capability to the vectorizer to perform versioning
> > for the case of a constant (suitable) stride. For example for
> Another very useful application of the ability to work with unknown strides
> is in the context of outer-loop vectorization: if you have an unknown
> stride in the inner-loop and stride one in the outer-loop, you can
> vectorize the outer-loop (in which case you don't care about the inner-loop
> stride as long as it is invariant). So for this purpose there's no need for
> versioning - just to relax some assumptions here and there in the
> vectorizer about the inner-loop stride having to be a known constant, which
> is what part of your patch already starts to do (I know this is not the
> purpose of your patch, but it provides initial infrastructure towards
> that). I think this is quite a common case, and it appears in several PRs -
> e.g. PR33113 (with a bit more detailed discussion) and PR37021 (which has a
> (i,j) instead of a(j,i) in the example you show below); Outer-loop
> vectorization would be better cause the reduction would be done more
> accurately and more efficiently than with inner-loop vectorization, and it
> could be done for any inner-loop stride (not only 1).
> [off topic comment: At least a couple tweaks that are currently required
> to allow outer-loop vectorization for examples like in the above mentioned
> 1) somehow don't let gcc create guard code around the innermost loop to
> check that it executes more than zero iterations. This creates a
> complicated control flow structure within the outer-loop. For now you have
> to have constant number of iterations for the inner-loop because of that,
> or insert a statement like "if (LB<=0) return;" before the loop...
> 2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
> latch block and the vectorizer likes to have the latch block empty...
> I really hope we could do something about these... maybe use a COND
> expression for the LB to solve the first problem, and clean up non-empty
> latch blocks as part of vectorization preprocessing for the second?
> I'm sorry for digressing... I know this is a bit off topic, but besides
> this patch adding a very good feature to have as is, it's also a beginning
> of a direction that can really help outer-loop vectorization in really
> common cases - so this is very good news!
> > subroutine to_product_of(self,a,b,a1,a2)
> > complex(kind=8) :: self (:)
> > complex(kind=8), intent(in) :: a(:,:)
> > complex(kind=8), intent(in) :: b(:)
> > integer a1,a2
> > do i = 1,a1
> > do j = 1,a2
> > self(i) = self(i) + a(j,i)*b(j)
> > end do
> > end do
> > end subroutine
> > we can only apply vectorization if the strides of the fastest dimension
> > of self, a and b are one (they are loaded from the passed array
> > descriptors and thus appear as (loop invariant) variables).
> > During the implementation of this I noticed that peeling for
> > number of iterations (we have to unroll the above loop twice, and so
> > for an odd number of iterations have a epilogue loop for the remaining
> > iteration(s)) does not play well with versioning and we end up
> > vectorizing the wrong loop. So I just disabled versioning if we
> > apply peeling with an epilogue loop
> yes, this is an old limitation since the introduction of loop versioning to
> the vectorizer - it's either peeling or versioning... (not for any good
> > and instead attach the versioning
> > condition to the pre-condition of the main loop that skips directly
> > to the epilogue if the number of iterations is too small. We obviously
> > can use the epilogue loop as the non-vectorized version.
> maybe we should consider doing that also when versioning for
Yes, the patch does this.
> > This patch also inserts an extra copyprop and dce pass before the
> > vectorizer so it can recognize the reduction in the above testcase
> > (LIM has made that reduction non-obvious).
> cool. This was also repeatedly reported problem... I wonder if it doesn't
> solve a few missed-optimization PRs on the way...
> > So I noticed that
> > copyprop does not preserve loop-closed SSA form and fixed that as well.
> > Some earlier version bootstrapped and tested ok on
> > x86_64-unknown-linux-gnu, a final attempt is still running.
> > I didn't yet performance test this extensively, but it might need
> > cost-model adjustments and/or need to wait until we have profile
> > feedback to properly seed vectorizer analysis here. A micro-benchmark
> > based on the above loop shows around 15% improvement on AMD K10.
> > Feedback (and ppc testing) is still welcome of course.
> I didn't review the code very closely, but one tiny thing that would be
> nice to add is a printout to the dump file reporting that versioining for
> stride is done.
> > + /* ??? Delay this change until after versioning or
> > + preserve the original step somewhere. */
Ah, this comment is a left-over. The change here is definitely needed.
> When we take this forward to support outer-loop vectorization with unknown
> inner-loop strides, we'd indeed want to
> delay this change.
I guess outer-loop vectorization should simply deal with non-constant
DR_STEP then, even when the versioning for non-constant strides is not
done? So, the data-ref change should already support this?
I have SPEC2006 tested the patch and it doesn't show any improvements
More information about the Gcc-patches