[PATCH][RFC] Add versioning for constant strides for vectorization

Sun Jan 25 14:08:00 GMT 2009

gcc-patches-owner@gcc.gnu.org wrote on 23/01/2009 18:08:43:

>
> This patch adds the capability to the vectorizer to perform versioning
> for the case of a constant (suitable) stride.  For example for
>
> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
>
> we can only apply vectorization if the strides of the fastest dimension
> of self, a and b are one (they are loaded from the passed array
> descriptors and thus appear as (loop invariant) variables).
>
> During the implementation of this I noticed that peeling for
> number of iterations (we have to unroll the above loop twice, and so
> for an odd number of iterations have a epilogue loop for the remaining
> iteration(s)) does not play well with versioning and we end up
> vectorizing the wrong loop.  So I just disabled versioning if we
> apply peeling with an epilogue loop and instead attach the versioning
> condition to the pre-condition of the main loop that skips directly
> to the epilogue if the number of iterations is too small.  We obviously
> can use the epilogue loop as the non-vectorized version.
>
> This patch also inserts an extra copyprop and dce pass before the
> vectorizer so it can recognize the reduction in the above testcase
> (LIM has made that reduction non-obvious).  So I noticed that
> copyprop does not preserve loop-closed SSA form and fixed that as well.
>
> Some earlier version bootstrapped and tested ok on
> x86_64-unknown-linux-gnu, a final attempt is still running.
>
> I didn't yet performance test this extensively, but it might need
> cost-model adjustments and/or need to wait until we have profile
> feedback to properly seed vectorizer analysis here.  A micro-benchmark
> based on the above loop shows around 15% improvement on AMD K10.
>
> Feedback (and ppc testing) is still welcome of course.

Regtested on powerpc64-suse-linux..

>
> 2009-01-23  Richard Guenther  <rguenther@suse.de>
>
>
>    * gcc.dg/vect/fast-math-vect-complex-5.c: New testcase.
>    * gfortran.dg/vect/fast-math-vect-complex-1.f90: Likewise.
>    * gfortran.dg/vect/fast-math-vect-stride-1.f90: Likewise.

The Fortran testcases require vect_double as well.
The testcases get vectorized on powerpc64-suse-linux if the type is
_Complex float/complex(kind=4).

Thanks,
Ira