RFC: GCC Aarch64 SIMD vectorization question involving libmvec
Fri Jun 28 08:56:00 GMT 2019
Steve Ellcey <firstname.lastname@example.org> writes:
> I am testing the latest GCC with not-yet-submitted GLIBC changes that
> implement libmvec on Aarch64.
> While trying to run SPEC 2017 (specifically 521.wrf_r) I ran into a
> case where GCC was generating a call to _ZGVnN2vv_powf, that is a
> vectorized powf call for 2 (not 4) elements. This was a problem
> because I only implemented a 4 element 32 bit vectorized powf function
> for libmvec and not a 2 element version.
> I think this is due to aarch64_simd_clone_compute_vecsize_and_simdlen
> which allows for (element count * element size) to be either 64
> or 128.
> I would like some thoughts on what we should do about this, should
> we require glibc/libmvec to provide 2 element 32 bit floating point
> vector functions (as well as the 4 element ones) or should we change
> aarch64_simd_clone_compute_vecsize_and_simdlen to only allow 4
> element (128 total bit size) vectors and not 2 element (64 total bit
> size) ones?
> This is obviously a question for the pre-SVE vector instructions,
> I am not sure how this would be handled in SVE.
The vector ABI says that "#pragma omp declare simd" without a simdlen
declares both 64-bit and 128-bit functions, so I think the GCC code is
doing the right thing. If glibc only implements 128-bit functions
for powf then it should use simdlen(4).
It would be nice to support simdlen(2) as well though. Low-trip-count
loops like the one below would be one use case. Another would be SLP.
And hopefully at some point in the future we'll be able to turn
vect-epilogues-nomask on by default, in which case we would also have
64-bit vectorisation in the tail of a loop vectorised at 128 bits.
> Steve Ellcey
> P.S. Here a test case in Fortran that generated the 2 element
> vector call. It unrolled the loop into one vector call
> of 2 elements and one scalar call.
> SUBROUTINE FOO(B,W,P)
> REAL, DIMENSION (3) :: W, P
> DO 10 I = 1, 3
> P(I) = W(I) ** B
> 10 CONTINUE
> END SUBROUTINE FOO
More information about the Gcc-patches