RFC: GCC Aarch64 SIMD vectorization question involving libmvec

Richard Sandiford richard.sandiford@arm.com
Fri Jun 28 08:56:00 GMT 2019


Steve Ellcey <sellcey@marvell.com> writes:
> I am testing the latest GCC with not-yet-submitted GLIBC changes that
> implement libmvec on Aarch64.
>
> While trying to run SPEC 2017 (specifically 521.wrf_r) I ran into a
> case where GCC was generating a call to _ZGVnN2vv_powf, that is a
> vectorized powf call for 2 (not 4) elements.  This was a problem
> because I only implemented a 4 element 32 bit vectorized powf function
> for libmvec and not a 2 element version.
>
> I think this is due to aarch64_simd_clone_compute_vecsize_and_simdlen
> which allows for (element count * element size) to be either 64
> or 128.
>
> I would like some thoughts on what we should do about this, should
> we require glibc/libmvec to provide 2 element 32 bit floating point
> vector functions (as well as the 4 element ones) or should we change
> aarch64_simd_clone_compute_vecsize_and_simdlen to only allow 4
> element (128 total bit size) vectors and not 2 element (64 total bit
> size) ones?
>
> This is obviously a question for the pre-SVE vector instructions,
> I am not sure how this would be handled in SVE.

The vector ABI says that "#pragma omp declare simd" without a simdlen
declares both 64-bit and 128-bit functions, so I think the GCC code is
doing the right thing.  If glibc only implements 128-bit functions
for powf then it should use simdlen(4).

It would be nice to support simdlen(2) as well though.  Low-trip-count
loops like the one below would be one use case.  Another would be SLP.
And hopefully at some point in the future we'll be able to turn
vect-epilogues-nomask on by default, in which case we would also have
64-bit vectorisation in the tail of a loop vectorised at 128 bits.

Thanks,
Richard

>
> Steve Ellcey
> sellcey@marvell.com
>
> P.S.  Here a test case in Fortran that generated the 2 element
>       vector call.  It unrolled the loop into one vector call
>       of 2 elements and one scalar call.
>
>       SUBROUTINE FOO(B,W,P)
>       REAL, DIMENSION (3) :: W, P
>       DO 10 I = 1, 3
>       P(I) = W(I) ** B
> 10    CONTINUE
>       END SUBROUTINE FOO



More information about the Gcc-patches mailing list