[Bug tree-optimization/88713] Vectorized code slow vs. flang

Tue Jan 22 05:40:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #20 from Chris Elrod <elrodc at gmail dot com> ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the Newton step, the answers are correct.

My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
the Newton step. They get the correct answer.

That leaves my best guess for the performance difference as owing to the masked
"vrsqrt14ps" that gcc is using (g++ does this too):

        vcmpps  $4, %zmm0, %zmm5, %k1
        vrsqrt14ps      %zmm0, %zmm1{%k1}{z}

Is there any way for me to test that idea?
Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
benchmark it?

Okay, I just tried playing around with flags and looking at asm.
I compiled with:

g++ -O3 -ffinite-math-only -fexcess-precision=fast -fno-math-errno
-fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math
-fno-rounding-math -fno-signaling-nans -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
libgppvectorization_test.so  vectorization_test.cpp

which is basically all flags implied by "-ffast-math", except
"-funsafe-math-optimizations". This does include the flags implied by the
unsafe-math optimizations, just not that flag itself.

This list can be simplified to (only "-fno-math-errno" is needed):

g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512
-fno-semantic-interposition -o libgppvectorization_test.so 
vectorization_test.cpp

or

gfortran -O3 -fno-math-errno -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
libgfortvectorization_test.so  vectorization_test.f90

This results in the following:

        vsqrtps (%r8,%rax), %zmm0
        vdivps  %zmm0, %zmm7, %zmm0

ie, vsqrt and a division, rather than the masked reciprocal square root.

With N = 2827, that speeds gfortran and g++ from about 4.3 microseconds to 3.5
microseconds.
For comparison, Clang takes about 2 microseconds, and Flang/ispc/and awful
looking unsafe Rust take 2.3-2.4 microseconds, using the vrsqrt14ps (without a
mask) and a Newton step, instead of vsqrtps followed by a division.

So, "-funsafe-math-optimizations" results in a regression here.