[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Feb 5 09:58:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
--- Comment #55 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think the original report is about x87 math vs. SSE math. It's a bit hard to
benchmark this through the releases given changes in tuning and vector feature
sets (-march=native is out of the question).
So I use -O3 -ffast-math -DREPS=100000 -m32 as base and see
ISA 4.3.6 4.6.4 4.8.5 7.2
-mno-sse 1855 6930 4618 5623
-msse2 -mfpmath=sse 1967 6945 4744 6472
-m64 2977 6917 4935 6205
note I edited the benchmark and put noinline,noclone attributes on the
gemm_atlas function. I benchmarked on a broadwell system with minimal
CPU frequency boosting but still varying REPS varies the reported MFLOPS
_a lot_ (but individual runs are somewhat stable, for the last reported
number 6205 I also can get 6331 or 6186). I used the attached
benchmark, the cited URL doesn't work anymore.
So there's still an appearant regression, the trunk numbers aren't very
different
from the 7.2 results, the 4.6.4 variant is still fastest and we recovered to
current levels with 4.9.4 already (just checked -m64 across all releases).
With -march=native I get to new heights obviously because we use things like
FMA, AVX, etc. if I add just -mavx to 4.6.4 it's not faster than without
but 7.2 improves to 6628 for example (4.6.4 doesn't know AVX2 and -mfma
results in bogus assembler being generated...).
If I look at the generated code for -m64 (with just SSE) we no longer
spill a lot in the inner loop (only once) and we don't vectorize. 4.6.4
manages to avoid any spilling in the computation (even in the outer loop).
So the original analysis (RA sucks) still holds.
Note the original report used -O and Aldhy used -O2 but we are talking
about a benchmark and when you use -ffast-math you also use -O3.
Note the biggest regression we still see is with x87 math - I think we
can reasonably disregard that now.
The benchmark is somewhat badly written (manually "optimized") so our
vectorization attempts fail.
Overall conclusion is I'm unsure if it's worth pursuing this bug further?
There is a register pressure issue left but the testcase maybe not
real-world enough? That is, I would usually recommend to first un-obfuscate
the manually optimized code.
More information about the Gcc-bugs
mailing list