[Bug target/27855] [6/7/8 regression] reassociation causes the RA to be confused

Mon Feb 5 09:58:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855

--- Comment #55 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think the original report is about x87 math vs. SSE math.  It's a bit hard to
benchmark this through the releases given changes in tuning and vector feature
sets (-march=native is out of the question).

So I use -O3 -ffast-math -DREPS=100000 -m32 as base and see

  ISA                4.3.6  4.6.4  4.8.5  7.2
-mno-sse              1855   6930   4618   5623
-msse2 -mfpmath=sse   1967   6945   4744   6472
-m64                  2977   6917   4935   6205

note I edited the benchmark and put noinline,noclone attributes on the
gemm_atlas function.  I benchmarked on a broadwell system with minimal
CPU frequency boosting but still varying REPS varies the reported MFLOPS
_a lot_ (but individual runs are somewhat stable, for the last reported
number 6205 I also can get 6331 or 6186).  I used the attached
benchmark, the cited URL doesn't work anymore.

So there's still an appearant regression, the trunk numbers aren't very
different
from the 7.2 results, the 4.6.4 variant is still fastest and we recovered to
current levels with 4.9.4 already (just checked -m64 across all releases).

With -march=native I get to new heights obviously because we use things like
FMA, AVX, etc. if I add just -mavx to 4.6.4 it's not faster than without
but 7.2 improves to 6628 for example (4.6.4 doesn't know AVX2 and -mfma
results in bogus assembler being generated...).

If I look at the generated code for -m64 (with just SSE) we no longer
spill a lot in the inner loop (only once) and we don't vectorize.  4.6.4
manages to avoid any spilling in the computation (even in the outer loop).
So the original analysis (RA sucks) still holds.

Note the original report used -O and Aldhy used -O2 but we are talking
about a benchmark and when you use -ffast-math you also use -O3.

Note the biggest regression we still see is with x87 math - I think we
can reasonably disregard that now.

The benchmark is somewhat badly written (manually "optimized") so our
vectorization attempts fail.

Overall conclusion is I'm unsure if it's worth pursuing this bug further?
There is a register pressure issue left but the testcase maybe not
real-world enough?  That is, I would usually recommend to first un-obfuscate
the manually optimized code.