[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Oct 9 13:01:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |matz at gcc dot gnu.org
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so re-running perf gives me a more reasonable result (-march=native on
Haswell):
Overhead Samples Command Shared Object Symbol
15.59% 754868 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.]
forms_
15.55% 749452 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.]
forms_
10.77% 496796 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.]
twotff_
7.58% 377894 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.]
dirfck_
7.57% 375587 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.]
dirfck_
7.01% 328685 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.]
twotff_
4.98% 243101 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.]
xyzint_
4.03% 197815 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.]
xyzint_
with the already noticed loop where there's appearantly not enough iterations
warranting the vectorization and the cost model check comes in the way.
xyzint_ looks simiar.
Note that
DO 30 MK=1,NOC
DO 30 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
30 CONTINUE
shows the inner loop will first iterate once, then twice, then ... that
makes hoisting the cost model check not possible and also it makes the
alias check not invariant in the outer loop. That would mean if we'd
code-generate the iteration cost-model then loop splitting might get
the idea of splitting the outer loop ... (but loop splitting runs before
vectorization of course).
So in this very case if we analyze the scalar evolution of the niter
of the loop we want to vectorize we get back {0, +, 1}_5 -- that's
certainly something we could factor in when computing the vectorization
cost. It would increase the prologue/epilogue cost but it wouldn't
make vectorization never profitable (we know nothing about the upper bound
of the number of iterations).
More information about the Gcc-bugs
mailing list