[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

rguenth at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Tue Oct 9 13:01:00 GMT 2018


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
                 CC|                            |matz at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so re-running perf gives me a more reasonable result (-march=native on

Overhead       Samples  Command          Shared Object                   Symbol
  15.59%        754868  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
  15.55%        749452  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
  10.77%        496796  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
   7.58%        377894  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
   7.57%        375587  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
   7.01%        328685  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]
   4.98%        243101  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.]
   4.03%        197815  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.]

with the already noticed loop where there's appearantly not enough iterations
warranting the vectorization and the cost model check comes in the way.

xyzint_ looks simiar.

Note that

            DO 30 MK=1,NOC
            DO 30 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30       CONTINUE

shows the inner loop will first iterate once, then twice, then ... that
makes hoisting the cost model check not possible and also it makes the
alias check not invariant in the outer loop.  That would mean if we'd
code-generate the iteration cost-model then loop splitting might get
the idea of splitting the outer loop ... (but loop splitting runs before
vectorization of course).

So in this very case if we analyze the scalar evolution of the niter
of the loop we want to vectorize we get back {0, +, 1}_5 -- that's
certainly something we could factor in when computing the vectorization
cost.  It would increase the prologue/epilogue cost but it wouldn't
make vectorization never profitable (we know nothing about the upper bound
of the number of iterations).

More information about the Gcc-bugs mailing list