[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

rguenth at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Tue Oct 9 11:35:00 GMT 2018


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so on haswell I see (- is bad, + is good):

-0x2342ca0 _40 + _45 1 times scalar_stmt costs 12 in body
+0x2342ca0 _40 + _45 1 times scalar_stmt costs 4 in body

so a simple add changes cost from 4 to 12 with the patch.  Ah, so that
goes

      switch (subcode)
        {
        case PLUS_EXPR:
        case POINTER_PLUS_EXPR:
        case MINUS_EXPR:
          if (kind == scalar_stmt)
            {
              if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH)
                stmt_cost = ix86_cost->addss;
              else if (X87_FLOAT_MODE_P (mode))
                stmt_cost = ix86_cost->fadd;
              else
                stmt_cost = ix86_cost->add;
            }

where with kind == scalar_stmt we now run into the SSE_FLOAT_MODE_P case
(previously mode was sth like V2DFmode) and thus use ix86_cost->addss
instead of ix86_cost->add.  That's more correct.

That causes us to (for example) now vectorize mccas.fppized.f:3160 where
we previously figured vectorization is never profitable.  The look looks
like

            DO 10 MK=1,NOC
            DO 10 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   10       CONTINUE

and requires versioning for aliasing and strided loads and strided
stores.  We're too trigger-happy for doing that it seems.  Also the
vector version isn't entered at all at runtime.

But that's not the 10%.  And the big offenders from looking at perf output
do not have any vectorization decision changes...  very strage.


More information about the Gcc-bugs mailing list