[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Oct 9 11:35:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so on haswell I see (- is bad, + is good):
-0x2342ca0 _40 + _45 1 times scalar_stmt costs 12 in body
+0x2342ca0 _40 + _45 1 times scalar_stmt costs 4 in body
so a simple add changes cost from 4 to 12 with the patch. Ah, so that
goes
switch (subcode)
{
case PLUS_EXPR:
case POINTER_PLUS_EXPR:
case MINUS_EXPR:
if (kind == scalar_stmt)
{
if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH)
stmt_cost = ix86_cost->addss;
else if (X87_FLOAT_MODE_P (mode))
stmt_cost = ix86_cost->fadd;
else
stmt_cost = ix86_cost->add;
}
where with kind == scalar_stmt we now run into the SSE_FLOAT_MODE_P case
(previously mode was sth like V2DFmode) and thus use ix86_cost->addss
instead of ix86_cost->add. That's more correct.
That causes us to (for example) now vectorize mccas.fppized.f:3160 where
we previously figured vectorization is never profitable. The look looks
like
DO 10 MK=1,NOC
DO 10 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
10 CONTINUE
and requires versioning for aliasing and strided loads and strided
stores. We're too trigger-happy for doing that it seems. Also the
vector version isn't entered at all at runtime.
But that's not the 10%. And the big offenders from looking at perf output
do not have any vectorization decision changes... very strage.
More information about the Gcc-bugs
mailing list