94472 – 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs

Bug 94472 - 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs

Summary: 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen...

Status:	WAITING

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	ipa (show other bugs)
Version:	10.0

Importance:	P1 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2020-04-03 14:54 UTC by Martin Jambor
Modified:	2023-10-12 12:45 UTC (History)
CC List:	8 users (show)

See Also:
Host:	x86_64-linux
Target:	x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-04-28 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2020-04-03 14:54:49 UTC

400.perlbench is slower when compiled at -O2 (and generic march/mtune)
with both PGO and LTO when compiled with master (26b3e568a60) than
when built with GCC 9, on Zen2 by 13% and on Zen1 by 7%.  The
performance is comparable on Intel Cascade Lake server CPU.

I attempted bisecting the problems on the Zen2 CPU but was only
partially successful because a lot of the slowdown seemed to have
happened gradually.  The first bigger slowdown - almost 4% - came
with:

  562d1e9556777988ae46c5d1357af2636bc272ea is the first bad commit
  commit 562d1e9556777988ae46c5d1357af2636bc272ea
  Author: Jan Hubicka <hubicka@gcc.gnu.org>
  Date:   Wed Oct 2 16:01:47 2019 +0000

    cif-code.def (MAX_INLINE_INSNS_SINGLE_O2_LIMIT, [...]): New.
    
    
            * cif-code.def (MAX_INLINE_INSNS_SINGLE_O2_LIMIT,
            MAX_INLINE_INSNS_AUTO_O2_LIMIT): New.

  ...
    From-SVN: r276469

About the same performance loss was then introduced by:

commit 2925cad2151842daa387950e62d989090e47c91d
Author: Jan Hubicka <hubicka@ucw.cz>
Date:   Thu Oct 3 17:08:21 2019 +0200

    params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New.

            * params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT,
            PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New.
            * doc/invoke.texi (inline-heuristics-hint-percent,
            inline-heuristics-hint-percent-O2): Document.
            * tree-inline.c (inline_insns_single, inline_insns_auto): Add new
            hint attribute.
            (can_inline_edge_by_limits_p): Use it.


And finally throughout March the benchmark is quite jumpy but finally
ended again ended up about 5% slower than at the beginning of the
month.

Comment 1 Bernd Edlinger 2020-04-28 02:30:09 UTC

This looks like an important issue to me.
maybe P2 ?

Comment 2 Bernd Edlinger 2020-04-28 02:36:30 UTC

Martin, can you try to change the limits,
maybe that is just a limit for inline expansions
that is not right?

Comment 3 Martin Jambor 2020-04-28 08:06:56 UTC

My benchmarking setup is currently gone so unfortunately no, not easily.  I'll be re-measuring everything on a different computer with a slightly different CPU model soon, so after that I guess I could.  But it is most likely the limits, yes.

Comment 4 Bernd Edlinger 2020-04-28 08:36:43 UTC

(In reply to Martin Jambor from comment #3)
> My benchmarking setup is currently gone so unfortunately no, not easily. 
> I'll be re-measuring everything on a different computer with a slightly
> different CPU model soon, so after that I guess I could.  But it is most
> likely the limits, yes.

Yeah, easy to fix, but it takes some time.
But this is not more important than your life.

Shall I raise this to P1 so it prevents gcc-10 release?

Comment 5 Jakub Jelinek 2020-04-28 08:38:33 UTC

No, we can't block GCC 10 release indefinitely, we are already behind the usual schedule.  We need to resolve the C++ ABI issues and get the release out.

Comment 6 Bernd Edlinger 2020-04-28 08:41:27 UTC

(In reply to Jakub Jelinek from comment #5)
> No, we can't block GCC 10 release indefinitely, we are already behind the
> usual schedule.  We need to resolve the C++ ABI issues and get the release
> out.

Sorry, have you heard of the Corona pandemic out there?

This is not like olympic games 2020, which has been cancelled?
I just say I would delay gcc 10 right now, before it is too
late, this performance regression will make the damage worse.

Comment 7 Richard Biener 2020-04-28 09:21:25 UTC

(In reply to Bernd Edlinger from comment #4)
> (In reply to Martin Jambor from comment #3)
> > My benchmarking setup is currently gone so unfortunately no, not easily. 
> > I'll be re-measuring everything on a different computer with a slightly
> > different CPU model soon, so after that I guess I could.  But it is most
> > likely the limits, yes.
> 
> Yeah, easy to fix, but it takes some time.
> But this is not more important than your life.

Note tuning parameters is hard and takes a lot of time.  If we adjust things
to make 400.perlbench happy which is btw. from SPEC 2006(!) we're going to
regress things elsewhere.  It's going to be a whack-a-mole game and definitely
not suitable at this stage (inliner re-tuning is also prone to trigger
latent GCC issues in previously fine compiling apps).

> Shall I raise this to P1 so it prevents gcc-10 release?

Definitely not.  Setting priority is the release managers job, and btw.
bug priority is meaningless for non-regression bugreports.

Comment 8 Richard Biener 2020-04-28 09:22:59 UTC

Oh, and bugfixing requires to first understand the bug.  Especially for performance related issues understanding what goes wrong is important.
I see no analysis being performed to date.

Comment 9 Jan Hubicka 2020-04-28 09:57:35 UTC

> --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Oh, and bugfixing requires to first understand the bug.  Especially for
> performance related issues understanding what goes wrong is important.
> I see no analysis being performed to date.

The problem here is that -O2 -fprofile-use is now using -O2 inliner
limits while previously it used -O3 inliner limit (because -fprofile-use
enables -finline-functions).

I can see this on SPEC GCC, perl, Firefox, real GCC and clang. We now
have performance diference between -O2+FDO and -O3+FDO.

It is something I kind of missed in my testing, because I was testing
-O2 and -O3 + FDO but not -O2+FDO.  I realize that -O2+FDO is kind of
important because we use it in our bootstrap. So i was collecting data
over weekend for Clang, GCC and Firefox.

It is question how agressive we want to be at -O2+FDO but the
observation is that in all these programs the code size growth for -O3
style limits is quite small (bellow 2%) simply because thraining
coverage is quite small in all those programs (sub 10%) and thus the
code size growth for inlining hot calls is acceptable
and thus I think the current defaults are really suboptimal.

I think there are few ways to proceed
 1) make inline limits with FDO to be -O3 ones
 2) invent yet another set of parameters for FDO
 3) increase importance of known_hot hint that is set of calls that are
 known to be hot (either by inlining or by hot attribute).

1 is easiest but bit non-sytematic. I am not really keen about 2 because
if parameter explosion.
However 3 looks like good alternative so I am running benchmarks with
few settings of it, but they take some time.

Honza

Comment 10 Bernd Edlinger 2020-04-28 13:09:33 UTC

(In reply to Richard Biener from comment #7)
> 
> > Shall I raise this to P1 so it prevents gcc-10 release?
> 
> Definitely not.  Setting priority is the release managers job, and btw.
> bug priority is meaningless for non-regression bugreports.

Okay, Richard,

is this P2 or P3 then, I just wanted you to think about it.
;-)



Thanks
Bernd.