Bug 107409 - Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
Summary: Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2022-10-26 09:05 UTC by Rama Malladi
Modified: 2023-07-12 21:04 UTC (History)
6 users (show)

See Also:
Host: aarch64
Target: aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-12-01 00:00:00


Attachments
Input and source files. (deleted)
2022-10-26 09:05 UTC, Rama Malladi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rama Malladi 2022-10-26 09:05:04 UTC
Created attachment 53773 [details]
Input and source files.

Below is some perf data executing the 519.lbm_r benchmark on aarch64 architecture (Graviton 3 processor). I have comparison of the baseline perf (mainline commit ID: f56d48b2471c388401174029324e1f4c4b84fcdb) vs. a fix for the same (revert the code change in commit ID: a9a4edf0e71bbac9f1b5dcecdcf9250111d16889).

Steps to compile:
$ gcc -std=c99 -mabi=lp64 -g -Ofast -mcpu=native lbm.i main.i -lm -flto -o 519_lbm_r_base

$ time ./519_lbm_r_base 3000 reference.dat 0 0 100_100_130_ldc.of
real    2m50.946s

Reverting the code changes in commit ID: a9a4edf0e71bbac9f1b5dcecdcf9250111d16889
$ time ./519_lbm_r_fix 3000 reference.dat 0 0 100_100_130_ldc.of
real    2m42.091s

The code change reverted was in the following file:
* tree-cfg.c (execute_fixup_cfg): Update also max_bb_count when scaling happen.

Author: Jan Hubicka <hubicka@ucw.cz>
Date:   Sat Nov 30 22:25:24 2019 +0100

Please find attached the files to reproduce this issue and the fix.
Comment 1 Rama Malladi 2022-10-26 09:20:04 UTC
$ /home/ubuntu/gccfixissue1/bin/gcc  -v
Using built-in specs.
COLLECT_GCC=/home/ubuntu/gccfixissue1/bin/gcc
COLLECT_LTO_WRAPPER=/home/ubuntu/gccfixissue1/libexec/gcc/aarch64-unknown-linux-gnu/13.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure --prefix=/home/ubuntu/gccfixissue1 --enable-languages=c,fortran
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.0.0 20221021 (experimental) (GCC)
Comment 2 Martin Liška 2022-10-27 08:08:56 UTC
(In reply to Rama Malladi from comment #0)
> Created attachment 53773 [details]
> Input and source files.
> 

Note you should not share publicly source files of the entire SPEC benchmark. That very likely violates their license rules!
Comment 3 Martin Liška 2022-10-27 08:13:03 UTC
Can you please share perf-profile before and after the revision?

Note I can't see it for Altra aarch64 CPU:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.477.0&plot.6=759.477.0&plot.7=584.477.0&

However, there are huge changes in between GCC 6/7 and a newer releases. Note the benchmark is pretty small and very sensitive to instruction caches.
Comment 4 Rama Malladi 2022-10-27 08:15:56 UTC
Hi Martin,
Thanks for the guidance. Can we delete the attachment from this bug report?

Regards,
Rama
Comment 5 Martin Liška 2022-10-27 08:28:15 UTC
Please try writing here: overseers@sourceware.org
Comment 6 Rama Malladi 2022-10-27 12:08:10 UTC
(In reply to Martin Liška from comment #5)
> Please try writing here: overseers@sourceware.org

I have asked for deletion. Thanks
Comment 7 Mark Wielaard 2022-10-27 12:18:10 UTC
The content of attachment 53773 [details] has been deleted for the following reason:

https://sourceware.org/pipermail/overseers/2022q4/019048.html
Comment 8 Rama Malladi 2022-10-27 12:21:12 UTC
(In reply to Mark Wielaard from comment #7)
> The content of attachment 53773 [details] has been deleted for the following
> reason:
> 
> https://sourceware.org/pipermail/overseers/2022q4/019048.html

Thank you.
Comment 9 Rama Malladi 2022-12-01 06:54:55 UTC
(In reply to Martin Liška from comment #3)
> Can you please share perf-profile before and after the revision?
> 
> Note I can't see it for Altra aarch64 CPU:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.
> 1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.
> 477.0&plot.6=759.477.0&plot.7=584.477.0&
> 
> However, there are huge changes in between GCC 6/7 and a newer releases.
> Note the benchmark is pretty small and very sensitive to instruction caches.

Hi, I got IPC data for baseline version of compiler and with this patch reverted.

This is on Graviton3 processor machine, executing 1-copy rate run of 519.lbm_r.

Baseline: Compiler commit ID: f896c13489d22b30d01257bc8316ab97b3359d1c
Cycles:            148,489,372,938
Instructions:      382,748,379,257
IPC:               2.58

Baseline with code change in a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 reverted.

$ git diff gcc/tree-cfg.cc
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index d982988048f..736432713fe 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -9984,7 +9984,7 @@ execute_fixup_cfg (void)
     }
   if (scale)
     {
-      update_max_bb_count ();
+//      update_max_bb_count ();
       compute_function_frequency ();
     }

Cycles:            140,937,228,769
Instructions:      368,881,714,982
IPC:               2.62

From the above, I do see the instructions executed are higher for the baseline compiler code-gen vs. the one with patch reverted. Can you please look into the code-gen and let me know if you find some perf opportunity with this patch revert? Thank you.
Comment 10 Martin Liška 2022-12-01 10:09:59 UTC
@Honza ?
Comment 11 Rama Malladi 2022-12-08 10:32:11 UTC
(In reply to Martin Liška from comment #10)
> @Honza ?

Just checking if this can be fixed/ implemented. Thanks.
Comment 12 Rama Malladi 2022-12-09 09:48:41 UTC
I found difference in dumps at various stages of the compilation for the mainline GCC and with update_max_bb_count() commented. Here are the details:

Mainline: Commit ID: 63a42ffc0833553fbcb84b50cf0fd2d867b8a92f

There was difference in the dumps for these 2 stages:
"einline" and "earlydebug"

Since we use LTO for this build of 519.lbm_r build, I found these differences in these stages of the link-time optimizer:
"vect", "slp1", "ivopts", "earlydebug", "debug"

Also, this perf drop of 5%-6% with update_max_bb_count() code was observed only on ARM64 instances (Graviton3) and not on x86_64 instances (Intel Xeon).

I ran the other SPEC cpu2017_fprate benchmarks on ARM64 with this code commented on GCC mainline and I haven't observed any perf regression. So, maybe worth a fix.

Thank you.
Comment 13 Martin Liška 2022-12-09 10:05:18 UTC
Note the mentioned revision is a fix and yes, sometimes these revisions can end up with a regression as profile estimation is a complex guess.
Comment 14 Rama Malladi 2022-12-12 09:48:59 UTC
(In reply to Martin Liška from comment #13)
> Note the mentioned revision is a fix and yes, sometimes these revisions can
> end up with a regression as profile estimation is a complex guess.

Yes, possibly. So, from my understanding, the update_max_bb_count() tracks the max basic block count and takes a decision to inline or not in this case/ run. That is likely why we see a larger instruction count w this function executed/ enabled.
Comment 15 Rama Malladi 2023-01-09 04:38:17 UTC
Hi, Can we review this issue and suggest next steps/ action please? Thanks.
Comment 16 Martin Liška 2023-01-09 08:41:04 UTC
@Honza: ???
Comment 17 Martin Jambor 2023-01-30 18:15:34 UTC
(In reply to Martin Liška from comment #3)
> Note I can't see it for Altra aarch64 CPU:

I think LNT can see it very well and it has appeared around the reported time:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=759.477.0&plot.1=584.477.0&

But I don't think the commit given in the bug summary is the culprit, this clearly happened in GCC 13 development cycle.  Rama, what is your reasoning to suggest  reverting this particular commit?  If you bisected the issue, can you double check you arrived at the correct one?
Comment 18 Sebastian Pop 2023-02-02 21:35:47 UTC
A new 5% regression happened in gcc-trunk more recently and may be due to another patch.

Rama was bisecting a 15% perf regression on lbm when updating gcc-7 to gcc-10.
The regression can be seen on the LNT graph link from comment#3 

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.477.0&plot.6=759.477.0&plot.7=584.477.0

gcc-6 has execution time of 213 seconds
gcc-7 is at 215 seconds
gcc-8 is at 266
gcc-9 at 259
gcc-10 at 260

Honza's patch seems to be unrelated as it was committed to trunk before gcc-10 release on May 7, 2020:

commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889
Author: Jan Hubicka <hubicka@ucw.cz>
Date:   Sat Nov 30 22:25:24 2019 +0100

    Update max_bb_count in execute_fixup_cfg


We need to git-bisect between gcc-7 and gcc-8.
Comment 19 Rama Malladi 2023-02-03 02:00:56 UTC
Thanks @Sebastian and @Martin J. I will get another bisect between GCC 7-and-8.
Comment 20 Rama Malladi 2023-02-20 03:57:15 UTC
@Martin J and @Sebastian P, Let me walk you through the perf data and my triage.

First, my triage has been on Graviton 3 (neoverse-v1) processor based instances. Next, I was looking for perf delta going from gcc-7 to gcc-10. I found 2 issues: One was reported in 107413 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413) and fixed (the perf delta between gcc-7 and gcc-8 -- 215s vs. 266s); Another one is the issue reported in here.

I did another triage and landed at the same commit that I reported earlier.

# first bad commit: [a9a4edf0e71bbac9f1b5dcecdcf9250111d16889] Update max_bb_count in execute_fixup_cfg

Please let me know any further info/ studies you would like to see on this report.

Thank you.
Comment 21 Rama Malladi 2023-02-24 10:26:52 UTC
I did another triage for perf loss on Graviton 2 processor (neoverse-n1) based instance and found this commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` to be the reason. As I had indicated in my earlier reply, I was doing a triage of perf loss going from gcc-7 to gcc-10.

The perf of 519.libm_r 1-copy run improved 1.08x with the revert of commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` on gcc-mainline ( `2f1691be517fcdcabae9cd671ab511eb0e08b1d5`).

I am guessing that we don't see it on LNT/ Altra CPUs.

So, please look into this issue fix. Let me know if you have any queries. Thanks.
Comment 22 Rama Malladi 2023-03-30 04:56:21 UTC
I will close this issue as we were unable to reproduce the perf drop going from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5.
Comment 23 Rama Malladi 2023-03-30 04:58:40 UTC
(In reply to Rama Malladi from comment #22)
> I will close this issue as we were unable to reproduce the perf drop going
> from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of
> 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5.

Can someone from the GCC dev/ regression team close this issue as I am unable to find an option for the same? Thanks
Comment 24 Martin Liška 2023-03-30 07:52:23 UTC
Sure, let's close it.