Created attachment 53773 [details] Input and source files. Below is some perf data executing the 519.lbm_r benchmark on aarch64 architecture (Graviton 3 processor). I have comparison of the baseline perf (mainline commit ID: f56d48b2471c388401174029324e1f4c4b84fcdb) vs. a fix for the same (revert the code change in commit ID: a9a4edf0e71bbac9f1b5dcecdcf9250111d16889). Steps to compile: $ gcc -std=c99 -mabi=lp64 -g -Ofast -mcpu=native lbm.i main.i -lm -flto -o 519_lbm_r_base $ time ./519_lbm_r_base 3000 reference.dat 0 0 100_100_130_ldc.of real 2m50.946s Reverting the code changes in commit ID: a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 $ time ./519_lbm_r_fix 3000 reference.dat 0 0 100_100_130_ldc.of real 2m42.091s The code change reverted was in the following file: * tree-cfg.c (execute_fixup_cfg): Update also max_bb_count when scaling happen. Author: Jan Hubicka <hubicka@ucw.cz> Date: Sat Nov 30 22:25:24 2019 +0100 Please find attached the files to reproduce this issue and the fix.
$ /home/ubuntu/gccfixissue1/bin/gcc -v Using built-in specs. COLLECT_GCC=/home/ubuntu/gccfixissue1/bin/gcc COLLECT_LTO_WRAPPER=/home/ubuntu/gccfixissue1/libexec/gcc/aarch64-unknown-linux-gnu/13.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../configure --prefix=/home/ubuntu/gccfixissue1 --enable-languages=c,fortran Thread model: posix Supported LTO compression algorithms: zlib gcc version 13.0.0 20221021 (experimental) (GCC)
(In reply to Rama Malladi from comment #0) > Created attachment 53773 [details] > Input and source files. > Note you should not share publicly source files of the entire SPEC benchmark. That very likely violates their license rules!
Can you please share perf-profile before and after the revision? Note I can't see it for Altra aarch64 CPU: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.477.0&plot.6=759.477.0&plot.7=584.477.0& However, there are huge changes in between GCC 6/7 and a newer releases. Note the benchmark is pretty small and very sensitive to instruction caches.
Hi Martin, Thanks for the guidance. Can we delete the attachment from this bug report? Regards, Rama
Please try writing here: overseers@sourceware.org
(In reply to Martin Liška from comment #5) > Please try writing here: overseers@sourceware.org I have asked for deletion. Thanks
The content of attachment 53773 [details] has been deleted for the following reason: https://sourceware.org/pipermail/overseers/2022q4/019048.html
(In reply to Mark Wielaard from comment #7) > The content of attachment 53773 [details] has been deleted for the following > reason: > > https://sourceware.org/pipermail/overseers/2022q4/019048.html Thank you.
(In reply to Martin Liška from comment #3) > Can you please share perf-profile before and after the revision? > > Note I can't see it for Altra aarch64 CPU: > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot. > 1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605. > 477.0&plot.6=759.477.0&plot.7=584.477.0& > > However, there are huge changes in between GCC 6/7 and a newer releases. > Note the benchmark is pretty small and very sensitive to instruction caches. Hi, I got IPC data for baseline version of compiler and with this patch reverted. This is on Graviton3 processor machine, executing 1-copy rate run of 519.lbm_r. Baseline: Compiler commit ID: f896c13489d22b30d01257bc8316ab97b3359d1c Cycles: 148,489,372,938 Instructions: 382,748,379,257 IPC: 2.58 Baseline with code change in a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 reverted. $ git diff gcc/tree-cfg.cc diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index d982988048f..736432713fe 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -9984,7 +9984,7 @@ execute_fixup_cfg (void) } if (scale) { - update_max_bb_count (); +// update_max_bb_count (); compute_function_frequency (); } Cycles: 140,937,228,769 Instructions: 368,881,714,982 IPC: 2.62 From the above, I do see the instructions executed are higher for the baseline compiler code-gen vs. the one with patch reverted. Can you please look into the code-gen and let me know if you find some perf opportunity with this patch revert? Thank you.
@Honza ?
(In reply to Martin Liška from comment #10) > @Honza ? Just checking if this can be fixed/ implemented. Thanks.
I found difference in dumps at various stages of the compilation for the mainline GCC and with update_max_bb_count() commented. Here are the details: Mainline: Commit ID: 63a42ffc0833553fbcb84b50cf0fd2d867b8a92f There was difference in the dumps for these 2 stages: "einline" and "earlydebug" Since we use LTO for this build of 519.lbm_r build, I found these differences in these stages of the link-time optimizer: "vect", "slp1", "ivopts", "earlydebug", "debug" Also, this perf drop of 5%-6% with update_max_bb_count() code was observed only on ARM64 instances (Graviton3) and not on x86_64 instances (Intel Xeon). I ran the other SPEC cpu2017_fprate benchmarks on ARM64 with this code commented on GCC mainline and I haven't observed any perf regression. So, maybe worth a fix. Thank you.
Note the mentioned revision is a fix and yes, sometimes these revisions can end up with a regression as profile estimation is a complex guess.
(In reply to Martin Liška from comment #13) > Note the mentioned revision is a fix and yes, sometimes these revisions can > end up with a regression as profile estimation is a complex guess. Yes, possibly. So, from my understanding, the update_max_bb_count() tracks the max basic block count and takes a decision to inline or not in this case/ run. That is likely why we see a larger instruction count w this function executed/ enabled.
Hi, Can we review this issue and suggest next steps/ action please? Thanks.
@Honza: ???
(In reply to Martin Liška from comment #3) > Note I can't see it for Altra aarch64 CPU: I think LNT can see it very well and it has appeared around the reported time: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=759.477.0&plot.1=584.477.0& But I don't think the commit given in the bug summary is the culprit, this clearly happened in GCC 13 development cycle. Rama, what is your reasoning to suggest reverting this particular commit? If you bisected the issue, can you double check you arrived at the correct one?
A new 5% regression happened in gcc-trunk more recently and may be due to another patch. Rama was bisecting a 15% perf regression on lbm when updating gcc-7 to gcc-10. The regression can be seen on the LNT graph link from comment#3 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.477.0&plot.6=759.477.0&plot.7=584.477.0 gcc-6 has execution time of 213 seconds gcc-7 is at 215 seconds gcc-8 is at 266 gcc-9 at 259 gcc-10 at 260 Honza's patch seems to be unrelated as it was committed to trunk before gcc-10 release on May 7, 2020: commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 Author: Jan Hubicka <hubicka@ucw.cz> Date: Sat Nov 30 22:25:24 2019 +0100 Update max_bb_count in execute_fixup_cfg We need to git-bisect between gcc-7 and gcc-8.
Thanks @Sebastian and @Martin J. I will get another bisect between GCC 7-and-8.
@Martin J and @Sebastian P, Let me walk you through the perf data and my triage. First, my triage has been on Graviton 3 (neoverse-v1) processor based instances. Next, I was looking for perf delta going from gcc-7 to gcc-10. I found 2 issues: One was reported in 107413 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413) and fixed (the perf delta between gcc-7 and gcc-8 -- 215s vs. 266s); Another one is the issue reported in here. I did another triage and landed at the same commit that I reported earlier. # first bad commit: [a9a4edf0e71bbac9f1b5dcecdcf9250111d16889] Update max_bb_count in execute_fixup_cfg Please let me know any further info/ studies you would like to see on this report. Thank you.
I did another triage for perf loss on Graviton 2 processor (neoverse-n1) based instance and found this commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` to be the reason. As I had indicated in my earlier reply, I was doing a triage of perf loss going from gcc-7 to gcc-10. The perf of 519.libm_r 1-copy run improved 1.08x with the revert of commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` on gcc-mainline ( `2f1691be517fcdcabae9cd671ab511eb0e08b1d5`). I am guessing that we don't see it on LNT/ Altra CPUs. So, please look into this issue fix. Let me know if you have any queries. Thanks.
I will close this issue as we were unable to reproduce the perf drop going from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5.
(In reply to Rama Malladi from comment #22) > I will close this issue as we were unable to reproduce the perf drop going > from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of > 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5. Can someone from the GCC dev/ regression team close this issue as I am unable to find an option for the same? Thanks
Sure, let's close it.