We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered significant regressions from 14% to 28% with various compiling options. The rate-1 results are following: after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast -march=native": 13.2.0: 11.7 (223s) [gcc 13.2.0, system default] 20240317: 11.0 (237s) [gcc 14 snapshot 20240317] 20240324: 8.88 (295s) [gcc 14 snapshot 20240324] 20240430: 9.03 (290s) [gcc 14 snapshot 20240430, 14.1.0-RC] 14.1.0: 9.43 (278s) [gcc 14.1.0 release] after snapshot 20240317 score 16.5-20.8% lower with parameters "-g -Ofast -march=native -flto": 13.2.0: 12.0 (218s) 20240317: 10.6 (248s) 20240324: 8.40 (312s) 20240430: 8.48 (309s) 14.1.0: 8.85 (296s) after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast -march=la664": 13.2.0: "-march=la664" flag is not supported 20240317: 11.5 (227s) 20240324: 8.84 (296s) 20240430: 9.43 (278s) 14.1.0: 9.42 (278s) after snapshot 20240317 score 20.3-21.2% lower with parameters "-g -Ofast -march=la664 -flto": 13.2.0: "-march=la664" flag is not supported 20240317: 11.1 (236s) 20240324: 8.75 (299s) 20240430: 8.85 (296s) 14.1.0: 8.85 (296s) after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast -march=la464": 13.2.0: 8.76 (299s) 20240317: 12.8 (205s) 20240324: 9.39 (279s) 20240430: 9.43 (278s) 14.1.0: 9.43 (278s) after snapshot 20240317 score 26.6-28% lower with parameters "-g -Ofast -march=la464 -flto": 13.2.0: 8.52 (307s) 20240317: 12.8 (204s) 20240324: 9.22 (284s) 20240430: 9.37 (280s) 14.1.0: 9.40 (279s) The gcc 14 snapshots and gcc 14.1.0 are compiled with the following parameters: --enable-shared --enable-threads=posix --with-system-zlib --enable-gnu-indirect-function --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib --disable-werror --enable-pie --enable-checking=release --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new --enable-default-pie --enable-default-ssp --enable-bootstrap --enable-languages=c,c++,fortran,lto --with-abi=lp64d --with-arch=loongarch64 --with-tune=la664 --build=loongarch64-aosc-linux-gnu The regression may be found on other types of CPUs as well. We did a quick test on AMD Zen4 CPU R9 7940HS and found similar but smaller regression: The rate-1 results on x86_64 (AMD R9 7940HS) with operating system Debian 12: after snapshot 20240317 score 8.6-9.6% lower with parameters "-m64 -g -Ofast -march=znver3": 12.2.0: 30.1 (87.0s) [gcc 12.2.0, system default] 13.2.0: 30.6 (85.7s) [gcc 13.2 release] 20240317: 31.4 (83.3s) [gcc 14 snapshot] 20240324: 28.7 (91.2s) [gcc 14 snapshot] 20240430: 28.4 (92.2s) [gcc 14 snapshot, 14.1.0-RC] after snapshot 20240317 score 10% lower with parameters "-m64 -g -Ofast -march=znver3 -flto": 12.2.0: 29.0 (90.3s) 13.2.0: 30.9 (84.9s) 20240317: 32.0 (81.8s) 20240324: 28.8 (90.9s) 20240430: 28.8 (91.1s) gcc13 and gcc14 are compiled with the following parameters: --enable-shared --enable-threads=posix --with-system-zlib --enable-gnu-indirect-function --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib --disable-werror --enable-pie --enable-checking=release --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new --enable-default-pie --enable-default-ssp --enable-bootstrap --enable-languages=c,c++,fortran,lto --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Neither of the two loong maintainers seem to have accounts on BZ (?) so CCIng xry111. Apologies if I missed their accounts.
I don't have a SPEC access so I cannot confirm or dis-confirm the issue.
Changed component to target for now. I'm suspicious about the 10% regression on x86_64. IIRC there are already multiple bug reports complaining some 5% SPEC regression on x86_64, so I'll be really surprised if there is really a 10% regression on x86_64 but it's not already reported.
s/suspicious/skeptical/
I will verify it on multiple machines to see if the problem can be reproduced.
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=471.407.0 shows a recent improvement that then regressed again, maybe you have a similar artifact with the choosing of the snapshots. Try a snapshot from february for comparison for example.
(In reply to Richard Biener from comment #6) > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=471.407.0 > > shows a recent improvement that then regressed again, maybe you have a > similar artifact with the choosing of the snapshots. Try a snapshot from > february for comparison for example. Thanks for the nice graph! We did more tests on AMD Zen4 CPU R9 7940HS (system: Debian 12) and used similar parameters as yours. The results are following: with parameters "-m64 -g -Ofast -march=native": 13.2.0: 30.0 (87.2s) [gcc 13.2 release] 20240121: 29.8 (88.0s) [gcc 14 snapshot] 20240218: 29.8 (88.0s) [gcc 14 snapshot] 20240303: 29.2 (89.8s) [gcc 14 snapshot] 20240310: 31.7 (82.6s) [gcc 14 snapshot] 20240317: 31.7 (82.7s) [gcc 14 snapshot] 20240324: 28.3 (92.5s) [gcc 14 snapshot] 20240430: 28.4 (92.3s) [gcc 14 snapshot, 14.1.0-RC] 14.1.0: 28.4 (92.4s) [gcc 14.1 release] with parameters "-m64 -g -Ofast -march=native -flto": 13.2.0: 30.5 (85.8s) [gcc 13.2 release] 20240121: 30.5 (85.9s) [gcc 14 snapshot] 20240218: 29.5 (88.7s) [gcc 14 snapshot] 20240303: 30.5 (86.0s) [gcc 14 snapshot] 20240310: 31.6 (82.8s) [gcc 14 snapshot] 20240317: 31.7 (82.7s) [gcc 14 snapshot] 20240324: 28.6 (91.6s) [gcc 14 snapshot] 20240430: 29.1 (89.9s) [gcc 14 snapshot, 14.1.0-RC] 14.1.0: 29.1 (90.1s) [gcc 14.1 release] The scores with gcc 14.1 release are 8.2-10.4% lower than those with gcc 14 snapshot 20240317, and 4.6-5.3% lower than those with gcc 13.2 release.
(In reply to Chen Chen from comment #0) > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered > significant regressions from 14% to 28% with various compiling options. > > The rate-1 results are following: > /* snip */ > > after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast > -march=la664": > 13.2.0: "-march=la664" flag is not supported > 20240317: 11.5 (227s) > 20240324: 8.84 (296s) > 20240430: 9.43 (278s) > 14.1.0: 9.42 (278s) > /* snip */ > > > after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast > -march=la464": > 13.2.0: 8.76 (299s) > 20240317: 12.8 (205s) > 20240324: 9.39 (279s) > 20240430: 9.43 (278s) > 14.1.0: 9.43 (278s) > > > 20240317: 11.5 (227s) -march=la664 > 20240317: 12.8 (205s) -march=la464 I looked for the reason for the gap between the above two results. The performance regression is caused by r14-6814. If the following modifications are made, the scores of -march=la664 and -march464 will be the same. diff --git a/gcc/config/loongarch/loongarch-def.cc b/gcc/config/loongarch/loongarch-def.cc index e8c129ce643..f27284cb20a 100644 --- a/gcc/config/loongarch/loongarch-def.cc +++ b/gcc/config/loongarch/loongarch-def.cc @@ -111,11 +111,7 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data () tune targets (i.e. -mtune=native while PRID does not correspond to any known "-mtune" type). */ array_tune<loongarch_rtx_cost_data> loongarch_cpu_rtx_cost_data = - array_tune<loongarch_rtx_cost_data> () - .set (CPU_LA664, - loongarch_rtx_cost_data () - .movcf2gr_ (COSTS_N_INSNS (1)) - .movgr2cf_ (COSTS_N_INSNS (1))); + array_tune<loongarch_rtx_cost_data> ();
(In reply to chenglulu from comment #8) > diff --git a/gcc/config/loongarch/loongarch-def.cc > b/gcc/config/loongarch/loongarch-def.cc > index e8c129ce643..f27284cb20a 100644 > --- a/gcc/config/loongarch/loongarch-def.cc > +++ b/gcc/config/loongarch/loongarch-def.cc > @@ -111,11 +111,7 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data () > tune targets (i.e. -mtune=native while PRID does not correspond to > any known "-mtune" type). */ > array_tune<loongarch_rtx_cost_data> loongarch_cpu_rtx_cost_data = > - array_tune<loongarch_rtx_cost_data> () > - .set (CPU_LA664, > - loongarch_rtx_cost_data () > - .movcf2gr_ (COSTS_N_INSNS (1)) > - .movgr2cf_ (COSTS_N_INSNS (1))); > + array_tune<loongarch_rtx_cost_data> (); But why? Isn't movcf2gr and movgr2cf one-cycle on LA664?
(In reply to Xi Ruoyao from comment #9) > (In reply to chenglulu from comment #8) > > > diff --git a/gcc/config/loongarch/loongarch-def.cc > > b/gcc/config/loongarch/loongarch-def.cc > > index e8c129ce643..f27284cb20a 100644 > > --- a/gcc/config/loongarch/loongarch-def.cc > > +++ b/gcc/config/loongarch/loongarch-def.cc > > @@ -111,11 +111,7 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data () > > tune targets (i.e. -mtune=native while PRID does not correspond to > > any known "-mtune" type). */ > > array_tune<loongarch_rtx_cost_data> loongarch_cpu_rtx_cost_data = > > - array_tune<loongarch_rtx_cost_data> () > > - .set (CPU_LA664, > > - loongarch_rtx_cost_data () > > - .movcf2gr_ (COSTS_N_INSNS (1)) > > - .movgr2cf_ (COSTS_N_INSNS (1))); > > + array_tune<loongarch_rtx_cost_data> (); > > But why? Isn't movcf2gr and movgr2cf one-cycle on LA664? I think this is weird too. I'm still testing other situations, and I'll find out the reason after the testing is completed.
(In reply to Chen Chen from comment #0) > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered > significant regressions from 14% to 28% with various compiling options. > > The rate-1 results are following: > > after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast > -march=native": > 13.2.0: 11.7 (223s) [gcc 13.2.0, system default] Hi: I can't reproduce the score of r13.2. Have you made any modifications there?
(In reply to chenglulu from comment #11) > (In reply to Chen Chen from comment #0) > > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux > > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we > > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered > > significant regressions from 14% to 28% with various compiling options. > > > > The rate-1 results are following: > > > > after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast > > -march=native": > > 13.2.0: 11.7 (223s) [gcc 13.2.0, system default] > Hi: > > I can't reproduce the score of r13.2. Have you made any modifications there? No. I used system default gcc. How big is the difference? A little fluctuation is normal. I once got scores 11.3(232s)、11.5(227s) with parameters "-g -Ofast -march=native" in previous tests too. To be fair, in each test presented above I always free page cache by the command "echo 3 > /proc/sys/vm/drop_caches" and then run the test.
(In reply to Chen Chen from comment #12) > No. I used system default gcc. AOSC backports *many* changes not in upstream GCC 13.2 to their "13.2": https://github.com/AOSC-Dev/aosc-os-abbs/tree/stable/core-devel/gcc/01-runtime/patches So the default GCC is simply not GCC 13.2.
(In reply to Xi Ruoyao from comment #13) > (In reply to Chen Chen from comment #12) > > > No. I used system default gcc. > > AOSC backports *many* changes not in upstream GCC 13.2 to their "13.2": > https://github.com/AOSC-Dev/aosc-os-abbs/tree/stable/core-devel/gcc/01- > runtime/patches > > So the default GCC is simply not GCC 13.2. You are correct. The above 13.2 results should be "AOSC system default gcc 13.2" results. Under AOSC system I recompiled official gcc 13.2 source with the same parameters except for "--with-tune=la664" (changed to "--with-tune=la464" since gcc 13.2 does not support "LA664" architecture). The test results from official gcc 13.2 are following: -g -Ofast -march=native : 6.54 (400s) -g -Ofast -march=native -flto: 6.57 (399s) -g -Ofast -march=la464 : 6.46 (405s) -g -Ofast -march=la464 -flto : 6.57 (399s)
(In reply to Chen Chen from comment #14) > (In reply to Xi Ruoyao from comment #13) > > (In reply to Chen Chen from comment #12) > > > > > No. I used system default gcc. > > > > AOSC backports *many* changes not in upstream GCC 13.2 to their "13.2": > > https://github.com/AOSC-Dev/aosc-os-abbs/tree/stable/core-devel/gcc/01- > > runtime/patches > > > > So the default GCC is simply not GCC 13.2. > > You are correct. The above 13.2 results should be "AOSC system default gcc > 13.2" results. Under AOSC system I recompiled official gcc 13.2 source with > the same parameters except for "--with-tune=la664" (changed to > "--with-tune=la464" since gcc 13.2 does not support "LA664" architecture). > The test results from official gcc 13.2 are following: > > -g -Ofast -march=native : 6.54 (400s) > -g -Ofast -march=native -flto: 6.57 (399s) > -g -Ofast -march=la464 : 6.46 (405s) > -g -Ofast -march=la464 -flto : 6.57 (399s) The data of r13.2 I tested is similar to this. I am currently testing gcc with the AOSC patch.
The performance degradation on LoongArch is caused by one commit: commit e0e9499aeffdaca88f0f29334384aa5f710a81a4 (HEAD -> trunk) Author: Richard Biener <rguenther@suse.de> Date: Tue Mar 19 12:24:08 2024 +0100 tree-optimization/114151 - revert PR114074 fix The following reverts the chrec_fold_multiply fix and only keeps handling of constant overflow which keeps the original testcase fixed. A better solution might involve ranger improvements or tracking of assumptions during SCEV analysis similar to what niter analysis does. PR tree-optimization/114151 PR tree-optimization/114269 PR tree-optimization/114322 PR tree-optimization/114074 * tree-chrec.cc (chrec_fold_multiply): Restrict the use of unsigned arithmetic when actual overflow on constant operands is observed. * gcc.dg/pr68317.c: Revert last change. The scores before and after this patch are: (-g -Ofast -march=la464) r14-9539: 12.3 r14-9540: 9.26
Strangely PR114074 is a wrong-code (instead of missed-optimization) and reverting its fix seems improving performance for other targets...
(In reply to Xi Ruoyao from comment #17) > Strangely PR114074 is a wrong-code (instead of missed-optimization) and > reverting its fix seems improving performance for other targets... This is very strange. I tried turning off reg_reg addressing on the basis of r14-9540, and the performance was not much different from r14-9539. But unfortunately I still don’t know why
diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc index e7835ae34ae..6a808cb0a5c 100644 --- a/gcc/config/loongarch/loongarch.cc +++ b/gcc/config/loongarch/loongarch.cc @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode, bool might_split_p) return factor; case ADDRESS_REG_REG: - return factor; + return factor * 3; case ADDRESS_CONST_INT: return lsx_p ? 0 : factor; With this patch, -march=la464 has a score of 11.9. However, the specific revision plan has not yet been decided.
(In reply to chenglulu from comment #19) > diff --git a/gcc/config/loongarch/loongarch.cc > b/gcc/config/loongarch/loongarch.cc > index e7835ae34ae..6a808cb0a5c 100644 > --- a/gcc/config/loongarch/loongarch.cc > +++ b/gcc/config/loongarch/loongarch.cc > @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode, > bool might_split_p) > return factor; > > case ADDRESS_REG_REG: > - return factor; > + return factor * 3; > > case ADDRESS_CONST_INT: > return lsx_p ? 0 : factor; > > With this patch, -march=la464 has a score of 11.9. > However, the specific revision plan has not yet been decided. This is the score of R14-9540
(In reply to chenglulu from comment #19) > diff --git a/gcc/config/loongarch/loongarch.cc > b/gcc/config/loongarch/loongarch.cc > index e7835ae34ae..6a808cb0a5c 100644 > --- a/gcc/config/loongarch/loongarch.cc > +++ b/gcc/config/loongarch/loongarch.cc > @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode, > bool might_split_p) > return factor; > > case ADDRESS_REG_REG: > - return factor; > + return factor * 3; > > case ADDRESS_CONST_INT: > return lsx_p ? 0 : factor; > > With this patch, -march=la464 has a score of 11.9. > However, the specific revision plan has not yet been decided. Hmm are ldx and stx really so slow?
(In reply to Xi Ruoyao from comment #21) > (In reply to chenglulu from comment #19) > > diff --git a/gcc/config/loongarch/loongarch.cc > > b/gcc/config/loongarch/loongarch.cc > > index e7835ae34ae..6a808cb0a5c 100644 > > --- a/gcc/config/loongarch/loongarch.cc > > +++ b/gcc/config/loongarch/loongarch.cc > > @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode, > > bool might_split_p) > > return factor; > > > > case ADDRESS_REG_REG: > > - return factor; > > + return factor * 3; > > > > case ADDRESS_CONST_INT: > > return lsx_p ? 0 : factor; > > > > With this patch, -march=la464 has a score of 11.9. > > However, the specific revision plan has not yet been decided. > > Hmm are ldx and stx really so slow? I think it's more like it's because LDX/STX uses an extra register.
GCC 14.2 is being released, retargeting bugs to GCC 14.3.