Bug 114978 - [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317
Summary: [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after g...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 14.0
: P3 normal
Target Milestone: 14.3
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2024-05-07 17:58 UTC by Chen Chen
Modified: 2024-08-01 09:40 UTC (History)
3 users (show)

See Also:
Host:
Target: loongarch64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chen Chen 2024-05-07 17:58:23 UTC
We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux  operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered significant regressions from 14% to 28% with various compiling options.

The rate-1 results are following:

after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast -march=native":
13.2.0:    11.7 (223s) [gcc 13.2.0, system default]
20240317:  11.0 (237s) [gcc 14 snapshot 20240317]
20240324:  8.88 (295s) [gcc 14 snapshot 20240324]
20240430:  9.03 (290s) [gcc 14 snapshot 20240430, 14.1.0-RC]
14.1.0:    9.43 (278s) [gcc 14.1.0 release]

after snapshot 20240317 score 16.5-20.8% lower with parameters "-g -Ofast -march=native -flto": 
13.2.0:    12.0 (218s)
20240317:  10.6 (248s)
20240324:  8.40 (312s)
20240430:  8.48 (309s)
14.1.0:    8.85 (296s)


after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast -march=la664":       
13.2.0:    "-march=la664" flag is not supported
20240317:  11.5 (227s)
20240324:  8.84 (296s)
20240430:  9.43 (278s)
14.1.0:    9.42 (278s)


after snapshot 20240317 score 20.3-21.2% lower with parameters "-g -Ofast -march=la664 -flto": 
13.2.0:    "-march=la664" flag is not supported
20240317:  11.1 (236s)
20240324:  8.75 (299s)
20240430:  8.85 (296s)
14.1.0:    8.85 (296s)


after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast -march=la464":       
13.2.0:    8.76 (299s)
20240317:  12.8 (205s)
20240324:  9.39 (279s)
20240430:  9.43 (278s)
14.1.0:    9.43 (278s)


after snapshot 20240317 score 26.6-28% lower with parameters "-g -Ofast -march=la464 -flto": 
13.2.0:    8.52 (307s)
20240317:  12.8 (204s)
20240324:  9.22 (284s)
20240430:  9.37 (280s)
14.1.0:    9.40 (279s)


The gcc 14 snapshots and gcc 14.1.0 are compiled with the following parameters: 

--enable-shared --enable-threads=posix --with-system-zlib --enable-gnu-indirect-function --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib --disable-werror --enable-pie --enable-checking=release --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new --enable-default-pie --enable-default-ssp --enable-bootstrap --enable-languages=c,c++,fortran,lto --with-abi=lp64d --with-arch=loongarch64 --with-tune=la664 --build=loongarch64-aosc-linux-gnu


The regression may be found on other types of CPUs as well. We did a quick test on AMD Zen4 CPU R9 7940HS and found similar but smaller regression:

The rate-1 results on x86_64 (AMD R9 7940HS) with operating system Debian 12:

after snapshot 20240317 score 8.6-9.6% lower with parameters "-m64 -g -Ofast -march=znver3":
12.2.0:    30.1 (87.0s) [gcc 12.2.0, system default]
13.2.0:    30.6 (85.7s) [gcc 13.2 release]
20240317:  31.4 (83.3s) [gcc 14 snapshot]
20240324:  28.7 (91.2s) [gcc 14 snapshot]
20240430:  28.4 (92.2s) [gcc 14 snapshot, 14.1.0-RC]

after snapshot 20240317 score 10% lower with parameters "-m64 -g -Ofast -march=znver3 -flto":
12.2.0:    29.0 (90.3s) 
13.2.0:    30.9 (84.9s) 
20240317:  32.0 (81.8s) 
20240324:  28.8 (90.9s) 
20240430:  28.8 (91.1s)

gcc13 and gcc14 are compiled with the following parameters:

--enable-shared --enable-threads=posix --with-system-zlib --enable-gnu-indirect-function --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib --disable-werror --enable-pie --enable-checking=release --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new --enable-default-pie --enable-default-ssp --enable-bootstrap --enable-languages=c,c++,fortran,lto  --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Comment 1 Sam James 2024-05-07 18:12:25 UTC
Neither of the two loong maintainers seem to have accounts on BZ (?) so CCIng xry111. Apologies if I missed their accounts.
Comment 2 Xi Ruoyao 2024-05-07 18:13:57 UTC
I don't have a SPEC access so I cannot confirm or dis-confirm the issue.
Comment 3 Xi Ruoyao 2024-05-07 18:17:18 UTC
Changed component to target for now.

I'm suspicious about the 10% regression on x86_64.  IIRC there are already multiple bug reports complaining some 5% SPEC regression on x86_64, so I'll be really surprised if there is really a 10% regression on x86_64 but it's not already reported.
Comment 4 Xi Ruoyao 2024-05-07 18:18:00 UTC
s/suspicious/skeptical/
Comment 5 chenglulu 2024-05-08 01:41:09 UTC
I will verify it on multiple machines to see if the problem can be reproduced.
Comment 6 Richard Biener 2024-05-08 08:20:03 UTC
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=471.407.0

shows a recent improvement that then regressed again, maybe you have a similar artifact with the choosing of the snapshots.  Try a snapshot from february for comparison for example.
Comment 7 Chen Chen 2024-05-08 14:41:17 UTC
(In reply to Richard Biener from comment #6)
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=471.407.0
> 
> shows a recent improvement that then regressed again, maybe you have a
> similar artifact with the choosing of the snapshots.  Try a snapshot from
> february for comparison for example.

Thanks for the nice graph!

We did more tests on AMD Zen4 CPU R9 7940HS (system: Debian 12) and used similar parameters as yours. The results are following:

with parameters "-m64 -g -Ofast -march=native":
13.2.0:    30.0 (87.2s) [gcc 13.2 release]
20240121:  29.8 (88.0s) [gcc 14 snapshot]
20240218:  29.8 (88.0s) [gcc 14 snapshot]
20240303:  29.2 (89.8s) [gcc 14 snapshot]
20240310:  31.7 (82.6s) [gcc 14 snapshot]
20240317:  31.7 (82.7s) [gcc 14 snapshot]
20240324:  28.3 (92.5s) [gcc 14 snapshot]
20240430:  28.4 (92.3s) [gcc 14 snapshot, 14.1.0-RC]
14.1.0:    28.4 (92.4s) [gcc 14.1 release]

with parameters "-m64 -g -Ofast -march=native -flto":
13.2.0:    30.5 (85.8s) [gcc 13.2 release]
20240121:  30.5 (85.9s) [gcc 14 snapshot]
20240218:  29.5 (88.7s) [gcc 14 snapshot]
20240303:  30.5 (86.0s) [gcc 14 snapshot]
20240310:  31.6 (82.8s) [gcc 14 snapshot]
20240317:  31.7 (82.7s) [gcc 14 snapshot]
20240324:  28.6 (91.6s) [gcc 14 snapshot]
20240430:  29.1 (89.9s) [gcc 14 snapshot, 14.1.0-RC]
14.1.0:    29.1 (90.1s) [gcc 14.1 release]

The scores with gcc 14.1 release are 8.2-10.4% lower than those with gcc 14 snapshot 20240317, and 4.6-5.3% lower than those with gcc 13.2 release.
Comment 8 chenglulu 2024-05-09 02:44:40 UTC
(In reply to Chen Chen from comment #0)
> We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux
> operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we
> found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered
> significant regressions from 14% to 28% with various compiling options.
> 
> The rate-1 results are following:
> 
/* snip */
> 
> after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast
> -march=la664":       
> 13.2.0:    "-march=la664" flag is not supported
> 20240317:  11.5 (227s)
> 20240324:  8.84 (296s)
> 20240430:  9.43 (278s)
> 14.1.0:    9.42 (278s)
> 
/* snip */
> 
> 
> after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast
> -march=la464":       
> 13.2.0:    8.76 (299s)
> 20240317:  12.8 (205s)
> 20240324:  9.39 (279s)
> 20240430:  9.43 (278s)
> 14.1.0:    9.43 (278s)
> 
> 

> 20240317:  11.5 (227s) -march=la664
> 20240317:  12.8 (205s) -march=la464
I looked for the reason for the gap between the above two results. The performance regression is caused by r14-6814. If the following modifications are made, the scores of -march=la664 and -march464 will be the same.

diff --git a/gcc/config/loongarch/loongarch-def.cc b/gcc/config/loongarch/loongarch-def.cc
index e8c129ce643..f27284cb20a 100644
--- a/gcc/config/loongarch/loongarch-def.cc
+++ b/gcc/config/loongarch/loongarch-def.cc
@@ -111,11 +111,7 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
  tune targets (i.e. -mtune=native while PRID does not correspond to
  any known "-mtune" type).  */
 array_tune<loongarch_rtx_cost_data> loongarch_cpu_rtx_cost_data =
-  array_tune<loongarch_rtx_cost_data> ()
-    .set (CPU_LA664,
-         loongarch_rtx_cost_data ()
-           .movcf2gr_ (COSTS_N_INSNS (1))
-           .movgr2cf_ (COSTS_N_INSNS (1)));
+  array_tune<loongarch_rtx_cost_data> ();
Comment 9 Xi Ruoyao 2024-05-09 03:36:59 UTC
(In reply to chenglulu from comment #8)

> diff --git a/gcc/config/loongarch/loongarch-def.cc
> b/gcc/config/loongarch/loongarch-def.cc
> index e8c129ce643..f27284cb20a 100644
> --- a/gcc/config/loongarch/loongarch-def.cc
> +++ b/gcc/config/loongarch/loongarch-def.cc
> @@ -111,11 +111,7 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
>   tune targets (i.e. -mtune=native while PRID does not correspond to
>   any known "-mtune" type).  */
>  array_tune<loongarch_rtx_cost_data> loongarch_cpu_rtx_cost_data =
> -  array_tune<loongarch_rtx_cost_data> ()
> -    .set (CPU_LA664,
> -         loongarch_rtx_cost_data ()
> -           .movcf2gr_ (COSTS_N_INSNS (1))
> -           .movgr2cf_ (COSTS_N_INSNS (1)));
> +  array_tune<loongarch_rtx_cost_data> ();

But why?  Isn't movcf2gr and movgr2cf one-cycle on LA664?
Comment 10 chenglulu 2024-05-09 03:40:49 UTC
(In reply to Xi Ruoyao from comment #9)
> (In reply to chenglulu from comment #8)
> 
> > diff --git a/gcc/config/loongarch/loongarch-def.cc
> > b/gcc/config/loongarch/loongarch-def.cc
> > index e8c129ce643..f27284cb20a 100644
> > --- a/gcc/config/loongarch/loongarch-def.cc
> > +++ b/gcc/config/loongarch/loongarch-def.cc
> > @@ -111,11 +111,7 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
> >   tune targets (i.e. -mtune=native while PRID does not correspond to
> >   any known "-mtune" type).  */
> >  array_tune<loongarch_rtx_cost_data> loongarch_cpu_rtx_cost_data =
> > -  array_tune<loongarch_rtx_cost_data> ()
> > -    .set (CPU_LA664,
> > -         loongarch_rtx_cost_data ()
> > -           .movcf2gr_ (COSTS_N_INSNS (1))
> > -           .movgr2cf_ (COSTS_N_INSNS (1)));
> > +  array_tune<loongarch_rtx_cost_data> ();
> 
> But why?  Isn't movcf2gr and movgr2cf one-cycle on LA664?

I think this is weird too. I'm still testing other situations, and I'll find out the reason after the testing is completed.
Comment 11 chenglulu 2024-05-09 11:09:28 UTC
(In reply to Chen Chen from comment #0)
> We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux
> operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we
> found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered
> significant regressions from 14% to 28% with various compiling options.
> 
> The rate-1 results are following:
> 
> after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast
> -march=native":
> 13.2.0:    11.7 (223s) [gcc 13.2.0, system default]
Hi:

 I can't reproduce the score of r13.2. Have you made any modifications there?
Comment 12 Chen Chen 2024-05-09 12:57:16 UTC
(In reply to chenglulu from comment #11)
> (In reply to Chen Chen from comment #0)
> > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux
> > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we
> > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered
> > significant regressions from 14% to 28% with various compiling options.
> > 
> > The rate-1 results are following:
> > 
> > after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast
> > -march=native":
> > 13.2.0:    11.7 (223s) [gcc 13.2.0, system default]
> Hi:
> 
>  I can't reproduce the score of r13.2. Have you made any modifications there?

No. I used system default gcc. How big is the difference? A little fluctuation is normal. I once got scores 11.3(232s)、11.5(227s) with parameters "-g -Ofast -march=native" in previous tests too. To be fair, in each test presented above I always free page cache by the command "echo 3 > /proc/sys/vm/drop_caches" and then run the test.
Comment 13 Xi Ruoyao 2024-05-09 13:09:43 UTC
(In reply to Chen Chen from comment #12)

> No. I used system default gcc.

AOSC backports *many* changes not in upstream GCC 13.2 to their "13.2": https://github.com/AOSC-Dev/aosc-os-abbs/tree/stable/core-devel/gcc/01-runtime/patches

So the default GCC is simply not GCC 13.2.
Comment 14 Chen Chen 2024-05-10 03:12:43 UTC
(In reply to Xi Ruoyao from comment #13)
> (In reply to Chen Chen from comment #12)
> 
> > No. I used system default gcc.
> 
> AOSC backports *many* changes not in upstream GCC 13.2 to their "13.2":
> https://github.com/AOSC-Dev/aosc-os-abbs/tree/stable/core-devel/gcc/01-
> runtime/patches
> 
> So the default GCC is simply not GCC 13.2.

You are correct. The above 13.2 results should be "AOSC system default gcc 13.2" results. Under AOSC system I recompiled official gcc 13.2 source with the same parameters except for "--with-tune=la664" (changed to "--with-tune=la464" since gcc 13.2 does not support "LA664" architecture). The test results from official gcc 13.2 are following:

-g -Ofast -march=native      : 6.54 (400s)
-g -Ofast -march=native -flto: 6.57 (399s)
-g -Ofast -march=la464       : 6.46 (405s)
-g -Ofast -march=la464 -flto : 6.57 (399s)
Comment 15 chenglulu 2024-05-10 03:15:04 UTC
(In reply to Chen Chen from comment #14)
> (In reply to Xi Ruoyao from comment #13)
> > (In reply to Chen Chen from comment #12)
> > 
> > > No. I used system default gcc.
> > 
> > AOSC backports *many* changes not in upstream GCC 13.2 to their "13.2":
> > https://github.com/AOSC-Dev/aosc-os-abbs/tree/stable/core-devel/gcc/01-
> > runtime/patches
> > 
> > So the default GCC is simply not GCC 13.2.
> 
> You are correct. The above 13.2 results should be "AOSC system default gcc
> 13.2" results. Under AOSC system I recompiled official gcc 13.2 source with
> the same parameters except for "--with-tune=la664" (changed to
> "--with-tune=la464" since gcc 13.2 does not support "LA664" architecture).
> The test results from official gcc 13.2 are following:
> 
> -g -Ofast -march=native      : 6.54 (400s)
> -g -Ofast -march=native -flto: 6.57 (399s)
> -g -Ofast -march=la464       : 6.46 (405s)
> -g -Ofast -march=la464 -flto : 6.57 (399s)

The data of r13.2 I tested is similar to this. I am currently testing gcc with the AOSC patch.
Comment 16 chenglulu 2024-05-15 01:24:35 UTC
The performance degradation on LoongArch is caused by one commit:

commit e0e9499aeffdaca88f0f29334384aa5f710a81a4 (HEAD -> trunk)
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Mar 19 12:24:08 2024 +0100

    tree-optimization/114151 - revert PR114074 fix
    
    The following reverts the chrec_fold_multiply fix and only keeps
    handling of constant overflow which keeps the original testcase
    fixed.  A better solution might involve ranger improvements or
    tracking of assumptions during SCEV analysis similar to what niter
    analysis does.
    
            PR tree-optimization/114151
            PR tree-optimization/114269
            PR tree-optimization/114322
            PR tree-optimization/114074
            * tree-chrec.cc (chrec_fold_multiply): Restrict the use of
            unsigned arithmetic when actual overflow on constant operands
            is observed.
    
            * gcc.dg/pr68317.c: Revert last change.
The scores before and after this patch are:
(-g -Ofast -march=la464)
r14-9539: 12.3
r14-9540: 9.26
Comment 17 Xi Ruoyao 2024-05-15 03:41:12 UTC
Strangely PR114074 is a wrong-code (instead of missed-optimization) and reverting its fix seems improving performance for other targets...
Comment 18 chenglulu 2024-05-15 03:57:27 UTC
(In reply to Xi Ruoyao from comment #17)
> Strangely PR114074 is a wrong-code (instead of missed-optimization) and
> reverting its fix seems improving performance for other targets...

This is very strange. I tried turning off reg_reg addressing on the basis of r14-9540, and the performance was not much different from r14-9539. But unfortunately I still don’t know why
Comment 19 chenglulu 2024-05-21 12:46:41 UTC
diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc
index e7835ae34ae..6a808cb0a5c 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode, bool might_split_p)
        return factor;
 
       case ADDRESS_REG_REG:
-       return factor;
+       return factor * 3;
 
       case ADDRESS_CONST_INT:
        return lsx_p ? 0 : factor;

With this patch, -march=la464 has a score of 11.9.
However, the specific revision plan has not yet been decided.
Comment 20 chenglulu 2024-05-21 12:47:51 UTC
(In reply to chenglulu from comment #19)
> diff --git a/gcc/config/loongarch/loongarch.cc
> b/gcc/config/loongarch/loongarch.cc
> index e7835ae34ae..6a808cb0a5c 100644
> --- a/gcc/config/loongarch/loongarch.cc
> +++ b/gcc/config/loongarch/loongarch.cc
> @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode,
> bool might_split_p)
>         return factor;
>  
>        case ADDRESS_REG_REG:
> -       return factor;
> +       return factor * 3;
>  
>        case ADDRESS_CONST_INT:
>         return lsx_p ? 0 : factor;
> 
> With this patch, -march=la464 has a score of 11.9.
> However, the specific revision plan has not yet been decided.

This is the score of R14-9540
Comment 21 Xi Ruoyao 2024-05-21 13:01:16 UTC
(In reply to chenglulu from comment #19)
> diff --git a/gcc/config/loongarch/loongarch.cc
> b/gcc/config/loongarch/loongarch.cc
> index e7835ae34ae..6a808cb0a5c 100644
> --- a/gcc/config/loongarch/loongarch.cc
> +++ b/gcc/config/loongarch/loongarch.cc
> @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode,
> bool might_split_p)
>         return factor;
>  
>        case ADDRESS_REG_REG:
> -       return factor;
> +       return factor * 3;
>  
>        case ADDRESS_CONST_INT:
>         return lsx_p ? 0 : factor;
> 
> With this patch, -march=la464 has a score of 11.9.
> However, the specific revision plan has not yet been decided.

Hmm are ldx and stx really so slow?
Comment 22 chenglulu 2024-05-21 13:04:17 UTC
(In reply to Xi Ruoyao from comment #21)
> (In reply to chenglulu from comment #19)
> > diff --git a/gcc/config/loongarch/loongarch.cc
> > b/gcc/config/loongarch/loongarch.cc
> > index e7835ae34ae..6a808cb0a5c 100644
> > --- a/gcc/config/loongarch/loongarch.cc
> > +++ b/gcc/config/loongarch/loongarch.cc
> > @@ -2383,7 +2383,7 @@ loongarch_address_insns (rtx x, machine_mode mode,
> > bool might_split_p)
> >         return factor;
> >  
> >        case ADDRESS_REG_REG:
> > -       return factor;
> > +       return factor * 3;
> >  
> >        case ADDRESS_CONST_INT:
> >         return lsx_p ? 0 : factor;
> > 
> > With this patch, -march=la464 has a score of 11.9.
> > However, the specific revision plan has not yet been decided.
> 
> Hmm are ldx and stx really so slow?

I think it's more like it's because LDX/STX uses an extra register.
Comment 23 Jakub Jelinek 2024-08-01 09:40:21 UTC
GCC 14.2 is being released, retargeting bugs to GCC 14.3.