Bug 116760 - [15 Regression] 6-11% slowdown of 416.gamess on AMD Zen3 and Zen4 since r15-3509-gd34cda72098867
Summary: [15 Regression] 6-11% slowdown of 416.gamess on AMD Zen3 and Zen4 since r15-3...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: 15.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2024-09-18 12:19 UTC by Filip Kastl
Modified: 2024-11-25 14:55 UTC (History)
1 user (show)

See Also:
Host: x86_64-pc-linux-gnu
Target: x86_64-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-11-25 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Kastl 2024-09-18 12:19:27 UTC
As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=467.50.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.50.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=992.50.0

there was a 6-11% exec time slowdown of the 416.gamess SPEC 2006 benchmark between commits

r15-3465-gde3ca363811a39
r15-3518-g2c4438d3915649

when run with -Ofast -march=native (optionally -flto) on AMD Zen3/4 machines.
Comment 1 Filip Kastl 2024-09-23 13:52:35 UTC
Bisected to r15-3509-gd34cda72098867.  Cc-ing richi.
Comment 2 Richard Biener 2024-09-23 14:08:23 UTC
I will eventually investigate.
Comment 3 Richard Biener 2024-11-25 11:58:09 UTC
Re-confirmed (comparing 14.2 against trunk on Zen4 with -Ofast -flto -march=native).

Samples: 1M of event 'cycles:Pu', Event count (approx.): 2401109021645                                                    
Overhead       Samples  Command          Shared Object                   Symbol                                           
  12.03%        230087  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] twotff_
  11.79%        224014  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] forms_
  11.66%        222528  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] forms_
   8.44%        160676  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] dirfck_
   8.09%        153197  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] dirfck_
   6.27%        119537  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] twotff_
   5.89%        111667  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] xyzint_
   5.21%         99376  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] xyzint_
   3.02%         57506  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] genral_
   2.36%         44702  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] genral_
   1.62%         30954  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] zqout_
   1.56%         29806  gamess_peak.amd  gamess_peak.amd64-m64-gcc42-nn  [.] twoei_.constprop.2
   1.53%         29092  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] twoei_.constprop.2
   1.40%         26663  gamess_base.amd  gamess_base.amd64-m64-gcc42-nn  [.] zqout_

so the main thing is the usual suspect, the "triangular" loop

            MKL=0
            DO 10 MK=1,NOC
            DO 10 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   10       CONTINUE     

where previously I massaged costing to have the loop _not_ vectorized
but that doesn't work anymore it seems.
Comment 4 Richard Biener 2024-11-25 12:30:49 UTC
So on x86 the cost model difference 14.2 vs trunk is

-(*co_271(D))[_95] 1 times vec_construct costs 792 in body
+(*co_271(D))[_95] 1 times vec_construct costs 88 in body

and similar for

-_103 1 times vec_to_scalar costs 72 in body
+_103 1 times vec_to_scalar costs 8 in body

r15-5565-gdbc38dd9e96a99 doesn't seem to fix this yet.  The reason is
that the cost hook for non-SLP considers VMAT_ELEMENTWISE with variable
stride separately but not so VMAT_STRIDED_SLP with SLP.  With SLP we don't
get all the info we like (how we use lvectype/ltype vs. vectype).

For GCC 15 I'm going to emulate GCC 14 behavior here by special-casing
single-lane SLP.  For the future we want to let the backend know how many
and what kind of loads we do for VMAT_STRIDED_SLP, that's something the
cost hook doesn't get us yet.
Comment 5 GCC Commits 2024-11-25 14:54:20 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:cd8db107b9bef73fd822ffb420f96ed2bc622a19

commit r15-5651-gcd8db107b9bef73fd822ffb420f96ed2bc622a19
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Nov 25 13:32:15 2024 +0100

    target/116760 - 416.gamess slowdown with SLP
    
    For the TWOTFF loop vectorization the backend scales constructor
    and vector extract cost to make higher VFs less profitable.  This
    heuristic currently fails to consider VMAT_STRIDED_SLP which we
    now get with single-lane SLP, causing a huge regression in SPEC 2k6
    416.gamess for the respective loop nest.
    
    The following fixes this, matching behavior to that of GCC 14 by
    treating single-lane VMAT_STRIDED_SLP the same as VMAT_ELEMENTWISE.
    
            PR target/116760
            * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
            Scale vec_construct for single-lane VMAT_STRIDED_SLP the
            same as VMAT_ELEMENTWISE.
            * tree-vect-stmts.cc (vectorizable_store): Pass SLP node
            down to costing for vec_to_scalar for VMAT_STRIDED_SLP.
Comment 6 Richard Biener 2024-11-25 14:55:06 UTC
Fixed.