Bug 102750 - [12 Regression] 433.milc regressed by 10% on AMD zen2 at -Ofast -march=native -flto after r12-3893-g6390c5047adb75
Summary: [12 Regression] 433.milc regressed by 10% on AMD zen2 at -Ofast -march=native...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2021-10-14 16:50 UTC by Martin Jambor
Modified: 2022-03-23 08:51 UTC (History)
4 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-10-15 00:00:00


Attachments
testcase (925 bytes, text/plain)
2021-10-15 12:49 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2021-10-14 16:50:25 UTC
I have bisected an AMD zen2 10% performance regression of SPEC 2006 FP
433.milc benchmark when compiled with -Ofast -march=native -flto to
commit r12-3893-g6390c5047adb75.  See also:

  https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=412.70.0&plot.1=289.70.0&

As Richi asked, I am filing this bug even though I cannot reproduce the
regression neither on an AMD zen3 machine nor on Intel CascadeLake, because
the history of the benchmark performance and because I know milc can be
sensitive to conditions outside our control.  OTOH, the regression
reproduces reliably for me.

Some relevant perf data:

BEFORE:
# Samples: 585K of event 'cycles:u'
# Event count (approx.): 472738682838
#
# Overhead       Samples  Command          Shared Object           Symbol
# ........  ............  ...............  ......................  .........................................
# 
    24.59%        140397  milc_peak.mine-  milc_peak.mine-lto-nat  [.] u_shift_fermion
    18.47%        105497  milc_peak.mine-  milc_peak.mine-lto-nat  [.] add_force_to_mom
    15.97%         96343  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_su3_na
    15.29%         90027  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_su3_nn
     5.55%         35114  milc_peak.mine-  milc_peak.mine-lto-nat  [.] path_product
     4.75%         27693  milc_peak.mine-  milc_peak.mine-lto-nat  [.] compute_gen_staple
     2.76%         16109  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_su3_an
     2.42%         14255  milc_peak.mine-  milc_peak.mine-lto-nat  [.] imp_gauge_force.constprop.0
     2.02%         11561  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_adj_su3_mat_4vec

AFTER:
# Samples: 634K of event 'cycles:u'
# Event count (approx.): 513635733685
#
# Overhead       Samples  Command          Shared Object           Symbol                                   
# ........  ............  ...............  ......................  .........................................
#
    24.04%        149010  milc_peak.mine-  milc_peak.mine-lto-nat  [.] add_force_to_mom
    23.76%        147370  milc_peak.mine-  milc_peak.mine-lto-nat  [.] u_shift_fermion
    14.19%         90929  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_su3_nn
    14.14%         92912  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_su3_na
     4.90%         33846  milc_peak.mine-  milc_peak.mine-lto-nat  [.] path_product
     3.89%         24621  milc_peak.mine-  milc_peak.mine-lto-nat  [.] mult_su3_an
     3.62%         22831  milc_peak.mine-  milc_peak.mine-lto-nat  [.] compute_gen_staple
     2.05%         13215  milc_peak.mine-  milc_peak.mine-lto-nat  [.] imp_gauge_force.constprop.0
Comment 1 Richard Biener 2021-10-15 12:12:47 UTC
So for some unknown reason the more vectorized version of the function is slower.

Note that all BB vectorization in this function happens when triggered from
the loop vectorizer on the if-converted loop body after loop vectorization failed.  One difference is

-make_ahmat.c:37:13: missed:   desired vector type conflicts with earlier one for _334 = _5->c[0].real;
-make_ahmat.c:37:13: note:  removing SLP instance operations starting from: MEM <struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im = _45;
+make_ahmat.c:37:13: note:   vect_compute_data_ref_alignment:
+make_ahmat.c:37:13: note:   can't force alignment of ref: MEM <struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im

and the extra vectorization has live lanes that we think we cannot reliably
place:

+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane extract
+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane extract
+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane extract
+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane extract

that will cause the scalar definitions to be retained (but they will be not
costed as removed then), possibly causing redundant computations and
ressource competition.

+su3_proj.c:44:24: note: Cost model analysis for part in loop 1:
+  Vector cost: 776
+  Scalar cost: 836

and before the change for a much smaller portion of the block:

make_ahmat.c:40:60: note: Cost model analysis for part in loop 1:
  Vector cost: 224
  Scalar cost: 328

the scalar/vector ratio was better there.
Comment 2 Richard Biener 2021-10-15 12:49:52 UTC
Created attachment 51609 [details]
testcase

This testcase reproduces the vectorization difference with REV and REV^ when
using -Ofast -march=znver2 [-fno-early-inlinig].  Trunk shows similar behavior
still.

The issue is likely code like

  vect__325.61_64 = MEM <vector(2) double> [(double *)_80];
  vect__325.62_60 = MEM <vector(2) double> [(double *)_80 + 16B];
  vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>;
  _52 = BIT_FIELD_REF <vect__325.62_60, 64, 0>;

where it would have been better to emit two scalar loads and combine them
with a CTOR (or in asm use movsd + movhpd).  But we end up with

.L5:
        vmovupd (%rcx), %xmm3
        vmovupd 16(%rcx), %xmm2
...
        addq    $48, %rcx
...
        vunpckhpd       %xmm2, %xmm3, %xmm7
...
        vfnmadd132sd    -48(%rcx), %xmm8, %xmm15

also scattered around (but that's not GIMPLEs fault).

The vectorizer generates

  vectp.60_65 = &_6->c[0].real;
  vect__325.61_64 = MEM <vector(2) double> [(double *)vectp.60_65];
  vectp.60_61 = vectp.60_65 + 16;
  vect__325.62_60 = MEM <vector(2) double> [(double *)vectp.60_61];
  vectp.60_57 = vectp.60_65 + 32;
  vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>;
  _52 = BIT_FIELD_REF <vect__325.64_53, 64, 64>;

which is of course entirely reasonable in some sense (that's 12 + 12 + 4 + 4
cost - two scalar loads plus CTOR would cost 12 + 12 + 8 but we'd likely
still generate and cost the scalar extract).  Eventually it's better to
pattern-match the permute and demote the vector loads to scalar...

We also end up with

  vect__259.20_179 = .FNMA (_182, vect__326.17_185, _196);
  vect__368.19_180 = .FMA (_182, vect__326.17_185, _196);
  _178 = VEC_PERM_EXPR <vect__368.19_180, vect__259.20_179, { 0, 5, 2, 7 }>;

and unhandled

add_force_to_mom.c:72:60: note:   node 0x3ed37e0 (max_nunits=4, refcnt=2)
add_force_to_mom.c:72:60: note:   op template: _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:         stmt 0 _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:         stmt 1 _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:         stmt 2 _399 = _6->c[2].real;
add_force_to_mom.c:72:60: note:         stmt 3 _399 = _6->c[2].real;
add_force_to_mom.c:72:60: note:         load permutation { 2 2 4 4 }
....
add_force_to_mom.c:72:60: note:   ==> examining statement: _362 = _6->c[1].real;
add_force_to_mom.c:72:60: missed:   BB vectorization with gaps at the end of a load is not supported
add_force_to_mom.c:53:16: missed:   not vectorized: relevant stmt not supported: _362 = _6->c[1].real;
add_force_to_mom.c:72:60: note:   Building vector operands of 0x3ed37e0 from scalars instead
Comment 3 Richard Biener 2022-01-19 08:45:49 UTC
So when we have BB vectorization and costs like

+su3_proj.c:44:24: note: Cost model analysis for part in loop 1:
+  Vector cost: 776
+  Scalar cost: 836

we decide it's worthwhile to vectorize.  But that does not account for the
fact that the scalar code represents multiple vector lanes possibly
executing in parallel while the vector code in combining the lanes makes
them data dependent.  So considering a VF of at least 2 the scalar code
could in the ideal case run with a latency of 836/2 while the vector code
would have a latency of 776.  Of course the cost number we compute doesn't
really resemble overall latency, I will see if we can improve that for GCC 13.

What I'm after is that it's maybe not a good idea to compare scalar vs.
vector cost for BB vectorization in the way we do.  Without any info
on the individually costed stmt dependences it's hard to do better of course
though the target could try simple adjustments based on available
issue with and execution resources to for example assume that two scalar
ops can always issue & execute in parallel and thus divide the scalar cost
by two (for BB vectorization only or for scalar stmts participating in SLP
for loop vect, sth not readily available though).  There's always the
possibility to special case more constrained resources (integer/fp divider,
stores or shifts), but that already gets a bit complicated because we
cost scalar stmts independently and thus the code doesn't know how many
of them will be combined into a single lane.  The alternative is to do
the biasing on the vector side but there you do not know the original
scalar operation involved (consider patterns).

So it might be not easy to adjust the current heuristics in a good way.
Comment 4 Richard Biener 2022-03-23 08:51:06 UTC
It's back to good performance.