Bug 112697 - [14 Regression] 30-40% exec time regression of 433.milc on zen2 since r14-4972-g8aa47713701b1f
Summary: [14 Regression] 30-40% exec time regression of 433.milc on zen2 since r14-497...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 14.0
: P2 normal
Target Milestone: 14.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2023-11-24 09:03 UTC by Filip Kastl
Modified: 2024-03-22 12:47 UTC (History)
5 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Perf annotate of milc built with r14-4971-g0beb1611754742 (27.83 KB, text/plain)
2023-11-29 13:52 UTC, Martin Jambor
Details
Perf annotate of milc built with r14-4972-g8aa47713701b1f (28.06 KB, text/plain)
2023-11-29 13:53 UTC, Martin Jambor
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Kastl 2023-11-24 09:03:48 UTC
Between commits

g:0c305f3dec9a992dd775a3b9607b7b1e8c051859
g:7f974c5fd4f59a9d8dd20c49a0e2909cb290f4b4

there's a 30-40% slowdown of the 433.milc 2006SPEC benchmark as seen here:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=292.70.0

It's 40% here and I also measured the slowdown on another zen2 machine where the difference was 30%.

Compile options:

-g -O2 -flto=128
Comment 1 Filip Kastl 2023-11-24 09:07:48 UTC
Around the same time there was also a 6% slowdown with
-Ofast -march=native -g and PGO
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=299.70.0

and an 11% slowdown with
-O2 -g -flto=128 and PGO
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=685.70.0
Comment 2 Richard Biener 2023-11-24 09:51:50 UTC
This big jump has been seen in the past, I wonder if it's one of those micro-arch hazards regarding alignment.  The only "generic" change with possibly
ripple-down effects is r14-4965-ga5e69e94591ae2.
Comment 3 Martin Jambor 2023-11-24 17:45:19 UTC
I can reliably bisect this to r14-4972-g8aa47713701b1f (Vladimir's [RA]: Add cost calculation for reg equivalence invariants) on a similar zen2 machine.  But it seems zen2 specific, I did not see any performance difference (this is generic march/tuning) on znver4, for example.  So it may be quite hard to analyze and fix, even though the regression is big :-/
Comment 4 Sam James 2023-11-24 17:52:18 UTC
I can probably find a znver2 machine for someone to work on if it's needed, but that's obviously not going to be the hardest part here...
Comment 5 Alexander Monakov 2023-11-27 11:58:53 UTC
Martin, if you still have the binaries, would you mind sharing perf profiles?
You can produce plain-text reports with 'perf report --stdio' and 'perf annotate --stdio'.
Comment 6 Martin Jambor 2023-11-29 13:52:37 UTC
Created attachment 56719 [details]
Perf annotate of milc built with r14-4971-g0beb1611754742

commit r14-4971-g0beb1611754742:

$ perf stat taskset -c 0 specinvoke

 Performance counter stats for 'taskset -c 0 specinvoke':

         216908.59 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
            889694      page-faults:u                    #    4.102 K/sec                     
      697007650237      cycles:u                         #    3.213 GHz                         (83.33%)
       31999772966      stalled-cycles-frontend:u        #    4.59% frontend cycles idle        (83.33%)
      540485725923      stalled-cycles-backend:u         #   77.54% backend cycles idle         (83.33%)
     1061256162815      instructions:u                   #    1.52  insn per cycle            
                                                  #    0.51  stalled cycles per insn     (83.33%)
       58760648879      branches:u                       #  270.901 M/sec                       (83.34%)
          11890202      branch-misses:u                  #    0.02% of all branches             (83.33%)

     216.935387643 seconds time elapsed

     211.436079000 seconds user
       5.472459000 seconds sys

$ perf record taskset -c 0 specinvoke
[ perf record: Woken up 132 times to write data ]
[ perf record: Captured and wrote 32.901 MB perf.data (862286 samples) ]


$ perf report -n --percent-limit=1 --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 862K of event 'cycles:Pu'
# Event count (approx.): 695776598661
#
# Overhead       Samples  Command          Shared Object           Symbol                                
# ........  ............  ...............  ......................  ......................................
#
    22.68%        197003  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_na
    20.99%        177912  milc_base.mine-  milc_base.mine-lto-gen  [.] u_shift_fermion
    19.04%        163787  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_nn
     6.85%         58509  milc_base.mine-  milc_base.mine-lto-gen  [.] scalar_mult_add_su3_matrix
     5.51%         50953  milc_base.mine-  milc_base.mine-lto-gen  [.] path_product
     5.40%         46083  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_an
     4.22%         35853  milc_base.mine-  milc_base.mine-lto-gen  [.] add_force_to_mom
     3.77%         32446  milc_base.mine-  milc_base.mine-lto-gen  [.] imp_gauge_force.constprop.0
     1.98%         16848  milc_base.mine-  milc_base.mine-lto-gen  [.] compute_gen_staple
     1.94%         16462  milc_base.mine-  milc_base.mine-lto-gen  [.] make_anti_hermitian
     1.73%         14655  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_mat_vec_sum_4dir
     1.35%         11472  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_adj_su3_mat_4vec
     1.27%         10801  milc_base.mine-  libc.so.6               [.] __memset_avx2_unaligned_erms


$ perf annotate -n --percent-limit=1 > ~/tmp/milc-perf-annotate-0beb1611754  (gzipeped and attached)
Comment 7 Martin Jambor 2023-11-29 13:53:59 UTC
Created attachment 56720 [details]
Perf annotate of milc built with r14-4972-g8aa47713701b1f

commit r14-4972-g8aa47713701b1f:

$ perf stat taskset -c 0 specinvoke

 Performance counter stats for 'taskset -c 0 specinvoke':

         272931.43 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
            472353      page-faults:u                    #    1.731 K/sec                     
      886165387570      cycles:u                         #    3.247 GHz                         (83.33%)
       31546898034      stalled-cycles-frontend:u        #    3.56% frontend cycles idle        (83.33%)
      729878095777      stalled-cycles-backend:u         #   82.36% backend cycles idle         (83.33%)
     1061779557370      instructions:u                   #    1.20  insn per cycle            
                                                  #    0.69  stalled cycles per insn     (83.33%)
       58797121078      branches:u                       #  215.428 M/sec                       (83.33%)
           6960852      branch-misses:u                  #    0.01% of all branches             (83.33%)

     272.967381843 seconds time elapsed

     268.718335000 seconds user
       4.212584000 seconds sys

$ perf record taskset -c 0 specinvoke
[ perf record: Woken up 167 times to write data ]
[ perf record: Captured and wrote 41.549 MB perf.data (1088982 samples) ]

$ perf report -n --percent-limit=1 --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:Pu'
# Event count (approx.): 883903400858
#
# Overhead       Samples  Command          Shared Object           Symbol                                
# ........  ............  ...............  ......................  ......................................
#
    24.34%        260907  milc_base.mine-  milc_base.mine-lto-gen  [.] add_force_to_mom
    18.01%        198287  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_na
    17.45%        187529  milc_base.mine-  milc_base.mine-lto-gen  [.] u_shift_fermion
    14.22%        155596  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_nn
     5.61%         60601  milc_base.mine-  milc_base.mine-lto-gen  [.] scalar_mult_add_su3_matrix
     4.35%         51034  milc_base.mine-  milc_base.mine-lto-gen  [.] path_product
     4.24%         46032  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_an
     2.99%         32624  milc_base.mine-  milc_base.mine-lto-gen  [.] imp_gauge_force.constprop.0
     1.50%         16242  milc_base.mine-  milc_base.mine-lto-gen  [.] compute_gen_staple
     1.35%         14580  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_su3_mat_vec_sum_4dir
     1.21%         12922  milc_base.mine-  milc_base.mine-lto-gen  [.] make_anti_hermitian
     1.06%         11469  milc_base.mine-  milc_base.mine-lto-gen  [.] mult_adj_su3_mat_4vec
     1.03%         11111  milc_base.mine-  libc.so.6               [.] __memset_avx2_unaligned_erms


$ perf annotate -n --percent-limit=1 > ~/tmp/milc-perf-annotate-8aa47713701  (gzipeped and attached)
Comment 8 Alexander Monakov 2023-12-01 15:37:03 UTC
Thanks, I can reproduce it. It is pretty tricky though. For instance, just swapping the mov and the compare is enough to make it fast:

--- d.out.ltrans0.ltrans.slow.s 2023-12-01 18:32:54.255841611 +0300
+++ d.out.ltrans0.ltrans.fast.s 2023-12-01 18:32:20.318668991 +0300
@@ -743,8 +743,8 @@ add_force_to_mom:
        .p2align 4,,10
        .p2align 3
 .L58:
-       cmpb    $1, -680(%r11,%r12)
        movapd  %xmm5, %xmm7
+       cmpb    $1, -680(%r11,%r12)
        jne     .L54
        xorpd   %xmm6, %xmm7
 .L54:
Comment 9 Alexander Monakov 2023-12-01 16:00:46 UTC
... as does inserting a nop before the compare ¯\_(ツ)_/¯


--- d.out.ltrans0.ltrans.slow.s 2023-12-01 18:32:54.255841611 +0300
+++ d.out.ltrans0.ltrans.s      2023-12-01 18:53:04.909438690 +0300
@@ -743,6 +743,7 @@ add_force_to_mom:
        .p2align 4,,10
        .p2align 3
 .L58:
+       nop
        cmpb    $1, -680(%r11,%r12)
        movapd  %xmm5, %xmm7
        jne     .L54
Comment 10 Filip Kastl 2024-03-21 09:45:55 UTC
I see that the benchmark's exec time has returned to its original value. If there are no objections, I'll mark this bug as fixed.
Comment 11 Jeffrey A. Law 2024-03-22 12:47:39 UTC
Per c#10.