Between commits g:0c305f3dec9a992dd775a3b9607b7b1e8c051859 g:7f974c5fd4f59a9d8dd20c49a0e2909cb290f4b4 there's a 30-40% slowdown of the 433.milc 2006SPEC benchmark as seen here: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=292.70.0 It's 40% here and I also measured the slowdown on another zen2 machine where the difference was 30%. Compile options: -g -O2 -flto=128
Around the same time there was also a 6% slowdown with -Ofast -march=native -g and PGO https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=299.70.0 and an 11% slowdown with -O2 -g -flto=128 and PGO https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=685.70.0
This big jump has been seen in the past, I wonder if it's one of those micro-arch hazards regarding alignment. The only "generic" change with possibly ripple-down effects is r14-4965-ga5e69e94591ae2.
I can reliably bisect this to r14-4972-g8aa47713701b1f (Vladimir's [RA]: Add cost calculation for reg equivalence invariants) on a similar zen2 machine. But it seems zen2 specific, I did not see any performance difference (this is generic march/tuning) on znver4, for example. So it may be quite hard to analyze and fix, even though the regression is big :-/
I can probably find a znver2 machine for someone to work on if it's needed, but that's obviously not going to be the hardest part here...
Martin, if you still have the binaries, would you mind sharing perf profiles? You can produce plain-text reports with 'perf report --stdio' and 'perf annotate --stdio'.
Created attachment 56719 [details] Perf annotate of milc built with r14-4971-g0beb1611754742 commit r14-4971-g0beb1611754742: $ perf stat taskset -c 0 specinvoke Performance counter stats for 'taskset -c 0 specinvoke': 216908.59 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 889694 page-faults:u # 4.102 K/sec 697007650237 cycles:u # 3.213 GHz (83.33%) 31999772966 stalled-cycles-frontend:u # 4.59% frontend cycles idle (83.33%) 540485725923 stalled-cycles-backend:u # 77.54% backend cycles idle (83.33%) 1061256162815 instructions:u # 1.52 insn per cycle # 0.51 stalled cycles per insn (83.33%) 58760648879 branches:u # 270.901 M/sec (83.34%) 11890202 branch-misses:u # 0.02% of all branches (83.33%) 216.935387643 seconds time elapsed 211.436079000 seconds user 5.472459000 seconds sys $ perf record taskset -c 0 specinvoke [ perf record: Woken up 132 times to write data ] [ perf record: Captured and wrote 32.901 MB perf.data (862286 samples) ] $ perf report -n --percent-limit=1 --stdio # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 862K of event 'cycles:Pu' # Event count (approx.): 695776598661 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ...................... ...................................... # 22.68% 197003 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_na 20.99% 177912 milc_base.mine- milc_base.mine-lto-gen [.] u_shift_fermion 19.04% 163787 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_nn 6.85% 58509 milc_base.mine- milc_base.mine-lto-gen [.] scalar_mult_add_su3_matrix 5.51% 50953 milc_base.mine- milc_base.mine-lto-gen [.] path_product 5.40% 46083 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_an 4.22% 35853 milc_base.mine- milc_base.mine-lto-gen [.] add_force_to_mom 3.77% 32446 milc_base.mine- milc_base.mine-lto-gen [.] imp_gauge_force.constprop.0 1.98% 16848 milc_base.mine- milc_base.mine-lto-gen [.] compute_gen_staple 1.94% 16462 milc_base.mine- milc_base.mine-lto-gen [.] make_anti_hermitian 1.73% 14655 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_mat_vec_sum_4dir 1.35% 11472 milc_base.mine- milc_base.mine-lto-gen [.] mult_adj_su3_mat_4vec 1.27% 10801 milc_base.mine- libc.so.6 [.] __memset_avx2_unaligned_erms $ perf annotate -n --percent-limit=1 > ~/tmp/milc-perf-annotate-0beb1611754 (gzipeped and attached)
Created attachment 56720 [details] Perf annotate of milc built with r14-4972-g8aa47713701b1f commit r14-4972-g8aa47713701b1f: $ perf stat taskset -c 0 specinvoke Performance counter stats for 'taskset -c 0 specinvoke': 272931.43 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 472353 page-faults:u # 1.731 K/sec 886165387570 cycles:u # 3.247 GHz (83.33%) 31546898034 stalled-cycles-frontend:u # 3.56% frontend cycles idle (83.33%) 729878095777 stalled-cycles-backend:u # 82.36% backend cycles idle (83.33%) 1061779557370 instructions:u # 1.20 insn per cycle # 0.69 stalled cycles per insn (83.33%) 58797121078 branches:u # 215.428 M/sec (83.33%) 6960852 branch-misses:u # 0.01% of all branches (83.33%) 272.967381843 seconds time elapsed 268.718335000 seconds user 4.212584000 seconds sys $ perf record taskset -c 0 specinvoke [ perf record: Woken up 167 times to write data ] [ perf record: Captured and wrote 41.549 MB perf.data (1088982 samples) ] $ perf report -n --percent-limit=1 --stdio # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 1M of event 'cycles:Pu' # Event count (approx.): 883903400858 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ...................... ...................................... # 24.34% 260907 milc_base.mine- milc_base.mine-lto-gen [.] add_force_to_mom 18.01% 198287 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_na 17.45% 187529 milc_base.mine- milc_base.mine-lto-gen [.] u_shift_fermion 14.22% 155596 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_nn 5.61% 60601 milc_base.mine- milc_base.mine-lto-gen [.] scalar_mult_add_su3_matrix 4.35% 51034 milc_base.mine- milc_base.mine-lto-gen [.] path_product 4.24% 46032 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_an 2.99% 32624 milc_base.mine- milc_base.mine-lto-gen [.] imp_gauge_force.constprop.0 1.50% 16242 milc_base.mine- milc_base.mine-lto-gen [.] compute_gen_staple 1.35% 14580 milc_base.mine- milc_base.mine-lto-gen [.] mult_su3_mat_vec_sum_4dir 1.21% 12922 milc_base.mine- milc_base.mine-lto-gen [.] make_anti_hermitian 1.06% 11469 milc_base.mine- milc_base.mine-lto-gen [.] mult_adj_su3_mat_4vec 1.03% 11111 milc_base.mine- libc.so.6 [.] __memset_avx2_unaligned_erms $ perf annotate -n --percent-limit=1 > ~/tmp/milc-perf-annotate-8aa47713701 (gzipeped and attached)
Thanks, I can reproduce it. It is pretty tricky though. For instance, just swapping the mov and the compare is enough to make it fast: --- d.out.ltrans0.ltrans.slow.s 2023-12-01 18:32:54.255841611 +0300 +++ d.out.ltrans0.ltrans.fast.s 2023-12-01 18:32:20.318668991 +0300 @@ -743,8 +743,8 @@ add_force_to_mom: .p2align 4,,10 .p2align 3 .L58: - cmpb $1, -680(%r11,%r12) movapd %xmm5, %xmm7 + cmpb $1, -680(%r11,%r12) jne .L54 xorpd %xmm6, %xmm7 .L54:
... as does inserting a nop before the compare ¯\_(ツ)_/¯ --- d.out.ltrans0.ltrans.slow.s 2023-12-01 18:32:54.255841611 +0300 +++ d.out.ltrans0.ltrans.s 2023-12-01 18:53:04.909438690 +0300 @@ -743,6 +743,7 @@ add_force_to_mom: .p2align 4,,10 .p2align 3 .L58: + nop cmpb $1, -680(%r11,%r12) movapd %xmm5, %xmm7 jne .L54
I see that the benchmark's exec time has returned to its original value. If there are no objections, I'll mark this bug as fixed.
Per c#10.