In my experiments on an EPYC CPU and GCC trunk r270364, 503.bwaves_r is over 6% slower at -Ofast when I supply -march=native =mtune=native than when I compile for generic x86_64. LNT sees 3.55% regression too: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/tuning perf stat and report of the generic (fast) binary run: Performance counter stats for 'numactl -C 0 -l specinvoke': 240411.714022 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 35189 page-faults:u # 0.146 K/sec 757727387955 cycles:u # 3.152 GHz (83.32%) 40175950077 stalled-cycles-frontend:u # 5.30% frontend cycles idle (83.31%) 91872393105 stalled-cycles-backend:u # 12.12% backend cycles idle (83.37%) 2177387522561 instructions:u # 2.87 insn per cycle # 0.04 stalled cycles per insn (83.32%) 98299602685 branches:u # 408.880 M/sec (83.32%) 131591436 branch-misses:u # 0.13% of all branches (83.36%) 240.668052943 seconds time elapsed # Samples: 960K of event 'cycles' # Event count (approx.): 755626377551 # # Overhead Samples Command Shared Object Symbol # ........ ........ ........ ................. ........................ # 62.10% 595840 bwaves_r bwaves_r_peak-gen mat_times_vec_ 13.91% 133958 bwaves_r bwaves_r_peak-gen shell_ 12.40% 119012 bwaves_r bwaves_r_peak-gen bi_cgstab_block_ 7.81% 75246 bwaves_r bwaves_r_peak-gen jacobian_ 2.11% 20290 bwaves_r bwaves_r_peak-gen flux_ 1.27% 12217 bwaves_r libc-2.29.so __memset_avx2_unaligned perf stat and report of the native (slow) binary run: Performance counter stats for 'numactl -C 0 -l specinvoke': 255695.249393 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 35604 page-faults:u # 0.139 K/sec 800619530480 cycles:u # 3.131 GHz (83.32%) 77320365388 stalled-cycles-frontend:u # 9.66% frontend cycles idle (83.34%) 93389410778 stalled-cycles-backend:u # 11.66% backend cycles idle (83.33%) 1821704428841 instructions:u # 2.28 insn per cycle # 0.05 stalled cycles per insn (83.32%) 99885762475 branches:u # 390.644 M/sec (83.34%) 130710907 branch-misses:u # 0.13% of all branches (83.34%) 255.958363704 seconds time elapsed # Samples: 1M of event 'cycles' # Event count (approx.): 804011318580 # # Overhead Samples Command Shared Object Symbol # ........ ........ ........ ................. ........................ # 64.87% 662574 bwaves_r bwaves_r_peak-nat mat_times_vec_ 12.75% 130756 bwaves_r bwaves_r_peak-nat shell_ 11.48% 117266 bwaves_r bwaves_r_peak-nat bi_cgstab_block_ 7.45% 76415 bwaves_r bwaves_r_peak-nat jacobian_ 1.92% 19701 bwaves_r bwaves_r_peak-nat flux_ 1.34% 13662 bwaves_r libc-2.29.so __memset_avx2_unaligned Examining the slow mat_times_vec_ further, perf claims that the following loop is the most sample-heavy: 0.01 |6c0:+->vmulpd (%r8,%rax,1),%xmm9,%xmm0 4.34 | | vandnp (%r10,%rax,1),%xmm2,%xmm1 0.83 | | vfmadd (%r15,%rax,1),%xmm11,%xmm1 1.35 | | vfmadd (%r14,%rax,1),%xmm10,%xmm0 5.85 | | vaddpd %xmm1,%xmm0,%xmm1 7.41 | | vmulpd (%rsi,%rax,1),%xmm7,%xmm0 2.19 | | vfmadd (%rdi,%rax,1),%xmm8,%xmm0 3.97 | | vmovap %xmm0,%xmm12 0.07 | | vmulpd (%r11,%rax,1),%xmm5,%xmm0 0.93 | | vfmadd (%rcx,%rax,1),%xmm6,%xmm0 1.92 | | vaddpd %xmm12,%xmm0,%xmm0 6.34 | | vaddpd %xmm1,%xmm0,%xmm0 9.58 | | vmovup %xmm0,(%r10,%rax,1) 0.49 | | add $0x10,%rax 0.05 | | cmp %rax,0x38(%rsp) 0.02 | +--jne 6c0 Objdump perhaps gives a better idea about exactly which instructions these are: 4011c0: c4 c1 31 59 04 00 vmulpd (%r8,%rax,1),%xmm9,%xmm0 4011c6: c4 c1 68 55 0c 02 vandnps (%r10,%rax,1),%xmm2,%xmm1 4011cc: c4 c2 a1 b8 0c 07 vfmadd231pd (%r15,%rax,1),%xmm11,%xmm1 4011d2: c4 c2 a9 b8 04 06 vfmadd231pd (%r14,%rax,1),%xmm10,%xmm0 4011d8: c5 f9 58 c9 vaddpd %xmm1,%xmm0,%xmm1 4011dc: c5 c1 59 04 06 vmulpd (%rsi,%rax,1),%xmm7,%xmm0 4011e1: c4 e2 b9 b8 04 07 vfmadd231pd (%rdi,%rax,1),%xmm8,%xmm0 4011e7: c5 78 28 e0 vmovaps %xmm0,%xmm12 4011eb: c4 c1 51 59 04 03 vmulpd (%r11,%rax,1),%xmm5,%xmm0 4011f1: c4 e2 c9 b8 04 01 vfmadd231pd (%rcx,%rax,1),%xmm6,%xmm0 4011f7: c4 c1 79 58 c4 vaddpd %xmm12,%xmm0,%xmm0 4011fc: c5 f9 58 c1 vaddpd %xmm1,%xmm0,%xmm0 401200: c4 c1 78 11 04 02 vmovups %xmm0,(%r10,%rax,1) 401206: 48 83 c0 10 add $0x10,%rax 40120a: 48 39 44 24 38 cmp %rax,0x38(%rsp) 40120f: 75 af jne 4011c0 <mat_times_vec_+0x6c0> I did a quick experiment with completely disabling FMA generation but it did not help.
I can still see this issue on a Zen1 machine as of trunk revision abe13e1847f (Feb 17 2020) but not on Zen2 machines (in both cases targeting native ISAs).
I spoke too soon, I can see this in May gcc 10.1 data on zen1 machine and also in current master (6e1e0decc9e) on a zen-2 machine, still about 6% in both cases. (Gcc9 does not have this problem on zen2 but does on zen1 so it looks a bit fragile).
I do not have data from Zen1 on this, but on Zen2 this is fixed on the current trunk (and on Zen3 too, where GCC 10 also was slower with native than generic tuning). I don't seem to be able to force LNT to show me the respective graph, but my guess would be that lim after loop interchange did it. Anyway, fixed.