90234 – 503.bwaves_r is 6% slower on Zen1/Zen2 CPUs at -Ofast with native march/mtune than with generic ones

Bug 90234 - 503.bwaves_r is 6% slower on Zen1/Zen2 CPUs at -Ofast with native march/mtune than with generic ones

Summary: 503.bwaves_r is 6% slower on Zen1/Zen2 CPUs at -Ofast with native march/mtune...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2019-04-24 15:56 UTC by Martin Jambor
Modified:	2021-02-04 17:43 UTC (History)
CC List:	2 users (show)

See Also:
Host:	x86_64-linux
Target:	x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2019-04-24 15:56:42 UTC

In my experiments on an EPYC CPU and GCC trunk r270364, 503.bwaves_r
is over 6% slower at -Ofast when I supply -march=native =mtune=native
than when I compile for generic x86_64.  LNT sees 3.55% regression
too: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/tuning

perf stat and report of the generic (fast) binary run:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     240411.714022      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             35189      page-faults:u             #    0.146 K/sec
      757727387955      cycles:u                  #    3.152 GHz                      (83.32%)
       40175950077      stalled-cycles-frontend:u #    5.30% frontend cycles idle     (83.31%)
       91872393105      stalled-cycles-backend:u  #   12.12% backend cycles idle      (83.37%)
     2177387522561      instructions:u            #    2.87  insn per cycle
                                                  #    0.04  stalled cycles per insn  (83.32%)
       98299602685      branches:u                #  408.880 M/sec                    (83.32%)
         131591436      branch-misses:u           #    0.13% of all branches          (83.36%)

     240.668052943 seconds time elapsed

 # Samples: 960K of event 'cycles'
 # Event count (approx.): 755626377551
 #
 # Overhead   Samples  Command   Shared Object      Symbol
 # ........  ........  ........  .................  ........................
 #
     62.10%    595840  bwaves_r  bwaves_r_peak-gen  mat_times_vec_
     13.91%    133958  bwaves_r  bwaves_r_peak-gen  shell_
     12.40%    119012  bwaves_r  bwaves_r_peak-gen  bi_cgstab_block_
      7.81%     75246  bwaves_r  bwaves_r_peak-gen  jacobian_
      2.11%     20290  bwaves_r  bwaves_r_peak-gen  flux_
      1.27%     12217  bwaves_r  libc-2.29.so       __memset_avx2_unaligned



perf stat and report of the native (slow) binary run:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     255695.249393      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             35604      page-faults:u             #    0.139 K/sec
      800619530480      cycles:u                  #    3.131 GHz                      (83.32%)
       77320365388      stalled-cycles-frontend:u #    9.66% frontend cycles idle     (83.34%)
       93389410778      stalled-cycles-backend:u  #   11.66% backend cycles idle      (83.33%)
     1821704428841      instructions:u            #    2.28  insn per cycle
                                                  #    0.05  stalled cycles per insn  (83.32%)
       99885762475      branches:u                #  390.644 M/sec                    (83.34%)
         130710907      branch-misses:u           #    0.13% of all branches          (83.34%)

     255.958363704 seconds time elapsed

 # Samples: 1M of event 'cycles'
 # Event count (approx.): 804011318580
 #
 # Overhead   Samples  Command   Shared Object      Symbol
 # ........  ........  ........  .................  ........................
 #
     64.87%    662574  bwaves_r  bwaves_r_peak-nat  mat_times_vec_
     12.75%    130756  bwaves_r  bwaves_r_peak-nat  shell_
     11.48%    117266  bwaves_r  bwaves_r_peak-nat  bi_cgstab_block_
      7.45%     76415  bwaves_r  bwaves_r_peak-nat  jacobian_
      1.92%     19701  bwaves_r  bwaves_r_peak-nat  flux_
      1.34%     13662  bwaves_r  libc-2.29.so       __memset_avx2_unaligned



Examining the slow mat_times_vec_ further, perf claims that the
following loop is the most sample-heavy:

  0.01 |6c0:+->vmulpd (%r8,%rax,1),%xmm9,%xmm0
  4.34 |    |  vandnp (%r10,%rax,1),%xmm2,%xmm1
  0.83 |    |  vfmadd (%r15,%rax,1),%xmm11,%xmm1
  1.35 |    |  vfmadd (%r14,%rax,1),%xmm10,%xmm0
  5.85 |    |  vaddpd %xmm1,%xmm0,%xmm1
  7.41 |    |  vmulpd (%rsi,%rax,1),%xmm7,%xmm0
  2.19 |    |  vfmadd (%rdi,%rax,1),%xmm8,%xmm0
  3.97 |    |  vmovap %xmm0,%xmm12
  0.07 |    |  vmulpd (%r11,%rax,1),%xmm5,%xmm0
  0.93 |    |  vfmadd (%rcx,%rax,1),%xmm6,%xmm0
  1.92 |    |  vaddpd %xmm12,%xmm0,%xmm0
  6.34 |    |  vaddpd %xmm1,%xmm0,%xmm0
  9.58 |    |  vmovup %xmm0,(%r10,%rax,1)
  0.49 |    |  add    $0x10,%rax
  0.05 |    |  cmp    %rax,0x38(%rsp)
  0.02 |    +--jne    6c0

Objdump perhaps gives a better idea about exactly which instructions
these are:

  4011c0:  c4 c1 31 59 04 00   vmulpd (%r8,%rax,1),%xmm9,%xmm0
  4011c6:  c4 c1 68 55 0c 02   vandnps (%r10,%rax,1),%xmm2,%xmm1
  4011cc:  c4 c2 a1 b8 0c 07   vfmadd231pd (%r15,%rax,1),%xmm11,%xmm1
  4011d2:  c4 c2 a9 b8 04 06   vfmadd231pd (%r14,%rax,1),%xmm10,%xmm0
  4011d8:  c5 f9 58 c9         vaddpd %xmm1,%xmm0,%xmm1
  4011dc:  c5 c1 59 04 06      vmulpd (%rsi,%rax,1),%xmm7,%xmm0
  4011e1:  c4 e2 b9 b8 04 07   vfmadd231pd (%rdi,%rax,1),%xmm8,%xmm0
  4011e7:  c5 78 28 e0         vmovaps %xmm0,%xmm12
  4011eb:  c4 c1 51 59 04 03   vmulpd (%r11,%rax,1),%xmm5,%xmm0
  4011f1:  c4 e2 c9 b8 04 01   vfmadd231pd (%rcx,%rax,1),%xmm6,%xmm0
  4011f7:  c4 c1 79 58 c4      vaddpd %xmm12,%xmm0,%xmm0
  4011fc:  c5 f9 58 c1         vaddpd %xmm1,%xmm0,%xmm0
  401200:  c4 c1 78 11 04 02   vmovups %xmm0,(%r10,%rax,1)
  401206:  48 83 c0 10         add    $0x10,%rax
  40120a:  48 39 44 24 38      cmp    %rax,0x38(%rsp)
  40120f:  75 af               jne    4011c0 <mat_times_vec_+0x6c0>

I did a quick experiment with completely disabling FMA generation but
it did not help.

Comment 1 Martin Jambor 2020-03-30 11:13:52 UTC

I can still see this issue on a Zen1 machine as of trunk revision
abe13e1847f (Feb 17 2020) but not on Zen2 machines (in both cases
targeting native ISAs).

Comment 2 Martin Jambor 2020-07-30 20:16:37 UTC

I spoke too soon, I can see this in May gcc 10.1 data on zen1 machine and also in current master (6e1e0decc9e) on a zen-2 machine, still about 6% in both cases.

(Gcc9 does not have this problem on zen2 but does on zen1 so it looks a bit fragile).

Comment 3 Martin Jambor 2021-02-04 17:43:28 UTC

I do not have data from Zen1 on this, but on Zen2 this is fixed on the current trunk (and on Zen3 too, where GCC 10 also was slower with native than generic tuning).

I don't seem to be able to force LNT to show me the respective graph, but my guess would be that lim after loop interchange did it.  Anyway, fixed.