Bug 89853 - Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC
Summary: Regression of 525.x264_r at -O2 (and generic tuning) on AMD EPYC
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2019-03-27 15:42 UTC by Martin Jambor
Modified: 2019-03-28 17:26 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2019-03-27 15:42:30 UTC
I have detected a 7% regression of 525.x264_r from SPEC INTrate 2017
at -O2 and generic march/tuning on AMD EPYC (znver1) CPUs (I have not seen
it on an Intel CPU), compared to the gcc-8-branch.

I have bisected it to r264897.

With revision 264896 I get:

  perf stat:

    Performance counter stats for 'numactl -C 0 -l specinvoke':
   
        495413.105450      task-clock:u (msec)       #    0.999 CPUs utilized          
                    0      context-switches:u        #    0.000 K/sec                  
                    0      cpu-migrations:u          #    0.000 K/sec                  
                80572      page-faults:u             #    0.163 K/sec                  
        1573525941814      cycles:u                  #    3.176 GHz                      (83.33%)
          56730573392      stalled-cycles-frontend:u #    3.61% frontend cycles idle     (83.33%)
         397644125819      stalled-cycles-backend:u  #   25.27% backend cycles idle      (83.33%)
        5157395976259      instructions:u            #    3.28  insn per cycle         
                                                     #    0.08  stalled cycles per insn  (83.33%)
         421019689027      branches:u                #  849.836 M/sec                    (83.33%)
          10705813341      branch-misses:u           #    2.54% of all branches          (83.33%)
   
        495.869208013 seconds time elapsed


  perf report -n --percent-limit 2

   # Event count (approx.): 1576108148398
   #
   # Overhead    Samples  Command      Shared Object   Symbol                                           
   # ........  .........  ...........  ..............  ............................
   #
       14.20%     282290  x264_r_base  x264_r_base.mi  [.] x264_pixel_satd_8x4
       11.19%     222403  x264_r_base  x264_r_base.mi  [.] get_ref
       10.82%     215061  x264_r_base  x264_r_base.mi  [.] x264_pixel_sad_x4_16x16
        7.00%     139082  x264_r_base  x264_r_base.mi  [.] x264_pixel_sad_16x16
        6.11%     121470  x264_r_base  x264_r_base.mi  [.] x264_pixel_sad_x3_16x16
        5.89%     116939  x264_r_base  x264_r_base.mi  [.] x264_pixel_sad_x4_8x8
        5.09%     101266  x264_r_base  x264_r_base.mi  [.] quant_4x4
        4.10%      81471  x264_r_base  x264_r_base.mi  [.] mc_chroma
        2.47%      49122  x264_r_base  x264_r_base.mi  [.] x264_pixel_sad_x3_8x8
        2.21%      43928  x264_r_base  x264_r_base.mi  [.] sub4x4_dct
        2.14%      42598  x264_r_base  x264_r_base.mi  [.] pixel_hadamard_ac
   
With revision 264897 I get:

  perf stat

    Performance counter stats for 'numactl -C 0 -l specinvoke':
   
        495413.105450      task-clock:u (msec)       #    0.999 CPUs utilized          
                    0      context-switches:u        #    0.000 K/sec                  
                    0      cpu-migrations:u          #    0.000 K/sec                  
                80572      page-faults:u             #    0.163 K/sec                  
        1573525941814      cycles:u                  #    3.176 GHz                      (83.33%)
          56730573392      stalled-cycles-frontend:u #    3.61% frontend cycles idle     (83.33%)
         397644125819      stalled-cycles-backend:u  #   25.27% backend cycles idle      (83.33%)
        5157395976259      instructions:u            #    3.28  insn per cycle         
                                                     #    0.08  stalled cycles per insn  (83.33%)
         421019689027      branches:u                #  849.836 M/sec                    (83.33%)
          10705813341      branch-misses:u           #    2.54% of all branches          (83.33%)
   
        495.869208013 seconds time elapsed


  perf report -n --percent-limit 2

   # Event count (approx.): 1576108148398
   #
   # Overhead       Samples  Command          Shared Object                 Symbol                                           
   # ........  ............  ...............  ............................  .................................................
   #
       14.20%        282290  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_satd_8x4
       11.19%        222403  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] get_ref
       10.82%        215061  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x4_16x16
        7.00%        139082  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_16x16
        6.11%        121470  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x3_16x16
        5.89%        116939  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x4_8x8
        5.09%        101266  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] quant_4x4
        4.10%         81471  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] mc_chroma
        2.47%         49122  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x3_8x8
        2.21%         43928  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] sub4x4_dct
        2.14%         42598  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] pixel_hadamard_ac
Comment 1 Peter Bergner 2019-03-27 17:28:55 UTC
Cut and paste error?  The two data sets look the same to me...or am I missing something?
Comment 2 Martin Jambor 2019-03-27 17:53:34 UTC
Doh, yes, copy-paste error, sorry.  The data should have been:

FAST:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     495413.105450      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             80572      page-faults:u             #    0.163 K/sec                  
     1573525941814      cycles:u                  #    3.176 GHz                      (83.33%)
       56730573392      stalled-cycles-frontend:u #    3.61% frontend cycles idle     (83.33%)
      397644125819      stalled-cycles-backend:u  #   25.27% backend cycles idle      (83.33%)
     5157395976259      instructions:u            #    3.28  insn per cycle         
                                                  #    0.08  stalled cycles per insn  (83.33%)
      421019689027      branches:u                #  849.836 M/sec                    (83.33%)
       10705813341      branch-misses:u           #    2.54% of all branches          (83.33%)

     495.869208013 seconds time elapsed

# Event count (approx.): 1576108148398
#
# Overhead       Samples  Command          Shared Object                 Symbol                                           
# ........  ............  ...............  ............................  .................................................
#
    14.20%        282290  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_satd_8x4
    11.19%        222403  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] get_ref
    10.82%        215061  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x4_16x16
     7.00%        139082  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_16x16
     6.11%        121470  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x3_16x16
     5.89%        116939  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x4_8x8
     5.09%        101266  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] quant_4x4
     4.10%         81471  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] mc_chroma
     2.47%         49122  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x3_8x8
     2.21%         43928  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] sub4x4_dct
     2.14%         42598  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] pixel_hadamard_ac



SLOW:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     526858.531112      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             81064      page-faults:u             #    0.154 K/sec                  
     1673634535742      cycles:u                  #    3.177 GHz                      (83.33%)
       64458929239      stalled-cycles-frontend:u #    3.85% frontend cycles idle     (83.33%)
      397586117982      stalled-cycles-backend:u  #   23.76% backend cycles idle      (83.33%)
     5157346862311      instructions:u            #    3.08  insn per cycle         
                                                  #    0.08  stalled cycles per insn  (83.33%)
      421082988475      branches:u                #  799.234 M/sec                    (83.33%)
       14226205709      branch-misses:u           #    3.38% of all branches          (83.33%)

     527.353829377 seconds time elapsed


 # Samples: 2M of event 'cycles'
 # Event count (approx.): 1675655436335
 #
 # Overhead       Samples  Command          Shared Object                 Symbol                                           
 # ........  ............  ...............  ............................  .................................................
 #
    14.13%        298519  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x4_16x16
    13.43%        283793  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_satd_8x4
    11.56%        244196  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] get_ref
     7.17%        151589  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x3_16x16
     6.29%        132936  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_16x16
     5.28%        111517  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x4_8x8
     4.84%        102317  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] quant_4x4
     3.86%         81563  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] mc_chroma
     2.57%         54233  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] x264_pixel_sad_x3_8x8
     2.08%         43964  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] sub4x4_dct
     2.01%         42520  x264_r_base.min  x264_r_base.mine-gen-std-m64  [.] pixel_hadamard_ac
Comment 3 Peter Bergner 2019-03-27 21:18:39 UTC
I don't have access to that type of machine and honestly don't know the ISA well enough to know the differences between what runs well and what doesn't just by looking at the code.  Can you point out some code/function where the assembler code is worse?

The patch you bisected to only removes unneeded conflicts in the conflict graph, which gives the allocators more freedom, which in general is a good thing.  That said, since these are all heuristics built on top of heuristics, it's not impossible that giving more freedom could lead to worse code.

My guess is though, we're probably tickling a AMD specific hardware pipeline feature, since you said you don't see the same thing on Intel.
Comment 4 Martin Liška 2019-03-28 07:52:43 UTC
Just for the record, my Ryzen machine periodic tester probably improved due to the revision:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=158.377.0&plot.1=41.377.0&plot.2=70.377.0&plot.3=31.377.0

As seen, it's now about 5% faster than GCC8 branch.
Comment 5 Peter Bergner 2019-03-28 17:07:45 UTC
(In reply to Martin Liška from comment #4)
> Just for the record, my Ryzen machine periodic tester probably improved due
> to the revision:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=158.377.0&plot.1=41.
> 377.0&plot.2=70.377.0&plot.3=31.377.0
> 
> As seen, it's now about 5% faster than GCC8 branch.

Very interesting, thanks for that!  Since the two of you both used -O2 and generic tuning (ie, same code), that would tend to agree with my speculation that this is an AMD EPYC specific pipeline issue/hazard/... we're unluckily hitting.  Agreed?  If so, I'm not sure we can really blame my patch, but if someone could narrow down what the exact issue is that is causing the slowdown, maybe we can mitigate it somehow.
Comment 6 Martin Jambor 2019-03-28 17:17:40 UTC
Hi, the assembly of the most affected function does not change at all, just its offset (is 0x10 bytes bigger).  Aligning the loops in the function a bit more avoids most of the slowdown but not quite all of it.  In any event, this is a microarchitectural problem that we probably cannot do anything about.  Sorry for the noise, I will check for this the next time before I report a problem.
Comment 7 Peter Bergner 2019-03-28 17:26:53 UTC
(In reply to Martin Jambor from comment #6)
> Hi, the assembly of the most affected function does not change at all, just
> its offset (is 0x10 bytes bigger).  Aligning the loops in the function a bit
> more avoids most of the slowdown but not quite all of it.  In any event,
> this is a microarchitectural problem that we probably cannot do anything
> about.  Sorry for the noise, I will check for this the next time before I
> report a problem.

We've seen similar issues on POWER, where a particular revision causes slight size changes in a function that changes the function offset of some other later function and that causes a performance change.  Unfortunately, just increasing function alignment to eliminate that has other unintended performance issues.

Thanks for isolating the issue.