Bug 113600 - [14/15 regression] 525.x264_r run-time regresses by 8% with PGO -Ofast -march=znver4
Summary: [14/15 regression] 525.x264_r run-time regresses by 8% with PGO -Ofast -march...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 14.0
: P2 normal
Target Milestone: 15.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 112879 (view as bug list)
Depends on:
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2024-01-25 14:29 UTC by Martin Jambor
Modified: 2024-11-27 09:23 UTC (History)
6 users (show)

See Also:
Host: x86_64-linux-gnu
Target: x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2024-01-25 14:29:27 UTC
With profile-feedback, -Ofast and -march=native on an AMD Zen 4, there is a recent 8% regression:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=979.377.0&plot.1=966.377.0&

With both PGO and LTO, the situation is similar (6%):
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=977.377.0&plot.1=958.377.0&

On a Zen3 machine, there is a 2% bump around the same time: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=900.377.0&plot.1=473.377.0&

I have bisected the (non-LTO) Zen 4 case to commit r14-5603-g2b59e2b4dff421:

2b59e2b4dff42118fe3a505f07b9a6aa4cf53bdf is the first bad commit
commit 2b59e2b4dff42118fe3a505f07b9a6aa4cf53bdf
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Nov 16 18:38:39 2023 +0800

    Support reduc_{plus,xor,and,ior}_scal_m for vector integer mode.

    BB vectorizer relies on the backend support of
    .REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction.

    gcc/ChangeLog:
            
            PR target/112325
            * config/i386/sse.md (reduc_<code>_scal_<mode>): New expander.
            (REDUC_ANY_LOGIC_MODE): New iterator.
            (REDUC_PLUS_MODE): Extend to VxHI/SI/DImode.
            (REDUC_SSE_PLUS_MODE): Ditto.

    gcc/testsuite/ChangeLog:
            
            * gcc.target/i386/pr112325-1.c: New test.
            * gcc.target/i386/pr112325-2.c: New test.
Comment 1 Hongtao Liu 2024-01-26 01:02:23 UTC
Guess it's same issue as PR112879?
Comment 2 Hongtao Liu 2024-01-26 02:14:39 UTC
A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640276.html

Would you give a try to see if it fixes the regression, I don't currently have a znver4 machine for testing.
Comment 3 Richard Biener 2024-01-26 07:48:12 UTC
I'll note that esp. two-lane reductions (or in general two-lane BB vectorization) is hardly profitable on modern x86 uarchs unless the vectorized code is interleaved with other non-vectorized code that can execute at the same time.  vectorizing two lanes will only make them dependent on each other while when not vectorized modern uarchs have no difficulty in executing them in parallel (but without the tied dependences).  It's only when there's sufficient
benefit, aka more lanes, approaching the issue width or the number of available ports for the ops, or the whole SLP mostly consisting of loads/stores, that BB vectorization is going to be profitable.  Note the cost model only ever looks
at the stmts participating in the vectorization, not the "surrounding" code,
and it would be difficult to include that since the schedule on GIMPLE isn't
even close to what we get later.  The reduction op is also a serialization
point on the scalar side of course, whether that means that BB reductions
with two lanes are possibly better candidates than grouped BB stores with
two lanes is another question.

The BB reduction op itself is costed properly.

So the 525.x264_r case might be loop vectorization, OTOH the epilogue
cost is hardly ever a knob that decides whether a vectorization is profitable.

I think we need to figure out what exactly gets slower (and hope it's not
scattered all over the place)
Comment 4 Martin Jambor 2024-01-26 18:27:35 UTC
(In reply to Hongtao Liu from comment #2)
> A patch is posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640276.html
> 
> Would you give a try to see if it fixes the regression, I don't currently
> have a znver4 machine for testing.

Unfortunately it does not.

(In reply to Richard Biener from comment #3)
> I think we need to figure out what exactly gets slower (and hope it's not
> scattered all over the place)

I have collected some profiles:

r14-5602-ge6269bb69c0734

# Samples: 516K of event 'cycles:u'
# Event count (approx.): 468008188417
# Overhead       Samples  Command          Shared Object                          Symbol                                           
# ........  ............  ...............  .....................................  .................................................
#
    13.55%         69886  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] mc_chroma
    11.05%         57017  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_satd_16x16
     9.24%         47693  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_satd_8x8
     8.67%         44733  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] get_ref
     4.84%         24984  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] sub16x16_dct
     4.16%         21484  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_me_search_ref
     3.30%         17033  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_hadamard_ac_16x16
     2.28%         11770  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_satd_4x4
     2.10%         10824  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] quant_trellis_cabac
     2.07%         10694  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] hpel_filter
     2.05%         10616  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] sub8x8_dct
     1.86%          9593  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] refine_subpel
     1.70%          8788  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] quant_4x4
     1.57%          8077  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_sad_16x16
     1.16%          6324  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] frame_init_lowres_core
     1.14%          5867  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_sa8d_8x8
     1.11%          5738  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_cabac_encode_decision_c
     1.08%          5736  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_var_16x16



r14-5603-g2b59e2b4dff421

# Samples: 550K of event 'cycles:u'
# Event count (approx.): 498834737657
# Overhead       Samples  Command          Shared Object                          Symbol                                           
# ........  ............  ...............  .....................................  .................................................
#
    18.21%        100151  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_satd_16x16
    12.37%         68006  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] mc_chroma
     8.51%         46815  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_satd_8x8
     7.56%         41560  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] get_ref
     4.53%         24901  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] sub16x16_dct
     3.92%         21561  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_me_search_ref
     3.08%         16963  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_hadamard_ac_16x16
     2.41%         13239  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_satd_4x4
     1.99%         10931  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] quant_trellis_cabac
     1.96%         10801  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] hpel_filter
     1.95%         10764  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] sub8x8_dct
     1.56%          8587  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] quant_4x4
     1.49%          8166  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] refine_subpel
     1.48%          8124  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_sad_16x16
     1.09%          6328  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] frame_init_lowres_core
     1.07%          5901  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_pixel_sa8d_8x8
     1.04%          5703  x264_r_peak.min  x264_r_peak.mine-pgo-Ofast-native-m64  [.] x264_cabac_encode_decision_c
Comment 5 Hongtao Liu 2024-01-30 09:29:36 UTC
It looks like x264_pixel_satd_16x16 consumes more time after my commit, an extracted case is as below, note there's no attribute((always_inline)) in the original x264_pixel_satd_8x4, it's added to force inline(Under PGO, it's hot and will be inlined)

typedef unsigned char uint8_t;
typedef unsigned uint32_t;
typedef unsigned short uint16_t;

static inline uint32_t abs2( uint32_t a )
{
    uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

int
__attribute__((always_inline))
x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
  uint32_t tmp[4][4];
  uint32_t a0, a1, a2, a3;
  int sum = 0;
  for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
      { int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][2] = t0 - t2; tmp[i][1] = t1 + t3; tmp[i][3] = t1 - t3;};
    }
  for( int i = 0; i < 4; i++ )
    {
      { int t0 = tmp[0][i] + tmp[1][i]; int t1 = tmp[0][i] - tmp[1][i]; int t2 = tmp[2][i] + tmp[3][i]; int t3 = tmp[2][i] - tmp[3][i]; a0 = t0 + t2; a2 = t0 - t2; a1 = t1 + t3; a3 = t1 - t3;};
      sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }
  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}

int x264_pixel_satd_16x16( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
  int sum = x264_pixel_satd_8x4( pix1, i_pix1, pix2, i_pix2 )
    + x264_pixel_satd_8x4( pix1+4*i_pix1, i_pix1, pix2+4*i_pix2, i_pix2 );
  sum+= x264_pixel_satd_8x4( pix1+8, i_pix1, pix2+8, i_pix2 )
    + x264_pixel_satd_8x4( pix1+8+4*i_pix1, i_pix1, pix2+8+4*i_pix2, i_pix2 );
  sum+= x264_pixel_satd_8x4( pix1+8*i_pix1, i_pix1, pix2+8*i_pix2, i_pix2 )
    + x264_pixel_satd_8x4( pix1+12*i_pix1, i_pix1, pix2+12*i_pix2, i_pix2 );
  sum+= x264_pixel_satd_8x4( pix1+8+8*i_pix1, i_pix1, pix2+8+8*i_pix2, i_pix2 )
    + x264_pixel_satd_8x4( pix1+8+12*i_pix1, i_pix1, pix2+8+12*i_pix2, i_pix2 );
  return sum;
}


after commits, slp failed to splitted group size 16(vector int(16)) into small 4 + 12 and missed vectorization for below cases.

  vect_t2_2445.784_8503 = VIEW_CONVERT_EXPR<vector(4) int>(_8502);
  vect__2457.786_8505 = vect_t0_2441.783_8501 - vect_t2_2445.784_8503;
  vect__2448.785_8504 = vect_t0_2441.783_8501 + vect_t2_2445.784_8503;
  _8506 = VEC_PERM_EXPR <vect__2448.785_8504, vect__2457.786_8505, { 0, 1, 6, 7 }>;
  vect__2449.787_8507 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_8506);
  t3_2447 = (int) _2446;
  _2448 = t0_2441 + t2_2445;
  _2449 = (unsigned int) _2448;
  _2451 = t0_2441 - t2_2445;
  _2452 = (unsigned int) _2451;
  _2454 = t1_2443 + t3_2447;
  _2455 = (unsigned int) _2454;
  _2457 = t1_2443 - t3_2447;
  _2458 = (unsigned int) _2457;
  MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B] = vect__2449.787_8507;


The vector store will be optimized off with later vector load, so for the bad case there're STLF issue.
Comment 6 Hongtao Liu 2024-01-30 09:31:30 UTC
Guess explicit .REDUC_PLUS instead of original VEC_PERM_EXPR somehow impacts the store split decision.
Comment 7 Filip Kastl 2024-02-13 10:13:28 UTC
*** Bug 112879 has been marked as a duplicate of this bug. ***
Comment 8 Richard Biener 2024-05-07 07:44:25 UTC
GCC 14.1 is being released, retargeting bugs to GCC 14.2.
Comment 9 Jakub Jelinek 2024-08-01 09:37:42 UTC
GCC 14.2 is being released, retargeting bugs to GCC 14.3.
Comment 10 Hongtao Liu 2024-08-15 05:43:56 UTC
I think it should be fixed by r15-2820-gab18785840d7b8
Comment 11 Hongtao Liu 2024-11-26 07:01:31 UTC
Should be fixed in GCC15.0