With profile-feedback, -Ofast and -march=native on an AMD Zen 4, there is a recent 8% regression: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=979.377.0&plot.1=966.377.0& With both PGO and LTO, the situation is similar (6%): https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=977.377.0&plot.1=958.377.0& On a Zen3 machine, there is a 2% bump around the same time: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=900.377.0&plot.1=473.377.0& I have bisected the (non-LTO) Zen 4 case to commit r14-5603-g2b59e2b4dff421: 2b59e2b4dff42118fe3a505f07b9a6aa4cf53bdf is the first bad commit commit 2b59e2b4dff42118fe3a505f07b9a6aa4cf53bdf Author: liuhongt <hongtao.liu@intel.com> Date: Thu Nov 16 18:38:39 2023 +0800 Support reduc_{plus,xor,and,ior}_scal_m for vector integer mode. BB vectorizer relies on the backend support of .REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction. gcc/ChangeLog: PR target/112325 * config/i386/sse.md (reduc_<code>_scal_<mode>): New expander. (REDUC_ANY_LOGIC_MODE): New iterator. (REDUC_PLUS_MODE): Extend to VxHI/SI/DImode. (REDUC_SSE_PLUS_MODE): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr112325-1.c: New test. * gcc.target/i386/pr112325-2.c: New test.
Guess it's same issue as PR112879?
A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640276.html Would you give a try to see if it fixes the regression, I don't currently have a znver4 machine for testing.
I'll note that esp. two-lane reductions (or in general two-lane BB vectorization) is hardly profitable on modern x86 uarchs unless the vectorized code is interleaved with other non-vectorized code that can execute at the same time. vectorizing two lanes will only make them dependent on each other while when not vectorized modern uarchs have no difficulty in executing them in parallel (but without the tied dependences). It's only when there's sufficient benefit, aka more lanes, approaching the issue width or the number of available ports for the ops, or the whole SLP mostly consisting of loads/stores, that BB vectorization is going to be profitable. Note the cost model only ever looks at the stmts participating in the vectorization, not the "surrounding" code, and it would be difficult to include that since the schedule on GIMPLE isn't even close to what we get later. The reduction op is also a serialization point on the scalar side of course, whether that means that BB reductions with two lanes are possibly better candidates than grouped BB stores with two lanes is another question. The BB reduction op itself is costed properly. So the 525.x264_r case might be loop vectorization, OTOH the epilogue cost is hardly ever a knob that decides whether a vectorization is profitable. I think we need to figure out what exactly gets slower (and hope it's not scattered all over the place)
(In reply to Hongtao Liu from comment #2) > A patch is posted at > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640276.html > > Would you give a try to see if it fixes the regression, I don't currently > have a znver4 machine for testing. Unfortunately it does not. (In reply to Richard Biener from comment #3) > I think we need to figure out what exactly gets slower (and hope it's not > scattered all over the place) I have collected some profiles: r14-5602-ge6269bb69c0734 # Samples: 516K of event 'cycles:u' # Event count (approx.): 468008188417 # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ..................................... ................................................. # 13.55% 69886 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] mc_chroma 11.05% 57017 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_satd_16x16 9.24% 47693 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_satd_8x8 8.67% 44733 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] get_ref 4.84% 24984 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] sub16x16_dct 4.16% 21484 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_me_search_ref 3.30% 17033 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_hadamard_ac_16x16 2.28% 11770 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_satd_4x4 2.10% 10824 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] quant_trellis_cabac 2.07% 10694 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] hpel_filter 2.05% 10616 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] sub8x8_dct 1.86% 9593 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] refine_subpel 1.70% 8788 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] quant_4x4 1.57% 8077 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_sad_16x16 1.16% 6324 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] frame_init_lowres_core 1.14% 5867 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_sa8d_8x8 1.11% 5738 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_cabac_encode_decision_c 1.08% 5736 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_var_16x16 r14-5603-g2b59e2b4dff421 # Samples: 550K of event 'cycles:u' # Event count (approx.): 498834737657 # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ..................................... ................................................. # 18.21% 100151 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_satd_16x16 12.37% 68006 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] mc_chroma 8.51% 46815 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_satd_8x8 7.56% 41560 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] get_ref 4.53% 24901 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] sub16x16_dct 3.92% 21561 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_me_search_ref 3.08% 16963 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_hadamard_ac_16x16 2.41% 13239 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_satd_4x4 1.99% 10931 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] quant_trellis_cabac 1.96% 10801 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] hpel_filter 1.95% 10764 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] sub8x8_dct 1.56% 8587 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] quant_4x4 1.49% 8166 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] refine_subpel 1.48% 8124 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_sad_16x16 1.09% 6328 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] frame_init_lowres_core 1.07% 5901 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_pixel_sa8d_8x8 1.04% 5703 x264_r_peak.min x264_r_peak.mine-pgo-Ofast-native-m64 [.] x264_cabac_encode_decision_c
It looks like x264_pixel_satd_16x16 consumes more time after my commit, an extracted case is as below, note there's no attribute((always_inline)) in the original x264_pixel_satd_8x4, it's added to force inline(Under PGO, it's hot and will be inlined) typedef unsigned char uint8_t; typedef unsigned uint32_t; typedef unsigned short uint16_t; static inline uint32_t abs2( uint32_t a ) { uint32_t s = ((a>>15)&0x10001)*0xffff; return (a+s)^s; } int __attribute__((always_inline)) x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) { uint32_t tmp[4][4]; uint32_t a0, a1, a2, a3; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); { int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][2] = t0 - t2; tmp[i][1] = t1 + t3; tmp[i][3] = t1 - t3;}; } for( int i = 0; i < 4; i++ ) { { int t0 = tmp[0][i] + tmp[1][i]; int t1 = tmp[0][i] - tmp[1][i]; int t2 = tmp[2][i] + tmp[3][i]; int t3 = tmp[2][i] - tmp[3][i]; a0 = t0 + t2; a2 = t0 - t2; a1 = t1 + t3; a3 = t1 - t3;}; sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); } return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; } int x264_pixel_satd_16x16( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) { int sum = x264_pixel_satd_8x4( pix1, i_pix1, pix2, i_pix2 ) + x264_pixel_satd_8x4( pix1+4*i_pix1, i_pix1, pix2+4*i_pix2, i_pix2 ); sum+= x264_pixel_satd_8x4( pix1+8, i_pix1, pix2+8, i_pix2 ) + x264_pixel_satd_8x4( pix1+8+4*i_pix1, i_pix1, pix2+8+4*i_pix2, i_pix2 ); sum+= x264_pixel_satd_8x4( pix1+8*i_pix1, i_pix1, pix2+8*i_pix2, i_pix2 ) + x264_pixel_satd_8x4( pix1+12*i_pix1, i_pix1, pix2+12*i_pix2, i_pix2 ); sum+= x264_pixel_satd_8x4( pix1+8+8*i_pix1, i_pix1, pix2+8+8*i_pix2, i_pix2 ) + x264_pixel_satd_8x4( pix1+8+12*i_pix1, i_pix1, pix2+8+12*i_pix2, i_pix2 ); return sum; } after commits, slp failed to splitted group size 16(vector int(16)) into small 4 + 12 and missed vectorization for below cases. vect_t2_2445.784_8503 = VIEW_CONVERT_EXPR<vector(4) int>(_8502); vect__2457.786_8505 = vect_t0_2441.783_8501 - vect_t2_2445.784_8503; vect__2448.785_8504 = vect_t0_2441.783_8501 + vect_t2_2445.784_8503; _8506 = VEC_PERM_EXPR <vect__2448.785_8504, vect__2457.786_8505, { 0, 1, 6, 7 }>; vect__2449.787_8507 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_8506); t3_2447 = (int) _2446; _2448 = t0_2441 + t2_2445; _2449 = (unsigned int) _2448; _2451 = t0_2441 - t2_2445; _2452 = (unsigned int) _2451; _2454 = t1_2443 + t3_2447; _2455 = (unsigned int) _2454; _2457 = t1_2443 - t3_2447; _2458 = (unsigned int) _2457; MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B] = vect__2449.787_8507; The vector store will be optimized off with later vector load, so for the bad case there're STLF issue.
Guess explicit .REDUC_PLUS instead of original VEC_PERM_EXPR somehow impacts the store split decision.
*** Bug 112879 has been marked as a duplicate of this bug. ***
GCC 14.1 is being released, retargeting bugs to GCC 14.2.
GCC 14.2 is being released, retargeting bugs to GCC 14.3.
I think it should be fixed by r15-2820-gab18785840d7b8
Should be fixed in GCC15.0