This seems to be at least partly caused by fact that ipa-cp does not clone function with no hot calls. This is wrong. Since the function itself may spend a lot of time, we do not want to give up sine it is called just few times. The cost model should consider the expected peedup after cloning. I.e. time_benefit multiplied by sum of counts of call edges.
There is surprising group of issues with ipa-cp cost model and speculation I have WIP patches for. However there is also problem with vectorization of mc_chroma. We vectorized w/o profile feedback but give up with profile feedback since the expected number of iterations is lower then the min profitability. With cascade epilogue it seems that vectorization is a win, so cost model should consider vectorizatoin with smaller Vf...
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:57dbbdd8e34b80926e06b352b6c442c555b303ed commit r15-8041-g57dbbdd8e34b80926e06b352b6c442c555b303ed Author: Jan Hubicka <hubicka@ucw.cz> Date: Thu Mar 13 20:11:02 2025 +0100 Fix speculation_useful_p This patch fixes issue with speculation and x264. With profile feedback we first introduce speculative calls to mc_chroma which is called indirectly. Then we propagate constants acorss these calls (which is useful transform) but then speculation_useful_p decides that these speculations are not useful and we end up calling unspecialized version. This patch updates speculation_useful_p to consider edges redirected earlier to clones as useful, since we can expect that ipa-cp knows what it is doing (originally it only looked for inlined calls). I also noticed that we want to keep edges even if they are not hot. Finally I noticed a typo in computing target in code which intends to keep devirtualized calls to functions where we propagated pureness/constness. Newly we also track ipa-modref summaries as they also may be useful. gcc/ChangeLog: PR ipa/119147 * ipa-inline.cc: Include ipa-modref-tree.h and ipa-modref.h. (speculation_useful_p): If target is a clone, speculation is usef; fix mixup of caller and callee; speculate also calls not considered hot; consider modref summary also possibly useful for optimization. * ipa-profile.cc (ipa_profile): Keep non-hot speculations.
With speculation_useful_p we now are able to constant propagate stride into mc_chroma with PGO, but it does not help runtime. https://gcc.gnu.org/pipermail/gcc-patches/2025-April/680055.html solves the costing issue. Vectorizer is still disabling itself when loop trip count is small even if vectorization of cascaded epilogue is still a good idea.
Re-benchmarked current trunk -flto -Ofast -march=native (base) and -flto -Ofast -march=native + PGO (peak) on znver3 Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Copies Run Time Rate Copies Run Time Rate --------------- ------- --------- --------- ------- --------- --------- 525.x264_r 1 87.1 20.1 * 1 101 17.3 -flto -Ofast profile is: 7.67% x264_r_base.tru [.] x264_pixel_satd_8x4.lto_priv.0 ◆ 4.80% x264_r_base.tru [.] get_ref.lto_priv.0 ▒ 4.08% x264_r_base.tru [.] mc_chroma.lto_priv.0 ▒ 1.58% x264_r_base.tru [.] x264_me_search_ref ▒ 1.41% x264_r_base.tru [.] pixel_hadamard_ac ▒ 1.31% x264_r_base.tru [.] x264_pixel_satd_4x4.lto_priv.0 ▒ 1.17% x264_r_base.tru [.] sub4x4_dct.lto_priv.0 ▒ 1.11% x264_r_base.tru [.] refine_subpel.lto_priv.0 ▒ 1.10% x264_r_base.tru [.] quant_4x4.lto_priv.0 ▒ 0.98% x264_r_base.tru [.] quant_trellis_cabac.lto_priv.0 ▒ 0.77% x264_r_base.tru [.] hpel_filter.lto_priv.0 ▒ 0.68% x264_r_base.tru [.] x264_pixel_sad_x4_8x8.lto_priv.0 ▒ 0.56% x264_r_base.tru [.] frame_init_lowres_core.lto_priv.0 ▒ 0.55% x264_r_base.tru [.] x264_pixel_sad_x4_16x16.lto_priv.0 ▒ 0.54% x264_r_base.tru [.] x264_pixel_sad_16x16.lto_priv.0 ▒ While with PGO 5.04% x264_r_peak.tru [.] refine_subpel.lto_priv.0 ◆ 4.42% x264_r_peak.tru [.] x264_pixel_satd_8x8.constprop.1 ▒ 3.66% x264_r_peak.tru [.] mc_chroma.constprop.1 ▒ 3.45% x264_r_peak.tru [.] x264_pixel_satd_16x16.lto_priv.0 ▒ 2.78% x264_r_peak.tru [.] x264_me_search_ref ▒ 2.13% x264_r_peak.tru [.] x264_mb_analyse_intra.lto_priv.0 ▒ 2.06% x264_r_peak.tru [.] x264_macroblock_encode ▒ 1.43% x264_r_peak.tru [.] x264_slicetype_mb_cost ▒ 1.38% x264_r_peak.tru [.] mc_chroma.lto_priv.0 ▒ 1.22% x264_r_peak.tru [.] x264_pixel_hadamard_ac_16x16.constprop.0 ▒ 0.99% x264_r_peak.tru [.] x264_mb_encode_8x8_chroma ▒ 0.96% x264_r_peak.tru [.] quant_trellis_cabac.lto_priv.0 ▒ 0.92% x264_r_peak.tru [.] x264_pixel_sad_x4_8x8.lto_priv.0 ▒ 0.77% x264_r_peak.tru [.] hpel_filter.lto_priv.0 ▒ 0.77% x264_r_peak.tru [.] x264_mb_mc_0xywh ▒ 0.73% x264_r_peak.tru [.] x264_pixel_satd_4x4.constprop.1 ▒ We speculatively inline get_ref into refine_subpel (which is called indirectly but pointer is always the same). Similarly we constant propagate stride to mc_chroma. This seems good, but sum of time spent in mc_chroma clones grows up. Inlining decisions on pixel_satd differs but seems fine. Next problem is that vectorizer turns itself off when trip count is low. Following hack: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 9413dcef702..8882a5dea11 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -2483,14 +2483,16 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, if (estimated_niter == -1) estimated_niter = likely_max_stmt_executions_int (loop); } - if (estimated_niter != -1 + if (estimated_niter != -1 && 0 && ((unsigned HOST_WIDE_INT) estimated_niter < MAX (th, (unsigned) min_profitable_estimate))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "not vectorized: estimated iteration count too " - "small.\n"); + "not vectorized: estimated iteration count %li smaller " + "than threshold %li.\n", + (long) estimated_niter, + (long MAX (th, (unsigned) min_profitable_estimate))); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "not vectorized: estimated iteration count smaller " improves PGO score to 18.1 (96.6 runtime). This speeds up mc_chroma.constprop.1 by about 50%. Unvectorized: │ for( int x = 0; x < i_width; x++ ) ▒ │ dst[x] = ( cA*src[x] + cB*src[x+1] + cC*srcp[x]▒ 0.00 │a0:┌─ movzbl (%rcx,%rax,1),%edx ▒ 1.69 │ │ movzbl 0x1(%rcx,%rax,1),%r14d ▒ 0.15 │ │ imul %ebx,%edx ▒ 2.57 │ │ imul %r10d,%r14d ▒ 1.95 │ │ add %r14d,%edx ▒ 24.93 │ │ movzbl (%rsi,%rax,1),%r14d ▒ 0.65 │ │ imul %r9d,%r14d ▒ 0.12 │ │ add %r14d,%edx ▒ 7.48 │ │ movzbl 0x1(%rsi,%rax,1),%r14d ▒ 1.60 │ │ imul %r11d,%r14d ▒ 0.03 │ │ lea 0x20(%rdx,%r14,1),%edx ▒ 16.81 │ │ sar $0x6,%edx ▒ 34.78 │ │ mov %dl,(%rdi,%rax,1) ▒ │ │for( int x = 0; x < i_width; x++ ) ▒ 0.01 │ │ inc %rax ▒ 0.01 │ ├──cmp %rax,%r8 ▒ 0.02 │ └──jne a0 ▒ But still we don't get same speed as w/o PGO...