Bug 119147 - 525.x264_r is approx. 10% slower with LTO+PGO than without (at -Ofast -march-native)
Summary: 525.x264_r is approx. 10% slower with LTO+PGO than without (at -Ofast -march-...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2025-03-06 17:51 UTC by Jan Hubicka
Modified: 2025-04-10 01:13 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2025-03-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2025-03-06 17:51:04 UTC
This seems to be at least partly caused by fact that ipa-cp does not clone function with no hot calls.  This is wrong. Since the function itself may spend a lot of time, we do not want to give up sine it is called just few times.

The cost model should consider the expected peedup after cloning. I.e. time_benefit multiplied by sum of counts of call edges.
Comment 1 Jan Hubicka 2025-03-13 18:25:02 UTC
There is surprising group of issues with ipa-cp cost model and speculation I have WIP patches for.

However there is also problem with vectorization of mc_chroma. We vectorized w/o profile feedback but give up with profile feedback since the expected number of iterations is lower then the min profitability.

With cascade epilogue it seems that vectorization is a win, so cost model should consider vectorizatoin with smaller Vf...
Comment 2 GCC Commits 2025-03-13 19:12:57 UTC
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:57dbbdd8e34b80926e06b352b6c442c555b303ed

commit r15-8041-g57dbbdd8e34b80926e06b352b6c442c555b303ed
Author: Jan Hubicka <hubicka@ucw.cz>
Date:   Thu Mar 13 20:11:02 2025 +0100

    Fix speculation_useful_p
    
    This patch fixes issue with speculation and x264.  With profile feedback
    we first introduce speculative calls to mc_chroma which is called indirectly.
    Then we propagate constants acorss these calls (which is useful transform) but
    then speculation_useful_p decides that these speculations are not useful and
    we end up calling unspecialized version.
    
    This patch updates speculation_useful_p to consider edges redirected earlier
    to clones as useful, since we can expect that ipa-cp knows what it is doing
    (originally it only looked for inlined calls).  I also noticed that we want
    to keep edges even if they are not hot.
    
    Finally I noticed a typo in computing target in code which intends to keep
    devirtualized calls to functions where we propagated pureness/constness. Newly
    we also track ipa-modref summaries as they also may be useful.
    
    gcc/ChangeLog:
    
            PR ipa/119147
            * ipa-inline.cc: Include ipa-modref-tree.h and
            ipa-modref.h.
            (speculation_useful_p): If target is a clone, speculation is usef;
            fix mixup of caller and callee; speculate also calls not considered
            hot; consider modref summary also possibly useful for optimization.
            * ipa-profile.cc (ipa_profile): Keep non-hot speculations.
Comment 3 Jan Hubicka 2025-04-03 14:53:54 UTC
With speculation_useful_p we now are able to constant propagate stride into mc_chroma with PGO, but it does not help runtime.

https://gcc.gnu.org/pipermail/gcc-patches/2025-April/680055.html

solves the costing issue.

Vectorizer is still disabling itself when loop trip count is small even if vectorization of cascaded epilogue is still a good idea.
Comment 4 Jan Hubicka 2025-04-03 16:23:37 UTC
Re-benchmarked current trunk -flto -Ofast -march=native (base) and  -flto -Ofast -march=native + PGO (peak) on znver3
                       Estimated                       Estimated
                 Base     Base        Base        Peak     Peak        Peak
Benchmarks       Copies  Run Time     Rate        Copies  Run Time     Rate
--------------- -------  ---------  ---------    -------  ---------  ---------
525.x264_r            1       87.1       20.1  *       1        101       17.3  

-flto -Ofast profile is:
   7.67%  x264_r_base.tru  [.] x264_pixel_satd_8x4.lto_priv.0         ◆
   4.80%  x264_r_base.tru  [.] get_ref.lto_priv.0                     ▒
   4.08%  x264_r_base.tru  [.] mc_chroma.lto_priv.0                   ▒
   1.58%  x264_r_base.tru  [.] x264_me_search_ref                     ▒
   1.41%  x264_r_base.tru  [.] pixel_hadamard_ac                      ▒
   1.31%  x264_r_base.tru  [.] x264_pixel_satd_4x4.lto_priv.0         ▒
   1.17%  x264_r_base.tru  [.] sub4x4_dct.lto_priv.0                  ▒
   1.11%  x264_r_base.tru  [.] refine_subpel.lto_priv.0               ▒
   1.10%  x264_r_base.tru  [.] quant_4x4.lto_priv.0                   ▒
   0.98%  x264_r_base.tru  [.] quant_trellis_cabac.lto_priv.0         ▒
   0.77%  x264_r_base.tru  [.] hpel_filter.lto_priv.0                 ▒
   0.68%  x264_r_base.tru  [.] x264_pixel_sad_x4_8x8.lto_priv.0       ▒
   0.56%  x264_r_base.tru  [.] frame_init_lowres_core.lto_priv.0      ▒
   0.55%  x264_r_base.tru  [.] x264_pixel_sad_x4_16x16.lto_priv.0     ▒
   0.54%  x264_r_base.tru  [.] x264_pixel_sad_16x16.lto_priv.0        ▒

While with PGO
   5.04%  x264_r_peak.tru  [.] refine_subpel.lto_priv.0                    ◆
   4.42%  x264_r_peak.tru  [.] x264_pixel_satd_8x8.constprop.1             ▒
   3.66%  x264_r_peak.tru  [.] mc_chroma.constprop.1                       ▒
   3.45%  x264_r_peak.tru  [.] x264_pixel_satd_16x16.lto_priv.0            ▒
   2.78%  x264_r_peak.tru  [.] x264_me_search_ref                          ▒
   2.13%  x264_r_peak.tru  [.] x264_mb_analyse_intra.lto_priv.0            ▒
   2.06%  x264_r_peak.tru  [.] x264_macroblock_encode                      ▒
   1.43%  x264_r_peak.tru  [.] x264_slicetype_mb_cost                      ▒
   1.38%  x264_r_peak.tru  [.] mc_chroma.lto_priv.0                        ▒
   1.22%  x264_r_peak.tru  [.] x264_pixel_hadamard_ac_16x16.constprop.0    ▒
   0.99%  x264_r_peak.tru  [.] x264_mb_encode_8x8_chroma                   ▒
   0.96%  x264_r_peak.tru  [.] quant_trellis_cabac.lto_priv.0              ▒
   0.92%  x264_r_peak.tru  [.] x264_pixel_sad_x4_8x8.lto_priv.0            ▒
   0.77%  x264_r_peak.tru  [.] hpel_filter.lto_priv.0                      ▒
   0.77%  x264_r_peak.tru  [.] x264_mb_mc_0xywh                            ▒
   0.73%  x264_r_peak.tru  [.] x264_pixel_satd_4x4.constprop.1             ▒

We speculatively inline get_ref into refine_subpel (which is called indirectly but pointer is always the same).  Similarly we constant propagate stride to mc_chroma. This seems good, but sum of time spent in mc_chroma clones grows up. Inlining decisions on pixel_satd differs but seems fine.

Next problem is that vectorizer turns itself off when trip count is low. Following hack:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 9413dcef702..8882a5dea11 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2483,14 +2483,16 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
       if (estimated_niter == -1)
        estimated_niter = likely_max_stmt_executions_int (loop);
     }
-  if (estimated_niter != -1
+  if (estimated_niter != -1 && 0
       && ((unsigned HOST_WIDE_INT) estimated_niter
          < MAX (th, (unsigned) min_profitable_estimate)))
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "not vectorized: estimated iteration count too "
-                        "small.\n");
+                        "not vectorized: estimated iteration count %li smaller "
+                        "than threshold %li.\n",
+                        (long) estimated_niter,
+                        (long MAX (th, (unsigned) min_profitable_estimate)));
       if (dump_enabled_p ())
        dump_printf_loc (MSG_NOTE, vect_location,
                         "not vectorized: estimated iteration count smaller "

improves PGO score to 18.1 (96.6 runtime).

This speeds up mc_chroma.constprop.1 by about 50%. Unvectorized:

        │    for( int x = 0; x < i_width; x++ )              ▒
        │    dst[x] = ( cA*src[x]  + cB*src[x+1] + cC*srcp[x]▒
   0.00 │a0:┌─ movzbl (%rcx,%rax,1),%edx                     ▒
   1.69 │   │  movzbl 0x1(%rcx,%rax,1),%r14d                 ▒
   0.15 │   │  imul   %ebx,%edx                              ▒
   2.57 │   │  imul   %r10d,%r14d                            ▒
   1.95 │   │  add    %r14d,%edx                             ▒
  24.93 │   │  movzbl (%rsi,%rax,1),%r14d                    ▒
   0.65 │   │  imul   %r9d,%r14d                             ▒
   0.12 │   │  add    %r14d,%edx                             ▒
   7.48 │   │  movzbl 0x1(%rsi,%rax,1),%r14d                 ▒
   1.60 │   │  imul   %r11d,%r14d                            ▒
   0.03 │   │  lea    0x20(%rdx,%r14,1),%edx                 ▒
  16.81 │   │  sar    $0x6,%edx                              ▒
  34.78 │   │  mov    %dl,(%rdi,%rax,1)                      ▒
        │   │for( int x = 0; x < i_width; x++ )              ▒
   0.01 │   │  inc    %rax                                   ▒
   0.01 │   ├──cmp    %rax,%r8                               ▒
   0.02 │   └──jne    a0                                     ▒

But still we don't get same speed as w/o PGO...