[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022
luoxhu at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Jul 25 09:44:25 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293
--- Comment #4 from luoxhu at gcc dot gnu.org ---
Could you try revert (In reply to Richard Biener from comment #2)
> I can reproduce a regression with -Ofast -march=znver2 running on Haswell as
> well. -fopt-info doesn't reveal anything interesting besides
>
> -fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
> unrolled (header execution count 32987933)
> +fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
> unrolled (header execution count 129072791)
>
> obviously the slowdown is in P7Viterbi. There's only minimal changes on the
> GIMPLE side, one notable:
>
> niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 | _2041 =
> niters.203_438 & 3;
> _2408 = (int) niters_vector_mult_vf.205_2406; | if (_2041
> == 0)
> tmp.206_2407 = k_384 + _2408; | goto <bb
> 66>; [25.00%]
> _2300 = niters.203_442 & 3; <
> if (_2300 == 0) <
> goto <bb 65>; [25.00%] <
> else else
> goto <bb 36>; [75.00%] goto <bb
> 36>; [75.00%]
>
> <bb 36> [local count: 41646173]: | <bb 36>
> [local count: 177683003]:
> # k_2403 = PHI <tmp.206_2407(35), tmp.239_2637(34)> |
> niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729
> # DEBUG k => k_2403 | _2411 =
> (int) niters_vector_mult_vf.205_2409;
> >
> tmp.206_2410 = k_382 + _2411;
> >
> > <bb 37>
> [local count: 162950122]:
> > # k_2406 =
> PHI <tmp.206_2410(36), tmp.239_2639(34)>
>
> the sink pass now does the transform where it did not do so before.
>
> That's appearantly because of
>
> /* If BEST_BB is at the same nesting level, then require it to have
> significantly lower execution frequency to avoid gratuitous movement.
> */
> if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
> /* If result of comparsion is unknown, prefer EARLY_BB.
> Thus use !(...>=..) rather than (...<...) */
> && !(best_bb->count * 100 >= early_bb->count * threshold))
> return best_bb;
>
> /* No better block found, so return EARLY_BB, which happens to be the
> statement's original block. */
> return early_bb;
>
> where the SRC count is 96726596 before, 236910671 after and the
> destination count is 72544947 before, 177683003 at the destination after.
> The edge probabilities are 75% vs 25% and param_sink_frequency_threshold
> is exactly 75 as well. Since 236910671*0.75
> is rounded down it passes the test while the previous state has an exact
> match defeating it.
>
> It's a little bit of an arbitrary choice,
>
> diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
> index 2e744d6ae50..9b368e13463 100644
> --- a/gcc/tree-ssa-sink.cc
> +++ b/gcc/tree-ssa-sink.cc
> @@ -230,7 +230,7 @@ select_best_block (basic_block early_bb,
> if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
> /* If result of comparsion is unknown, prefer EARLY_BB.
> Thus use !(...>=..) rather than (...<...) */
> - && !(best_bb->count * 100 >= early_bb->count * threshold))
> + && !(best_bb->count * 100 > early_bb->count * threshold))
> return best_bb;
>
> /* No better block found, so return EARLY_BB, which happens to be the
>
> fixes the missed sinking but not the regression :/
>
> The count differences start to appear in when LC PHI blocks are added
> only for virtuals and then pre-existing 'Invalid sum of incoming counts'
> eventually lead to mismatches. The 'Invalid sum of incoming counts'
> start with the loop splitting pass.
>
> fast_algorithms.c:145:10: optimized: loop split
>
> Xionghu Lou did profile count updates there, not sure if that made things
> worse in this case.
>
> At least with broken BB counts splitting/unsplitting an edge can propagate
> bogus counts elsewhere it seems.
:(, Could you please try revert cd5ae148c47c6dee05adb19acd6a523f7187be7f and
see whether performance is back?
More information about the Gcc-bugs
mailing list