Consider the following example: template <class T> struct simple_optional { bool has_val; T val; auto begin() const -> T const* { return &val; } #ifdef SIMPLE auto end() const -> T const* { return &val + (has_val ? 1 : 0); } #else auto end() const -> T const* { return has_val ? &val + 1 : &val; } #endif }; void f(int); void call_f(simple_optional<int> const& o) { for (int i : o) { f(i); } } This function should call f at most one time. With the SIMPLE implementation that adds (has_val ? 1 : 0), or simply has_val, or static_cast<int>(has_val), or any version thereof - gcc trunk -O3 still emits a loop: call_f(simple_optional<int> const&): push rbp push rbx lea rbx, [rdi+4] sub rsp, 8 movzx edx, BYTE PTR [rdi] lea rbp, [rbx+rdx*4] test dl, dl je .L1 .L3: mov edi, DWORD PTR [rbx] add rbx, 4 call f(int) cmp rbp, rbx jne .L3 .L1: add rsp, 8 pop rbx pop rbp ret With the other approach, where end() explicitly returns either &val + 1 or &val, gcc does not emit a loop: call_f(simple_optional<int> const&): cmp BYTE PTR [rdi], 0 jne .L4 ret .L4: mov edi, DWORD PTR [rdi+4] jmp f(int) On compiler explorer: https://godbolt.org/z/jaMxqsj5q
Confirmed. Why didn't sink1 push _10, _13, _12, and _11 past the conditional here ... If it did that I think it might have optimized correctly. <bb 2> [local count: 118111600]: _11 = &o_3(D)->val; _5 = o_3(D)->has_val; _12 = (sizetype) _5; _13 = _12 << 2; _10 = _11 + _13; if (_5 != 0) goto <bb 5>; [89.00%] else goto <bb 9>; [11.00%] <bb 5> [local count: 105119324]:
(In reply to Andrew Pinski from comment #1) > Confirmed. > > Why didn't sink1 push _10, _13, _12, and _11 past the conditional here ... > If it did that I think it might have optimized correctly. Because then _12 would be 1, _13 would be 4 and _10 would be come `_11 + 4` and the loop induction variable would be going from _11 to `_11 + 4` then: ``` # __for_begin_15 = PHI <__for_begin_8(6), _11(5)> # .MEM_16 = PHI <.MEM_7(6), .MEM_4(D)(5)> # VUSE <.MEM_16> i_6 = *__for_begin_15; # .MEM_7 = VDEF <.MEM_16> # USE = nonlocal escaped # CLB = nonlocal escaped _Z1fiD.2782 (i_6); # PT = nonlocal __for_begin_8 = __for_begin_15 + 4; if (__for_begin_8 != _10) goto <bb 6>; [89.00%] else goto <bb 10>; [11.00%] ``` And we could figure out this loop is only gone through once only.
This is the "old" issue with the bogus /* If BEST_BB is at the same nesting level, then require it to have significantly lower execution frequency to avoid gratuitous movement. */ if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) /* If result of comparsion is unknown, prefer EARLY_BB. Thus use !(...>=..) rather than (...<...) */ && !(best_bb->count * 100 >= early_bb->count * threshold)) return best_bb; check. We have a BB2 count of 118111600 and BB5 we'd sink to has 105119324 which is higher than 75% of BB2. This was supposed to prevent sinking across if (x) unlikely (); blocks but it has odd side-effects like this one. IIRC there's duplicates for this. I'll note that we probably want to avoid sinking to a post-dominator unless it sinks out of a loop (that's cross-BB scheduling). But we no longer compute post-dominators. Without re-introducing those and limiting sinking this way removing the odd code will cause regressions.
Let me work on this in stage1.
So when doing the sinking it's too late since there's no VRP between early sinking and late unrolling and also threadfull doesn't figure out <bb 3> [local count: 105119324]: _11 = &o_3(D)->val; _12 = 1; _13 = 4; _10 = _11 + 4; ivtmp.11_14 = (unsigned long) _11; <bb 4> [local count: 955630224]: # ivtmp.11_17 = PHI <ivtmp.11_2(6), ivtmp.11_14(3)> _18 = (void *) ivtmp.11_17; i_6 = MEM[(const int *)_18]; f (i_6); ivtmp.11_2 = ivtmp.11_17 + 4; __for_begin_19 = (const int *) ivtmp.11_2; if (__for_begin_19 != _10) goto <bb 6>; [89.00%] else goto <bb 5>; [11.00%] <bb 6> [local count: 850510900]: goto <bb 4>; [100.00%] but IIRC it doesn't consider the loop entry "fallthru" as "threading" source. Nevertheless I have a fix for the sinking issue.
So actually it also needs -fno-ivopts since otherwise VRP is confused. With -fno-ivopts it's late DOM that figures out the loop doesn't iterate.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:99b1daae18c095d6c94d32efb77442838e11cbfb commit r15-518-g99b1daae18c095d6c94d32efb77442838e11cbfb Author: Richard Biener <rguenther@suse.de> Date: Fri May 3 14:04:41 2024 +0200 tree-optimization/114589 - remove profile based sink heuristics The following removes the profile based heuristic limiting sinking and instead uses post-dominators to avoid sinking to places that are executed under the same conditions as the earlier location which the profile based heuristic should have guaranteed as well. To avoid regressing this moves the empty-latch check to cover all sink cases. It also stream-lines the resulting select_best_block a bit but avoids adjusting heuristics more with this change. gfortran.dg/streamio_9.f90 starts execute failing with this on x86_64 with -m32 because the (float)i * 9.9999...e-7 compute is sunk across a STOP causing it to be no longer spilled and thus the compare failing due to excess precision. The patch adds -ffloat-store to avoid this, following other similar testcases. This change fixes the testcase in the PR only when using -fno-ivopts as otherwise VRP is confused. PR tree-optimization/114589 * tree-ssa-sink.cc (select_best_block): Remove profile-based heuristics. Instead reject sink locations that sink to post-dominators. Move empty latch check here from statement_sink_location. Also consider early_bb for the loop depth check. (statement_sink_location): Remove superfluous check. Remove empty latch check. (pass_sink_code::execute): Compute/release post-dominators. * gfortran.dg/streamio_9.f90: Use -ffloat-store to avoid excess precision when not spilling. * g++.dg/tree-ssa/pr114589.C: New testcase.
The missed sinking is now fixed for GCC 15, VRP is still confused by what IVOPTs does so without -fno-ivopts the loop remains.