Bug 114589 - missed optimization: losing bool range information, missed sinking
Summary: missed optimization: losing bool range information, missed sinking
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.1.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: VRP
  Show dependency treegraph
 
Reported: 2024-04-04 17:28 UTC by Barry Revzin
Modified: 2024-05-15 16:14 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-04-04 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Barry Revzin 2024-04-04 17:28:42 UTC
Consider the following example:

template <class T>
struct simple_optional {
    bool has_val;
    T val;

    auto begin() const -> T const* { return &val; }
    #ifdef SIMPLE
    auto end() const -> T const* { return &val + (has_val ? 1 : 0); }
    #else
    auto end() const -> T const* { return has_val ? &val + 1 : &val; }
    #endif
};

void f(int);

void call_f(simple_optional<int> const& o) {   
    for (int i : o) {
        f(i);
    }
}

This function should call f at most one time. With the SIMPLE implementation that adds (has_val ? 1 : 0), or simply has_val, or static_cast<int>(has_val), or any version thereof - gcc trunk -O3 still emits a loop:

call_f(simple_optional<int> const&):
        push    rbp
        push    rbx
        lea     rbx, [rdi+4]
        sub     rsp, 8
        movzx   edx, BYTE PTR [rdi]
        lea     rbp, [rbx+rdx*4]
        test    dl, dl
        je      .L1
.L3:
        mov     edi, DWORD PTR [rbx]
        add     rbx, 4
        call    f(int)
        cmp     rbp, rbx
        jne     .L3
.L1:
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

With the other approach, where end() explicitly returns either &val + 1 or &val, gcc does not emit a loop:

call_f(simple_optional<int> const&):
        cmp     BYTE PTR [rdi], 0
        jne     .L4
        ret
.L4:
        mov     edi, DWORD PTR [rdi+4]
        jmp     f(int)

On compiler explorer: https://godbolt.org/z/jaMxqsj5q
Comment 1 Andrew Pinski 2024-04-04 17:47:18 UTC
Confirmed.

Why didn't sink1 push _10, _13, _12, and _11 past the conditional here ...
If it did that I think it might have optimized correctly.

  <bb 2> [local count: 118111600]:
  _11 = &o_3(D)->val;
  _5 = o_3(D)->has_val;
  _12 = (sizetype) _5;
  _13 = _12 << 2;
  _10 = _11 + _13;
  if (_5 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 9>; [11.00%]

  <bb 5> [local count: 105119324]:
Comment 2 Andrew Pinski 2024-04-04 17:58:27 UTC
(In reply to Andrew Pinski from comment #1)
> Confirmed.
> 
> Why didn't sink1 push _10, _13, _12, and _11 past the conditional here ...
> If it did that I think it might have optimized correctly.

Because then _12 would be 1, _13 would be 4 and _10 would be come `_11 + 4` and the loop induction variable would be going from _11 to `_11 + 4` then:
```
  # __for_begin_15 = PHI <__for_begin_8(6), _11(5)>
  # .MEM_16 = PHI <.MEM_7(6), .MEM_4(D)(5)>
  # VUSE <.MEM_16>
  i_6 = *__for_begin_15;
  # .MEM_7 = VDEF <.MEM_16>
  # USE = nonlocal escaped 
  # CLB = nonlocal escaped 
  _Z1fiD.2782 (i_6);
  # PT = nonlocal 
  __for_begin_8 = __for_begin_15 + 4;
  if (__for_begin_8 != _10)
    goto <bb 6>; [89.00%]
  else
    goto <bb 10>; [11.00%]
```

And we could figure out this loop is only gone through once only.
Comment 3 Richard Biener 2024-04-05 07:52:58 UTC
This is the "old" issue with the bogus

  /* If BEST_BB is at the same nesting level, then require it to have
     significantly lower execution frequency to avoid gratuitous movement.  */
  if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
      /* If result of comparsion is unknown, prefer EARLY_BB.
         Thus use !(...>=..) rather than (...<...)  */
      && !(best_bb->count * 100 >= early_bb->count * threshold))
    return best_bb;

check.  We have a BB2 count of 118111600 and BB5 we'd sink to has 105119324
which is higher than 75% of BB2.

This was supposed to prevent sinking across if (x) unlikely (); blocks
but it has odd side-effects like this one.

IIRC there's duplicates for this.

I'll note that we probably want to avoid sinking to a post-dominator
unless it sinks out of a loop (that's cross-BB scheduling).  But we no
longer compute post-dominators.  Without re-introducing those and
limiting sinking this way removing the odd code will cause regressions.
Comment 4 Richard Biener 2024-04-05 09:39:40 UTC
Let me work on this in stage1.
Comment 5 Richard Biener 2024-05-15 09:37:58 UTC
So when doing the sinking it's too late since there's no VRP between early
sinking and late unrolling and also threadfull doesn't figure out

  <bb 3> [local count: 105119324]:
  _11 = &o_3(D)->val;
  _12 = 1;
  _13 = 4;
  _10 = _11 + 4;
  ivtmp.11_14 = (unsigned long) _11;

  <bb 4> [local count: 955630224]:
  # ivtmp.11_17 = PHI <ivtmp.11_2(6), ivtmp.11_14(3)>
  _18 = (void *) ivtmp.11_17;
  i_6 = MEM[(const int *)_18];
  f (i_6);
  ivtmp.11_2 = ivtmp.11_17 + 4;
  __for_begin_19 = (const int *) ivtmp.11_2;
  if (__for_begin_19 != _10)
    goto <bb 6>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 6> [local count: 850510900]:
  goto <bb 4>; [100.00%]

but IIRC it doesn't consider the loop entry "fallthru" as "threading" source.

Nevertheless I have a fix for the sinking issue.
Comment 6 Richard Biener 2024-05-15 11:32:41 UTC
So actually it also needs -fno-ivopts since otherwise VRP is confused.  With
-fno-ivopts it's late DOM that figures out the loop doesn't iterate.
Comment 7 GCC Commits 2024-05-15 16:13:34 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:99b1daae18c095d6c94d32efb77442838e11cbfb

commit r15-518-g99b1daae18c095d6c94d32efb77442838e11cbfb
Author: Richard Biener <rguenther@suse.de>
Date:   Fri May 3 14:04:41 2024 +0200

    tree-optimization/114589 - remove profile based sink heuristics
    
    The following removes the profile based heuristic limiting sinking
    and instead uses post-dominators to avoid sinking to places that
    are executed under the same conditions as the earlier location which
    the profile based heuristic should have guaranteed as well.
    
    To avoid regressing this moves the empty-latch check to cover all
    sink cases.
    
    It also stream-lines the resulting select_best_block a bit but avoids
    adjusting heuristics more with this change.  gfortran.dg/streamio_9.f90
    starts execute failing with this on x86_64 with -m32 because the
    (float)i * 9.9999...e-7 compute is sunk across a STOP causing it
    to be no longer spilled and thus the compare failing due to excess
    precision.  The patch adds -ffloat-store to avoid this, following
    other similar testcases.
    
    This change fixes the testcase in the PR only when using -fno-ivopts
    as otherwise VRP is confused.
    
            PR tree-optimization/114589
            * tree-ssa-sink.cc (select_best_block): Remove profile-based
            heuristics.  Instead reject sink locations that sink
            to post-dominators.  Move empty latch check here from
            statement_sink_location.  Also consider early_bb for the
            loop depth check.
            (statement_sink_location): Remove superfluous check.  Remove
            empty latch check.
            (pass_sink_code::execute): Compute/release post-dominators.
    
            * gfortran.dg/streamio_9.f90: Use -ffloat-store to avoid
            excess precision when not spilling.
            * g++.dg/tree-ssa/pr114589.C: New testcase.
Comment 8 Richard Biener 2024-05-15 16:14:25 UTC
The missed sinking is now fixed for GCC 15, VRP is still confused by what IVOPTs does so without -fno-ivopts the loop remains.