82776 – Unable to optimize the loop when iteration count is unavailable.

Bug 82776 - Unable to optimize the loop when iteration count is unavailable.

Summary: Unable to optimize the loop when iteration count is unavailable.

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	bin cheng

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2017-10-31 00:55 UTC by AK
Modified:	2021-05-04 12:32 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-11-02 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description AK 2017-10-31 00:55:29 UTC

Compiling with 

g++ -O2 --std=c++14 -msse4 -DENABLE_FORLOOP
vs.
g++ -O2 --std=c++14 -msse4

gives dramatically different results in the sense that the loop is completely optimized when for-loop is present instead of `while(true)` loop. Reproduces with g++-7.2 and g++-trunk.


$ cat test.cpp

#include <type_traits>
#include <cstdint>
#include <emmintrin.h>
#include <vector>
#include <cstdlib>
#include <array>
#include <memory>

struct Chunk {
  std::array<uint8_t,14> tags_;
  uint8_t control_;

  bool eof() const {
    return (control_ & 1) != 0;
  }

  static constexpr unsigned kFullMask = (1 << 14) - 1;

  __m128i const* tagVector() const {
    return static_cast<__m128i const*>(static_cast<void const*>(&tags_[0]));
  }

  unsigned emptyMask() const {
    auto tagV = _mm_load_si128(tagVector());
    auto emptyTagV = _mm_cmpeq_epi8(tagV, _mm_setzero_si128());
    return _mm_movemask_epi8(emptyTagV) & kFullMask;
  }

  unsigned occupiedMask() const {
    return emptyMask() ^ kFullMask;
  }
};

#define LIKELY(x) __builtin_expect((x), true)
#define UNLIKELY(x) __builtin_expect((x), false)

struct Iter {
  Chunk* chunk_;
  std::size_t index_;

  void advance() {
    // common case is packed entries
    while (index_ > 0) {
      --index_;
      if (LIKELY(chunk_->tags_[index_] != 0)) {
        return;
      }
    }

    // bar only skips the work of advance() if this loop can
    // be guaranteed to terminate
#ifdef ENABLE_FORLOOP
    for (std::size_t i = 1; i != 0; ++i) {
#else
    while (true) {
#endif
      // exhausted the current chunk
      if (chunk_->eof()) {
        chunk_ = nullptr;
        break;
      }
      ++chunk_;
      auto m = chunk_->occupiedMask();
      if (m != 0) {
        index_ = 31 - __builtin_clz(m);
        break;
      }
    }
  }
};

static Iter foo(Iter iter) {
  puts("hello");
  iter.advance();
  return iter;
}

void bar(Iter iter) {
  foo(iter);
}

Comment 1 Marc Glisse 2017-10-31 06:32:04 UTC

That could be because gcc sadly refuses to optimize away infinite loops (happens for other cases, and cddce2 dump (the pass that removes the whole thing when the macro is defined) says "can not prove finiteness of loop 2"). Although ++chunk_ should be enough to prove that the loop terminates (otherwise chunk_ eventually overflows).

(the unaligned vector use in this code seems strange)

Comment 2 Jakub Jelinek 2017-10-31 07:34:16 UTC

Perhaps at least when seeing an unconditional POINTER_PLUS_EXPR with constant increment and no other bumps for the same pointer in the loop, we could derive upper niter bound from that, even when the loop has multiple exits etc.

Comment 3 bin cheng 2017-10-31 10:07:59 UTC

(In reply to Jakub Jelinek from comment #2)
> Perhaps at least when seeing an unconditional POINTER_PLUS_EXPR with
> constant increment and no other bumps for the same pointer in the loop, we
> could derive upper niter bound from that, even when the loop has multiple
> exits etc.

Yes.  GCC's niter analyzer only analyzes exit condition with induction variable involved while it's not the case here.  As a result, niter analysis is skipped for the loop.  In this code, the no-wrapping information is actually with the IV, as you suggested, we should be able to refine bound information when niter analysis of exit condition is skipped.   I will have a look.  Thanks.

Comment 4 bin cheng 2017-10-31 10:37:51 UTC

Well, one decision needs to be made is whether such bound information should be covered by -faggressive-loop-optimizations.  We already did this for undefined behavior of sign type and array bound.  OTOH, this doesn't look like too aggressive since we already rely on undefined behavior for pointer/signed types in SCEV.
Note I made change assuming non-wrap pointer all the time in r250765, but seems some kernel code depends on that, i.e, PR82694.  We may need to revert the change and only assume non-wrap pointer when !flag_wrapv.  Thanks.

Comment 5 Jakub Jelinek 2017-10-31 10:43:15 UTC

(In reply to amker from comment #4)
> Well, one decision needs to be made is whether such bound information should
> be covered by -faggressive-loop-optimizations.  We already did this for
> undefined behavior of sign type and array bound.  OTOH, this doesn't look
> like too aggressive since we already rely on undefined behavior for
> pointer/signed types in SCEV.

Given:
  if (flag_aggressive_loop_optimizations)
    infer_loop_bounds_from_undefined (loop);
I think it should be keyed on flag_aggressive_loop_optimizations.
E.g. we want to avoid something like that when sanitizing etc.

> Note I made change assuming non-wrap pointer all the time in r250765, but
> seems some kernel code depends on that, i.e, PR82694.  We may need to revert
> the change and only assume non-wrap pointer when !flag_wrapv.  Thanks.

IMHO the kernel should be fixed, in any case, that is something to discuss in that PR, not here.

Comment 6 bin cheng 2017-10-31 11:00:25 UTC

(In reply to Jakub Jelinek from comment #5)
> (In reply to amker from comment #4)
> > Well, one decision needs to be made is whether such bound information should
> > be covered by -faggressive-loop-optimizations.  We already did this for
> > undefined behavior of sign type and array bound.  OTOH, this doesn't look
> > like too aggressive since we already rely on undefined behavior for
> > pointer/signed types in SCEV.
> 
> Given:
>   if (flag_aggressive_loop_optimizations)
>     infer_loop_bounds_from_undefined (loop);
> I think it should be keyed on flag_aggressive_loop_optimizations.
Base IVs are a bit special and different to existing undefined behavior here.  IIUC, non-wrap has been assumed all the places in IV/SCEV analysis and that information has been used in niter analysis without flag_aggressive_loop_optimizations already.

> E.g. we want to avoid something like that when sanitizing etc.
> 
> > Note I made change assuming non-wrap pointer all the time in r250765, but
> > seems some kernel code depends on that, i.e, PR82694.  We may need to revert
> > the change and only assume non-wrap pointer when !flag_wrapv.  Thanks.
> 
> IMHO the kernel should be fixed, in any case, that is something to discuss
> in that PR, not here.

Comment 7 bin cheng 2017-10-31 12:03:24 UTC

Testing a patch.

Comment 8 Marc Glisse 2017-11-01 13:49:13 UTC

At some point, we could also think of taking advantage of what the C++ standard (for instance) says:

"[intro.progress]
The implementation may assume that any thread will eventually do one of the following:
(1.1) — terminate,
(1.2) — make a call to a library I/O function,
(1.3) — perform an access through a volatile glvalue, or
(1.4) — perform a synchronization operation or an atomic operation.
[ Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. — end note ]"

The only potential "progress" in this loop is the call to __builtin_ia32_pmovmskb128, but replacing it with a call to a function with attribute((const)) does not help. And if there is no progress in the loop, the loop must be finite.

(we could have some new flag if people insist on for(;;); not being optimized away. I would even use a flag -fno-infinite-loop that says that no loop can be infinite, or -fmain-returns that says that no loop is infinite and the program cannot trap or terminate, etc, but that's getting a bit far from this PR)

Comment 9 AK 2017-11-01 21:20:16 UTC

Are we also taking advantage of this statement in the standard:

> An iteration statement that performs no input/output operations, does not access volatile objects, and performs no
synchronization or atomic operations in its body, controlling
expression, or (in the case of a for statement) its expression may be assumed by the implementation to terminate.