[Bug rtl-optimization/93165] New: avoidable 2x penalty on unpredicted overwrite

Mon Jan 6 06:43:00 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

            Bug ID: 93165
           Summary: avoidable 2x penalty on unpredicted overwrite
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ncm at cantrip dot org
  Target Milestone: ---

=== Abstract:

In important cases where using a "conditional move" instruction would provide a
2x performance improvement, Gcc fails to generate the instruction. In testing,
a random field is sorted by the simplest possible Quicksort algorithm, varying
only the method for conditionally swapping values, with widely varying
runtimes, many substantially faster than `std::sort`. The 2x result is
obtainable only with Clang, which for amd64 generates cmov instructions.

A simple testing apparatus may be found at https://github.com/ncm/sortfast/ 
that depends only on shell tools.

=== Details:

The basic sort is:
```
int* partition(int* begin, int* end) {
  int pivot = end[-1];
  int* left = begin;
  for (int* right = begin; right < end - 1; ++right) {
    if (*right <= pivot) {
      std::swap(*left, *right);
      ++left;
    }
  }
  int tmp = *left; *left = end[-1], end[-1] = tmp;
  return left;
}

void quicksort(int* begin, int* end) {
  while (end - begin > 1) {
    int* mid = partition(begin, end);
    quicksort(begin, mid);
    begin = mid + 1;
} }
```
which runs about as fast as `std::sort`, on truly random input.

Replacing the body of the loop in `partition()` above with
```
  left += swap_if(*right <= pivot, *left, *right);
```
where `swap_if` is defined
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, mask = -c;
  a = (b & mask) | (ta & ~mask);
  b = (ta & mask) | (b & ~mask);
  return c;
}
```
the sort is substantially faster than `std::sort` compiled with Gcc. Compiled
with Clang 8 or later, it runs fully 2x as fast as `std::sort`. Clang
recognizes the pattern and substitutes `cmov` instructions. (Clang also unrolls
the loop, which helps a little.) Another formulation,
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = c ? tb : ta;
  b = c ? ta : tb;
  return c;
}
```
also results in the same object code, with Clang, but is 2x slower compiled
with Gcc, which generates a branch. A third formulation,
```
inline bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
  b = v[1-c], a = v[c];
  return c;
}
```
is much faster than `std::sort` compiled by both Gcc and Clang, but detours
values through L1 cache, at some cost, so is slower than the `cmov` version. A
fourth version,
```
inline bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
  a = v[c], b = v[!c];
  return c;
}
```
is about the same speed as the third when built with Clang, but with Gcc is
quite a lot slower than `std::sort`. Order matters, somehow, as does the choice
of operator. (I don't know how to express a bug report for this last case.
Advice welcome.)

In
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = c ? tb : ta;
//  b = c ? ta : tb;
  return c;
}
```
(at least when it is not inlined) Gcc seems happy to generate the `cmov`
instruction. Apparently the optimization code is very jealous about what else
is allowed in the basic block where a `cmov` is considered.

Finally, even with 
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = __builtin_expect_with_probability(c, 0, 0.5) ? tb : ta;
  b = __builtin_expect_with_probability(c, 0, 0.5) ? ta : tb;
  return c;
}
```
Gcc still will not generate the `cmov` instructions.

=== Discussion:

Replacing a branch with `cmov` may result in slower code, particularly on older
CPU targets. However, when the programmer provides direct information that the
branch is unpredictable, it seems like the compiler should be willing to act on
that expectation.

In that light, Clang's conversion of "`(mask & a)|(~mask & b)`" to `cmov` seems
to be universally correct. It is a portable formula that gives better results
than the branching version even without specific optimization, but may easily
be rewritten using `cmov` for even better results.

In addition, when `__builtin_expect_with_probability` is used to indicate
unpredictability, there seems to be no defensible reason not to rewrite the
expression to use `cmov`.

Finally, the indexed form of `swap_if` may be recognized and turned into `cmov`
instructions without worry that a predictable branch has been replaced,
avoiding the unnecessary detour through memory.