93165 – avoidable 2x penalty on unpredicted overwrite

Bug 93165 - avoidable 2x penalty on unpredicted overwrite

Summary: avoidable 2x penalty on unpredicted overwrite

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	9.2.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-01-06 06:43 UTC by ncm
Modified:	2020-01-21 14:15 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-01-21 00:00:00

Attachments
code demonstrating the failure (1.44 KB, text/x-csrc) 2020-01-06 06:46 UTC, ncm	Details
a makefile (301 bytes, text/plain) 2020-01-06 06:48 UTC, ncm	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description ncm 2020-01-06 06:43:34 UTC

=== Abstract:

In important cases where using a "conditional move" instruction would provide a 2x performance improvement, Gcc fails to generate the instruction. In testing, a random field is sorted by the simplest possible Quicksort algorithm, varying only the method for conditionally swapping values, with widely varying runtimes, many substantially faster than `std::sort`. The 2x result is obtainable only with Clang, which for amd64 generates cmov instructions.

A simple testing apparatus may be found at https://github.com/ncm/sortfast/ 
that depends only on shell tools.

=== Details:

The basic sort is:
```
int* partition(int* begin, int* end) {
  int pivot = end[-1];
  int* left = begin;
  for (int* right = begin; right < end - 1; ++right) {
    if (*right <= pivot) {
      std::swap(*left, *right);
      ++left;
    }
  }
  int tmp = *left; *left = end[-1], end[-1] = tmp;
  return left;
}

void quicksort(int* begin, int* end) {
  while (end - begin > 1) {
    int* mid = partition(begin, end);
    quicksort(begin, mid);
    begin = mid + 1;
} }
```
which runs about as fast as `std::sort`, on truly random input.

Replacing the body of the loop in `partition()` above with
```
  left += swap_if(*right <= pivot, *left, *right);
```
where `swap_if` is defined
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, mask = -c;
  a = (b & mask) | (ta & ~mask);
  b = (ta & mask) | (b & ~mask);
  return c;
}
```
the sort is substantially faster than `std::sort` compiled with Gcc. Compiled with Clang 8 or later, it runs fully 2x as fast as `std::sort`. Clang recognizes the pattern and substitutes `cmov` instructions. (Clang also unrolls the loop, which helps a little.) Another formulation,
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = c ? tb : ta;
  b = c ? ta : tb;
  return c;
}
```
also results in the same object code, with Clang, but is 2x slower compiled with Gcc, which generates a branch. A third formulation,
```
inline bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
  b = v[1-c], a = v[c];
  return c;
}
```
is much faster than `std::sort` compiled by both Gcc and Clang, but detours values through L1 cache, at some cost, so is slower than the `cmov` version. A fourth version,
```
inline bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
  a = v[c], b = v[!c];
  return c;
}
```
is about the same speed as the third when built with Clang, but with Gcc is quite a lot slower than `std::sort`. Order matters, somehow, as does the choice of operator. (I don't know how to express a bug report for this last case. Advice welcome.)

In
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = c ? tb : ta;
//  b = c ? ta : tb;
  return c;
}
```
(at least when it is not inlined) Gcc seems happy to generate the `cmov` instruction. Apparently the optimization code is very jealous about what else is allowed in the basic block where a `cmov` is considered.

Finally, even with 
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = __builtin_expect_with_probability(c, 0, 0.5) ? tb : ta;
  b = __builtin_expect_with_probability(c, 0, 0.5) ? ta : tb;
  return c;
}
```
Gcc still will not generate the `cmov` instructions.

=== Discussion:

Replacing a branch with `cmov` may result in slower code, particularly on older CPU targets. However, when the programmer provides direct information that the branch is unpredictable, it seems like the compiler should be willing to act on that expectation.

In that light, Clang's conversion of "`(mask & a)|(~mask & b)`" to `cmov` seems to be universally correct. It is a portable formula that gives better results than the branching version even without specific optimization, but may easily be rewritten using `cmov` for even better results.

In addition, when `__builtin_expect_with_probability` is used to indicate unpredictability, there seems to be no defensible reason not to rewrite the expression to use `cmov`.

Finally, the indexed form of `swap_if` may be recognized and turned into `cmov` instructions without worry that a predictable branch has been replaced, avoiding the unnecessary detour through memory.

Comment 1 ncm 2020-01-06 06:46:14 UTC

Created attachment 47592 [details]
code demonstrating the failure

Comment 2 ncm 2020-01-06 06:48:22 UTC

Created attachment 47593 [details]
a makefile

This duplicates code found on the linked archive.

E.g.:

make all
make CC=g++-9 INDEXED
make CC=clang++-9 ANDANDOR
make CC=clang++-9 INDEXED_PESSIMAL_ON_GCC
make CC=g++-9 CHECK=CHECK BOG

Comment 3 Alexander Monakov 2020-01-06 09:05:28 UTC

The compiler has no way of knowing ahead of time that you will be evaluating the result on random data; for mostly-sorted arrays branching is arguably preferable.

__builtin_expect_with_probability is a poor proxy for unpredictability: a condition that is true every other time leads to a branch that is both very predictable and has probability 0.5.

I think what you really need is a way to express branchless selection in the source code when you know you need it but the compiler cannot see that on its own. Other algorithms like constant-time checks for security-sensitive applications probably also need such computational primitive.

So perhaps an unpopular opinion, but I'd say a __builtin_branchless_select(c, a, b) (guaranteed to live throughout optimization pipeline as a non-branchy COND_EXPR) is badly missing.

Comment 4 Andrew Pinski 2020-01-06 09:14:06 UTC

(In reply to Alexander Monakov from comment #3)
> So perhaps an unpopular opinion, but I'd say a
> __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> optimization pipeline as a non-branchy COND_EXPR) is badly missing.

I am going to say otherwise.  Many of the time conditional move is faster than using a branch; even if the branch is predictable (there are a few exceptions) on most non-Intel/AMD targets.  This is because the conditional move is just one cycle and only a "predictable" branch is one cyle too.

It is even worse when doing things like:
if (a && b)
where on aarch64, this can be done using only one cmp followed by one ccmp.
NOTE on PowerPC, you could use in theory crand/cror (though this is not done currently and I don't know if they are profitable in any recent design).

Plus aarch64 has conditional add and a few other things which improve the idea of a conditional move.

Comment 5 ncm 2020-01-06 12:48:06 UTC

(In reply to Alexander Monakov from comment #3)
> The compiler has no way of knowing ahead of time that you will be evaluating
> the result on random data; for mostly-sorted arrays branching is arguably
> preferable.
> 
> __builtin_expect_with_probability is a poor proxy for unpredictability: a
> condition that is true every other time leads to a branch that is both very
> predictable and has probability 0.5.

If putting it in made my code slower, I would take it back out. The only value it has is if it changes something. If it doesn't improve matters, I need to try something else. For it to do nothing helps nobody.

> I think what you really need is a way to express branchless selection in the
> source code when you know you need it but the compiler cannot see that on
> its own. Other algorithms like constant-time checks for security-sensitive
> applications probably also need such computational primitive.
> 
> So perhaps an unpopular opinion, but I'd say a
> __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> optimization pipeline as a non-branchy COND_EXPR) is badly missing.

We can quibble over whether the name of the intrinsic means anything with a value of 0.5, but no other meaning would be useful.

But in general I would rather write portable code to get the semantics I need. I don't have a preference between the AND/OR notation and the indexing version, except that the former seems like a more generally useful optimization. Best would be both.

Comment 6 Richard Biener 2020-01-09 10:57:22 UTC

(In reply to Andrew Pinski from comment #4)
> (In reply to Alexander Monakov from comment #3)
> > So perhaps an unpopular opinion, but I'd say a
> > __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> > optimization pipeline as a non-branchy COND_EXPR) is badly missing.
> 
> I am going to say otherwise.  Many of the time conditional move is faster
> than using a branch; even if the branch is predictable (there are a few
> exceptions) on most non-Intel/AMD targets.  This is because the conditional
> move is just one cycle and only a "predictable" branch is one cyle too.

The issue with a conditional move is that it adds a data dependence while
branches are usually speculated and thus have zero overhead in the execution
stage.  The extra dependence can easily slow things down dependent on the
(three!) instructions feeding the conditional move (condition, first and
second source).  This is why well-predicted branches are often so much
faster.

> It is even worse when doing things like:
> if (a && b)
> where on aarch64, this can be done using only one cmp followed by one ccmp.
> NOTE on PowerPC, you could use in theory crand/cror (though this is not done
> currently and I don't know if they are profitable in any recent design).
> 
> Plus aarch64 has conditional add and a few other things which improve the
> idea of a conditional move.

I can see conditional moves are almost always a win on less
pipelined/speculative implementations.

Comment 7 ncm 2020-01-09 22:51:39 UTC

(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #4)
> > (In reply to Alexander Monakov from comment #3)
> > > So perhaps an unpopular opinion, but I'd say a
> > > __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> > > optimization pipeline as a non-branchy COND_EXPR) is badly missing.
> > 
> > I am going to say otherwise.  Many of the time conditional move is faster
> > than using a branch; even if the branch is predictable (there are a few
> > exceptions) on most non-Intel/AMD targets.  This is because the conditional
> > move is just one cycle and only a "predictable" branch is one cy`le too.
> 
> The issue with a conditional move is that it adds a data dependence while
> branches are usually speculated and thus have zero overhead in the execution
> stage.  The extra dependence can easily slow things down dependent on the
> (three!) instructions feeding the conditional move (condition, first and
> second source).  This is why well-predicted branches are often so much
> faster.
> 
> > It is even worse when doing things like:
> > if (a && b)
> > where on aarch64, this can be done using only one cmp followed by one ccmp.
> > NOTE on PowerPC, you could use in theory crand/cror (though this is not done
> > currently and I don't know if they are profitable in any recent design).
> > 
> > Plus aarch64 has conditional add and a few other things which improve the
> > idea of a conditional move.
> 
> I can see conditional moves are almost always a win on less
> pipelined/speculative implementations.

Nobody wants a change that makes code slower on our pipelined/
speculative targets, but this is a concrete case where code is 
already made slower. If the code before optimization has no 
branch, as in the case of "a = (m & b)|(~m & c)", we can be 
certain that replacing it with a cmov does not introduce any 
new data dependence.

Anyway, for the case of ?:, where cmov would replace a branch, 
Gcc is already happy to substitute a cmov instruction. Gcc just 
refuses to put in a second cmov, after it, for no apparent reason.