Bug 81914 - [7 Regression] gcc 7.1 generates branch for code which was branchless in earlier gcc version
Summary: [7 Regression] gcc 7.1 generates branch for code which was branchless in earl...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 7.1.0
: P2 normal
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-08-21 11:40 UTC by Daniel Fruzynski
Modified: 2019-11-14 10:20 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work: 8.0
Known to fail: 7.5.0
Last reconfirmed: 2017-08-21 00:00:00


Attachments
gcc8-pr81914.patch (1.07 KB, patch)
2017-12-12 18:41 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Fruzynski 2017-08-21 11:40:31 UTC
Code:

#include <stdint.h>

int cmp(int64_t a, int64_t b)
{
    return a < b ? -1 : a > b;
}

Above function compiled with gcc 4.5 or above and -O2 is compiled in following way. It is branchless:

cmp(long, long):
  xor eax, eax
  cmp rdi, rsi
  mov edx, -1
  setg al
  cmovl eax, edx
  ret

gcc 7.1 generates different code. It has branch:

cmp(long, long):
  cmp rdi, rsi
  jl .L3
  setg al
  movzx eax, al
  ret
.L3:
  mov eax, -1
  ret
Comment 1 Marek Polacek 2017-08-21 11:50:15 UTC
Started with r237185.
Comment 2 Richard Biener 2017-08-21 12:27:23 UTC
But are you sure the branchless variant is faster?
Comment 3 Daniel Fruzynski 2017-08-21 15:32:07 UTC
Yes, branchless version is faster. Here are results for code compiled with gcc 4.8.5:

Benchmark            Time           CPU Iterations
--------------------------------------------------
BM_memcmp            6 ns          6 ns  111001949
BM_int64cmp          4 ns          4 ns  183761752

And here for gcc 7.1.0:

Benchmark            Time           CPU Iterations
--------------------------------------------------
BM_memcmp            6 ns          6 ns  113198940
BM_int64cmp          8 ns          8 ns   82036754

Note: Code tested accepted values passed via pointer instead of value, to better compare it with memcmp function. Functions were comparing random values, generated before tests and stored in arrays. srand was called with constant value to get repeatable results.
Comment 4 Jakub Jelinek 2017-12-12 15:15:40 UTC
Predictions for bb 2
  DS theory heuristics: 98.0%
  combined heuristics: 98.0%
  negative return heuristics of edge 2->4: 2.0%
Predictions for bb 3
1 edges in bb 3 predicted to even probabilities
Predictions for bb 4
1 edges in bb 4 predicted to even probabilities
cmp (long long int a, long long int b)
{
  _Bool _1;
  int iftmp.0_2;
  int iftmp.0_6;

  <bb 2> [local count: 1073741825]:
  if (a_3(D) >= b_4(D))
    goto <bb 3>; [98.00%]
  else
    goto <bb 4>; [2.00%]

Honza, I think it would be worth to recognize idioms like this (<=> operator) and don't apply Dempster-Shaffer in that case, but rather predict either equal probabilities for <, == and > (like before), or perhaps even somewhat smaller probability for the == case and slightly higher for < or >.
Comment 5 Jakub Jelinek 2017-12-12 15:18:10 UTC
Oops, it is most likely the PRED_NEGATIVE_RETURN stuff instead.
Comment 6 Jakub Jelinek 2017-12-12 15:39:36 UTC
Another testcase, which has even higher prediction of not returning -1:
int
cmp (int a, int b)
{
  if (a < b)
    return -1;
  if (a > b)
    return 1;
  return 0;
}

In the #c0 case, we have in the IL:
  <bb 2> :
  if (a_3(D) >= b_4(D))
    goto <bb 3>; [INV]
  else
    goto <bb 4>; [INV]

  <bb 3> :
  _1 = a_3(D) > b_4(D);
  iftmp.0_6 = (int) _1;

  <bb 4> :
  # iftmp.0_2 = PHI <iftmp.0_6(3), -1(2)>
  return iftmp.0_2;
while in this one:
  <bb 2>:
  if (a_2(D) < b_3(D))
    goto <bb 5>;
  else
    goto <bb 3>;

  <bb 3>:
  if (a_2(D) > b_3(D))
    goto <bb 5>;
  else
    goto <bb 4>;

  <bb 4>:

  <bb 5>:
  # _1 = PHI <-1(2), 1(3), 0(4)>
  return _1;
In any case, I think it is worth recognizing these patterns.
We should also consider other examples, e.g. from gcc itself:
static int
oecount_cmp (const void *p1, const void *p2)
{
  const oecount *c1 = (const oecount *)p1;
  const oecount *c2 = (const oecount *)p2;
  if (c1->cnt != c2->cnt)
    return c1->cnt > c2->cnt ? 1 : -1;
  else
    /* If counts are identical, use unique IDs to stabilize qsort.  */
    return c1->id > c2->id ? 1 : -1;
}
static int
tm_alias_pair_cmp (const void *x, const void *y)
{
  const tm_alias_pair *p1 = (const tm_alias_pair *) x;
  const tm_alias_pair *p2 = (const tm_alias_pair *) y;
  if (p1->uid < p2->uid)
    return -1;
  if (p1->uid > p2->uid)
    return 1;
  return 0;
}
int
type_warning_cmp (const void *p1, const void *p2)
{
  const odr_type_warn_count *t1 = (const odr_type_warn_count *)p1;
  const odr_type_warn_count *t2 = (const odr_type_warn_count *)p2;

  if (t1->dyn_count < t2->dyn_count)
   return 1;
  if (t1->dyn_count > t2->dyn_count)
   return -1;
  return t2->count - t1->count;
}
static int
stack_var_cmp (const void *a, const void *b)
{
  size_t ia = *(const size_t *)a;
  size_t ib = *(const size_t *)b;
  unsigned int aligna = stack_vars[ia].alignb;
  unsigned int alignb = stack_vars[ib].alignb;
  HOST_WIDE_INT sizea = stack_vars[ia].size;
  HOST_WIDE_INT sizeb = stack_vars[ib].size;
  tree decla = stack_vars[ia].decl;
  tree declb = stack_vars[ib].decl;
  bool largea, largeb;
  unsigned int uida, uidb;

  /* Primary compare on "large" alignment.  Large comes first.  */
  largea = (aligna * BITS_PER_UNIT > MAX_SUPPORTED_STACK_ALIGNMENT);
  largeb = (alignb * BITS_PER_UNIT > MAX_SUPPORTED_STACK_ALIGNMENT);
  if (largea != largeb)
    return (int)largeb - (int)largea;

  /* Secondary compare on size, decreasing  */
  if (sizea > sizeb)
    return -1;
  if (sizea < sizeb)
    return 1;

  /* Tertiary compare on true alignment, decreasing.  */
  if (aligna < alignb)
    return -1;
  if (aligna > alignb)
    return 1;

  /* Final compare on ID for sort stability, increasing.
     Two SSA names are compared by their version, SSA names come before
     non-SSA names, and two normal decls are compared by their DECL_UID.  */
  if (TREE_CODE (decla) == SSA_NAME)
    {
      if (TREE_CODE (declb) == SSA_NAME)
        uida = SSA_NAME_VERSION (decla), uidb = SSA_NAME_VERSION (declb);
      else
        return -1;
    }
  else if (TREE_CODE (declb) == SSA_NAME)
    return 1;
  else
    uida = DECL_UID (decla), uidb = DECL_UID (declb);
  if (uida < uidb)
    return 1;
  if (uida > uidb)
    return -1;
  return 0;
}

etc.  Perhaps catching everything is too hard for heuristics, but at least figuring a pattern of these <=> comparisons, or consecutive comparisons like:
  if (uida < uidb)
    return 1;
  if (uida > uidb)
    return -1;
is highly desirable.
Comment 7 Jakub Jelinek 2017-12-12 17:53:53 UTC
Larger testcase with more cases.  Some of them, e.g. in f3, are predicted roughly reasonably, but the > 95% predictions are just wrong in these cases.

int
f1 (long long a, long long b)
{
  return a < b ? -1 : a > b;
}

int
f2 (int a, int b)
{
  if (a < b)
    return -1;
  if (a > b)
    return 1;
  return 0;
}

int
f3 (int *a, int *b)
{
  if (a[0] != b[0])
    return a[0] > b[0] ? 1 : -1;
  if (a[1] != b[1])
    return a[1] > b[1] ? 1 : -1;
  return 0;
}

int
f4 (int *a, int *b)
{
  if (a[0] > b[0])
    return -1;
  if (a[0] < b[0])
    return 1;

  if (a[1] > b[1])
    return -1;
  if (a[1] < b[1])
    return 1;

  if (a[2] > b[2])
    return -1;
  if (a[2] < b[2])
    return 1;
  return 0;
}
Comment 8 Jakub Jelinek 2017-12-12 18:41:53 UTC
Created attachment 42856 [details]
gcc8-pr81914.patch

Untested patch that handles all the cases in the #c7 testcase.  Though it will not handle some comparison functions in GCC (e.g. those which sometimes just return a - b or similar).
If it doesn't look like a completely dumb approach, we probably should test it on SPEC (unfortunately I don't have a reasonable setup for that right now).
Comment 9 Daniel Fruzynski 2017-12-12 22:20:09 UTC
In the meantime I found another case when gcc 7 inserts lots of jumps. I am not sure if your extra test cases covers it too:

#include <stdint.h>

int test(int data1[9][9], int data2[9][9])
{
  uint64_t b1 = 0, b2 = 0;
  for (int n = 0; n < 9; ++n)
  {
    for (int k = 0; k < 9; ++k)
    {
      int a = data1[n][k] * 9 + data2[n][k];
      (a < 64 ? b1 : b2) |= 1 << (a & 63);
    }
  }
  return __builtin_popcount(b1) + __builtin_popcount(b2);
}
Comment 10 Marc Glisse 2017-12-16 11:15:35 UTC
For the particular case of <=> (-1, 0 or 1), I've seen code like (a>b)-(a<b), which is branchless (IIRC we don't generate optimal code for this either, we could use sbb or adc).
Comment 11 Jakub Jelinek 2017-12-19 16:43:35 UTC
Author: jakub
Date: Tue Dec 19 16:43:04 2017
New Revision: 255829

URL: https://gcc.gnu.org/viewcvs?rev=255829&root=gcc&view=rev
Log:
	PR middle-end/81914
	* predict.c (zero_one_minusone): New function.
	(apply_return_prediction): Avoid return prediction for functions
	returning only -1, 0 and 1 values, unless they only return -1 and 0
	or 0 and 1.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/predict.c
Comment 12 Daniel Fruzynski 2017-12-21 15:53:52 UTC
One more test case. Code compiled with TEST defined is branchless, without it has branch.

[code]
#include <stdint.h>

#define TEST

void test(uint64_t* a)
{
  uint64_t n = *a / 8;
  if (0 == n)
    n = 1;
#ifdef TEST
  *a += n;
#else
  *a += 1 << n;
#endif
}
[/code]
Comment 13 Richard Biener 2018-01-25 08:23:42 UTC
GCC 7.3 is being released, adjusting target milestone.
Comment 14 Martin Liška 2018-05-11 14:03:34 UTC
Jakub I see you fixed the most interesting case, can we close that?
Comment 15 Jonny Grant 2019-08-29 18:45:38 UTC
Hello

I noticed this example below does not give a warning. I had expected something similar for -Wsign-conversion present behaviour. Sharing my notes as follows.

A) false+true  maybe not treated as signed?

B) the return conversion from size_t to int.

https://godbolt.org/z/KoVPqd


#include <cstddef>
int main()
{
    size_t j = false+true;

    return j;
}
Comment 16 Richard Biener 2019-11-14 10:20:03 UTC
Fixed in GCC 8.