Bug 85375 - possible missed optimisation / regression from 6.3 with while (__builtin_ffs(x) && x)
Summary: possible missed optimisation / regression from 6.3 with while (__builtin_ffs(...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0.1
: P3 normal
Target Milestone: 11.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: VRP
  Show dependency treegraph
 
Reported: 2018-04-12 11:16 UTC by Vegard Nossum
Modified: 2021-09-05 00:03 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work: 11.1.0
Known to fail: 10.3.0
Last reconfirmed: 2018-04-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vegard Nossum 2018-04-12 11:16:14 UTC
Input:

extern int a;

int f(int x)
{
    while (__builtin_ffs(x) && x)
        x -= a;

    return x;
}

gcc 6.3.0 with -O3 compiled this as:

f(int):
  movl %edi, %eax
  movl a(%rip), %esi
  movl $-1, %ecx
  jmp .L3
.L11:
  testl %eax, %eax
  je .L2
  subl %esi, %eax
.L3:
  bsfl %eax, %edx
  cmove %ecx, %edx
  cmpl $-1, %edx
  jne .L11
.L2:
  rep ret

whereas current trunk (also with -O3) compiles it as:

f(int):
  movl $-1, %ecx
  bsfl %edi, %eax
  cmove %ecx, %eax
  cmpl %ecx, %eax
  je .L5
  testl %edi, %edi
  je .L6
  movl a(%rip), %esi
  movl %edi, %eax
  jmp .L3
.L4:
  testl %eax, %eax
  je .L1
.L3:
  subl %esi, %eax
  bsfl %eax, %edx
  cmove %ecx, %edx
  cmpl $-1, %edx
  jne .L4
  ret
.L6:
  xorl %eax, %eax
.L1:
  ret
.L5:
  movl %edi, %eax
  ret

There are fewer instructions overall for the case where x is 0 on entry, but trunk still has longer code overall even if we change the while-condition to __builtin_expect(!!__builtin_ffs(x) && x, 1) which ideally should pessimise this case.

For the same input, clang trunk with -O3 gives:

f(int): # @f(int)
  test edi, edi
  je .LBB0_3
  mov eax, dword ptr [rip + a]
  neg edi
.LBB0_2: # =>This Inner Loop Header: Depth=1
  add edi, eax
  jne .LBB0_2
.LBB0_3:
  xor eax, eax
  ret

This seems to rely simply on the fact that (__builtin_ffs(x) == 0) and (x == 0) are equivalent.

If you simplify the while-condition to simply __builtin_ffs(x), then the difference is smaller but still there:

6.3.0:

f(int):
  movl %edi, %eax
  movl a(%rip), %esi
  movl $-1, %ecx
  jmp .L3
.L5:
  subl %esi, %eax
.L3:
  bsfl %eax, %edx
  cmove %ecx, %edx
  cmpl $-1, %edx
  jne .L5
  rep ret

trunk:

f(int):
  movl $-1, %ecx
  bsfl %edi, %eax
  cmove %ecx, %eax
  cmpl %ecx, %eax
  je .L4
  movl a(%rip), %esi
  movl %edi, %eax
.L3:
  subl %esi, %eax
  bsfl %eax, %edx
  cmove %ecx, %edx
  cmpl $-1, %edx
  jne .L3
  ret
.L4:
  movl %edi, %eax
  ret
Comment 1 Richard Biener 2018-04-12 12:21:20 UTC
.optimized shows

  <bb 2> [local count: 114863532]:
  _8 = __builtin_ffs (x_4(D));
  if (_8 != 0)
    goto <bb 3>; [94.50%]
  else
    goto <bb 7>; [5.50%]

  <bb 3> [local count: 108546038]:
  if (x_4(D) != 0)
    goto <bb 4>; [94.50%]
  else
    goto <bb 7>; [5.50%]

missed jump-threading.

  <bb 5> [local count: 958878293]:
  # x_10 = PHI <x_4(D)(4), x_6(6)>
  x_6 = x_10 - a.0_1;
  _2 = __builtin_ffs (x_6);
  if (_2 != 0)
    goto <bb 6>; [94.50%]
  else
    goto <bb 7>; [5.50%]

  <bb 6> [local count: 906139986]:
  if (x_6 != 0)
    goto <bb 5>; [94.50%]
  else
    goto <bb 7>; [5.50%]

likewise.  VRP needs to derive a range for x_6 from _2 != 0.
Comment 2 Andrew Macleod 2020-11-17 18:44:38 UTC
This appears to be resolved?
Comment 3 Andrew Pinski 2021-09-05 00:03:16 UTC
After r11-1080 (PR 95527), __builtin_ffs(x) && x becomes just x != 0 and optimized.

So yes fixed for GCC 11.