Input: extern int a; int f(int x) { while (__builtin_ffs(x) && x) x -= a; return x; } gcc 6.3.0 with -O3 compiled this as: f(int): movl %edi, %eax movl a(%rip), %esi movl $-1, %ecx jmp .L3 .L11: testl %eax, %eax je .L2 subl %esi, %eax .L3: bsfl %eax, %edx cmove %ecx, %edx cmpl $-1, %edx jne .L11 .L2: rep ret whereas current trunk (also with -O3) compiles it as: f(int): movl $-1, %ecx bsfl %edi, %eax cmove %ecx, %eax cmpl %ecx, %eax je .L5 testl %edi, %edi je .L6 movl a(%rip), %esi movl %edi, %eax jmp .L3 .L4: testl %eax, %eax je .L1 .L3: subl %esi, %eax bsfl %eax, %edx cmove %ecx, %edx cmpl $-1, %edx jne .L4 ret .L6: xorl %eax, %eax .L1: ret .L5: movl %edi, %eax ret There are fewer instructions overall for the case where x is 0 on entry, but trunk still has longer code overall even if we change the while-condition to __builtin_expect(!!__builtin_ffs(x) && x, 1) which ideally should pessimise this case. For the same input, clang trunk with -O3 gives: f(int): # @f(int) test edi, edi je .LBB0_3 mov eax, dword ptr [rip + a] neg edi .LBB0_2: # =>This Inner Loop Header: Depth=1 add edi, eax jne .LBB0_2 .LBB0_3: xor eax, eax ret This seems to rely simply on the fact that (__builtin_ffs(x) == 0) and (x == 0) are equivalent. If you simplify the while-condition to simply __builtin_ffs(x), then the difference is smaller but still there: 6.3.0: f(int): movl %edi, %eax movl a(%rip), %esi movl $-1, %ecx jmp .L3 .L5: subl %esi, %eax .L3: bsfl %eax, %edx cmove %ecx, %edx cmpl $-1, %edx jne .L5 rep ret trunk: f(int): movl $-1, %ecx bsfl %edi, %eax cmove %ecx, %eax cmpl %ecx, %eax je .L4 movl a(%rip), %esi movl %edi, %eax .L3: subl %esi, %eax bsfl %eax, %edx cmove %ecx, %edx cmpl $-1, %edx jne .L3 ret .L4: movl %edi, %eax ret
.optimized shows <bb 2> [local count: 114863532]: _8 = __builtin_ffs (x_4(D)); if (_8 != 0) goto <bb 3>; [94.50%] else goto <bb 7>; [5.50%] <bb 3> [local count: 108546038]: if (x_4(D) != 0) goto <bb 4>; [94.50%] else goto <bb 7>; [5.50%] missed jump-threading. <bb 5> [local count: 958878293]: # x_10 = PHI <x_4(D)(4), x_6(6)> x_6 = x_10 - a.0_1; _2 = __builtin_ffs (x_6); if (_2 != 0) goto <bb 6>; [94.50%] else goto <bb 7>; [5.50%] <bb 6> [local count: 906139986]: if (x_6 != 0) goto <bb 5>; [94.50%] else goto <bb 7>; [5.50%] likewise. VRP needs to derive a range for x_6 from _2 != 0.
This appears to be resolved?
After r11-1080 (PR 95527), __builtin_ffs(x) && x becomes just x != 0 and optimized. So yes fixed for GCC 11.