Bug 104376 - Failure to optimize clz equivalent to clz
Summary: Failure to optimize clz equivalent to clz
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 enhancement
Target Milestone: ---
Assignee: Andrew Pinski
URL:
Keywords: missed-optimization
Depends on: 104378 101822
Blocks:
  Show dependency treegraph
 
Reported: 2022-02-04 01:10 UTC by Gabriel Ravier
Modified: 2023-10-24 11:17 UTC (History)
1 user (show)

See Also:
Host:
Target: aarch64-*-* x86_64-*-* (with -mlzcnt)
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-02-17 00:00:00


Attachments
Patch which I am testing to fix the second issue (1.31 KB, patch)
2023-10-15 19:21 UTC, Andrew Pinski
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gabriel Ravier 2022-02-04 01:10:30 UTC
#include <stdint.h>

uint32_t countLeadingZeros32(uint32_t x)
{
    if (x == 0)
        return 32;
    return (31 - __builtin_clz(x)) ^ 31;
}

On x86, with `-mlzcnt`, GCC outputs this:

countLeadingZeros32(unsigned int):
  mov eax, 32
  test edi, edi
  je .L1
  mov eax, 31
  lzcnt edi, edi
  sub eax, edi
  xor eax, 31
.L1:
  ret

LLVM instead outputs this:

countLeadingZeros32(unsigned int):
  lzcnt eax, edi
  ret
Comment 1 Andrew Pinski 2022-02-04 04:26:07 UTC
There are two issues, both are tree level issues, though the second one works on the RTL level just fine.

Right now we have:
  _1 = __builtin_clz (x_5(D));
  _2 = 31 - _1;
  _3 = _2 ^ 31;

But the _3 can be optimized to just _1.
Comment 2 Andrew Pinski 2022-02-04 04:27:41 UTC
The second issue can be seen with:
#include <stdint.h>

uint32_t countLeadingZeros32(uint32_t x)
{
    if (x == 0)
        return 32;
    return (__builtin_clz(x)) ;
}

This gets optimized for aarch64 at the rtl level but not for x86_64 with -mlzcnt.
Comment 3 Andrew Pinski 2022-02-04 07:01:32 UTC
Filed PR 104378 for the (31 - x) ^ 31 issue.
Comment 4 Andrew Pinski 2022-02-09 06:10:27 UTC
(In reply to Andrew Pinski from comment #2)
> The second issue can be seen with:
> #include <stdint.h>
> 
> uint32_t countLeadingZeros32(uint32_t x)
> {
>     if (x == 0)
>         return 32;
>     return (__builtin_clz(x)) ;
> }

cond_removal_in_builtin_zero_pattern should have optimized the above but does not for some reason.
Let me take a look.
Comment 5 Andrew Pinski 2022-02-09 07:40:18 UTC
(In reply to Andrew Pinski from comment #4)
> cond_removal_in_builtin_zero_pattern should have optimized the above but
> does not for some reason.
> Let me take a look.

So one problem is we have:
  <bb 2> [local count: 1073741824]:
  if (x_3(D) == 0)
    goto <bb 4>; [21.72%]
  else
    goto <bb 3>; [78.28%]

  <bb 3> [local count: 840525097]:
  _1 = __builtin_clz (x_3(D));
  _4 = (uint32_t) _1;

  <bb 4> [local count: 1073741824]:
  # _2 = PHI <32(2), _4(3)>


Which we don't handle in cond_removal_in_builtin_zero_pattern, this similar to PR 99997 and PR 101822, that is the code which added to fix PR 71016 is getting in the way.
Comment 6 Andrew Pinski 2023-05-06 21:23:41 UTC
The cast issue is basically PR 101822.
Comment 7 Andrew Pinski 2023-10-15 18:58:25 UTC
I have a fix for the secondary issue which does not cause PR 71016 to show up again.
Basically we should allow nop conversions always in factor_out_conditional_operation .
Comment 8 Andrew Pinski 2023-10-15 19:21:18 UTC
Created attachment 56117 [details]
Patch which I am testing to fix the second issue
Comment 9 GCC Commits 2023-10-24 11:17:45 UTC
The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>:

https://gcc.gnu.org/g:0fc13e8c0e39c51e82deb93f324d9d86ad8d7460

commit r14-4889-g0fc13e8c0e39c51e82deb93f324d9d86ad8d7460
Author: Andrew Pinski <pinskia@gmail.com>
Date:   Sun Oct 15 19:15:38 2023 +0000

    Improve factor_out_conditional_operation for conversions and constants
    
    In the case of a NOP conversion (precisions of the 2 types are equal),
    factoring out the conversion can be done even if int_fits_type_p returns
    false and even when the conversion is defined by a statement inside the
    conditional. Since it is a NOP conversion there is no zero/sign extending
    happening which is why it is ok to be done here; we were trying to prevent
    an extra sign/zero extend from being moved away from definition which no-op
    conversions are not.
    
    Bootstrapped and tested on x86_64-linux-gnu with no regressions.
    
    gcc/ChangeLog:
    
            PR tree-optimization/104376
            PR tree-optimization/101541
            * tree-ssa-phiopt.cc (factor_out_conditional_operation):
            Allow nop conversions even if it is defined by a statement
            inside the conditional.
    
    gcc/testsuite/ChangeLog:
    
            PR tree-optimization/101541
            * gcc.dg/tree-ssa/phi-opt-39.c: New test.