104376 – Failure to optimize clz equivalent to clz

Bug 104376 - Failure to optimize clz equivalent to clz

Summary: Failure to optimize clz equivalent to clz

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	12.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Andrew Pinski

URL:
Keywords:	missed-optimization

Depends on:	104378 101822
Blocks:
	Show dependency tree / graph

Reported:	2022-02-04 01:10 UTC by Gabriel Ravier
Modified:	2023-10-24 11:17 UTC (History)
CC List:	1 user (show)

See Also:	101822 99997 71016
Host:
Target:	aarch64-- x86_64-- (with -mlzcnt)
Build:
Known to work:
Known to fail:
Last reconfirmed:	2023-02-17 00:00:00

Attachments
Patch which I am testing to fix the second issue (1.31 KB, patch) 2023-10-15 19:21 UTC, Andrew Pinski	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gabriel Ravier 2022-02-04 01:10:30 UTC

#include <stdint.h>

uint32_t countLeadingZeros32(uint32_t x)
{
    if (x == 0)
        return 32;
    return (31 - __builtin_clz(x)) ^ 31;
}

On x86, with `-mlzcnt`, GCC outputs this:

countLeadingZeros32(unsigned int):
  mov eax, 32
  test edi, edi
  je .L1
  mov eax, 31
  lzcnt edi, edi
  sub eax, edi
  xor eax, 31
.L1:
  ret

LLVM instead outputs this:

countLeadingZeros32(unsigned int):
  lzcnt eax, edi
  ret

Comment 1 Andrew Pinski 2022-02-04 04:26:07 UTC

There are two issues, both are tree level issues, though the second one works on the RTL level just fine.

Right now we have:
  _1 = __builtin_clz (x_5(D));
  _2 = 31 - _1;
  _3 = _2 ^ 31;

But the _3 can be optimized to just _1.

Comment 2 Andrew Pinski 2022-02-04 04:27:41 UTC

The second issue can be seen with:
#include <stdint.h>

uint32_t countLeadingZeros32(uint32_t x)
{
    if (x == 0)
        return 32;
    return (__builtin_clz(x)) ;
}

This gets optimized for aarch64 at the rtl level but not for x86_64 with -mlzcnt.

Comment 3 Andrew Pinski 2022-02-04 07:01:32 UTC

Filed PR 104378 for the (31 - x) ^ 31 issue.

Comment 4 Andrew Pinski 2022-02-09 06:10:27 UTC

(In reply to Andrew Pinski from comment #2)
> The second issue can be seen with:
> #include <stdint.h>
> 
> uint32_t countLeadingZeros32(uint32_t x)
> {
>     if (x == 0)
>         return 32;
>     return (__builtin_clz(x)) ;
> }

cond_removal_in_builtin_zero_pattern should have optimized the above but does not for some reason.
Let me take a look.

Comment 5 Andrew Pinski 2022-02-09 07:40:18 UTC

(In reply to Andrew Pinski from comment #4)
> cond_removal_in_builtin_zero_pattern should have optimized the above but
> does not for some reason.
> Let me take a look.

So one problem is we have:
  <bb 2> [local count: 1073741824]:
  if (x_3(D) == 0)
    goto <bb 4>; [21.72%]
  else
    goto <bb 3>; [78.28%]

  <bb 3> [local count: 840525097]:
  _1 = __builtin_clz (x_3(D));
  _4 = (uint32_t) _1;

  <bb 4> [local count: 1073741824]:
  # _2 = PHI <32(2), _4(3)>


Which we don't handle in cond_removal_in_builtin_zero_pattern, this similar to PR 99997 and PR 101822, that is the code which added to fix PR 71016 is getting in the way.

Comment 6 Andrew Pinski 2023-05-06 21:23:41 UTC

The cast issue is basically PR 101822.

Comment 7 Andrew Pinski 2023-10-15 18:58:25 UTC

I have a fix for the secondary issue which does not cause PR 71016 to show up again.
Basically we should allow nop conversions always in factor_out_conditional_operation .

Comment 8 Andrew Pinski 2023-10-15 19:21:18 UTC

Created attachment 56117 [details]
Patch which I am testing to fix the second issue

Comment 9 GCC Commits 2023-10-24 11:17:45 UTC

The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>:

https://gcc.gnu.org/g:0fc13e8c0e39c51e82deb93f324d9d86ad8d7460

commit r14-4889-g0fc13e8c0e39c51e82deb93f324d9d86ad8d7460
Author: Andrew Pinski <pinskia@gmail.com>
Date:   Sun Oct 15 19:15:38 2023 +0000

    Improve factor_out_conditional_operation for conversions and constants
    
    In the case of a NOP conversion (precisions of the 2 types are equal),
    factoring out the conversion can be done even if int_fits_type_p returns
    false and even when the conversion is defined by a statement inside the
    conditional. Since it is a NOP conversion there is no zero/sign extending
    happening which is why it is ok to be done here; we were trying to prevent
    an extra sign/zero extend from being moved away from definition which no-op
    conversions are not.
    
    Bootstrapped and tested on x86_64-linux-gnu with no regressions.
    
    gcc/ChangeLog:
    
            PR tree-optimization/104376
            PR tree-optimization/101541
            * tree-ssa-phiopt.cc (factor_out_conditional_operation):
            Allow nop conversions even if it is defined by a statement
            inside the conditional.
    
    gcc/testsuite/ChangeLog:
    
            PR tree-optimization/101541
            * gcc.dg/tree-ssa/phi-opt-39.c: New test.