This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][3/3][RTL ifcvt] PR middle-end/37780: Conditional expression with __builtin_clz() should be optimized out



On 09/06/16 13:14, Christophe Lyon wrote:
On 8 June 2016 at 18:40, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
On 07/06/16 20:34, Christophe Lyon wrote:
On 26 May 2016 at 11:53, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
wrote:
Hi all,

In this PR we want to optimise:
int foo (int i)
{
    return (i == 0) ? N : __builtin_clz (i);
}

on targets where CLZ is defined at zero to the constant 'N'.
This is determined at the RTL level through the CLZ_DEFINED_VALUE_AT_ZERO
macro.
The obvious place to implement this would be in combine through
simplify-rtx
where we'd
recognise an IF_THEN_ELSE of the form:
(set (reg:SI r1)
       (if_then_else:SI (ne (reg:SI r2)
                            (const_int 0 [0]))
         (clz:SI (reg:SI r2))
         (const_int 32)))

and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd
simplify
it into
just (clz:SI (reg:SI r2)).
However, I found this doesn't quite happen for a couple of reasons:
1) This depends on ifcvt or some other pass to have created a conditional
move of the
two branches that provide the IF_THEN_ELSE to propagate the const_int and
clz operation into.

2) Combine will refuse to propagate r2 from the above example into both
the
condition and the
CLZ at the same time, so the most we see is:
(set (reg:SI r1)
       (if_then_else:SI (ne (reg:CC cc)
              (const_int 0))
         (clz:SI (reg:SI r2))
         (const_int 32)))

which is not enough information to perform the simplification.

This patch implements the optimisation in ce1 using the noce ifcvt
framework.
During ifcvt noce_process_if_block can see that we're trying to optimise
something
of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of
all
the information
needed to perform the transformation.

The transformation is performed by adding a new noce_try* function that
tries to put the
condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and try
to
simplify that
using the simplify-rtx machinery. That way, we can implement the
simplification logic in
simplify-rtx.c where it belongs.

A similar transformation for CTZ is implemented as well.
So for code:
int foo (int i)
{
    return (i == 0) ? 32 : __builtin_clz (i);
}

On aarch64 we now emit:
foo:
          clz     w0, w0
          ret

instead of:
foo:
          mov     w1, 32
          clz     w2, w0
          cmp     w0, 0
          csel    w0, w2, w1, ne
          ret

and for arm similarly we generate:
foo:
          clz     r0, r0
          bx      lr

instead of:
foo:
          cmp     r0, #0
          clzne   r0, r0
          moveq   r0, #32
          bx      lr


and for x86_64 with -O2 -mlzcnt we generate:
foo:
          xorl    %eax, %eax
          lzcntl  %edi, %eax
          ret

instead of:
foo:
          xorl    %eax, %eax
          movl    $32, %edx
          lzcntl  %edi, %eax
          testl   %edi, %edi
          cmove   %edx, %eax
          ret


I tried getting this to work on other targets as well, but encountered
difficulties.
For example on powerpc the two arms of the condition seen during ifcvt
are:

(insn 4 22 11 4 (set (reg:DI 156 [ <retval> ])
          (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64}
       (nil))
and
(insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ <retval> ]) 0)
          (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132
{clzsi2}
       (expr_list:REG_DEAD (reg/v:DI 157 [ i ])
          (nil)))

So the setup code in noce_process_if_block sees that the set destination
is
not the same
((reg:DI 156 [ <retval> ]) and (subreg/s/u:SI (reg:DI 156 [ <retval> ])
0))
so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b)) check.
I suppose that's a consequence of how SImode operations are represented
in
early RTL
on powerpc, I don't know what to do there. Perhaps that part of ivcvt can
be
taught to handle
destinations that are subregs of one another, but that would be a
separate
patch.

Anyway, is this patch ok for trunk?

Bootstrapped and tested on arm-none-linux-gnueabihf,
aarch64-none-linux-gnu,
x86_64-pc-linux-gnu.

Thanks,
Kyrill

2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

      PR middle-end/37780
      * ifcvt.c (noce_try_ifelse_collapse): New function.
      Declare prototype.
      (noce_process_if_block): Call noce_try_ifelse_collapse.
      * simplify-rtx.c (simplify_cond_clz_ctz): New function.
      (simplify_ternary_operation): Use the above to simplify
      conditional CLZ/CTZ expressions.

2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

      PR middle-end/37780
      * gcc.c-torture/execute/pr37780.c: New test.
      * gcc.target/aarch64/pr37780_1.c: Likewise.
      * gcc.target/arm/pr37780_1.c: Likewise.
Hi Kyrylo,

Hi Christophe,

I've noticed that gcc.target/arm/pr37780_1.c fails on
arm if arch < v6.
I first tried to fix the effective-target guard (IIRC, the doc
says clz is available starting with v5t), but that isn't sufficient.

When compiling for armv5-t, the scan-assembler directives
fail. It seems to work with v6t2, so I am wondering whether
it's just a matter of increasing the effective-target arch version,
or if you really intended to make the test pass on these old
architectures?

I've dug into it a bit.
I think the problem is that CLZ is available with ARMv5T but only in arm
mode.
In Thumb mode it's only available with ARMv6T2.
So if your armv5t gcc compiles for Thumb by default you'll get the failure.
Actually this gcc is configured with default cpu & mode, target=arm-none-eabi.

So I think it's arm7tdmi, arm mode.

I think just bumping the effective to armv6t2 here is appropriate as I
intended
the test to just check the ifcvt transformation when the instruction is
available
rather than test that the CLZ instruction is available at one architecture
level
or another.

A patch to bump the effective target check is pre-approved if you want to do
it.
Otherwise I'll get to it in a few days.

I was a about to commit the arm_arch_v5_ok -> arm_arch_v6t2_ok, but
this would not be sufficient without adding
dg-add-options arm_arch_v6t2
but then if gcc is configured with a higher default architecture, we'll end
up testing if works for v6t2 only.

I'm okay with adding "dg-add-options arm_arch_v6t2".
We're not testing the availability of the instruction here, just the ifcvt
transformation when it *is* available, so I think we should just add whatever
option guarantees that instruction is available.

Kyrill

Thanks for spotting this,
Kyrill
Thanks,

Christophe.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]