This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][3/3][RTL ifcvt] PR middle-end/37780: Conditional expression with __builtin_clz() should be optimized out


On 8 June 2016 at 18:40, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>
> On 07/06/16 20:34, Christophe Lyon wrote:
>>
>> On 26 May 2016 at 11:53, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> In this PR we want to optimise:
>>> int foo (int i)
>>> {
>>>    return (i == 0) ? N : __builtin_clz (i);
>>> }
>>>
>>> on targets where CLZ is defined at zero to the constant 'N'.
>>> This is determined at the RTL level through the CLZ_DEFINED_VALUE_AT_ZERO
>>> macro.
>>> The obvious place to implement this would be in combine through
>>> simplify-rtx
>>> where we'd
>>> recognise an IF_THEN_ELSE of the form:
>>> (set (reg:SI r1)
>>>       (if_then_else:SI (ne (reg:SI r2)
>>>                            (const_int 0 [0]))
>>>         (clz:SI (reg:SI r2))
>>>         (const_int 32)))
>>>
>>> and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd
>>> simplify
>>> it into
>>> just (clz:SI (reg:SI r2)).
>>> However, I found this doesn't quite happen for a couple of reasons:
>>> 1) This depends on ifcvt or some other pass to have created a conditional
>>> move of the
>>> two branches that provide the IF_THEN_ELSE to propagate the const_int and
>>> clz operation into.
>>>
>>> 2) Combine will refuse to propagate r2 from the above example into both
>>> the
>>> condition and the
>>> CLZ at the same time, so the most we see is:
>>> (set (reg:SI r1)
>>>       (if_then_else:SI (ne (reg:CC cc)
>>>              (const_int 0))
>>>         (clz:SI (reg:SI r2))
>>>         (const_int 32)))
>>>
>>> which is not enough information to perform the simplification.
>>>
>>> This patch implements the optimisation in ce1 using the noce ifcvt
>>> framework.
>>> During ifcvt noce_process_if_block can see that we're trying to optimise
>>> something
>>> of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of
>>> all
>>> the information
>>> needed to perform the transformation.
>>>
>>> The transformation is performed by adding a new noce_try* function that
>>> tries to put the
>>> condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and try
>>> to
>>> simplify that
>>> using the simplify-rtx machinery. That way, we can implement the
>>> simplification logic in
>>> simplify-rtx.c where it belongs.
>>>
>>> A similar transformation for CTZ is implemented as well.
>>> So for code:
>>> int foo (int i)
>>> {
>>>    return (i == 0) ? 32 : __builtin_clz (i);
>>> }
>>>
>>> On aarch64 we now emit:
>>> foo:
>>>          clz     w0, w0
>>>          ret
>>>
>>> instead of:
>>> foo:
>>>          mov     w1, 32
>>>          clz     w2, w0
>>>          cmp     w0, 0
>>>          csel    w0, w2, w1, ne
>>>          ret
>>>
>>> and for arm similarly we generate:
>>> foo:
>>>          clz     r0, r0
>>>          bx      lr
>>>
>>> instead of:
>>> foo:
>>>          cmp     r0, #0
>>>          clzne   r0, r0
>>>          moveq   r0, #32
>>>          bx      lr
>>>
>>>
>>> and for x86_64 with -O2 -mlzcnt we generate:
>>> foo:
>>>          xorl    %eax, %eax
>>>          lzcntl  %edi, %eax
>>>          ret
>>>
>>> instead of:
>>> foo:
>>>          xorl    %eax, %eax
>>>          movl    $32, %edx
>>>          lzcntl  %edi, %eax
>>>          testl   %edi, %edi
>>>          cmove   %edx, %eax
>>>          ret
>>>
>>>
>>> I tried getting this to work on other targets as well, but encountered
>>> difficulties.
>>> For example on powerpc the two arms of the condition seen during ifcvt
>>> are:
>>>
>>> (insn 4 22 11 4 (set (reg:DI 156 [ <retval> ])
>>>          (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64}
>>>       (nil))
>>> and
>>> (insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ <retval> ]) 0)
>>>          (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132
>>> {clzsi2}
>>>       (expr_list:REG_DEAD (reg/v:DI 157 [ i ])
>>>          (nil)))
>>>
>>> So the setup code in noce_process_if_block sees that the set destination
>>> is
>>> not the same
>>> ((reg:DI 156 [ <retval> ]) and (subreg/s/u:SI (reg:DI 156 [ <retval> ])
>>> 0))
>>> so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b)) check.
>>> I suppose that's a consequence of how SImode operations are represented
>>> in
>>> early RTL
>>> on powerpc, I don't know what to do there. Perhaps that part of ivcvt can
>>> be
>>> taught to handle
>>> destinations that are subregs of one another, but that would be a
>>> separate
>>> patch.
>>>
>>> Anyway, is this patch ok for trunk?
>>>
>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>>> aarch64-none-linux-gnu,
>>> x86_64-pc-linux-gnu.
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      PR middle-end/37780
>>>      * ifcvt.c (noce_try_ifelse_collapse): New function.
>>>      Declare prototype.
>>>      (noce_process_if_block): Call noce_try_ifelse_collapse.
>>>      * simplify-rtx.c (simplify_cond_clz_ctz): New function.
>>>      (simplify_ternary_operation): Use the above to simplify
>>>      conditional CLZ/CTZ expressions.
>>>
>>> 2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      PR middle-end/37780
>>>      * gcc.c-torture/execute/pr37780.c: New test.
>>>      * gcc.target/aarch64/pr37780_1.c: Likewise.
>>>      * gcc.target/arm/pr37780_1.c: Likewise.
>>
>> Hi Kyrylo,
>
>
> Hi Christophe,
>
>> I've noticed that gcc.target/arm/pr37780_1.c fails on
>> arm if arch < v6.
>> I first tried to fix the effective-target guard (IIRC, the doc
>> says clz is available starting with v5t), but that isn't sufficient.
>>
>> When compiling for armv5-t, the scan-assembler directives
>> fail. It seems to work with v6t2, so I am wondering whether
>> it's just a matter of increasing the effective-target arch version,
>> or if you really intended to make the test pass on these old
>> architectures?
>
>
> I've dug into it a bit.
> I think the problem is that CLZ is available with ARMv5T but only in arm
> mode.
> In Thumb mode it's only available with ARMv6T2.
> So if your armv5t gcc compiles for Thumb by default you'll get the failure.

Actually this gcc is configured with default cpu & mode, target=arm-none-eabi.

So I think it's arm7tdmi, arm mode.

> I think just bumping the effective to armv6t2 here is appropriate as I
> intended
> the test to just check the ifcvt transformation when the instruction is
> available
> rather than test that the CLZ instruction is available at one architecture
> level
> or another.
>
> A patch to bump the effective target check is pre-approved if you want to do
> it.
> Otherwise I'll get to it in a few days.
>
I was a about to commit the arm_arch_v5_ok -> arm_arch_v6t2_ok, but
this would not be sufficient without adding
dg-add-options arm_arch_v6t2
but then if gcc is configured with a higher default architecture, we'll end
up testing if works for v6t2 only.

> Thanks for spotting this,
> Kyrill
>>
>> Thanks,
>>
>> Christophe.
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]