[PATCH][simplify-rtx] (GTU (PLUS a C) (C - 1)) --> (LTU a -C)

Fri Sep 16 10:02:00 GMT 2016

On Fri, Sep 16, 2016 at 10:20 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> On 16/09/16 10:02, Richard Biener wrote:
>>
>> On Fri, Sep 16, 2016 at 10:40 AM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>
>>> Hi all,
>>>
>>> Currently the functions:
>>> int f1(int x, int t)
>>> {
>>>    if (x == -1 || x == -2)
>>>      t = 1;
>>>    return t;
>>> }
>>>
>>> int f2(int x, int t)
>>> {
>>>    if (x == -1 || x == -2)
>>>      return 1;
>>>    return t;
>>> }
>>>
>>> generate different code on AArch64 even though they have identical
>>> functionality:
>>> f1:
>>>          add     w0, w0, 2
>>>          cmp     w0, 1
>>>          csinc   w0, w1, wzr, hi
>>>          ret
>>>
>>> f2:
>>>          cmn     w0, #2
>>>          csinc   w0, w1, wzr, cc
>>>          ret
>>>
>>> The problem is that f2 performs the comparison (LTU w0 -2)
>>> whereas f1 performs (GTU (PLUS w0 2) 1). I think it is possible to
>>> simplify
>>> the f1 form
>>> to the f2 form with the simplify-rtx.c rule added in this patch. With
>>> this
>>> patch the
>>> codegen for both f1 and f2 on aarch64 at -O2 is identical (CMN, CSINC).
>>>
>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>>> aarch64-none-linux-gnu,
>>> x86_64.
>>> What do you think? Is this a correct generalisation of this issue?
>>> If so, ok for trunk?
>>
>> Do you see a difference on the GIMPLE level?  If so, this kind of
>> transform looks
>> appropriate there, too.
>
>
> The GIMPLE for the two functions looks almost identical:
> f1 (intD.7 xD.3078, intD.7 tD.3079)
> {
>   intD.7 x_4(D) = xD.3078;
>   intD.7 t_5(D) = tD.3079;
>   unsigned int x.0_1;
>   unsigned int _2;
>   x.0_1 = (unsigned int) x_4(D);
>
>   _2 = x.0_1 + 2;
>   if (_2 <= 1)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
> ;;   basic block 3, loop depth 0, count 0, freq 3977, maybe hot
> ;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
>
>   # t_3 = PHI <t_5(D)(2), 1(3)>
>   return t_3;
> }
>
> f2 (intD.7 xD.3082, intD.7 tD.3083)
> {
>   intD.7 x_4(D) = xD.3082;
>   intD.7 t_5(D) = tD.3083;
>   unsigned int x.1_1;
>   unsigned int _2;
>   intD.7 _3;
>
>   x.1_1 = (unsigned int) x_4(D);
>
>   _2 = x.1_1 + 2;
>   if (_2 <= 1)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
>
> ;;   basic block 3, loop depth 0, count 0, freq 6761, maybe hot
> ;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
>   # _3 = PHI <1(2), t_5(D)(3)>
>   return _3;
>
> }
>
> So at GIMPLE level we see a (x + 2 <=u 1) in both cases but with slightly
> different CFG.  RTL-level transformations (ce1) bring it to the pre-combine
> RTL
> where one does (LTU w0 -2) and the other does (GTU (PLUS w0 2) 1).
>
> So the differences start at RTL level, so I think we need this
> transformation there.
> However, for the testcase:
> unsigned int
> foo (unsigned int a, unsigned int b)
> {
>   return (a + 2) > 1;
> }
>
> The differences do appear at GIMPLE level, so I think a match.pd pattern
> would help here.
Hi, may I ask what the function looks like to which this one is different to?

Thanks,
bin