[PATCH][simplify-rtx] (GTU (PLUS a C) (C - 1)) --> (LTU a -C)

Fri Sep 16 10:05:00 GMT 2016

On 16/09/16 10:50, Bin.Cheng wrote:
> On Fri, Sep 16, 2016 at 10:20 AM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> On 16/09/16 10:02, Richard Biener wrote:
>>> On Fri, Sep 16, 2016 at 10:40 AM, Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> Hi all,
>>>>
>>>> Currently the functions:
>>>> int f1(int x, int t)
>>>> {
>>>>     if (x == -1 || x == -2)
>>>>       t = 1;
>>>>     return t;
>>>> }
>>>>
>>>> int f2(int x, int t)
>>>> {
>>>>     if (x == -1 || x == -2)
>>>>       return 1;
>>>>     return t;
>>>> }
>>>>
>>>> generate different code on AArch64 even though they have identical
>>>> functionality:
>>>> f1:
>>>>           add     w0, w0, 2
>>>>           cmp     w0, 1
>>>>           csinc   w0, w1, wzr, hi
>>>>           ret
>>>>
>>>> f2:
>>>>           cmn     w0, #2
>>>>           csinc   w0, w1, wzr, cc
>>>>           ret
>>>>
>>>> The problem is that f2 performs the comparison (LTU w0 -2)
>>>> whereas f1 performs (GTU (PLUS w0 2) 1). I think it is possible to
>>>> simplify
>>>> the f1 form
>>>> to the f2 form with the simplify-rtx.c rule added in this patch. With
>>>> this
>>>> patch the
>>>> codegen for both f1 and f2 on aarch64 at -O2 is identical (CMN, CSINC).
>>>>
>>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>>>> aarch64-none-linux-gnu,
>>>> x86_64.
>>>> What do you think? Is this a correct generalisation of this issue?
>>>> If so, ok for trunk?
>>> Do you see a difference on the GIMPLE level?  If so, this kind of
>>> transform looks
>>> appropriate there, too.
>>
>> The GIMPLE for the two functions looks almost identical:
>> f1 (intD.7 xD.3078, intD.7 tD.3079)
>> {
>>    intD.7 x_4(D) = xD.3078;
>>    intD.7 t_5(D) = tD.3079;
>>    unsigned int x.0_1;
>>    unsigned int _2;
>>    x.0_1 = (unsigned int) x_4(D);
>>
>>    _2 = x.0_1 + 2;
>>    if (_2 <= 1)
>>      goto <bb 3>;
>>    else
>>      goto <bb 4>;
>> ;;   basic block 3, loop depth 0, count 0, freq 3977, maybe hot
>> ;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
>>
>>    # t_3 = PHI <t_5(D)(2), 1(3)>
>>    return t_3;
>> }
>>
>> f2 (intD.7 xD.3082, intD.7 tD.3083)
>> {
>>    intD.7 x_4(D) = xD.3082;
>>    intD.7 t_5(D) = tD.3083;
>>    unsigned int x.1_1;
>>    unsigned int _2;
>>    intD.7 _3;
>>
>>    x.1_1 = (unsigned int) x_4(D);
>>
>>    _2 = x.1_1 + 2;
>>    if (_2 <= 1)
>>      goto <bb 4>;
>>    else
>>      goto <bb 3>;
>>
>> ;;   basic block 3, loop depth 0, count 0, freq 6761, maybe hot
>> ;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
>>    # _3 = PHI <1(2), t_5(D)(3)>
>>    return _3;
>>
>> }
>>
>> So at GIMPLE level we see a (x + 2 <=u 1) in both cases but with slightly
>> different CFG.  RTL-level transformations (ce1) bring it to the pre-combine
>> RTL
>> where one does (LTU w0 -2) and the other does (GTU (PLUS w0 2) 1).
>>
>> So the differences start at RTL level, so I think we need this
>> transformation there.
>> However, for the testcase:
>> unsigned int
>> foo (unsigned int a, unsigned int b)
>> {
>>    return (a + 2) > 1;
>> }
>>
>> The differences do appear at GIMPLE level, so I think a match.pd pattern
>> would help here.
> Hi, may I ask what the function looks like to which this one is different to?

Hi Bin,
I meant to say that the unsigned greater than comparison is retained at the GIMPLE level
so could be optimised there.

Kyrill

> Thanks,
> bin