This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PR middle-end/70359] uncoalesce IVs outside of loops


On Tue, Mar 20, 2018 at 7:15 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Mar 20, 2018 at 5:56 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On March 20, 2018 6:11:53 PM GMT+01:00, "Bin.Cheng" <amker.cheng@gmail.com> wrote:
>>>On Mon, Mar 19, 2018 at 5:08 PM, Aldy Hernandez <aldyh@redhat.com>
>>>wrote:
>>>> Hi Richard.
>>>>
>>>> As discussed in the PR, the problem here is that we have two
>>>different
>>>> iterations of an IV live outside of a loop.  This inhibits us from
>>>using
>>>> autoinc/dec addressing on ARM, and causes extra lea's on x86.
>>>>
>>>> An abbreviated example is this:
>>>>
>>>> loop:
>>>>   # p_9 = PHI <p_17(2), p_20(3)>
>>>>   p_20 = p_9 + 18446744073709551615;
>>>> goto loop
>>>>   p_24 = p_9 + 18446744073709551614;
>>>>   MEM[(char *)p_20 + -1B] = 45;
>>>>
>>>> Here we have both the previous IV (p_9) and the current IV (p_20)
>>>used
>>>> outside of the loop.  On Arm this keeps us from using auto-dec
>>>addressing,
>>>> because one use is -2 and the other one is -1.
>>>>
>>>> With the attached patch we attempt to rewrite out-of-loop uses of the
>>>IV in
>>>> terms of the current/last IV (p_20 in the case above).  With it, we
>>>end up
>>>> with:
>>>>
>>>>   p_24 = p_20 + 18446744073709551615;
>>>>   *p_24 = 45;
>>>>
>>>> ...which helps both x86 and Arm.
>>>>
>>>> As you have suggested in comment 38 on the PR, I handle specially
>>>> out-of-loop IV uses of the form IV+CST and propagate those
>>>accordingly
>>>> (along with the MEM_REF above).  Otherwise, in less specific cases,
>>>we un-do
>>>> the IV increment, and use that value in all out-of-loop uses.  For
>>>instance,
>>>> in the attached testcase, we rewrite:
>>>>
>>>>   george (p_9);
>>>>
>>>> into
>>>>
>>>>   _26 = p_20 + 1;
>>>>   ...
>>>>   george (_26);
>>>>
>>>> The attached testcase tests the IV+CST specific case, as well as the
>>>more
>>>> generic case with george().
>>>>
>>>> Although the original PR was for ARM, this behavior can be noticed on
>>>x86,
>>>> so I tested on x86 with a full bootstrap + tests.  I also ran the
>>>specific
>>>> test on an x86 cross ARM build and made sure we had 2 auto-dec with
>>>the
>>>> test.  For the original test (slightly different than the testcase in
>>>this
>>>> patch), with this patch we are at 104 bytes versus 116 without it.
>>>There is
>>>> still the issue of a division optimization which would further reduce
>>>the
>>>> code size.  I will discuss this separately as it is independent from
>>>this
>>>> patch.
>>>>
>>>> Oh yeah, we could make this more generic, and maybe handle any
>>>multiple of
>>>> the constant, or perhaps *= and /=.  Perhaps something for next
>>>stage1...
>>>>
>>>> OK for trunk?
>>>Just FYI, this looks similar to what I did in
>>>https://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html
>>>That change was non-trivial and didn't give obvious improvement back
>>>in time.  But I still wonder if this
>>>can be done at rewriting iv_use in a light-overhead way.
>>
>> Certainly, but the issue is we wreck it again at forwprop time as ivopts runs too early.
> So both values of p_9/p_20 are used after loop.
>
> loop:
>   # p_9 = PHI <p_17(2), p_20(3)>
>   p_20 = p_9 + 18446744073709551615;
> goto loop
>   p_24 = p_20 + 18446744073709551615;
>   MEM[(char *)p_20 + -1B] = 45;
>
> It looks like a fwprop issue that propagating p_20 with p_9 which
> results in below code:
>
> loop:
>   # p_9 = PHI <p_17(2), p_20(3)>
>   p_20 = p_9 + 18446744073709551615;
> goto loop
>   p_24 = p_9 + 18446744073709551614;
>   MEM[(char *)p_20 + -1B] = 45;
>
> It creates intersecting/longer live ranges while doesn't eliminate
> copy or definition for p_9.

Yes.  It's actually general folding patterns that combine the two adds.
It may be profitable to do this even if intersecting live ranges because
you gain some scheduling freedom and reduce dependences.

So it's not easy to avoid in general.

> Ah, IIRC, RTL address forward propagation also has this issue.

I guess so.

Richard.

> Thanks,
> bin
>>
>> Richard.
>>>
>>>Thanks,
>>>bin
>>>> Aldy
>>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]