[PR middle-end/70359] uncoalesce IVs outside of loops

Tue Mar 20 18:18:00 GMT 2018

On Tue, Mar 20, 2018 at 5:56 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On March 20, 2018 6:11:53 PM GMT+01:00, "Bin.Cheng" <amker.cheng@gmail.com> wrote:
>>On Mon, Mar 19, 2018 at 5:08 PM, Aldy Hernandez <aldyh@redhat.com>
>>wrote:
>>> Hi Richard.
>>>
>>> As discussed in the PR, the problem here is that we have two
>>different
>>> iterations of an IV live outside of a loop.  This inhibits us from
>>using
>>> autoinc/dec addressing on ARM, and causes extra lea's on x86.
>>>
>>> An abbreviated example is this:
>>>
>>> loop:
>>>   # p_9 = PHI <p_17(2), p_20(3)>
>>>   p_20 = p_9 + 18446744073709551615;
>>> goto loop
>>>   p_24 = p_9 + 18446744073709551614;
>>>   MEM[(char *)p_20 + -1B] = 45;
>>>
>>> Here we have both the previous IV (p_9) and the current IV (p_20)
>>used
>>> outside of the loop.  On Arm this keeps us from using auto-dec
>>addressing,
>>> because one use is -2 and the other one is -1.
>>>
>>> With the attached patch we attempt to rewrite out-of-loop uses of the
>>IV in
>>> terms of the current/last IV (p_20 in the case above).  With it, we
>>end up
>>> with:
>>>
>>>   p_24 = p_20 + 18446744073709551615;
>>>   *p_24 = 45;
>>>
>>> ...which helps both x86 and Arm.
>>>
>>> As you have suggested in comment 38 on the PR, I handle specially
>>> out-of-loop IV uses of the form IV+CST and propagate those
>>accordingly
>>> (along with the MEM_REF above).  Otherwise, in less specific cases,
>>we un-do
>>> the IV increment, and use that value in all out-of-loop uses.  For
>>instance,
>>> in the attached testcase, we rewrite:
>>>
>>>   george (p_9);
>>>
>>> into
>>>
>>>   _26 = p_20 + 1;
>>>   ...
>>>   george (_26);
>>>
>>> The attached testcase tests the IV+CST specific case, as well as the
>>more
>>> generic case with george().
>>>
>>> Although the original PR was for ARM, this behavior can be noticed on
>>x86,
>>> so I tested on x86 with a full bootstrap + tests.  I also ran the
>>specific
>>> test on an x86 cross ARM build and made sure we had 2 auto-dec with
>>the
>>> test.  For the original test (slightly different than the testcase in
>>this
>>> patch), with this patch we are at 104 bytes versus 116 without it.
>>There is
>>> still the issue of a division optimization which would further reduce
>>the
>>> code size.  I will discuss this separately as it is independent from
>>this
>>> patch.
>>>
>>> Oh yeah, we could make this more generic, and maybe handle any
>>multiple of
>>> the constant, or perhaps *= and /=.  Perhaps something for next
>>stage1...
>>>
>>> OK for trunk?
>>Just FYI, this looks similar to what I did in
>>https://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html
>>That change was non-trivial and didn't give obvious improvement back
>>in time.  But I still wonder if this
>>can be done at rewriting iv_use in a light-overhead way.
>
> Certainly, but the issue is we wreck it again at forwprop time as ivopts runs too early.
So both values of p_9/p_20 are used after loop.

loop:
  # p_9 = PHI <p_17(2), p_20(3)>
  p_20 = p_9 + 18446744073709551615;
goto loop
  p_24 = p_20 + 18446744073709551615;
  MEM[(char *)p_20 + -1B] = 45;

It looks like a fwprop issue that propagating p_20 with p_9 which
results in below code:

loop:
  # p_9 = PHI <p_17(2), p_20(3)>
  p_20 = p_9 + 18446744073709551615;
goto loop
  p_24 = p_9 + 18446744073709551614;
  MEM[(char *)p_20 + -1B] = 45;

It creates intersecting/longer live ranges while doesn't eliminate
copy or definition for p_9.
Ah, IIRC, RTL address forward propagation also has this issue.

Thanks,
bin
>
> Richard.
>>
>>Thanks,
>>bin
>>> Aldy
>