Add an "early rematerialisation" pass

Thu Dec 14 19:32:00 GMT 2017

On December 14, 2017 8:26:49 PM GMT+01:00, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>Jeff Law <law@redhat.com> writes:
>> On 12/14/2017 04:09 AM, Richard Biener wrote:
>>> On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford
>>> <richard.sandiford@linaro.org> wrote:
>>>> This patch looks for pseudo registers that are live across a call
>>>> and for which no call-preserved hard registers exist.  It then
>>>> recomputes the pseudos as necessary to ensure that they are no
>>>> longer live across a call.  The comment at the head of the file
>>>> describes the approach.
>>>>
>>>> A new target hook selects which modes should be treated in this
>way.
>>>> By default none are, in which case the pass is skipped very early.
>>>>
>>>> It might also be worth looking for cases like:
>>>>
>>>>    C1: R1 := f (...)
>>>>    ...
>>>>    C2: R2 := f (...)
>>>>    C3: R1 := C2
>>>>
>>>> and giving the same value number to C1 and C3, effectively treating
>>>> it like:
>>>>
>>>>    C1: R1 := f (...)
>>>>    ...
>>>>    C2: R2 := f (...)
>>>>    C3: R1 := f (...)
>>>>
>>>> Another (much more expensive) enhancement would be to apply value
>>>> numbering to all pseudo registers (not just rematerialisation
>>>> candidates), so that we can handle things like:
>>>>
>>>>   C1: R1 := f (...R2...)
>>>>   ...
>>>>   C2: R1 := f (...R3...)
>>>>
>>>> where R2 and R3 hold the same value.  But the current pass seems
>>>> to catch the vast majority of cases.
>>>>
>>>> Tested on aarch64-linux-gnu (with and without SVE),
>x86_64-linux-gnu
>>>> and powerpc64le-linux-gnu.  OK to install?
>>> 
>>> Can you tell anything about the complexity of the algorithm?
>
>Have to get back to you on that one. :-)
>
>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to
>find
>>> any global optimal solution and previous hardreg assignments may
>work
>>> against it?
>
>Yeah, both of those are problems.  But the more important problem is
>that it can't increase the live ranges of input registers as easily.
>Doing it before RA means that IRA gets to see the new ranges.
>
>>> That said - I would have expected remat to be done before the
>>> first scheduling pass?  Even before pass_sms (not sure
>>> what pass_live_range_shrinkage does).  Or be integrated
>>> with scheduling and it's register pressure cost model.
>
>SMS shouldn't be a problem.  Early remat wouldn't introduce new
>instructions into a loop unless the loop also had a call, which would
>prevent SMS.  And although it's theoretically possible that it could
>remove instructions from a loop, that would only happen if:
>
>  (a) the instruction actually computes the same value every time, so
>      could have been moved outside the loop; and
>
>  (b) the result is only used after a following call (and in particular
>      isn't used within the loop itself)
>
>(a) is a missed optimisation and (b) seems unlikely.
>
>Integrating remat into scheduling would make it much less powerful,
>since scheduling does only limited code motion between blocks.
>
>Doing it before scheduling would be good in principle, but there
>would then need to be a fake dependency between the call and remat
>instructions to stop the scheduler moving the remat instructions
>back before the call.  Adding early remat was a way of avoiding such
>fake dependencies in "every" pass, but it might be that scheduling
>is one case in which the dependencies make sense.
>
>Either way, being able to run the pass before scheduling seems
>like a future enhancement, blocked on a future enhancement to
>the scheduler.
>
>>> Also I would have expected the approach to apply to all modes,
>>> just the cost of spilling is different.  But if you can, say, reduce
>>> register pressure by one by rematerializing a bit-not then that
>>> should be always profitable, no?  postreload-cse will come to
>>> the rescue anyhow.
>
>But that would then mean taking the register pressure into account when
>deciding whether to rematerialise.  On the one hand would make it hard
>to do before scheduling (which decides the final pre-RA pressure).
>It would also make it a significantly different algorithm, since it
>wouldn't be a standard availability problem any more.
>
>For that use case, pressure-dependent remat in the scheduler might
>be a better approach, like you were suggesting.  The early remat
>pass is specifically for the extreme case of no registers being
>call-preserved, where it's more important that we don't miss
>remat opportunities, and more important that we treat it as
>a global problem.

On x86_64 all xmm registers are caller saved for example. That means all FP regs and all vectors. (yeah, stupid ABI decision....)

Richard. 

>Thanks,
>Richard