Add an "early rematerialisation" pass

Thu Dec 14 23:22:00 GMT 2017

Richard Biener <richard.guenther@gmail.com> writes:
> On December 14, 2017 8:26:49 PM GMT+01:00, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>>Jeff Law <law@redhat.com> writes:
>>> On 12/14/2017 04:09 AM, Richard Biener wrote:
>>>> On Fri, Nov 17, 2017 at 4:58 PM, Richard Sandiford
>>>> <richard.sandiford@linaro.org> wrote:
>>>>> This patch looks for pseudo registers that are live across a call
>>>>> and for which no call-preserved hard registers exist.  It then
>>>>> recomputes the pseudos as necessary to ensure that they are no
>>>>> longer live across a call.  The comment at the head of the file
>>>>> describes the approach.
>>>>>
>>>>> A new target hook selects which modes should be treated in this
>>way.
>>>>> By default none are, in which case the pass is skipped very early.
>>>>>
>>>>> It might also be worth looking for cases like:
>>>>>
>>>>>    C1: R1 := f (...)
>>>>>    ...
>>>>>    C2: R2 := f (...)
>>>>>    C3: R1 := C2
>>>>>
>>>>> and giving the same value number to C1 and C3, effectively treating
>>>>> it like:
>>>>>
>>>>>    C1: R1 := f (...)
>>>>>    ...
>>>>>    C2: R2 := f (...)
>>>>>    C3: R1 := f (...)
>>>>>
>>>>> Another (much more expensive) enhancement would be to apply value
>>>>> numbering to all pseudo registers (not just rematerialisation
>>>>> candidates), so that we can handle things like:
>>>>>
>>>>>   C1: R1 := f (...R2...)
>>>>>   ...
>>>>>   C2: R1 := f (...R3...)
>>>>>
>>>>> where R2 and R3 hold the same value.  But the current pass seems
>>>>> to catch the vast majority of cases.
>>>>>
>>>>> Tested on aarch64-linux-gnu (with and without SVE),
>>x86_64-linux-gnu
>>>>> and powerpc64le-linux-gnu.  OK to install?
>>>> 
>>>> Can you tell anything about the complexity of the algorithm?
>>
>>Have to get back to you on that one. :-)
>>
>>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to
>>find
>>>> any global optimal solution and previous hardreg assignments may
>>work
>>>> against it?
>>
>>Yeah, both of those are problems.  But the more important problem is
>>that it can't increase the live ranges of input registers as easily.
>>Doing it before RA means that IRA gets to see the new ranges.
>>
>>>> That said - I would have expected remat to be done before the
>>>> first scheduling pass?  Even before pass_sms (not sure
>>>> what pass_live_range_shrinkage does).  Or be integrated
>>>> with scheduling and it's register pressure cost model.
>>
>>SMS shouldn't be a problem.  Early remat wouldn't introduce new
>>instructions into a loop unless the loop also had a call, which would
>>prevent SMS.  And although it's theoretically possible that it could
>>remove instructions from a loop, that would only happen if:
>>
>>  (a) the instruction actually computes the same value every time, so
>>      could have been moved outside the loop; and
>>
>>  (b) the result is only used after a following call (and in particular
>>      isn't used within the loop itself)
>>
>>(a) is a missed optimisation and (b) seems unlikely.
>>
>>Integrating remat into scheduling would make it much less powerful,
>>since scheduling does only limited code motion between blocks.
>>
>>Doing it before scheduling would be good in principle, but there
>>would then need to be a fake dependency between the call and remat
>>instructions to stop the scheduler moving the remat instructions
>>back before the call.  Adding early remat was a way of avoiding such
>>fake dependencies in "every" pass, but it might be that scheduling
>>is one case in which the dependencies make sense.
>>
>>Either way, being able to run the pass before scheduling seems
>>like a future enhancement, blocked on a future enhancement to
>>the scheduler.
>>
>>>> Also I would have expected the approach to apply to all modes,
>>>> just the cost of spilling is different.  But if you can, say, reduce
>>>> register pressure by one by rematerializing a bit-not then that
>>>> should be always profitable, no?  postreload-cse will come to
>>>> the rescue anyhow.
>>
>>But that would then mean taking the register pressure into account when
>>deciding whether to rematerialise.  On the one hand would make it hard
>>to do before scheduling (which decides the final pre-RA pressure).
>>It would also make it a significantly different algorithm, since it
>>wouldn't be a standard availability problem any more.
>>
>>For that use case, pressure-dependent remat in the scheduler might
>>be a better approach, like you were suggesting.  The early remat
>>pass is specifically for the extreme case of no registers being
>>call-preserved, where it's more important that we don't miss
>>remat opportunities, and more important that we treat it as
>>a global problem.
>
> On x86_64 all xmm registers are caller saved for example. That means all
> FP regs and all vectors. (yeah, stupid ABI decision....)

OK.  The patch uses a target hook to select the modes -- basing it off
whether they're variable-length is just the default.  So if this turns
out to be a win for x86_64 or for SPARC (was originally going to reply
to that in Jeff's message, sorry), then the target could opt in if it
wants to.

Thanks,
Richard