Add an "early rematerialisation" pass
Jeff Law
law@redhat.com
Fri Dec 15 21:50:00 GMT 2017
On 11/17/2017 08:58 AM, Richard Sandiford wrote:
> This patch looks for pseudo registers that are live across a call
> and for which no call-preserved hard registers exist. It then
> recomputes the pseudos as necessary to ensure that they are no
> longer live across a call. The comment at the head of the file
> describes the approach.
>
> A new target hook selects which modes should be treated in this way.
> By default none are, in which case the pass is skipped very early.
>
> It might also be worth looking for cases like:
>
> C1: R1 := f (...)
> ...
> C2: R2 := f (...)
> C3: R1 := C2
>
> and giving the same value number to C1 and C3, effectively treating
> it like:
>
> C1: R1 := f (...)
> ...
> C2: R2 := f (...)
> C3: R1 := f (...)
>
> Another (much more expensive) enhancement would be to apply value
> numbering to all pseudo registers (not just rematerialisation
> candidates), so that we can handle things like:
>
> C1: R1 := f (...R2...)
> ...
> C2: R1 := f (...R3...)
>
> where R2 and R3 hold the same value. But the current pass seems
> to catch the vast majority of cases.
>
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linux-gnu. OK to install?
>
> Richard
>
>
> 2017-11-17 Richard Sandiford <richard.sandiford@linaro.org>
>
> gcc/
> * Makefile.in (OBJS): Add early-remat.o.
> * target.def (select_early_remat_modes): New hook.
> * doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
> * doc/tm.texi: Regenerate.
> * targhooks.h (default_select_early_remat_modes): Declare.
> * targhooks.c (default_select_early_remat_modes): New function.
> * timevar.def (TV_EARLY_REMAT): New timevar.
> * passes.def (pass_early_remat): New pass.
> * tree-pass.h (make_pass_early_remat): Declare.
> * early-remat.c: New file.
> * config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
> function.
> (TARGET_SELECT_EARLY_REMAT_MODES): Define.
>
> gcc/testsuite/
> * gcc.target/aarch64/sve_spill_1.c: Also test that no predicates
> are spilled.
> * gcc.target/aarch64/sve_spill_2.c: New test.
> * gcc.target/aarch64/sve_spill_3.c: Likewise.
> * gcc.target/aarch64/sve_spill_4.c: Likewise.
> * gcc.target/aarch64/sve_spill_5.c: Likewise.
> * gcc.target/aarch64/sve_spill_6.c: Likewise.
> * gcc.target/aarch64/sve_spill_7.c: Likewise.
So it's not the immediate goal here, but it'd be nice if we could extend
this to cover all the other targets that have no registers of certain
classes that can be allocated across a call.
Mostly this is exactly what I'd expect -- not to diminish the work,
there's a ton of stuff here. But the overall flow of work and
organization is basically what I'd expect.
I know Richi was a bit scared off by the RTL nature, but the vast
majority of the code here isn't related to RTL -- it's operating on
properties that got extracted from the RTL. I could almost take this
change bits at the start of the pass and the end and run it on gimple
:-) Which brings up an interesting question, is that a worthwhile thing
to ponder? Is it gimple optimizers that are causing objects to be live
across calls or is it usually teh RTL bits?
As I was reading one of the thoughts I had was that this reminded me a
bit of LCM. WHich makes sense -- it's code motion after all. I was
pleasantly surprised to see the comment explaining why the LCM
infrastructure was not used.
Do you have to do anything about rounding modes? Or are those not an
issue for aarch64? It's something I'm pretty sure we'd need to handle
for x86 as we have to ensure the rounding mode is stable if we move/copy
FP computations from one context to another.
> Index: gcc/early-remat.c
> ===================================================================
> --- /dev/null 2017-11-14 14:28:07.424493901 +0000
> +++ gcc/early-remat.c 2017-11-17 15:56:53.021759920 +0000
> @@ -0,0 +1,2610 @@
> + Candidate validation and value numbering
> + ========================================
> +
> + Next we simultaneously decide which candidates are valid and look
> + for candidates that are equivalent to each other, assigning numbers
> + to each unique candidate value. A candidate C is invalid if:
> +
> + (a) C uses an invalid candidate;
> +
> + (b) there is a cycle of candidate uses involving C; or
> +
> + (c) C takes a candidate register R as input and the reaching
> + definitions of R do not have the same value number.
> +
> + We assign a "representative" candidate C to each value number and from
> + here on replace references to other candidates with that value number
> + with references to C. It is then only possible to rematerialize a
> + register R at point P if (after this replacement) there is a single
> + reaching definition of R at P.
So how often are you seeing candidates that are equal to each other? I
know Vlad did some experiments in the register allocator and the number
of equivalences seen by VN was relatively small.
Maybe it's the case that there's something fundamentally different with
what you're doing allowing you to see more equivalences.
Of course if you're using VNs as part of the validation process, then I
guess you get equivalences for free, so you might as well go ahead and
use them.
Local/Global Phases seem reasonable at a high level. About what you'd
expect.
> +/* Return true if REGNO is worth rematerializing. */
> +
> +bool
> +early_remat::interesting_regno_p (unsigned int regno)
> +{
> + /* Ignore unused registers. */
> + rtx reg = regno_reg_rtx[regno];
> + if (!reg || DF_REG_DEF_COUNT (regno) == 0)
> + return false;
> +
> + /* Make sure the register has a mode that we want to rematerialize. */
> + if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))
> + return false;
> +
> + /* Ignore values that might sometimes be used uninitialized. We could
> + instead add dummy candidates for the entry block definition, and so
> + handle uses that are definitely not uninitialized, but the combination
> + of the two should be rare in practice. */
> + if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
> + return false;
> +
> + return true;
> +}
So triggering on modes isn't really ideal, but I can understand why
you're doing it that way. I think it's OK for now, but there may be a
day when we want to do something different.
That could one day be dealing with an architecture with a painful to
save register that is accessed in the same mode as other pseudos or
dealing with a remat pass to reduce pressure.
> +
> +/* Record the set of register REGNO in instruction INSN as a
> + rematerialization candidate. CAN_COPY_P is true unless we already
> + know that rematerialization is impossible (in which case the candidate
> + only exists for the reaching definition calculation).
> +
> + The candidate's index is not fixed at this stage. */
> +
> +remat_candidate *
> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,
> + bool can_copy_p)
> +{
> + remat_candidate cand;
> + memset (&cand, 0, sizeof (cand));
> + cand.regno = regno;
> + cand.insn = insn;
> + cand.remat_rtx = PATTERN (insn);
> + cand.can_copy_p = can_copy_p;
FWIW, if the fields you're assigning are at the start or the end of the
object, DSE will trim the memset to avoid unnecessary write traffic.
No need to try and rearrange the object to enable that optimization.
Just making you aware in case such things are of interest to you.
> + else if (HARD_REGISTER_NUM_P (use_regno))
> + {
> + /* Allocate a dummy pseudo register and temporarily install it.
> + Make the register number depend on the mode, which should
> + provide enough sharing for match_dup while also weeding
> + out cases in which operands with different modes are
> + explicitly tied. */
> + rtx *loc = DF_REF_REAL_LOC (ref);
> + unsigned int size = RTX_CODE_SIZE (REG);
> + rtx new_reg = (rtx) alloca (size);
> + memset (new_reg, 0, size);
> + PUT_CODE (new_reg, REG);
> + set_mode_and_regno (new_reg, GET_MODE (*loc),
> + LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));
> + validate_change (insn, loc, new_reg, 1);
> + }
> + }
Ewww. But I guess it works OK and you undo it immediately. There was
another similar instance later. They're not ideal, but they're well
localized and I don't think worth making a big deal out of.
> +
> +/* Assuming that every path reaching a point P contains a copy of a
> + use U of REGNO, return true if another copy of U at P would have
> + access to the same value of REGNO. */
> +
> +bool
> +early_remat::stable_use_p (unsigned int regno)
> +{
> + /* Conservatively assume not for hard registers. */
> + if (HARD_REGISTER_NUM_P (regno))
> + return false;
> +
> + /* See if REGNO has a single definition and is never used uninitialized.
> + In this case the definition of REGNO dominates the common dominator
> + of the uses U, which in turn dominates P. */
> + if (DF_REG_DEF_COUNT (regno) == 1
> + && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
> + return true;
> +
> + return false;
> +}
What about loops here? The def may dominate the use but the value might
be changing each iteration of the loop. Does that affect correctness of
the algorithm you're using?
> +
> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's instruction. */
> +
> +void
> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)
> +{
> + remat_candidate *cand = &m_candidates[cand_index];
> + if (dump_file)
> + {
> + fprintf (dump_file, ";; Stabilizing insn ");
> + dump_insn_id (cand->insn);
> + fprintf (dump_file, " by copying source reg %d:%s to temporary reg %d\n",
> + REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));
> + }
> + emit_insn_before (gen_move_insn (dest, src), cand->insn);
Might want to clarify that SRC and DEST are regs. I was about to ask
about it, but then saw the mention in the dumping code.
Again, generally this is largely what I'd expect. Just a few questions
above to clarify a couple minor concerns, but I think this is fine.
Jeff
More information about the Gcc-patches
mailing list