This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Add an "early rematerialisation" pass


Possible ping: wasn't sure whether this one needed more work or whether
it was OK to go in.  I've attached the patch with the improved comment
above early_remat::emit_copy_before.

Thanks,
Richard

Richard Sandiford <richard.sandiford@linaro.org> writes:
> Jeff Law <law@redhat.com> writes:
>> On 11/17/2017 08:58 AM, Richard Sandiford wrote:
>>> This patch looks for pseudo registers that are live across a call
>>> and for which no call-preserved hard registers exist.  It then
>>> recomputes the pseudos as necessary to ensure that they are no
>>> longer live across a call.  The comment at the head of the file
>>> describes the approach.
>>> 
>>> A new target hook selects which modes should be treated in this way.
>>> By default none are, in which case the pass is skipped very early.
>>> 
>>> It might also be worth looking for cases like:
>>> 
>>>    C1: R1 := f (...)
>>>    ...
>>>    C2: R2 := f (...)
>>>    C3: R1 := C2
>>> 
>>> and giving the same value number to C1 and C3, effectively treating
>>> it like:
>>> 
>>>    C1: R1 := f (...)
>>>    ...
>>>    C2: R2 := f (...)
>>>    C3: R1 := f (...)
>>> 
>>> Another (much more expensive) enhancement would be to apply value
>>> numbering to all pseudo registers (not just rematerialisation
>>> candidates), so that we can handle things like:
>>> 
>>>   C1: R1 := f (...R2...)
>>>   ...
>>>   C2: R1 := f (...R3...)
>>> 
>>> where R2 and R3 hold the same value.  But the current pass seems
>>> to catch the vast majority of cases.
>>> 
>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>> and powerpc64le-linux-gnu.  OK to install?
>>> 
>>> Richard
>>> 
>>> 
>>> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
>>> 
>>> gcc/
>>> 	* Makefile.in (OBJS): Add early-remat.o.
>>> 	* target.def (select_early_remat_modes): New hook.
>>> 	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
>>> 	* doc/tm.texi: Regenerate.
>>> 	* targhooks.h (default_select_early_remat_modes): Declare.
>>> 	* targhooks.c (default_select_early_remat_modes): New function.
>>> 	* timevar.def (TV_EARLY_REMAT): New timevar.
>>> 	* passes.def (pass_early_remat): New pass.
>>> 	* tree-pass.h (make_pass_early_remat): Declare.
>>> 	* early-remat.c: New file.
>>> 	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
>>> 	function.
>>> 	(TARGET_SELECT_EARLY_REMAT_MODES): Define.
>>> 
>>> gcc/testsuite/
>>> 	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates
>>> 	are spilled.
>>> 	* gcc.target/aarch64/sve_spill_2.c: New test.
>>> 	* gcc.target/aarch64/sve_spill_3.c: Likewise.
>>> 	* gcc.target/aarch64/sve_spill_4.c: Likewise.
>>> 	* gcc.target/aarch64/sve_spill_5.c: Likewise.
>>> 	* gcc.target/aarch64/sve_spill_6.c: Likewise.
>>> 	* gcc.target/aarch64/sve_spill_7.c: Likewise.
>> So it's not the immediate goal here, but it'd be nice if we could extend
>> this to cover all the other targets that have no registers of certain
>> classes that can be allocated across a call.
>>
>> Mostly this is exactly what I'd expect -- not to diminish the work,
>> there's a ton of stuff here.  But the overall flow of work and
>> organization is basically what I'd expect.
>
> That's good.  I wasn't setting out to do anything particularly
> innovative here, so the fact that it's not surprising is a positive
> sign. :-)
>
>> I know Richi was a bit scared off by the RTL nature, but the vast
>> majority of the code here isn't related to RTL -- it's operating on
>> properties that got extracted from the RTL.  I could almost take this
>> change bits at the start of the pass and the end and run it on gimple
>> :-)  Which brings up an interesting question, is that a worthwhile thing
>> to ponder?  Is it gimple optimizers that are causing objects to be live
>> across calls or is it usually teh RTL bits?
>
> It tends to be a mixture of both.  E.g. if we have two loops with
> the same bounds and a call inbetween, the gimple optimisers will reuse
> the WHILE_ULT in the first loop header for the second loop.  But there
> are also cases introduced in rtl.  E.g. we don't (and IMO shouldn't)
> expose to gimple that some SVE operations always require a predicate.
> The expanders instead generate all-true predicates where necessary.
> These are then an obvious candidate for PRE.
>
>> As I was reading one of the thoughts I had was that this reminded me a
>> bit of LCM.  WHich makes sense -- it's code motion after all.  I was
>> pleasantly surprised to see the comment explaining why the LCM
>> infrastructure was not used.
>
> Yeah, the availablity problem is basically the same.  The main
> difference is choosing the final placement based on block frequencies.
>
>> Do you have to do anything about rounding modes?  Or are those not an
>> issue for aarch64?  It's something I'm pretty sure we'd need to handle
>> for x86 as we have to ensure the rounding mode is stable if we move/copy
>> FP computations from one context to another.
>
> Rounding modes don't affect loads, stores or moves, which are just
> bit copies.  I suppose there is the problem that we could rematerialise
> an FP operation in a region with a different rounding mode, but we just
> don't have an IL that tracks that properly (in general, not just here).
>
>>> Index: gcc/early-remat.c
>>> ===================================================================
>>> --- /dev/null	2017-11-14 14:28:07.424493901 +0000
>>> +++ gcc/early-remat.c	2017-11-17 15:56:53.021759920 +0000
>>> @@ -0,0 +1,2610 @@
>>> +   Candidate validation and value numbering
>>> +   ========================================
>>> +
>>> +   Next we simultaneously decide which candidates are valid and look
>>> +   for candidates that are equivalent to each other, assigning numbers
>>> +   to each unique candidate value.  A candidate C is invalid if:
>>> +
>>> +     (a) C uses an invalid candidate;
>>> +
>>> +     (b) there is a cycle of candidate uses involving C; or
>>> +
>>> +     (c) C takes a candidate register R as input and the reaching
>>> +         definitions of R do not have the same value number.
>>> +
>>> +   We assign a "representative" candidate C to each value number and from
>>> +   here on replace references to other candidates with that value number
>>> +   with references to C.  It is then only possible to rematerialize a
>>> +   register R at point P if (after this replacement) there is a single
>>> +   reaching definition of R at P.
>> So how often are you seeing candidates that are equal to each other?  I
>> know Vlad did some experiments in the register allocator and the number
>> of equivalences seen by VN was relatively small.
>>
>> Maybe it's the case that there's something fundamentally different with
>> what you're doing allowing you to see more equivalences.
>>
>> Of course if you're using VNs as part of the validation process, then I
>> guess you get equivalences for free, so you might as well go ahead and
>> use them.
>
> It happened in at least one important case (can't remember which, sorry).
> That was enough to get me to implement it without first measuring how
> common it was.  Like you say, it almost came for free as part of
> validation.
>
>> Local/Global Phases seem reasonable at a high level.  About what you'd
>> expect.
>
> Thanks.
>
>>> +/* Return true if REGNO is worth rematerializing.  */
>>> +
>>> +bool
>>> +early_remat::interesting_regno_p (unsigned int regno)
>>> +{
>>> +  /* Ignore unused registers.  */
>>> +  rtx reg = regno_reg_rtx[regno];
>>> +  if (!reg || DF_REG_DEF_COUNT (regno) == 0)
>>> +    return false;
>>> +
>>> +  /* Make sure the register has a mode that we want to rematerialize.  */
>>> +  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))
>>> +    return false;
>>> +
>>> +  /* Ignore values that might sometimes be used uninitialized.  We could
>>> +     instead add dummy candidates for the entry block definition, and so
>>> +     handle uses that are definitely not uninitialized, but the combination
>>> +     of the two should be rare in practice.  */
>>> +  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
>>> +    return false;
>>> +
>>> +  return true;
>>> +}
>> So triggering on modes isn't really ideal, but I can understand why
>> you're doing it that way.  I think it's OK for now, but there may be a
>> day when we want to do something different.
>>
>> That could one day be dealing with an architecture with a painful to
>> save register that is accessed in the same mode as other pseudos or
>> dealing with a remat pass to reduce pressure.
>
> I agree it might be useful to handle cases that aren't determined
> solely by the mode.  It should be easy to plug that in when necessary.
>
> I'm still not sure that this is the right place or algorithm for doing
> remat to reduce pressure (but I'd be happy to be proven wrong :-)).
>
>>> +/* Record the set of register REGNO in instruction INSN as a
>>> +   rematerialization candidate.  CAN_COPY_P is true unless we already
>>> +   know that rematerialization is impossible (in which case the candidate
>>> +   only exists for the reaching definition calculation).
>>> +
>>> +   The candidate's index is not fixed at this stage.  */
>>> +
>>> +remat_candidate *
>>> +early_remat::add_candidate (rtx_insn *insn, unsigned int regno,
>>> +			    bool can_copy_p)
>>> +{
>>> +  remat_candidate cand;
>>> +  memset (&cand, 0, sizeof (cand));
>>> +  cand.regno = regno;
>>> +  cand.insn = insn;
>>> +  cand.remat_rtx = PATTERN (insn);
>>> +  cand.can_copy_p = can_copy_p;
>> FWIW, if the fields you're assigning are at the start or the end of the
>> object, DSE will trim the memset to avoid unnecessary write traffic.
>> No need to try and rearrange the object to enable that optimization.
>> Just making you aware in case such things are of interest to you.
>
> OK.  This particular code isn't performance-sensitive, but it's
> definitely something that's worth bearing in mind.
>
>>> +      else if (HARD_REGISTER_NUM_P (use_regno))
>>> +	{
>>> +	  /* Allocate a dummy pseudo register and temporarily install it.
>>> +	     Make the register number depend on the mode, which should
>>> +	     provide enough sharing for match_dup while also weeding
>>> +	     out cases in which operands with different modes are
>>> +	     explicitly tied.  */
>>> +	  rtx *loc = DF_REF_REAL_LOC (ref);
>>> +	  unsigned int size = RTX_CODE_SIZE (REG);
>>> +	  rtx new_reg = (rtx) alloca (size);
>>> +	  memset (new_reg, 0, size);
>>> +	  PUT_CODE (new_reg, REG);
>>> +	  set_mode_and_regno (new_reg, GET_MODE (*loc),
>>> +			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));
>>> +	  validate_change (insn, loc, new_reg, 1);
>>> +	}
>>> +    }
>> Ewww.  But I guess it works OK and you undo it immediately.  There was
>> another similar instance later.  They're not ideal, but they're well
>> localized and I don't think worth making a big deal out of.
>
> Yeah, it's not pretty, but I don't think we can do much better given
> the current "try it and see" approach to insn recognition.
>
>>> +
>>> +/* Assuming that every path reaching a point P contains a copy of a
>>> +   use U of REGNO, return true if another copy of U at P would have
>>> +   access to the same value of REGNO.  */
>>> +
>>> +bool
>>> +early_remat::stable_use_p (unsigned int regno)
>>> +{
>>> +  /* Conservatively assume not for hard registers.  */
>>> +  if (HARD_REGISTER_NUM_P (regno))
>>> +    return false;
>>> +
>>> +  /* See if REGNO has a single definition and is never used uninitialized.
>>> +     In this case the definition of REGNO dominates the common dominator
>>> +     of the uses U, which in turn dominates P.  */
>>> +  if (DF_REG_DEF_COUNT (regno) == 1
>>> +      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
>>> +    return true;
>>> +
>>> +  return false;
>>> +}
>> What about loops here?  The def may dominate the use but the value might
>> be changing each iteration of the loop.  Does that affect correctness of
>> the algorithm you're using?
>
> No, that's OK.  Since the pre-existing copies of the candidate
> instruction CI all use REGNO, and since every path to the rematerialisation
> point P includes a copy of CI, there can't be a path from the definition
> of REGNO to P that doesn't also include a copy of CI.  So the value of
> REGNO at P will always be the correct one.
>
>>> +
>>> +/* Emit a copy from DEST to SRC before candidate CAND_INDEX's
>>> instruction.  */
>>> +
>>> +void
>>> +early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)
>>> +{
>>> +  remat_candidate *cand = &m_candidates[cand_index];
>>> +  if (dump_file)
>>> +    {
>>> +      fprintf (dump_file, ";; Stabilizing insn ");
>>> +      dump_insn_id (cand->insn);
>>> + fprintf (dump_file, " by copying source reg %d:%s to temporary reg
>>> %d\n",
>>> +	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));
>>> +    }
>>> +  emit_insn_before (gen_move_insn (dest, src), cand->insn);
>> Might want to clarify that SRC and DEST are regs.  I was about to ask
>> about it, but then saw the mention in the dumping code.
>
> OK.
>
>> Again, generally this is largely what I'd expect.  Just a few questions
>> above to clarify a couple minor concerns, but I think this is fine.
>
> Thanks,
> Richard

2018-01-09  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* Makefile.in (OBJS): Add early-remat.o.
	* target.def (select_early_remat_modes): New hook.
	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_select_early_remat_modes): Declare.
	* targhooks.c (default_select_early_remat_modes): New function.
	* timevar.def (TV_EARLY_REMAT): New timevar.
	* passes.def (pass_early_remat): New pass.
	* tree-pass.h (make_pass_early_remat): Declare.
	* early-remat.c: New file.
	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
	function.
	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

gcc/testsuite/
	* gcc.target/aarch64/sve_spill_1.c: Also test that no predicates
	are spilled.
	* gcc.target/aarch64/sve_spill_2.c: New test.
	* gcc.target/aarch64/sve_spill_3.c: Likewise.
	* gcc.target/aarch64/sve_spill_4.c: Likewise.
	* gcc.target/aarch64/sve_spill_5.c: Likewise.
	* gcc.target/aarch64/sve_spill_6.c: Likewise.
	* gcc.target/aarch64/sve_spill_7.c: Likewise.

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2018-01-03 11:12:59.095630403 +0000
+++ gcc/Makefile.in	2018-01-09 15:28:36.199732981 +0000
@@ -1277,6 +1277,7 @@ OBJS = \
 	dwarf2asm.o \
 	dwarf2cfi.o \
 	dwarf2out.o \
+	early-remat.o \
 	emit-rtl.o \
 	et-forest.o \
 	except.o \
Index: gcc/target.def
===================================================================
--- gcc/target.def	2018-01-09 15:09:57.243794886 +0000
+++ gcc/target.def	2018-01-09 15:28:36.204732772 +0000
@@ -5571,6 +5571,19 @@ reload from using some alternatives, lik
  default_preferred_output_reload_class)
 
 DEFHOOK
+(select_early_remat_modes,
+ "On some targets, certain modes cannot be held in registers around a\n\
+standard ABI call and are relatively expensive to spill to the stack.\n\
+The early rematerialization pass can help in such cases by aggressively\n\
+recomputing values after calls, so that they don't need to be spilled.\n\
+\n\
+This hook returns the set of such modes by setting the associated bits\n\
+in @var{modes}.  The default implementation selects no modes, which has\n\
+the effect of disabling the early rematerialization pass.",
+ void, (sbitmap modes),
+ default_select_early_remat_modes)
+
+DEFHOOK
 (class_likely_spilled_p,
  "A target hook which returns @code{true} if pseudos that have been assigned\n\
 to registers of class @var{rclass} would likely be spilled because\n\
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	2018-01-09 15:09:57.242794927 +0000
+++ gcc/doc/tm.texi.in	2018-01-09 15:28:36.203732814 +0000
@@ -2307,6 +2307,8 @@ Do not define this macro if you do not d
 
 @hook TARGET_SECONDARY_MEMORY_NEEDED_MODE
 
+@hook TARGET_SELECT_EARLY_REMAT_MODES
+
 @hook TARGET_CLASS_LIKELY_SPILLED_P
 
 @hook TARGET_CLASS_MAX_NREGS
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	2018-01-09 15:09:57.241794968 +0000
+++ gcc/doc/tm.texi	2018-01-09 15:28:36.202732856 +0000
@@ -2774,6 +2774,17 @@ details.
 With LRA, the default is to use @var{mode} unmodified.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_SELECT_EARLY_REMAT_MODES (sbitmap @var{modes})
+On some targets, certain modes cannot be held in registers around a
+standard ABI call and are relatively expensive to spill to the stack.
+The early rematerialization pass can help in such cases by aggressively
+recomputing values after calls, so that they don't need to be spilled.
+
+This hook returns the set of such modes by setting the associated bits
+in @var{modes}.  The default implementation selects no modes, which has
+the effect of disabling the early rematerialization pass.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_CLASS_LIKELY_SPILLED_P (reg_class_t @var{rclass})
 A target hook which returns @code{true} if pseudos that have been assigned
 to registers of class @var{rclass} would likely be spilled because
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	2018-01-09 15:09:34.792756213 +0000
+++ gcc/targhooks.h	2018-01-09 15:28:36.205732731 +0000
@@ -285,6 +285,7 @@ extern unsigned int default_min_arithmet
 extern enum flt_eval_method
 default_excess_precision (enum excess_precision_type ATTRIBUTE_UNUSED);
 extern bool default_stack_clash_protection_final_dynamic_probe (rtx);
+extern void default_select_early_remat_modes (sbitmap);
 
 extern rtx
 default_inhibit_load_speculation (machine_mode, rtx, rtx, rtx, rtx, rtx, rtx);
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	2018-01-09 15:09:34.792756213 +0000
+++ gcc/targhooks.c	2018-01-09 15:28:36.205732731 +0000
@@ -83,6 +83,7 @@ Software Foundation; either version 3, o
 #include "real.h"
 #include "langhooks.h"
 #include "dojump.h"
+#include "sbitmap.h"
 
 bool
 default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@@ -2382,4 +2383,11 @@ default_inhibit_load_speculation (machin
   return result;
 }
 
+/* The default implementation of TARGET_EARLY_REMAT_MODES.  */
+
+void
+default_select_early_remat_modes (sbitmap)
+{
+}
+
 #include "gt-targhooks.h"
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	2018-01-03 11:12:57.799681474 +0000
+++ gcc/timevar.def	2018-01-09 15:28:36.205732731 +0000
@@ -253,6 +253,7 @@ DEFTIMEVAR (TV_MODE_SWITCH           , "
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
 DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE  , "live range shrinkage")
 DEFTIMEVAR (TV_SCHED                 , "scheduling")
+DEFTIMEVAR (TV_EARLY_REMAT           , "early rematerialization")
 DEFTIMEVAR (TV_IRA		     , "integrated RA")
 DEFTIMEVAR (TV_LRA		     , "LRA non-specific")
 DEFTIMEVAR (TV_LRA_ELIMINATE	     , "LRA virtuals elimination")
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	2018-01-03 11:12:57.677686279 +0000
+++ gcc/passes.def	2018-01-09 15:28:36.203732814 +0000
@@ -460,6 +460,7 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_sms);
       NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
+      NEXT_PASS (pass_early_remat);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
       NEXT_PASS (pass_postreload);
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	2018-01-03 11:12:56.406736330 +0000
+++ gcc/tree-pass.h	2018-01-09 15:28:36.205732731 +0000
@@ -576,6 +576,7 @@ extern rtl_opt_pass *make_pass_mode_swit
 extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_early_remat (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);
Index: gcc/early-remat.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/early-remat.c	2018-01-09 15:28:36.203732814 +0000
@@ -0,0 +1,2611 @@
+/* Early (pre-RA) rematerialization
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+/* FIXME: The next two are only needed for gen_move_insn.  */
+#include "tree.h"
+#include "expr.h"
+#include "target.h"
+#include "inchash.h"
+#include "rtlhash.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+
+/* This pass runs before register allocation and implements an aggressive
+   form of rematerialization.  It looks for pseudo registers R of mode M
+   for which:
+
+     (a) there are no call-preserved registers of mode M; and
+     (b) spilling R to the stack is expensive.
+
+   The assumption is that it's better to recompute R after each call instead
+   of spilling it, even if this extends the live ranges of other registers.
+
+   The motivating example for which these conditions hold are AArch64 SVE
+   vectors and predicates.  Spilling them to the stack makes the frame
+   variable-sized, which we'd like to avoid if possible.  It's also very
+   rare for SVE values to be "naturally" live across a call: usually this
+   happens as a result of CSE or other code motion.
+
+   The pass is split into the following phases:
+
+   Collection phase
+   ================
+
+   First we go through all pseudo registers looking for any that meet
+   the conditions above.  For each such register R, we go through each
+   instruction that defines R to see whether any of them are suitable
+   rematerialization candidates.  If at least one is, we treat all the
+   instructions that define R as candidates, but record which ones are
+   not in fact suitable.  These unsuitable candidates exist only for the
+   sake of calculating reaching definitions (see below).
+
+   A "candidate" is a single instruction that we want to rematerialize
+   and a "candidate register" is a register that is set by at least one
+   candidate.
+
+   Candidate sorting
+   =================
+
+   Next we sort the candidates based on the cfg postorder, so that if
+   candidate C1 uses candidate C2, C1 has a lower index than C2.
+   This is useful when iterating through candidate bitmaps.
+
+   Reaching definition calculation
+   ===============================
+
+   We then compute standard reaching-definition sets for each candidate.
+   Each set specifies which candidates might provide the current definition
+   of a live candidate register.
+
+   From here on, a candidate C is "live" at a point P if the candidate
+   register defined by C is live at P and if C's definition reaches P.
+   An instruction I "uses" a candidate C if I takes the register defined by
+   C as input and if C is one of the reaching definitions of that register.
+
+   Candidate validation and value numbering
+   ========================================
+
+   Next we simultaneously decide which candidates are valid and look
+   for candidates that are equivalent to each other, assigning numbers
+   to each unique candidate value.  A candidate C is invalid if:
+
+     (a) C uses an invalid candidate;
+
+     (b) there is a cycle of candidate uses involving C; or
+
+     (c) C takes a candidate register R as input and the reaching
+         definitions of R do not have the same value number.
+
+   We assign a "representative" candidate C to each value number and from
+   here on replace references to other candidates with that value number
+   with references to C.  It is then only possible to rematerialize a
+   register R at point P if (after this replacement) there is a single
+   reaching definition of R at P.
+
+   Local phase
+   ===========
+
+   During this phase we go through each block and look for cases in which:
+
+     (a) an instruction I comes between two call instructions CI1 and CI2;
+
+     (b) I uses a candidate register R;
+
+     (c) a candidate C provides the only reaching definition of R; and
+
+     (d) C does not come between CI1 and I.
+
+   We then emit a copy of C after CI1, as well as the transitive closure
+   TC of the candidates used by C.  The copies of TC might use the original
+   candidate registers or new temporary registers, depending on circumstances.
+
+   For example, if elsewhere we have:
+
+       C3: R3 <- f3 (...)
+	   ...
+       C2: R2 <- f2 (...)
+	   ...
+       C1: R1 <- f1 (R2, R3, ...)  // uses C2 and C3
+
+   then for a block containing:
+
+      CI1: call
+	   ...
+	I: use R1  // uses C1
+	   ...
+      CI2: call
+
+   we would emit:
+
+      CI1: call
+      C3': R3' <- f3 (...)
+      C2': R2' <- f2 (...)
+      C1': R1 <- f1 (R2', R3', ...)
+	   ...
+	I: use R1
+	   ...
+      CI2: call
+
+   where R2' and R3' might be fresh registers.  If instead we had:
+
+      CI1: call
+	   ...
+       I1: use R1  // uses C1
+	   ...
+       I2: use R3  // uses C3
+	   ...
+      CI2: call
+
+   we would keep the original R3:
+
+      CI1: call
+      C3': R3 <- f3 (...)
+      C2': R2' <- f2 (...)
+      C1': R1 <- f1 (R2', R3, ...)
+	   ...
+       I1: use R1  // uses C1
+	   ...
+       I2: use R3  // uses C3
+	   ...
+      CI2: call
+
+   We also record the last call in each block (if any) and compute:
+
+     rd_after_call:
+       The set of candidates that either (a) are defined outside the block
+       and are live after the last call or (b) are defined within the block
+       and reach the end of the last call.  (We don't track whether the
+       latter values are live or not.)
+
+     required_after_call:
+       The set of candidates that need to be rematerialized after the
+       last call in order to satisfy uses in the block itself.
+
+     required_in:
+       The set of candidates that are live on entry to the block and are
+       used without an intervening call.
+
+   In addition, we compute the initial values of the sets required by
+   the global phase below.
+
+   Global phase
+   ============
+
+   We next compute a maximal solution to the following availability
+   problem:
+
+     available_in:
+       The set of candidates that are live on entry to a block and can
+       be used at that point without rematerialization.
+
+     available_out:
+       The set of candidates that are live on exit from a block and can
+       be used at that point without rematerialization.
+
+     available_locally:
+       The subset of available_out that is due to code in the block itself.
+       It contains candidates that are defined or used in the block and
+       not invalidated by a later call.
+
+   We then go through each block B and look for an appropriate place
+   to insert copies of required_in - available_in.  Conceptually we
+   start by placing the copies at the head of B, but then move the
+   copy of a candidate C to predecessors if:
+
+     (a) that seems cheaper;
+
+     (b) there is more than one reaching definition of C's register at
+	 the head of B; or
+
+     (c) copying C would clobber a hard register that is live on entry to B.
+
+   Moving a copy of C to a predecessor block PB involves:
+
+     (1) adding C to PB's required_after_call, if PB contains a call; or
+
+     (2) adding C PB's required_in otherwise.
+
+   C is then available on output from each PB and on input to B.
+
+   Once all this is done, we emit instructions for the final required_in
+   and required_after_call sets.  */
+
+namespace {
+
+/* An invalid candidate index, used to indicate that there is more than
+   one reaching definition.  */
+const unsigned int MULTIPLE_CANDIDATES = -1U;
+
+/* Pass-specific information about one basic block.  */
+struct remat_block_info {
+  /* The last call instruction in the block.  */
+  rtx_insn *last_call;
+
+  /* The set of candidates that are live on entry to the block.  NULL is
+     equivalent to an empty set.  */
+  bitmap rd_in;
+
+  /* The set of candidates that are live on exit from the block.  This might
+     reuse rd_in.  NULL is equivalent to an empty set.  */
+  bitmap rd_out;
+
+  /* The subset of RD_OUT that comes from local definitions.  NULL is
+     equivalent to an empty set.  */
+  bitmap rd_gen;
+
+  /* The set of candidates that the block invalidates (because it defines
+     the register to something else, or because the register's value is
+     no longer important).  NULL is equivalent to an empty set.  */
+  bitmap rd_kill;
+
+  /* The set of candidates that either (a) are defined outside the block
+     and are live after LAST_CALL or (b) are defined within the block
+     and reach the instruction after LAST_CALL.  (We don't track whether
+     the latter values are live or not.)
+
+     Only used if LAST_CALL is nonnull.  NULL is equivalent to an
+     empty set.  */
+  bitmap rd_after_call;
+
+  /* Candidates that are live and available without rematerialization
+     on entry to the block.  NULL is equivalent to an empty set.  */
+  bitmap available_in;
+
+  /* Candidates that become available without rematerialization within the
+     block, and remain so on exit.  NULL is equivalent to an empty set.  */
+  bitmap available_locally;
+
+  /* Candidates that are available without rematerialization on exit from
+     the block.  This might reuse available_in or available_locally.  */
+  bitmap available_out;
+
+  /* Candidates that need to be rematerialized either at the start of the
+     block or before entering the block.  */
+  bitmap required_in;
+
+  /* Candidates that need to be rematerialized after LAST_CALL.
+     Only used if LAST_CALL is nonnull.  */
+  bitmap required_after_call;
+
+  /* The number of candidates in the block.  */
+  unsigned int num_candidates;
+
+  /* The earliest candidate in the block (i.e. the one with the
+     highest index).  Only valid if NUM_CANDIDATES is nonzero.  */
+  unsigned int first_candidate;
+
+  /* The best (lowest) execution frequency for rematerializing REQUIRED_IN.
+     This is the execution frequency of the block if LOCAL_REMAT_CHEAPER_P,
+     otherwise it is the sum of the execution frequencies of whichever
+     predecessor blocks would do the rematerialization.  */
+  int remat_frequency;
+
+  /* True if the block ends with an abnormal call.  */
+  unsigned int abnormal_call_p : 1;
+
+  /* Used to record whether a graph traversal has visited this block.  */
+  unsigned int visited_p : 1;
+
+  /* True if we have calculated REMAT_FREQUENCY.  */
+  unsigned int remat_frequency_valid_p : 1;
+
+  /* True if it is cheaper to rematerialize candidates at the start of
+     the block, rather than moving them to predecessor blocks.  */
+  unsigned int local_remat_cheaper_p : 1;
+};
+
+/* Information about a group of candidates with the same value number.  */
+struct remat_equiv_class {
+  /* The candidates that have the same value number.  */
+  bitmap members;
+
+  /* The candidate that was first added to MEMBERS.  */
+  unsigned int earliest;
+
+  /* The candidate that represents the others.  This is always the one
+     with the highest index.  */
+  unsigned int representative;
+};
+
+/* Information about an instruction that we might want to rematerialize.  */
+struct remat_candidate {
+  /* The pseudo register that the instruction sets.  */
+  unsigned int regno;
+
+  /* A temporary register used when rematerializing uses of this candidate,
+     if REGNO doesn't have the right value or isn't worth using.  */
+  unsigned int copy_regno;
+
+  /* True if we intend to rematerialize this instruction by emitting
+     a move of a constant into REGNO, false if we intend to emit a
+     copy of the original instruction.  */
+  unsigned int constant_p : 1;
+
+  /* True if we still think it's possible to rematerialize INSN.  */
+  unsigned int can_copy_p : 1;
+
+  /* Used to record whether a graph traversal has visited this candidate.  */
+  unsigned int visited_p : 1;
+
+  /* True if we have verified that it's possible to rematerialize INSN.
+     Once this is true, both it and CAN_COPY_P remain true.  */
+  unsigned int validated_p : 1;
+
+  /* True if we have "stabilized" INSN, i.e. ensured that all non-candidate
+     registers read by INSN will have the same value when rematerializing INSN.
+     Only ever true if CAN_COPY_P.  */
+  unsigned int stabilized_p : 1;
+
+  /* Hash value used for value numbering.  */
+  hashval_t hash;
+
+  /* The instruction that sets REGNO.  */
+  rtx_insn *insn;
+
+  /* If CONSTANT_P, the value that should be moved into REGNO when
+     rematerializing, otherwise the pattern of the instruction that
+     should be used.  */
+  rtx remat_rtx;
+
+  /* The set of candidates that INSN takes as input.  NULL is equivalent
+     to the empty set.  All candidates in this set have a higher index
+     than the current candidate.  */
+  bitmap uses;
+
+  /* The set of hard registers that would be clobbered by rematerializing
+     the candidate, including (transitively) all those that would be
+     clobbered by rematerializing USES.  */
+  bitmap clobbers;
+
+  /* The equivalence class to which the candidate belongs, or null if none.  */
+  remat_equiv_class *equiv_class;
+};
+
+/* Hash functions used for value numbering.  */
+struct remat_candidate_hasher : nofree_ptr_hash <remat_candidate>
+{
+  typedef value_type compare_type;
+  static hashval_t hash (const remat_candidate *);
+  static bool equal (const remat_candidate *, const remat_candidate *);
+};
+
+/* Main class for this pass.  */
+class early_remat {
+public:
+  early_remat (function *, sbitmap);
+  ~early_remat ();
+
+  void run (void);
+
+private:
+  bitmap alloc_bitmap (void);
+  bitmap get_bitmap (bitmap *);
+  void init_temp_bitmap (bitmap *);
+  void copy_temp_bitmap (bitmap *, bitmap *);
+
+  void dump_insn_id (rtx_insn *);
+  void dump_candidate_bitmap (bitmap);
+  void dump_all_candidates (void);
+  void dump_edge_list (basic_block, bool);
+  void dump_block_info (basic_block);
+  void dump_all_blocks (void);
+
+  bool interesting_regno_p (unsigned int);
+  remat_candidate *add_candidate (rtx_insn *, unsigned int, bool);
+  bool maybe_add_candidate (rtx_insn *, unsigned int);
+  bool collect_candidates (void);
+  void init_block_info (void);
+  void sort_candidates (void);
+  void finalize_candidate_indices (void);
+  void record_equiv_candidates (unsigned int, unsigned int);
+  static bool rd_confluence_n (edge);
+  static bool rd_transfer (int);
+  void compute_rd (void);
+  unsigned int canon_candidate (unsigned int);
+  void canon_bitmap (bitmap *);
+  unsigned int resolve_reaching_def (bitmap);
+  bool check_candidate_uses (unsigned int);
+  void compute_clobbers (unsigned int);
+  void assign_value_number (unsigned int);
+  void decide_candidate_validity (void);
+  bool stable_use_p (unsigned int);
+  void emit_copy_before (unsigned int, rtx, rtx);
+  void stabilize_pattern (unsigned int);
+  void replace_dest_with_copy (unsigned int);
+  void stabilize_candidate_uses (unsigned int, bitmap, bitmap, bitmap,
+				 bitmap);
+  void emit_remat_insns (bitmap, bitmap, bitmap, rtx_insn *);
+  bool set_available_out (remat_block_info *);
+  void process_block (basic_block);
+  void local_phase (void);
+  static bool avail_confluence_n (edge);
+  static bool avail_transfer (int);
+  void compute_availability (void);
+  void unshare_available_sets (remat_block_info *);
+  bool can_move_across_edge_p (edge);
+  bool local_remat_cheaper_p (unsigned int);
+  bool need_to_move_candidate_p (unsigned int, unsigned int);
+  void compute_minimum_move_set (unsigned int, bitmap);
+  void move_to_predecessors (unsigned int, bitmap, bitmap);
+  void choose_rematerialization_points (void);
+  void emit_remat_insns_for_block (basic_block);
+  void global_phase (void);
+
+  /* The function that we're optimizing.  */
+  function *m_fn;
+
+  /* The modes that we want to rematerialize.  */
+  sbitmap m_selected_modes;
+
+  /* All rematerialization candidates, identified by their index into the
+     vector.  */
+  auto_vec<remat_candidate> m_candidates;
+
+  /* The set of candidate registers.  */
+  bitmap_head m_candidate_regnos;
+
+  /* Temporary sets.  */
+  bitmap_head m_tmp_bitmap;
+  bitmap m_available;
+  bitmap m_required;
+
+  /* Information about each basic block.  */
+  auto_vec<remat_block_info> m_block_info;
+
+  /* A mapping from register numbers to the set of associated candidates.
+     Only valid for registers in M_CANDIDATE_REGNOS.  */
+  auto_vec<bitmap> m_regno_to_candidates;
+
+  /* An obstack used for allocating bitmaps, so that we can free them all
+     in one go.  */
+  bitmap_obstack m_obstack;
+
+  /* A hash table of candidates used for value numbering.  If a candidate
+     in the table is in an equivalence class, the candidate is marked as
+     the earliest member of the class.  */
+  hash_table<remat_candidate_hasher> m_value_table;
+
+  /* Used temporarily by callback functions.  */
+  static early_remat *er;
+};
+
+}
+
+early_remat *early_remat::er;
+
+/* rtx_equal_p_cb callback that treats any two SCRATCHes as equal.
+   This allows us to compare two copies of a pattern, even though their
+   SCRATCHes are always distinct.  */
+
+static int
+scratch_equal (const_rtx *x, const_rtx *y, rtx *nx, rtx *ny)
+{
+  if (GET_CODE (*x) == SCRATCH && GET_CODE (*y) == SCRATCH)
+    {
+      *nx = const0_rtx;
+      *ny = const0_rtx;
+      return 1;
+    }
+  return 0;
+}
+
+/* Hash callback functions for remat_candidate.  */
+
+hashval_t
+remat_candidate_hasher::hash (const remat_candidate *cand)
+{
+  return cand->hash;
+}
+
+bool
+remat_candidate_hasher::equal (const remat_candidate *cand1,
+			       const remat_candidate *cand2)
+{
+  return (cand1->regno == cand2->regno
+	  && cand1->constant_p == cand2->constant_p
+	  && (cand1->constant_p
+	      ? rtx_equal_p (cand1->remat_rtx, cand2->remat_rtx)
+	      : rtx_equal_p_cb (cand1->remat_rtx, cand2->remat_rtx,
+				scratch_equal))
+	  && (!cand1->uses || bitmap_equal_p (cand1->uses, cand2->uses)));
+}
+
+/* Return true if B is null or empty.  */
+
+inline bool
+empty_p (bitmap b)
+{
+  return !b || bitmap_empty_p (b);
+}
+
+/* Allocate a new bitmap.  It will be automatically freed at the end of
+   the pass.  */
+
+inline bitmap
+early_remat::alloc_bitmap (void)
+{
+  return bitmap_alloc (&m_obstack);
+}
+
+/* Initialize *PTR to an empty bitmap if it is currently null.  */
+
+inline bitmap
+early_remat::get_bitmap (bitmap *ptr)
+{
+  if (!*ptr)
+    *ptr = alloc_bitmap ();
+  return *ptr;
+}
+
+/* *PTR is either null or empty.  If it is null, initialize it to an
+   empty bitmap.  */
+
+inline void
+early_remat::init_temp_bitmap (bitmap *ptr)
+{
+  if (!*ptr)
+    *ptr = alloc_bitmap ();
+  else
+    gcc_checking_assert (bitmap_empty_p (*ptr));
+}
+
+/* Move *SRC to *DEST and leave *SRC empty.  */
+
+inline void
+early_remat::copy_temp_bitmap (bitmap *dest, bitmap *src)
+{
+  if (!empty_p (*src))
+    {
+      *dest = *src;
+      *src = NULL;
+    }
+  else
+    *dest = NULL;
+}
+
+/* Print INSN's identifier to the dump file.  */
+
+void
+early_remat::dump_insn_id (rtx_insn *insn)
+{
+  fprintf (dump_file, "%d[bb:%d]", INSN_UID (insn),
+	   BLOCK_FOR_INSN (insn)->index);
+}
+
+/* Print candidate set CANDIDATES to the dump file, with a leading space.  */
+
+void
+early_remat::dump_candidate_bitmap (bitmap candidates)
+{
+  if (empty_p (candidates))
+    {
+      fprintf (dump_file, " none");
+      return;
+    }
+
+  unsigned int cand_index;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (candidates, 0, cand_index, bi)
+    fprintf (dump_file, " %d", cand_index);
+}
+
+/* Print information about all candidates to the dump file.  */
+
+void
+early_remat::dump_all_candidates (void)
+{
+  fprintf (dump_file, "\n;; Candidates:\n;;\n");
+  fprintf (dump_file, ";; %5s %5s %8s %s\n", "#", "reg", "mode", "insn");
+  fprintf (dump_file, ";; %5s %5s %8s %s\n", "=", "===", "====", "====");
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      fprintf (dump_file, ";; %5d %5d %8s ", cand_index, cand->regno,
+	       GET_MODE_NAME (GET_MODE (regno_reg_rtx[cand->regno])));
+      dump_insn_id (cand->insn);
+      if (!cand->can_copy_p)
+	fprintf (dump_file, "   -- can't copy");
+      fprintf (dump_file, "\n");
+    }
+
+  fprintf (dump_file, "\n;; Register-to-candidate mapping:\n;;\n");
+  unsigned int regno;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)
+    {
+      fprintf (dump_file, ";; %5d:", regno);
+      dump_candidate_bitmap (m_regno_to_candidates[regno]);
+      fprintf (dump_file, "\n");
+    }
+}
+
+/* Print the predecessors or successors of BB to the dump file, with a
+   leading space.  DO_SUCC is true to print successors and false to print
+   predecessors.  */
+
+void
+early_remat::dump_edge_list (basic_block bb, bool do_succ)
+{
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, do_succ ? bb->succs : bb->preds)
+    dump_edge_info (dump_file, e, 0, do_succ);
+}
+
+/* Print information about basic block BB to the dump file.  */
+
+void
+early_remat::dump_block_info (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+  fprintf (dump_file, ";;\n;; Block %d:", bb->index);
+  int width = 25;
+
+  fprintf (dump_file, "\n;;%*s:", width, "predecessors");
+  dump_edge_list (bb, false);
+
+  fprintf (dump_file, "\n;;%*s:", width, "successors");
+  dump_edge_list (bb, true);
+
+  fprintf (dump_file, "\n;;%*s: %d", width, "frequency",
+	   bb->count.to_frequency (m_fn));
+
+  if (info->last_call)
+    fprintf (dump_file, "\n;;%*s: %d", width, "last call",
+	     INSN_UID (info->last_call));
+
+  if (!empty_p (info->rd_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD in");
+      dump_candidate_bitmap (info->rd_in);
+    }
+  if (!empty_p (info->rd_kill))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD kill");
+      dump_candidate_bitmap (info->rd_kill);
+    }
+  if (!empty_p (info->rd_gen))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD gen");
+      dump_candidate_bitmap (info->rd_gen);
+    }
+  if (!empty_p (info->rd_after_call))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD after call");
+      dump_candidate_bitmap (info->rd_after_call);
+    }
+  if (!empty_p (info->rd_out))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "RD out");
+      if (info->rd_in == info->rd_out)
+	fprintf (dump_file, " RD in");
+      else
+	dump_candidate_bitmap (info->rd_out);
+    }
+  if (!empty_p (info->available_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available in");
+      dump_candidate_bitmap (info->available_in);
+    }
+  if (!empty_p (info->available_locally))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available locally");
+      dump_candidate_bitmap (info->available_locally);
+    }
+  if (!empty_p (info->available_out))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "available out");
+      if (info->available_in == info->available_out)
+	fprintf (dump_file, " available in");
+      else if (info->available_locally == info->available_out)
+	fprintf (dump_file, " available locally");
+      else
+	dump_candidate_bitmap (info->available_out);
+    }
+  if (!empty_p (info->required_in))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "required in");
+      dump_candidate_bitmap (info->required_in);
+    }
+  if (!empty_p (info->required_after_call))
+    {
+      fprintf (dump_file, "\n;;%*s:", width, "required after call");
+      dump_candidate_bitmap (info->required_after_call);
+    }
+  fprintf (dump_file, "\n");
+}
+
+/* Print information about all basic blocks to the dump file.  */
+
+void
+early_remat::dump_all_blocks (void)
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    dump_block_info (bb);
+}
+
+/* Return true if REGNO is worth rematerializing.  */
+
+bool
+early_remat::interesting_regno_p (unsigned int regno)
+{
+  /* Ignore unused registers.  */
+  rtx reg = regno_reg_rtx[regno];
+  if (!reg || DF_REG_DEF_COUNT (regno) == 0)
+    return false;
+
+  /* Make sure the register has a mode that we want to rematerialize.  */
+  if (!bitmap_bit_p (m_selected_modes, GET_MODE (reg)))
+    return false;
+
+  /* Ignore values that might sometimes be used uninitialized.  We could
+     instead add dummy candidates for the entry block definition, and so
+     handle uses that are definitely not uninitialized, but the combination
+     of the two should be rare in practice.  */
+  if (bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
+    return false;
+
+  return true;
+}
+
+/* Record the set of register REGNO in instruction INSN as a
+   rematerialization candidate.  CAN_COPY_P is true unless we already
+   know that rematerialization is impossible (in which case the candidate
+   only exists for the reaching definition calculation).
+
+   The candidate's index is not fixed at this stage.  */
+
+remat_candidate *
+early_remat::add_candidate (rtx_insn *insn, unsigned int regno,
+			    bool can_copy_p)
+{
+  remat_candidate cand;
+  memset (&cand, 0, sizeof (cand));
+  cand.regno = regno;
+  cand.insn = insn;
+  cand.remat_rtx = PATTERN (insn);
+  cand.can_copy_p = can_copy_p;
+  m_candidates.safe_push (cand);
+
+  bitmap_set_bit (&m_candidate_regnos, regno);
+
+  return &m_candidates.last ();
+}
+
+/* Return true if we can rematerialize the set of register REGNO in
+   instruction INSN, and add it as a candidate if so.  When returning
+   false, print the reason to the dump file.  */
+
+bool
+early_remat::maybe_add_candidate (rtx_insn *insn, unsigned int regno)
+{
+#define FAILURE_FORMAT ";; Can't rematerialize set of reg %d in %d[bb:%d]: "
+#define FAILURE_ARGS regno, INSN_UID (insn), BLOCK_FOR_INSN (insn)->index
+
+  /* The definition must come from an ordinary instruction.  */
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  if (!NONJUMP_INSN_P (insn)
+      || (insn == BB_END (bb)
+	  && has_abnormal_or_eh_outgoing_edge_p (bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn alters control flow\n",
+		 FAILURE_ARGS);
+      return false;
+    }
+
+  /* Prefer to rematerialize constants directly -- it's much easier.  */
+  machine_mode mode = GET_MODE (regno_reg_rtx[regno]);
+  if (rtx note = find_reg_equal_equiv_note (insn))
+    {
+      rtx val = XEXP (note, 0);
+      if (CONSTANT_P (val)
+	  && targetm.legitimate_constant_p (mode, val))
+	{
+	  remat_candidate *cand = add_candidate (insn, regno, true);
+	  cand->constant_p = true;
+	  cand->remat_rtx = val;
+	  return true;
+	}
+    }
+
+  /* See whether the target has reasons to prevent a copy.  */
+  if (targetm.cannot_copy_insn_p && targetm.cannot_copy_insn_p (insn))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "target forbids copying\n",
+		 FAILURE_ARGS);
+      return false;
+    }
+
+  /* We can't copy trapping instructions.  */
+  rtx pat = PATTERN (insn);
+  if (may_trap_p (pat))
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn might trap\n", FAILURE_ARGS);
+      return false;
+    }
+
+  /* We can't copy instructions that read memory, unless we know that
+     the contents never change.  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, pat, ALL)
+    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+      {
+	if (dump_file)
+	  fprintf (dump_file, FAILURE_FORMAT "insn references non-constant"
+		   " memory\n", FAILURE_ARGS);
+	return false;
+      }
+
+  /* Check each defined register.  */
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, insn)
+    {
+      unsigned int def_regno = DF_REF_REGNO (ref);
+      if (def_regno == regno)
+	{
+	  /* Make sure the definition is write-only.  (Partial definitions,
+	     such as setting the low part and clobbering the high part,
+	     are otherwise OK.)  */
+	  if (DF_REF_FLAGS_IS_SET (ref, DF_REF_READ_WRITE))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "destination is"
+			 " read-modify-write\n", FAILURE_ARGS);
+	      return false;
+	    }
+	}
+      else
+	{
+	  /* The instruction can set additional registers, provided that
+	     they're call-clobbered hard registers.  This is useful for
+	     instructions that alter the condition codes.  */
+	  if (!HARD_REGISTER_NUM_P (def_regno))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " pseudo reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	  if (global_regs[def_regno])
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " global reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	  if (!TEST_HARD_REG_BIT (regs_invalidated_by_call, def_regno))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn also sets"
+			 " call-preserved reg %d\n", FAILURE_ARGS, def_regno);
+	      return false;
+	    }
+	}
+    }
+
+  /* If the instruction uses fixed hard registers, check that those
+     registers have the same value throughout the function.  If the
+     instruction uses non-fixed hard registers, check that we can
+     replace them with pseudos.  */
+  FOR_EACH_INSN_USE (ref, insn)
+    {
+      unsigned int use_regno = DF_REF_REGNO (ref);
+      if (HARD_REGISTER_NUM_P (use_regno) && fixed_regs[use_regno])
+	{
+	  if (rtx_unstable_p (DF_REF_REAL_REG (ref)))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, FAILURE_FORMAT "insn uses fixed hard reg"
+			 " %d\n", FAILURE_ARGS, use_regno);
+	      return false;
+	    }
+	}
+      else if (HARD_REGISTER_NUM_P (use_regno))
+	{
+	  /* Allocate a dummy pseudo register and temporarily install it.
+	     Make the register number depend on the mode, which should
+	     provide enough sharing for match_dup while also weeding
+	     out cases in which operands with different modes are
+	     explicitly tied.  */
+	  rtx *loc = DF_REF_REAL_LOC (ref);
+	  unsigned int size = RTX_CODE_SIZE (REG);
+	  rtx new_reg = (rtx) alloca (size);
+	  memset (new_reg, 0, size);
+	  PUT_CODE (new_reg, REG);
+	  set_mode_and_regno (new_reg, GET_MODE (*loc),
+			      LAST_VIRTUAL_REGISTER + 1 + GET_MODE (*loc));
+	  validate_change (insn, loc, new_reg, 1);
+	}
+    }
+  bool ok_p = verify_changes (0);
+  cancel_changes (0);
+  if (!ok_p)
+    {
+      if (dump_file)
+	fprintf (dump_file, FAILURE_FORMAT "insn does not allow hard"
+		 " register inputs to be replaced\n", FAILURE_ARGS);
+      return false;
+    }
+
+#undef FAILURE_ARGS
+#undef FAILURE_FORMAT
+
+  add_candidate (insn, regno, true);
+  return true;
+}
+
+/* Calculate the set of rematerialization candidates.  Return true if
+   we find at least one.  */
+
+bool
+early_remat::collect_candidates (void)
+{
+  unsigned int nregs = DF_REG_SIZE (df);
+  for (unsigned int regno = FIRST_PSEUDO_REGISTER; regno < nregs; ++regno)
+    if (interesting_regno_p (regno))
+      {
+	/* Create candidates for all suitable definitions.  */
+	bitmap_clear (&m_tmp_bitmap);
+	unsigned int bad = 0;
+	unsigned int id = 0;
+	for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;
+	     ref = DF_REF_NEXT_REG (ref))
+	  {
+	    rtx_insn *insn = DF_REF_INSN (ref);
+	    if (maybe_add_candidate (insn, regno))
+	      bitmap_set_bit (&m_tmp_bitmap, id);
+	    else
+	      bad += 1;
+	    id += 1;
+	  }
+
+	/* If we found at least one suitable definition, add dummy
+	   candidates for the rest, so that we can see which definitions
+	   are live where.  */
+	if (!bitmap_empty_p (&m_tmp_bitmap) && bad)
+	  {
+	    id = 0;
+	    for (df_ref ref = DF_REG_DEF_CHAIN (regno); ref;
+		 ref = DF_REF_NEXT_REG (ref))
+	      {
+		if (!bitmap_bit_p (&m_tmp_bitmap, id))
+		  add_candidate (DF_REF_INSN (ref), regno, false);
+		id += 1;
+	      }
+	  }
+      }
+
+
+  return !m_candidates.is_empty ();
+}
+
+/* Initialize the m_block_info array.  */
+
+void
+early_remat::init_block_info (void)
+{
+  unsigned int n_blocks = last_basic_block_for_fn (m_fn);
+  m_block_info.safe_grow_cleared (n_blocks);
+}
+
+/* Maps basic block indices to their position in the post order.  */
+static unsigned int *postorder_index;
+
+/* Order remat_candidates X_IN and Y_IN according to the cfg postorder.  */
+
+static int
+compare_candidates (const void *x_in, const void *y_in)
+{
+  const remat_candidate *x = (const remat_candidate *) x_in;
+  const remat_candidate *y = (const remat_candidate *) y_in;
+  basic_block x_bb = BLOCK_FOR_INSN (x->insn);
+  basic_block y_bb = BLOCK_FOR_INSN (y->insn);
+  if (x_bb != y_bb)
+    /* Make X and Y follow block postorder.  */
+    return postorder_index[x_bb->index] - postorder_index[y_bb->index];
+
+  /* Make X and Y follow a backward traversal of the containing block.  */
+  return DF_INSN_LUID (y->insn) - DF_INSN_LUID (x->insn);
+}
+
+/* Sort the collected rematerialization candidates so that they follow
+   cfg postorder.  */
+
+void
+early_remat::sort_candidates (void)
+{
+  /* Make sure the DF LUIDs are up-to-date for all the blocks we
+     care about.  */
+  bitmap_clear (&m_tmp_bitmap);
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      basic_block bb = BLOCK_FOR_INSN (cand->insn);
+      if (bitmap_set_bit (&m_tmp_bitmap, bb->index))
+	df_recompute_luids (bb);
+    }
+
+  /* Create a mapping from block numbers to their position in the
+     postorder.  */
+  unsigned int n_blocks = last_basic_block_for_fn (m_fn);
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  postorder_index = new unsigned int[n_blocks];
+  for (unsigned int i = 0; i < postorder_len; ++i)
+    postorder_index[postorder[i]] = i;
+
+  m_candidates.qsort (compare_candidates);
+
+  delete postorder_index;
+}
+
+/* Commit to the current candidate indices and initialize cross-references.  */
+
+void
+early_remat::finalize_candidate_indices (void)
+{
+  /* Create a bitmap for each candidate register.  */
+  m_regno_to_candidates.safe_grow (max_reg_num ());
+  unsigned int regno;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (&m_candidate_regnos, 0, regno, bi)
+    m_regno_to_candidates[regno] = alloc_bitmap ();
+
+  /* Go through each candidate and record its index.  */
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT (m_candidates, cand_index, cand)
+    {
+      basic_block bb = BLOCK_FOR_INSN (cand->insn);
+      remat_block_info *info = &m_block_info[bb->index];
+      info->num_candidates += 1;
+      info->first_candidate = cand_index;
+      bitmap_set_bit (m_regno_to_candidates[cand->regno], cand_index);
+    }
+}
+
+/* Record that candidates CAND1_INDEX and CAND2_INDEX are equivalent.
+   CAND1_INDEX might already have an equivalence class, but CAND2_INDEX
+   doesn't.  */
+
+void
+early_remat::record_equiv_candidates (unsigned int cand1_index,
+				      unsigned int cand2_index)
+{
+  if (dump_file)
+    fprintf (dump_file, ";; Candidate %d is equivalent to candidate %d\n",
+	     cand2_index, cand1_index);
+
+  remat_candidate *cand1 = &m_candidates[cand1_index];
+  remat_candidate *cand2 = &m_candidates[cand2_index];
+  gcc_checking_assert (!cand2->equiv_class);
+
+  remat_equiv_class *ec = cand1->equiv_class;
+  if (!ec)
+    {
+      ec = XOBNEW (&m_obstack.obstack, remat_equiv_class);
+      ec->members = alloc_bitmap ();
+      bitmap_set_bit (ec->members, cand1_index);
+      ec->earliest = cand1_index;
+      ec->representative = cand1_index;
+      cand1->equiv_class = ec;
+    }
+  cand1 = &m_candidates[ec->representative];
+  cand2->equiv_class = ec;
+  bitmap_set_bit (ec->members, cand2_index);
+  if (cand2_index > ec->representative)
+    ec->representative = cand2_index;
+}
+
+/* Propagate information from the rd_out set of E->src to the rd_in set
+   of E->dest, when computing global reaching definitions.  Return true
+   if something changed.  */
+
+bool
+early_remat::rd_confluence_n (edge e)
+{
+  remat_block_info *src = &er->m_block_info[e->src->index];
+  remat_block_info *dest = &er->m_block_info[e->dest->index];
+
+  /* available_in temporarily contains the set of candidates whose
+     registers are live on entry.  */
+  if (empty_p (src->rd_out) || empty_p (dest->available_in))
+    return false;
+
+  return bitmap_ior_and_into (er->get_bitmap (&dest->rd_in),
+			      src->rd_out, dest->available_in);
+}
+
+/* Propagate information from the rd_in set of block BB_INDEX to rd_out.
+   Return true if something changed.  */
+
+bool
+early_remat::rd_transfer (int bb_index)
+{
+  remat_block_info *info = &er->m_block_info[bb_index];
+
+  if (empty_p (info->rd_in))
+    return false;
+
+  if (empty_p (info->rd_kill))
+    {
+      gcc_checking_assert (empty_p (info->rd_gen));
+      if (!info->rd_out)
+	info->rd_out = info->rd_in;
+      else
+	gcc_checking_assert (info->rd_out == info->rd_in);
+      /* Assume that we only get called if something changed.  */
+      return true;
+    }
+
+  if (empty_p (info->rd_gen))
+    return bitmap_and_compl (er->get_bitmap (&info->rd_out),
+			     info->rd_in, info->rd_kill);
+
+  return bitmap_ior_and_compl (er->get_bitmap (&info->rd_out), info->rd_gen,
+			       info->rd_in, info->rd_kill);
+}
+
+/* Calculate the rd_* sets for each block.  */
+
+void
+early_remat::compute_rd (void)
+{
+  /* First calculate the rd_kill and rd_gen sets, using the fact
+     that m_candidates is sorted in order of decreasing LUID.  */
+  unsigned int cand_index;
+  remat_candidate *cand;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)
+    {
+      rtx_insn *insn = cand->insn;
+      basic_block bb = BLOCK_FOR_INSN (insn);
+      remat_block_info *info = &m_block_info[bb->index];
+      bitmap kill = m_regno_to_candidates[cand->regno];
+      bitmap_ior_into (get_bitmap (&info->rd_kill), kill);
+      if (bitmap_bit_p (DF_LR_OUT (bb), cand->regno))
+	{
+	  bitmap_and_compl_into (get_bitmap (&info->rd_gen), kill);
+	  bitmap_set_bit (info->rd_gen, cand_index);
+	}
+    }
+
+  /* Set up the initial values of the other sets.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    {
+      remat_block_info *info = &m_block_info[bb->index];
+      unsigned int regno;
+      bitmap_iterator bi;
+      EXECUTE_IF_AND_IN_BITMAP (DF_LR_IN (bb), &m_candidate_regnos,
+				0, regno, bi)
+	{
+	  /* Use available_in to record the set of candidates whose
+	     registers are live on entry (i.e. a maximum bound on rd_in).  */
+	  bitmap_ior_into (get_bitmap (&info->available_in),
+			   m_regno_to_candidates[regno]);
+
+	  /* Add registers that die in a block to the block's kill set,
+	     so that we don't needlessly propagate them through the rest
+	     of the function.  */
+	  if (!bitmap_bit_p (DF_LR_OUT (bb), regno))
+	    bitmap_ior_into (get_bitmap (&info->rd_kill),
+			     m_regno_to_candidates[regno]);
+	}
+
+      /* Initialize each block's rd_out to the minimal set (the set of
+	 local definitions).  */
+      if (!empty_p (info->rd_gen))
+	bitmap_copy (get_bitmap (&info->rd_out), info->rd_gen);
+    }
+
+  /* Iterate until we reach a fixed point.  */
+  er = this;
+  bitmap_clear (&m_tmp_bitmap);
+  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));
+  df_simple_dataflow (DF_FORWARD, NULL, NULL, rd_confluence_n, rd_transfer,
+		      &m_tmp_bitmap, df_get_postorder (DF_FORWARD),
+		      df_get_n_blocks (DF_FORWARD));
+  er = 0;
+
+  /* Work out which definitions reach which candidates, again taking
+     advantage of the candidate order.  */
+  bitmap_head reaching;
+  bitmap_initialize (&reaching, &m_obstack);
+  basic_block old_bb = NULL;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand_index, cand)
+    {
+      bb = BLOCK_FOR_INSN (cand->insn);
+      if (bb != old_bb)
+	{
+	  /* Get the definitions that reach the start of the new block.  */
+	  remat_block_info *info = &m_block_info[bb->index];
+	  if (info->rd_in)
+	    bitmap_copy (&reaching, info->rd_in);
+	  else
+	    bitmap_clear (&reaching);
+	  old_bb = bb;
+	}
+      else
+	{
+	  /* Process the definitions of the previous instruction.  */
+	  bitmap kill = m_regno_to_candidates[cand[1].regno];
+	  bitmap_and_compl_into (&reaching, kill);
+	  bitmap_set_bit (&reaching, cand_index + 1);
+	}
+
+      if (cand->can_copy_p && !cand->constant_p)
+	{
+	  df_ref ref;
+	  FOR_EACH_INSN_USE (ref, cand->insn)
+	    {
+	      unsigned int regno = DF_REF_REGNO (ref);
+	      if (bitmap_bit_p (&m_candidate_regnos, regno))
+		{
+		  bitmap defs = m_regno_to_candidates[regno];
+		  bitmap_and (&m_tmp_bitmap, defs, &reaching);
+		  bitmap_ior_into (get_bitmap (&cand->uses), &m_tmp_bitmap);
+		}
+	    }
+	}
+    }
+  bitmap_clear (&reaching);
+}
+
+/* If CAND_INDEX is in an equivalence class, return the representative
+   of the class, otherwise return CAND_INDEX.  */
+
+inline unsigned int
+early_remat::canon_candidate (unsigned int cand_index)
+{
+  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)
+    return ec->representative;
+  return cand_index;
+}
+
+/* Make candidate set *PTR refer to candidates using the representative
+   of each equivalence class.  */
+
+void
+early_remat::canon_bitmap (bitmap *ptr)
+{
+  bitmap old_set = *ptr;
+  if (empty_p (old_set))
+    return;
+
+  bitmap new_set = NULL;
+  unsigned int old_index;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (old_set, 0, old_index, bi)
+    {
+      unsigned int new_index = canon_candidate (old_index);
+      if (old_index != new_index)
+	{
+	  if (!new_set)
+	    {
+	      new_set = alloc_bitmap ();
+	      bitmap_copy (new_set, old_set);
+	    }
+	  bitmap_clear_bit (new_set, old_index);
+	  bitmap_set_bit (new_set, new_index);
+	}
+    }
+  if (new_set)
+    {
+      BITMAP_FREE (*ptr);
+      *ptr = new_set;
+    }
+}
+
+/* If the candidates in REACHING all have the same value, return the
+   earliest instance of that value (i.e. the first one to be added
+   to m_value_table), otherwise return MULTIPLE_CANDIDATES.  */
+
+unsigned int
+early_remat::resolve_reaching_def (bitmap reaching)
+{
+  unsigned int cand_index = bitmap_first_set_bit (reaching);
+  if (remat_equiv_class *ec = m_candidates[cand_index].equiv_class)
+    {
+      if (!bitmap_intersect_compl_p (reaching, ec->members))
+	return ec->earliest;
+    }
+  else if (bitmap_single_bit_set_p (reaching))
+    return cand_index;
+
+  return MULTIPLE_CANDIDATES;
+}
+
+/* Check whether all candidate registers used by candidate CAND_INDEX have
+   unique definitions.  Return true if so, replacing the candidate's uses
+   set with the appropriate form for value numbering.  */
+
+bool
+early_remat::check_candidate_uses (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+
+  /* Process the uses for each register in turn.  */
+  bitmap_head uses;
+  bitmap_initialize (&uses, &m_obstack);
+  bitmap_copy (&uses, cand->uses);
+  bitmap uses_ec = alloc_bitmap ();
+  while (!bitmap_empty_p (&uses))
+    {
+      /* Get the register for the lowest-indexed candidate remaining,
+	 and the reaching definitions of that register.  */
+      unsigned int first = bitmap_first_set_bit (&uses);
+      unsigned int regno = m_candidates[first].regno;
+      bitmap_and (&m_tmp_bitmap, &uses, m_regno_to_candidates[regno]);
+
+      /* See whether all reaching definitions have the same value and if
+	 so get the index of the first candidate we saw with that value.  */
+      unsigned int def = resolve_reaching_def (&m_tmp_bitmap);
+      if (def == MULTIPLE_CANDIDATES)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, ";; Removing candidate %d because there is"
+		     " more than one reaching definition of reg %d\n",
+		     cand_index, regno);
+	  cand->can_copy_p = false;
+	  break;
+	}
+      bitmap_set_bit (uses_ec, def);
+      bitmap_and_compl_into (&uses, &m_tmp_bitmap);
+    }
+  BITMAP_FREE (cand->uses);
+  cand->uses = uses_ec;
+  return cand->can_copy_p;
+}
+
+/* Calculate the set of hard registers that would be clobbered by
+   rematerializing candidate CAND_INDEX.  At this point the candidate's
+   set of uses is final.  */
+
+void
+early_remat::compute_clobbers (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (cand->uses)
+    {
+      unsigned int use_index;
+      bitmap_iterator bi;
+      EXECUTE_IF_SET_IN_BITMAP (cand->uses, 0, use_index, bi)
+	if (bitmap clobbers = m_candidates[use_index].clobbers)
+	  bitmap_ior_into (get_bitmap (&cand->clobbers), clobbers);
+    }
+
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, cand->insn)
+    {
+      unsigned int def_regno = DF_REF_REGNO (ref);
+      if (def_regno != cand->regno)
+	bitmap_set_bit (get_bitmap (&cand->clobbers), def_regno);
+    }
+}
+
+/* Mark candidate CAND_INDEX as validated and add it to the value table.  */
+
+void
+early_remat::assign_value_number (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  gcc_checking_assert (cand->can_copy_p && !cand->validated_p);
+
+  compute_clobbers (cand_index);
+  cand->validated_p = true;
+
+  inchash::hash h;
+  h.add_int (cand->regno);
+  inchash::add_rtx (cand->remat_rtx, h);
+  cand->hash = h.end ();
+
+  remat_candidate **slot
+    = m_value_table.find_slot_with_hash (cand, cand->hash, INSERT);
+  if (!*slot)
+    {
+      *slot = cand;
+      if (dump_file)
+	fprintf (dump_file, ";; Candidate %d is not equivalent to"
+		 " others seen so far\n", cand_index);
+    }
+  else
+    record_equiv_candidates (*slot - m_candidates.address (), cand_index);
+}
+
+/* Make a final decision about which candidates are valid and assign
+   value numbers to those that are.  */
+
+void
+early_remat::decide_candidate_validity (void)
+{
+  auto_vec<unsigned int, 16> stack;
+  unsigned int cand1_index;
+  remat_candidate *cand1;
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)
+    {
+      if (!cand1->can_copy_p || cand1->validated_p)
+	continue;
+
+      if (empty_p (cand1->uses))
+	{
+	  assign_value_number (cand1_index);
+	  continue;
+	}
+
+      stack.safe_push (cand1_index);
+      while (!stack.is_empty ())
+	{
+	  unsigned int cand2_index = stack.last ();
+	  unsigned int watermark = stack.length ();
+	  remat_candidate *cand2 = &m_candidates[cand2_index];
+	  if (!cand2->can_copy_p || cand2->validated_p)
+	    {
+	      stack.pop ();
+	      continue;
+	    }
+	  cand2->visited_p = true;
+	  unsigned int cand3_index;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (cand2->uses, 0, cand3_index, bi)
+	    {
+	      remat_candidate *cand3 = &m_candidates[cand3_index];
+	      if (!cand3->can_copy_p)
+		{
+		  if (dump_file)
+		    fprintf (dump_file, ";; Removing candidate %d because"
+			     " it uses removed candidate %d\n", cand2_index,
+			     cand3_index);
+		  cand2->can_copy_p = false;
+		  break;
+		}
+	      if (!cand3->validated_p)
+		{
+		  if (empty_p (cand3->uses))
+		    assign_value_number (cand3_index);
+		  else if (cand3->visited_p)
+		    {
+		      if (dump_file)
+			fprintf (dump_file, ";; Removing candidate %d"
+				 " because its definition is cyclic\n",
+				 cand2_index);
+		      cand2->can_copy_p = false;
+		      break;
+		    }
+		  else
+		    stack.safe_push (cand3_index);
+		}
+	    }
+	  if (!cand2->can_copy_p)
+	    {
+	      cand2->visited_p = false;
+	      stack.truncate (watermark - 1);
+	    }
+	  else if (watermark == stack.length ())
+	    {
+	      cand2->visited_p = false;
+	      if (check_candidate_uses (cand2_index))
+		assign_value_number (cand2_index);
+	      stack.pop ();
+	    }
+	}
+    }
+
+  /* Ensure that the candidates always use the same candidate index
+     to refer to an equivalence class.  */
+  FOR_EACH_VEC_ELT_REVERSE (m_candidates, cand1_index, cand1)
+    if (cand1->can_copy_p && !empty_p (cand1->uses))
+      {
+	canon_bitmap (&cand1->uses);
+	gcc_checking_assert (bitmap_first_set_bit (cand1->uses) > cand1_index);
+      }
+}
+
+/* Assuming that every path reaching a point P contains a copy of a
+   use U of REGNO, return true if another copy of U at P would have
+   access to the same value of REGNO.  */
+
+bool
+early_remat::stable_use_p (unsigned int regno)
+{
+  /* Conservatively assume not for hard registers.  */
+  if (HARD_REGISTER_NUM_P (regno))
+    return false;
+
+  /* See if REGNO has a single definition and is never used uninitialized.
+     In this case the definition of REGNO dominates the common dominator
+     of the uses U, which in turn dominates P.  */
+  if (DF_REG_DEF_COUNT (regno) == 1
+      && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
+    return true;
+
+  return false;
+}
+
+/* Emit a copy from register DEST to register SRC before candidate
+   CAND_INDEX's instruction.  */
+
+void
+early_remat::emit_copy_before (unsigned int cand_index, rtx dest, rtx src)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (dump_file)
+    {
+      fprintf (dump_file, ";; Stabilizing insn ");
+      dump_insn_id (cand->insn);
+      fprintf (dump_file, " by copying source reg %d:%s to temporary reg %d\n",
+	       REGNO (src), GET_MODE_NAME (GET_MODE (src)), REGNO (dest));
+    }
+  emit_insn_before (gen_move_insn (dest, src), cand->insn);
+}
+
+/* Check whether any inputs to candidate CAND_INDEX's instruction could
+   change at rematerialization points and replace them with new pseudo
+   registers if so.  */
+
+void
+early_remat::stabilize_pattern (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  if (cand->stabilized_p)
+    return;
+
+  remat_equiv_class *ec = cand->equiv_class;
+  gcc_checking_assert (!ec || cand_index == ec->representative);
+
+  /* Record the replacements we've made so far, so that we don't
+     create two separate registers for match_dups.  Lookup is O(n),
+     but the n is very small.  */
+  typedef std::pair<rtx, rtx> reg_pair;
+  auto_vec<reg_pair, 16> reg_map;
+
+  rtx_insn *insn = cand->insn;
+  df_ref ref;
+  FOR_EACH_INSN_USE (ref, insn)
+    {
+      unsigned int old_regno = DF_REF_REGNO (ref);
+      rtx *loc = DF_REF_REAL_LOC (ref);
+
+      if (HARD_REGISTER_NUM_P (old_regno) && fixed_regs[old_regno])
+	{
+	  /* We checked when adding the candidate that the value is stable.  */
+	  gcc_checking_assert (!rtx_unstable_p (*loc));
+	  continue;
+	}
+
+      if (bitmap_bit_p (&m_candidate_regnos, old_regno))
+	/* We already know which candidate provides the definition
+	   and will handle it during copying.  */
+	continue;
+
+      if (stable_use_p (old_regno))
+	/* We can continue to use the existing register.  */
+	continue;
+
+      /* We need to replace the register.  See whether we've already
+	 created a suitable copy.  */
+      rtx old_reg = *loc;
+      rtx new_reg = NULL_RTX;
+      machine_mode mode = GET_MODE (old_reg);
+      reg_pair *p;
+      unsigned int pi;
+      FOR_EACH_VEC_ELT (reg_map, pi, p)
+	if (REGNO (p->first) == old_regno
+	    && GET_MODE (p->first) == mode)
+	  {
+	    new_reg = p->second;
+	    break;
+	  }
+
+      if (!new_reg)
+	{
+	  /* Create a new register and initialize it just before
+	     the instruction.  */
+	  new_reg = gen_reg_rtx (mode);
+	  reg_map.safe_push (reg_pair (old_reg, new_reg));
+	  if (ec)
+	    {
+	      unsigned int member_index;
+	      bitmap_iterator bi;
+	      EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)
+		emit_copy_before (member_index, new_reg, old_reg);
+	    }
+	  else
+	    emit_copy_before (cand_index, new_reg, old_reg);
+	}
+      validate_change (insn, loc, new_reg, true);
+    }
+  if (num_changes_pending ())
+    {
+      if (!apply_change_group ())
+	/* We checked when adding the candidates that the pattern allows
+	   hard registers to be replaced.  Nothing else should make the
+	   changes invalid.  */
+	gcc_unreachable ();
+
+      if (ec)
+	{
+	  /* Copy the new pattern to other members of the equivalence
+	     class.  */
+	  unsigned int member_index;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (ec->members, 0, member_index, bi)
+	    if (cand_index != member_index)
+	      {
+		rtx_insn *other_insn = m_candidates[member_index].insn;
+		if (!validate_change (other_insn, &PATTERN (other_insn),
+				      copy_insn (PATTERN (insn)), 0))
+		  /* If the original instruction was valid then the copy
+		     should be too.  */
+		  gcc_unreachable ();
+	      }
+	}
+    }
+
+  cand->stabilized_p = true;
+}
+
+/* Change CAND's instruction so that it sets CAND->copy_regno instead
+   of CAND->regno.  */
+
+void
+early_remat::replace_dest_with_copy (unsigned int cand_index)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  df_ref def;
+  FOR_EACH_INSN_DEF (def, cand->insn)
+    if (DF_REF_REGNO (def) == cand->regno)
+      validate_change (cand->insn, DF_REF_REAL_LOC (def),
+		       regno_reg_rtx[cand->copy_regno], 1);
+}
+
+/* Make sure that the candidates used by candidate CAND_INDEX are available.
+   There are two ways of doing this for an input candidate I:
+
+   (1) Using the existing register number and ensuring that I is available.
+
+   (2) Using a new register number (recorded in copy_regno) and adding I
+       to VIA_COPY.  This guarantees that making I available does not
+       conflict with other uses of the original register.
+
+   REQUIRED is the set of candidates that are required but not available
+   before the copy of CAND_INDEX.  AVAILABLE is the set of candidates
+   that are already available before the copy of CAND_INDEX.  REACHING
+   is the set of candidates that reach the copy of CAND_INDEX.  VIA_COPY
+   is the set of candidates that will use new register numbers recorded
+   in copy_regno instead of the original ones.  */
+
+void
+early_remat::stabilize_candidate_uses (unsigned int cand_index,
+				       bitmap required, bitmap available,
+				       bitmap reaching, bitmap via_copy)
+{
+  remat_candidate *cand = &m_candidates[cand_index];
+  df_ref use;
+  FOR_EACH_INSN_USE (use, cand->insn)
+    {
+      unsigned int regno = DF_REF_REGNO (use);
+      if (!bitmap_bit_p (&m_candidate_regnos, regno))
+	continue;
+
+      /* Work out which candidate provides the definition.  */
+      bitmap defs = m_regno_to_candidates[regno];
+      bitmap_and (&m_tmp_bitmap, cand->uses, defs);
+      gcc_checking_assert (bitmap_single_bit_set_p (&m_tmp_bitmap));
+      unsigned int def_index = bitmap_first_set_bit (&m_tmp_bitmap);
+
+      /* First see if DEF_INDEX is the only reaching definition of REGNO
+	 at this point too and if it is or will become available.  We can
+	 continue to use REGNO if so.  */
+      bitmap_and (&m_tmp_bitmap, reaching, defs);
+      if (bitmap_single_bit_set_p (&m_tmp_bitmap)
+	  && bitmap_first_set_bit (&m_tmp_bitmap) == def_index
+	  && ((available && bitmap_bit_p (available, def_index))
+	      || bitmap_bit_p (required, def_index)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, ";; Keeping reg %d for use of candidate %d"
+		     " in candidate %d\n", regno, def_index, cand_index);
+	  continue;
+	}
+
+      /* Otherwise fall back to using a copy.  There are other cases
+	 in which we *could* continue to use REGNO, but there's not
+	 really much point.  Using a separate register ought to make
+	 things easier for the register allocator.  */
+      remat_candidate *def_cand = &m_candidates[def_index];
+      rtx *loc = DF_REF_REAL_LOC (use);
+      rtx new_reg;
+      if (bitmap_set_bit (via_copy, def_index))
+	{
+	  new_reg = gen_reg_rtx (GET_MODE (*loc));
+	  def_cand->copy_regno = REGNO (new_reg);
+	  if (dump_file)
+	    fprintf (dump_file, ";; Creating reg %d for use of candidate %d"
+		     " in candidate %d\n", REGNO (new_reg), def_index,
+		     cand_index);
+	}
+      else
+	new_reg = regno_reg_rtx[def_cand->copy_regno];
+      validate_change (cand->insn, loc, new_reg, 1);
+    }
+}
+
+/* Rematerialize the candidates in REQUIRED after instruction INSN,
+   given that the candidates in AVAILABLE are already available
+   and that REACHING is the set of candidates live after INSN.
+   REQUIRED and AVAILABLE are disjoint on entry.
+
+   Clear REQUIRED on exit.  */
+
+void
+early_remat::emit_remat_insns (bitmap required, bitmap available,
+			       bitmap reaching, rtx_insn *insn)
+{
+  /* Quick exit if there's nothing to do.  */
+  if (empty_p (required))
+    return;
+
+  /* Only reaching definitions should be available or required.  */
+  gcc_checking_assert (!bitmap_intersect_compl_p (required, reaching));
+  if (available)
+    gcc_checking_assert (!bitmap_intersect_compl_p (available, reaching));
+
+  bitmap_head via_copy;
+  bitmap_initialize (&via_copy, &m_obstack);
+  while (!bitmap_empty_p (required) || !bitmap_empty_p (&via_copy))
+    {
+      /* Pick the lowest-indexed candidate left.  */
+      unsigned int required_index = (bitmap_empty_p (required)
+				     ? ~0U : bitmap_first_set_bit (required));
+      unsigned int via_copy_index = (bitmap_empty_p (&via_copy)
+				     ? ~0U : bitmap_first_set_bit (&via_copy));
+      unsigned int cand_index = MIN (required_index, via_copy_index);
+      remat_candidate *cand = &m_candidates[cand_index];
+
+      bool via_copy_p = (cand_index == via_copy_index);
+      if (via_copy_p)
+	bitmap_clear_bit (&via_copy, cand_index);
+      else
+	{
+	  /* Remove all candidates for the same register from REQUIRED.  */
+	  bitmap_and (&m_tmp_bitmap, reaching,
+		      m_regno_to_candidates[cand->regno]);
+	  bitmap_and_compl_into (required, &m_tmp_bitmap);
+	  gcc_checking_assert (!bitmap_bit_p (required, cand_index));
+
+	  /* Only rematerialize if we have a single reaching definition
+	     of the register.  */
+	  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, ";; Can't rematerialize reg %d after ",
+			   cand->regno);
+		  dump_insn_id (insn);
+		  fprintf (dump_file, ": more than one reaching definition\n");
+		}
+	      continue;
+	    }
+
+	  /* Skip candidates that can't be rematerialized.  */
+	  if (!cand->can_copy_p)
+	    continue;
+
+	  /* Check the function precondition.  */
+	  gcc_checking_assert (!available
+			       || !bitmap_bit_p (available, cand_index));
+	}
+
+      /* Invalid candidates should have been weeded out by now.  */
+      gcc_assert (cand->can_copy_p);
+
+      rtx new_pattern;
+      if (cand->constant_p)
+	{
+	  /* Emit a simple move.  */
+	  unsigned int regno = via_copy_p ? cand->copy_regno : cand->regno;
+	  new_pattern = gen_move_insn (regno_reg_rtx[regno], cand->remat_rtx);
+	}
+      else
+	{
+	  /* If this is the first time we've copied the instruction, make
+	     sure that any inputs will have the same value after INSN.  */
+	  stabilize_pattern (cand_index);
+
+	  /* Temporarily adjust the original instruction so that it has
+	     the right form for the copy.  */
+	  if (via_copy_p)
+	    replace_dest_with_copy (cand_index);
+	  if (cand->uses)
+	    stabilize_candidate_uses (cand_index, required, available,
+				      reaching, &via_copy);
+
+	  /* Get the new instruction pattern.  */
+	  new_pattern = copy_insn (cand->remat_rtx);
+
+	  /* Undo the temporary changes.  */
+	  cancel_changes (0);
+	}
+
+      /* Emit the new instruction.  */
+      rtx_insn *new_insn = emit_insn_after (new_pattern, insn);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, ";; Rematerializing candidate %d after ",
+		   cand_index);
+	  dump_insn_id (insn);
+	  if (via_copy_p)
+	    fprintf (dump_file, " with new destination reg %d",
+		     cand->copy_regno);
+	  fprintf (dump_file, ":\n\n");
+	  print_rtl_single (dump_file, new_insn);
+	  fprintf (dump_file, "\n");
+	}
+    }
+}
+
+/* Recompute INFO's available_out set, given that it's distinct from
+   available_in and available_locally.  */
+
+bool
+early_remat::set_available_out (remat_block_info *info)
+{
+  if (empty_p (info->available_locally))
+    return bitmap_and_compl (get_bitmap (&info->available_out),
+			     info->available_in, info->rd_kill);
+
+  if (empty_p (info->rd_kill))
+    return bitmap_ior (get_bitmap (&info->available_out),
+		       info->available_locally, info->available_in);
+
+  return bitmap_ior_and_compl (get_bitmap (&info->available_out),
+			       info->available_locally, info->available_in,
+			       info->rd_kill);
+}
+
+/* If BB has more than one call, decide which candidates should be
+   rematerialized after the non-final calls and emit the associated
+   instructions.  Record other information about the block in preparation
+   for the global phase.  */
+
+void
+early_remat::process_block (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+  rtx_insn *last_call = NULL;
+  rtx_insn *insn;
+
+  /* Ensure that we always use the same candidate index to refer to an
+     equivalence class.  */
+  if (info->rd_out == info->rd_in)
+    {
+      canon_bitmap (&info->rd_in);
+      info->rd_out = info->rd_in;
+    }
+  else
+    {
+      canon_bitmap (&info->rd_in);
+      canon_bitmap (&info->rd_out);
+    }
+  canon_bitmap (&info->rd_kill);
+  canon_bitmap (&info->rd_gen);
+
+  /* The set of candidates that should be rematerialized on entry to the
+     block or after the previous call (whichever is more recent).  */
+  init_temp_bitmap (&m_required);
+
+  /* The set of candidates that reach the current instruction (i.e. are
+     live just before the instruction).  */
+  bitmap_head reaching;
+  bitmap_initialize (&reaching, &m_obstack);
+  if (info->rd_in)
+    bitmap_copy (&reaching, info->rd_in);
+
+  /* The set of candidates that are live and available without
+     rematerialization just before the current instruction.  This only
+     accounts for earlier candidates in the block, or those that become
+     available by being added to M_REQUIRED.  */
+  init_temp_bitmap (&m_available);
+
+  /* Get the range of candidates in the block.  */
+  unsigned int next_candidate = info->first_candidate;
+  unsigned int num_candidates = info->num_candidates;
+  remat_candidate *next_def = (num_candidates > 0
+			       ? &m_candidates[next_candidate]
+			       : NULL);
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      /* First process uses, since this is a forward walk.  */
+      df_ref ref;
+      FOR_EACH_INSN_USE (ref, insn)
+	{
+	  unsigned int regno = DF_REF_REGNO (ref);
+	  if (bitmap_bit_p (&m_candidate_regnos, regno))
+	    {
+	      bitmap defs = m_regno_to_candidates[regno];
+	      bitmap_and (&m_tmp_bitmap, defs, &reaching);
+	      gcc_checking_assert (!bitmap_empty_p (&m_tmp_bitmap));
+	      if (!bitmap_intersect_p (defs, m_available))
+		{
+		  /* There has been no definition of the register since
+		     the last call or the start of the block (whichever
+		     is most recent).  Mark the reaching definitions
+		     as required at that point and thus available here.  */
+		  bitmap_ior_into (m_required, &m_tmp_bitmap);
+		  bitmap_ior_into (m_available, &m_tmp_bitmap);
+		}
+	    }
+	}
+
+      if (CALL_P (insn))
+	{
+	  if (!last_call)
+	    {
+	      /* The first call in the block.  Record which candidates are
+		 required at the start of the block.  */
+	      copy_temp_bitmap (&info->required_in, &m_required);
+	      init_temp_bitmap (&m_required);
+	    }
+	  else
+	    /* The fully-local case: candidates that need to be
+	       rematerialized after a previous call in the block.  */
+	    emit_remat_insns (m_required, NULL, info->rd_after_call,
+			      last_call);
+	  last_call = insn;
+	  bitmap_clear (m_available);
+	  gcc_checking_assert (empty_p (m_required));
+	}
+
+      /* Now process definitions.  */
+      if (next_def && insn == next_def->insn)
+	{
+	  unsigned int gen = canon_candidate (next_candidate);
+
+	  /* Other candidates with the same regno are not available
+	     any more.  */
+	  bitmap kill = m_regno_to_candidates[next_def->regno];
+	  bitmap_and_compl_into (m_available, kill);
+	  bitmap_and_compl_into (&reaching, kill);
+
+	  /* Record that this candidate is available without
+	     rematerialization.  */
+	  bitmap_set_bit (m_available, gen);
+	  bitmap_set_bit (&reaching, gen);
+
+	  /* Find the next candidate in the block.  */
+	  num_candidates -= 1;
+	  next_candidate -= 1;
+	  if (num_candidates > 0)
+	    next_def -= 1;
+	  else
+	    next_def = NULL;
+	}
+
+      if (insn == last_call)
+	bitmap_copy (get_bitmap (&info->rd_after_call), &reaching);
+    }
+  bitmap_clear (&reaching);
+  gcc_checking_assert (num_candidates == 0);
+
+  /* Remove values from the available set if they aren't live (and so
+     aren't interesting to successor blocks).  */
+  if (info->rd_out)
+    bitmap_and_into (m_available, info->rd_out);
+
+  /* Record the accumulated information.  */
+  info->last_call = last_call;
+  info->abnormal_call_p = (last_call
+			   && last_call == BB_END (bb)
+			   && has_abnormal_or_eh_outgoing_edge_p (bb));
+  copy_temp_bitmap (&info->available_locally, &m_available);
+  if (last_call)
+    copy_temp_bitmap (&info->required_after_call, &m_required);
+  else
+    copy_temp_bitmap (&info->required_in, &m_required);
+
+  /* Assume at first that all live-in values are available without
+     rematerialization (i.e. start with the most optimistic assumption).  */
+  if (info->available_in)
+    {
+      if (info->rd_in)
+	bitmap_copy (info->available_in, info->rd_in);
+      else
+	BITMAP_FREE (info->available_in);
+    }
+
+  if (last_call || empty_p (info->available_in))
+    /* The values available on exit from the block are exactly those that
+       are available locally.  This set doesn't change.  */
+    info->available_out = info->available_locally;
+  else if (empty_p (info->available_locally) && empty_p (info->rd_kill))
+    /* The values available on exit are the same as those available on entry.
+       Updating one updates the other.  */
+    info->available_out = info->available_in;
+  else
+    set_available_out (info);
+}
+
+/* Process each block as for process_block, visiting dominators before
+   the blocks they dominate.  */
+
+void
+early_remat::local_phase (void)
+{
+  if (dump_file)
+    fprintf (dump_file, "\n;; Local phase:\n");
+
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  for (unsigned int i = postorder_len; i-- > 0; )
+    if (postorder[i] >= NUM_FIXED_BLOCKS)
+      process_block (BASIC_BLOCK_FOR_FN (m_fn, postorder[i]));
+}
+
+/* Return true if available values survive across edge E.  */
+
+static inline bool
+available_across_edge_p (edge e)
+{
+  return (e->flags & EDGE_EH) == 0;
+}
+
+/* Propagate information from the available_out set of E->src to the
+   available_in set of E->dest, when computing global availability.
+   Return true if something changed.  */
+
+bool
+early_remat::avail_confluence_n (edge e)
+{
+  remat_block_info *src = &er->m_block_info[e->src->index];
+  remat_block_info *dest = &er->m_block_info[e->dest->index];
+
+  if (!available_across_edge_p (e))
+    return false;
+
+  if (empty_p (dest->available_in))
+    return false;
+
+  if (!src->available_out)
+    {
+      bitmap_clear (dest->available_in);
+      return true;
+    }
+
+  return bitmap_and_into (dest->available_in, src->available_out);
+}
+
+/* Propagate information from the available_in set of block BB_INDEX
+   to available_out.  Return true if something changed.  */
+
+bool
+early_remat::avail_transfer (int bb_index)
+{
+  remat_block_info *info = &er->m_block_info[bb_index];
+
+  if (info->available_out == info->available_locally)
+    return false;
+
+  if (info->available_out == info->available_in)
+    /* Assume that we are only called if the input changed.  */
+    return true;
+
+  return er->set_available_out (info);
+}
+
+/* Compute global availability for the function, starting with the local
+   information computed by local_phase.  */
+
+void
+early_remat::compute_availability (void)
+{
+  /* We use df_simple_dataflow instead of the lcm routines for three reasons:
+
+     (1) it avoids recomputing the traversal order;
+     (2) many of the sets are likely to be sparse, so we don't necessarily
+	 want to use sbitmaps; and
+     (3) it means we can avoid creating an explicit kill set for the call.  */
+  er = this;
+  bitmap_clear (&m_tmp_bitmap);
+  bitmap_set_range (&m_tmp_bitmap, 0, last_basic_block_for_fn (m_fn));
+  df_simple_dataflow (DF_FORWARD, NULL, NULL,
+		      avail_confluence_n, avail_transfer,
+		      &m_tmp_bitmap, df_get_postorder (DF_FORWARD),
+		      df_get_n_blocks (DF_FORWARD));
+  er = 0;
+
+  /* Restrict the required_in sets to values that aren't available.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    {
+      remat_block_info *info = &m_block_info[bb->index];
+      if (info->required_in && info->available_in)
+	bitmap_and_compl_into (info->required_in, info->available_in);
+    }
+}
+
+/* Make sure that INFO's available_out and available_in sets are unique.  */
+
+inline void
+early_remat::unshare_available_sets (remat_block_info *info)
+{
+  if (info->available_in && info->available_in == info->available_out)
+    {
+      info->available_in = alloc_bitmap ();
+      bitmap_copy (info->available_in, info->available_out);
+    }
+}
+
+/* Return true if it is possible to move rematerializations from the
+   destination of E to the source of E.  */
+
+inline bool
+early_remat::can_move_across_edge_p (edge e)
+{
+  return (available_across_edge_p (e)
+	  && !m_block_info[e->src->index].abnormal_call_p);
+}
+
+/* Return true if it is cheaper to rematerialize values at the head of
+   block QUERY_BB_INDEX instead of rematerializing in its predecessors.  */
+
+bool
+early_remat::local_remat_cheaper_p (unsigned int query_bb_index)
+{
+  if (m_block_info[query_bb_index].remat_frequency_valid_p)
+    return m_block_info[query_bb_index].local_remat_cheaper_p;
+
+  /* Iteratively compute the cost of rematerializing values in the
+     predecessor blocks, then compare that with the cost of
+     rematerializing at the head of the block.
+
+     A cycle indicates that there is no call on that execution path,
+     so it isn't necessary to rematerialize on that path.  */
+  auto_vec<basic_block, 16> stack;
+  stack.quick_push (BASIC_BLOCK_FOR_FN (m_fn, query_bb_index));
+  while (!stack.is_empty ())
+    {
+      basic_block bb = stack.last ();
+      remat_block_info *info = &m_block_info[bb->index];
+      if (info->remat_frequency_valid_p)
+	{
+	  stack.pop ();
+	  continue;
+	}
+
+      info->visited_p = true;
+      int frequency = 0;
+      bool can_move_p = true;
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (!can_move_across_edge_p (e))
+	  {
+	    can_move_p = false;
+	    break;
+	  }
+	else if (m_block_info[e->src->index].last_call)
+	  /* We'll rematerialize after the call.  */
+	  frequency += e->src->count.to_frequency (m_fn);
+	else if (m_block_info[e->src->index].remat_frequency_valid_p)
+	  /* Add the cost of rematerializing at the head of E->src
+	     or in its predecessors (whichever is cheaper).  */
+	  frequency += m_block_info[e->src->index].remat_frequency;
+	else if (!m_block_info[e->src->index].visited_p)
+	  /* Queue E->src and then revisit this block again.  */
+	  stack.safe_push (e->src);
+
+      /* Come back to this block later if we need to process some of
+	 its predecessors.  */
+      if (stack.last () != bb)
+	continue;
+
+      /* If rematerializing in and before the block have equal cost, prefer
+	 rematerializing in the block.  This should shorten the live range.  */
+      int bb_frequency = bb->count.to_frequency (m_fn);
+      if (!can_move_p || frequency >= bb_frequency)
+	{
+	  info->local_remat_cheaper_p = true;
+	  info->remat_frequency = bb_frequency;
+	}
+      else
+	info->remat_frequency = frequency;
+      info->remat_frequency_valid_p = true;
+      info->visited_p = false;
+      if (dump_file)
+	{
+	  if (!can_move_p)
+	    fprintf (dump_file, ";; Need to rematerialize at the head of"
+		     " block %d; cannot move to predecessors.\n", bb->index);
+	  else
+	    {
+	      fprintf (dump_file, ";; Block %d has frequency %d,"
+		       " rematerializing in predecessors has frequency %d",
+		       bb->index, bb_frequency, frequency);
+	      if (info->local_remat_cheaper_p)
+		fprintf (dump_file, "; prefer to rematerialize"
+			 " in the block\n");
+	      else
+		fprintf (dump_file, "; prefer to rematerialize"
+			 " in predecessors\n");
+	    }
+	}
+      stack.pop ();
+    }
+  return m_block_info[query_bb_index].local_remat_cheaper_p;
+}
+
+/* Return true if we cannot rematerialize candidate CAND_INDEX at the head of
+   block BB_INDEX.  */
+
+bool
+early_remat::need_to_move_candidate_p (unsigned int bb_index,
+				       unsigned int cand_index)
+{
+  remat_block_info *info = &m_block_info[bb_index];
+  remat_candidate *cand = &m_candidates[cand_index];
+  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);
+
+  /* If there is more than one reaching definition of REGNO,
+     we'll need to rematerialize in predecessors instead.  */
+  bitmap_and (&m_tmp_bitmap, info->rd_in, m_regno_to_candidates[cand->regno]);
+  if (!bitmap_single_bit_set_p (&m_tmp_bitmap))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Cannot rematerialize %d at the"
+		 " head of block %d because there is more than one"
+		 " reaching definition of reg %d\n", cand_index,
+		 bb_index, cand->regno);
+      return true;
+    }
+
+  /* Likewise if rematerializing CAND here would clobber a live register.  */
+  if (cand->clobbers
+      && bitmap_intersect_p (cand->clobbers, DF_LR_IN (bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Cannot rematerialize %d at the"
+		 " head of block %d because it would clobber live"
+		 " registers\n", cand_index, bb_index);
+      return true;
+    }
+
+  return false;
+}
+
+/* Set REQUIRED to the minimum set of candidates that must be rematerialized
+   in predecessors of block BB_INDEX instead of at the start of the block.  */
+
+void
+early_remat::compute_minimum_move_set (unsigned int bb_index,
+				       bitmap required)
+{
+  remat_block_info *info = &m_block_info[bb_index];
+  bitmap_head remaining;
+
+  bitmap_clear (required);
+  bitmap_initialize (&remaining, &m_obstack);
+  bitmap_copy (&remaining, info->required_in);
+  while (!bitmap_empty_p (&remaining))
+    {
+      unsigned int cand_index = bitmap_first_set_bit (&remaining);
+      remat_candidate *cand = &m_candidates[cand_index];
+      bitmap_clear_bit (&remaining, cand_index);
+
+      /* Leave invalid candidates where they are.  */
+      if (!cand->can_copy_p)
+	continue;
+
+      /* Decide whether to move this candidate.  */
+      if (!bitmap_bit_p (required, cand_index))
+	{
+	  if (!need_to_move_candidate_p (bb_index, cand_index))
+	    continue;
+	  bitmap_set_bit (required, cand_index);
+	}
+
+      /* Also move values used by the candidate, so that we don't
+	 rematerialize them twice.  */
+      if (cand->uses)
+	{
+	  bitmap_ior_and_into (required, cand->uses, info->required_in);
+	  bitmap_ior_and_into (&remaining, cand->uses, info->required_in);
+	}
+    }
+}
+
+/* Make the predecessors of BB_INDEX rematerialize the candidates in
+   REQUIRED.  Add any blocks whose required_in set changes to
+   PENDING_BLOCKS.  */
+
+void
+early_remat::move_to_predecessors (unsigned int bb_index, bitmap required,
+				   bitmap pending_blocks)
+{
+  if (empty_p (required))
+    return;
+  remat_block_info *dest_info = &m_block_info[bb_index];
+  basic_block bb = BASIC_BLOCK_FOR_FN (m_fn, bb_index);
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      remat_block_info *src_info = &m_block_info[e->src->index];
+
+      /* Restrict the set we add to the reaching definitions.  */
+      bitmap_and (&m_tmp_bitmap, required, src_info->rd_out);
+      if (bitmap_empty_p (&m_tmp_bitmap))
+	continue;
+
+      if (!can_move_across_edge_p (e))
+	{
+	  /* We can't move the rematerialization and we can't do it at
+	     the start of the block either.  In this case we just give up
+	     and rely on spilling to make the values available across E.  */
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, ";; Cannot rematerialize the following"
+		       " candidates in block %d:", e->src->index);
+	      dump_candidate_bitmap (required);
+	      fprintf (dump_file, "\n");
+	    }
+	  continue;
+	}
+
+      /* Remove candidates that are already available.  */
+      if (src_info->available_out)
+	{
+	  bitmap_and_compl_into (&m_tmp_bitmap, src_info->available_out);
+	  if (bitmap_empty_p (&m_tmp_bitmap))
+	    continue;
+	}
+
+      /* Add the remaining candidates to the appropriate required set.  */
+      if (dump_file)
+	{
+	  fprintf (dump_file, ";; Moving this set from block %d"
+		   " to block %d:", bb_index, e->src->index);
+	  dump_candidate_bitmap (&m_tmp_bitmap);
+	  fprintf (dump_file, "\n");
+	}
+      /* If the source block contains a call, we want to rematerialize
+	 after the call, otherwise we want to rematerialize at the start
+	 of the block.  */
+      bitmap src_required = get_bitmap (src_info->last_call
+					? &src_info->required_after_call
+					: &src_info->required_in);
+      if (bitmap_ior_into (src_required, &m_tmp_bitmap))
+	{
+	  if (!src_info->last_call)
+	    bitmap_set_bit (pending_blocks, e->src->index);
+	  unshare_available_sets (src_info);
+	  bitmap_ior_into (get_bitmap (&src_info->available_out),
+			   &m_tmp_bitmap);
+	}
+    }
+
+  /* The candidates are now available on entry to the block.  */
+  bitmap_and_compl_into (dest_info->required_in, required);
+  unshare_available_sets (dest_info);
+  bitmap_ior_into (get_bitmap (&dest_info->available_in), required);
+}
+
+/* Go through the candidates that are currently marked as being
+   rematerialized at the beginning of a block.  Decide in each case
+   whether that's valid and profitable; if it isn't, move the
+   rematerialization to predecessor blocks instead.  */
+
+void
+early_remat::choose_rematerialization_points (void)
+{
+  bitmap_head required;
+  bitmap_head pending_blocks;
+
+  int *postorder = df_get_postorder (DF_BACKWARD);
+  unsigned int postorder_len = df_get_n_blocks (DF_BACKWARD);
+  bitmap_initialize (&required, &m_obstack);
+  bitmap_initialize (&pending_blocks, &m_obstack);
+  do
+    /* Process the blocks in postorder, to reduce the number of iterations
+       of the outer loop.  */
+    for (unsigned int i = 0; i < postorder_len; ++i)
+      {
+	unsigned int bb_index = postorder[i];
+	remat_block_info *info = &m_block_info[bb_index];
+	bitmap_clear_bit (&pending_blocks, bb_index);
+
+	if (empty_p (info->required_in))
+	  continue;
+
+	if (info->available_in)
+	  gcc_checking_assert (!bitmap_intersect_p (info->required_in,
+						    info->available_in));
+
+	if (local_remat_cheaper_p (bb_index))
+	  {
+	    /* We'd prefer to rematerialize at the head of the block.
+	       Only move candidates if we need to.  */
+	    compute_minimum_move_set (bb_index, &required);
+	    move_to_predecessors (bb_index, &required, &pending_blocks);
+	  }
+	else
+	  move_to_predecessors (bb_index, info->required_in,
+				&pending_blocks);
+      }
+  while (!bitmap_empty_p (&pending_blocks));
+  bitmap_clear (&required);
+}
+
+/* Emit all rematerialization instructions queued for BB.  */
+
+void
+early_remat::emit_remat_insns_for_block (basic_block bb)
+{
+  remat_block_info *info = &m_block_info[bb->index];
+
+  if (info->last_call && !empty_p (info->required_after_call))
+    emit_remat_insns (info->required_after_call, NULL,
+		      info->rd_after_call, info->last_call);
+
+  if (!empty_p (info->required_in))
+    {
+      rtx_insn *insn = BB_HEAD (bb);
+      while (insn != BB_END (bb)
+	     && !INSN_P (NEXT_INSN (insn)))
+	insn = NEXT_INSN (insn);
+      emit_remat_insns (info->required_in, info->available_in,
+			info->rd_in, insn);
+    }
+}
+
+/* Decide which candidates in each block's REQUIRED_IN set need to be
+   rematerialized and decide where the rematerialization instructions
+   should go.  Emit queued rematerialization instructions at the start
+   of blocks and after the last calls in blocks.  */
+
+void
+early_remat::global_phase (void)
+{
+  compute_availability ();
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n;; Blocks after computing global"
+	       " availability:\n");
+      dump_all_blocks ();
+    }
+
+  choose_rematerialization_points ();
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n;; Blocks after choosing rematerialization"
+	       " points:\n");
+      dump_all_blocks ();
+    }
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, m_fn)
+    emit_remat_insns_for_block (bb);
+}
+
+/* Main function for the pass.  */
+
+void
+early_remat::run (void)
+{
+  df_analyze ();
+
+  if (!collect_candidates ())
+    return;
+
+  init_block_info ();
+  sort_candidates ();
+  finalize_candidate_indices ();
+  if (dump_file)
+    dump_all_candidates ();
+
+  compute_rd ();
+  decide_candidate_validity ();
+  local_phase ();
+  global_phase ();
+}
+
+early_remat::early_remat (function *fn, sbitmap selected_modes)
+  : m_fn (fn),
+    m_selected_modes (selected_modes),
+    m_available (0),
+    m_required (0),
+    m_value_table (63)
+{
+  bitmap_obstack_initialize (&m_obstack);
+  bitmap_initialize (&m_candidate_regnos, &m_obstack);
+  bitmap_initialize (&m_tmp_bitmap, &m_obstack);
+}
+
+early_remat::~early_remat ()
+{
+  bitmap_obstack_release (&m_obstack);
+}
+
+namespace {
+
+const pass_data pass_data_early_remat =
+{
+  RTL_PASS, /* type */
+  "early_remat", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_EARLY_REMAT, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_early_remat : public rtl_opt_pass
+{
+public:
+  pass_early_remat (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_early_remat, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return optimize > 1 && NUM_POLY_INT_COEFFS > 1;
+  }
+
+  virtual unsigned int execute (function *f)
+  {
+    auto_sbitmap selected_modes (NUM_MACHINE_MODES);
+    bitmap_clear (selected_modes);
+    targetm.select_early_remat_modes (selected_modes);
+    if (!bitmap_empty_p (selected_modes))
+      early_remat (f, selected_modes).run ();
+    return 0;
+  }
+}; // class pass_early_remat
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_early_remat (gcc::context *ctxt)
+{
+  return new pass_early_remat (ctxt);
+}
Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	2018-01-09 15:09:57.712775534 +0000
+++ gcc/config/aarch64/aarch64.c	2018-01-09 15:28:36.201732898 +0000
@@ -17439,6 +17439,22 @@ aarch64_can_change_mode_class (machine_m
   return true;
 }
 
+/* Implement TARGET_EARLY_REMAT_MODES.  */
+
+static void
+aarch64_select_early_remat_modes (sbitmap modes)
+{
+  /* SVE values are not normally live across a call, so it should be
+     worth doing early rematerialization even in VL-specific mode.  */
+  for (int i = 0; i < NUM_MACHINE_MODES; ++i)
+    {
+      machine_mode mode = (machine_mode) i;
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      if (vec_flags & VEC_ANY_SVE)
+	bitmap_set_bit (modes, i);
+    }
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -17909,6 +17925,9 @@ #define TARGET_COMPUTE_PRESSURE_CLASSES
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS aarch64_can_change_mode_class
 
+#undef TARGET_SELECT_EARLY_REMAT_MODES
+#define TARGET_SELECT_EARLY_REMAT_MODES aarch64_select_early_remat_modes
+
 #if CHECKING_P
 #undef TARGET_RUN_TARGET_SELFTESTS
 #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_spill_1.c	2018-01-09 15:09:55.875851493 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_1.c	2018-01-09 15:28:36.205732731 +0000
@@ -26,3 +26,5 @@ TEST_LOOP (uint64_t, 511);
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #511\n} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
 /* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_2.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_2.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE)				\
+  void						\
+  multi_loop_##TYPE (TYPE *x, TYPE val)		\
+  {						\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+    for (int i = 0; i < 7; ++i)			\
+      x[i] += val;				\
+    consumer (x);				\
+  }
+
+/* One iteration is enough.  */
+TEST_LOOP (uint8_t);
+TEST_LOOP (uint16_t);
+/* Two iterations are enough.  Complete unrolling makes sense
+   even at -O2.  */
+TEST_LOOP (uint32_t);
+/* Four iterations are needed; ought to stay a loop.  */
+TEST_LOOP (uint64_t);
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.b} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.h} 3 } } */
+/* { dg-final { scan-assembler {\twhilelo\tp[0-9]\.s} } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.d} 6 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_3.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_3.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,48 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE) \
+  void								\
+  multi_loop_##TYPE (TYPE *x, TYPE val1, TYPE val2, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	x[i * 2] += val1;					\
+	x[i * 2 + 1] += val2;					\
+      }								\
+    consumer (x);						\
+  }
+
+/* One iteration is enough.  */
+TEST_LOOP (uint8_t);
+TEST_LOOP (uint16_t);
+/* Two iterations are enough.  Complete unrolling makes sense
+   even at -O2.  */
+TEST_LOOP (uint32_t);
+/* Four iterations are needed; ought to stay a loop.  */
+TEST_LOOP (uint64_t);
+
+/* { dg-final { scan-assembler {\tld1b\tz[0-9]\.b} } } */
+/* { dg-final { scan-assembler {\tld1h\tz[0-9]\.h} } } */
+/* { dg-final { scan-assembler {\tld1w\tz[0-9]\.s} } } */
+/* { dg-final { scan-assembler {\tld1d\tz[0-9]\.d} } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_4.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_4.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x)			\
+  {						\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL;				\
+    consumer (x);				\
+  }
+
+TEST_LOOP (uint16_t, 0x1234);
+TEST_LOOP (uint32_t, 0x12345);
+TEST_LOOP (uint64_t, 0x123456);
+
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.s,} 3 } } */
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.d,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s,} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d,} 3 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_5.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_5.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x)			\
+  {						\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+    for (int i = 0; i < 100; ++i)		\
+      x[i] += VAL + i;				\
+    consumer (x);				\
+  }
+
+TEST_LOOP (uint8_t, 3);
+TEST_LOOP (uint16_t, 4);
+TEST_LOOP (uint32_t, 5);
+TEST_LOOP (uint64_t, 6);
+TEST_LOOP (float, 2.5f);
+TEST_LOOP (double, 3.5);
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_6.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_6.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)						\
+  void									\
+  multi_loop_##TYPE (TYPE *x1, TYPE *x2, TYPE *x3, TYPE *x4, int which) \
+  {									\
+    if (which)								\
+      {									\
+	for (int i = 0; i < 7; ++i)					\
+	  x1[i] += VAL;							\
+	consumer (x1);							\
+	for (int i = 0; i < 7; ++i)					\
+	  x2[i] -= VAL;							\
+	consumer (x2);							\
+      }									\
+    else								\
+      {									\
+	for (int i = 0; i < 7; ++i)					\
+	  x3[i] &= VAL;							\
+	consumer (x3);							\
+      }									\
+    for (int i = 0; i < 7; ++i)						\
+      x4[i] |= VAL;							\
+    consumer (x4);							\
+  }
+
+TEST_LOOP (uint8_t, 0x12);
+TEST_LOOP (uint16_t, 0x1234);
+TEST_LOOP (uint32_t, 0x12345);
+TEST_LOOP (uint64_t, 0x123456);
+
+/* { dg-final { scan-assembler {\tld1b\tz[0-9]+\.b,} } } */
+/* { dg-final { scan-assembler {\tld1h\tz[0-9]+\.h,} } } */
+/* { dg-final { scan-assembler {\tld1w\tz[0-9]+\.s,} } } */
+/* { dg-final { scan-assembler {\tld1d\tz[0-9]+\.d,} } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_spill_7.c
===================================================================
--- /dev/null	2018-01-08 18:48:58.045015662 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_spill_7.c	2018-01-09 15:28:36.205732731 +0000
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+void consumer (void *);
+
+#define TEST_LOOP(TYPE, VAL)			\
+  void						\
+  multi_loop_##TYPE (TYPE *x, int n)		\
+  {						\
+    for (int k = 0; k < 4; ++k)			\
+      {						\
+	for (int j = 0; j < n; ++j)		\
+	  {					\
+	    for (int i = 0; i < 100; ++i)	\
+	      x[i] += VAL + i;			\
+	    asm volatile ("");			\
+	  }					\
+	for (int j = 0; j < n; ++j)		\
+	  consumer (x);				\
+	for (int j = 0; j < n; ++j)		\
+	  {					\
+	    for (int i = 0; i < 100; ++i)	\
+	      x[i] += VAL + i;			\
+	    asm volatile ("");			\
+	  }					\
+	consumer (x);				\
+	for (int i = 0; i < 100; ++i)		\
+	  x[i] += VAL + i;			\
+	consumer (x);				\
+      }						\
+  }
+
+TEST_LOOP (uint8_t, 3);
+TEST_LOOP (uint16_t, 4);
+TEST_LOOP (uint32_t, 5);
+TEST_LOOP (uint64_t, 6);
+TEST_LOOP (float, 2.5f);
+TEST_LOOP (double, 3.5);
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
+/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]