RFC: LRA for x86/x86-64 [7/9] -- continuation

Fri Oct 12 14:44:00 GMT 2012

Hi Vlad,

Comments for the rest of ira-constraints.c.

Vladimir Makarov <vmakarov@redhat.com> writes:
> +  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
> +  change_p = equiv_address_substitution (&ad, addr_loc, mode, as, code);
> +  if (ad.base_reg_loc != NULL)
> +    {
> +      if (process_addr_reg
> +	  (ad.base_reg_loc, before,
> +	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
> +	    && find_regno_note (curr_insn, REG_DEAD,
> +				REGNO (*ad.base_reg_loc)) == NULL
> +	    ? after : NULL),
> +	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
> +	change_p = true;
> +      if (ad.base_reg_loc2 != NULL)
> +	*ad.base_reg_loc2 = *ad.base_reg_loc;
> +      saved_base_reg = *ad.base_reg_loc;
> +      lra_eliminate_reg_if_possible (ad.base_reg_loc);
> +      if (ad.base_reg_loc2 != NULL)
> +	{
> +	  saved_base_reg2 = *ad.base_reg_loc2;
> +	  lra_eliminate_reg_if_possible (ad.base_reg_loc2);
> +	}

We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
might be clearer without saved_base_reg2.  More below...

> +      /* The following addressing is checked by constraints and
> +	 usually target specific legitimate address hooks do not
> +	 consider them valid.  */
> +      || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
> +      || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC

typo: two PRE_DECs, although:

> +      || GET_CODE (*addr_loc) == PRE_MODIFY
> +      || GET_CODE (*addr_loc) == POST_MODIFY

the whole lot could just be replaced by ad.base_modify_p, or perhaps
even removed entirely given:

> +      /* In this case we can not do anything because if it is wrong
> +	 that is because of wrong displacement.	 Remember that any
> +	 address was legitimate in non-strict sense before LRA.	 */
> +      || ad.disp_loc == NULL)

It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
E.g. something like:

  if (ad.base_reg_loc != NULL
      && (process_addr_reg
	  (ad.base_reg_loc, before,
	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
	    && find_regno_note (curr_insn, REG_DEAD,
				REGNO (*ad.base_reg_loc)) == NULL
	    ? after : NULL),
	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code))))
    {
      change_p = true;
      if (ad.base_reg_loc2 != NULL)
        *ad.base_reg_loc2 = *ad.base_reg_loc;
    }

  if (ad.index_reg_loc != NULL
      && process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
    change_p = true;

  /* The address was valid before LRA.  We only change its form if the
     address has a displacement, so if it has no displacement it must
     still be valid.  */
  if (ad.disp_loc == NULL)
    return change_p;

  /* See whether the address is still valid.  Some ports do not check
     displacements for eliminable registers, so we replace them
     temporarily with the elimination target.  */
  saved_base_reg = saved_index_reg = NULL_RTX;
  ...
  if (ok_p)
    return change_p;

> +#ifdef HAVE_lo_sum
> +	  {
> +	    rtx insn;
> +	    rtx last = get_last_insn ();
> +
> +	    /* disp => lo_sum (new_base, disp)	*/
> +	    insn = emit_insn (gen_rtx_SET
> +			      (VOIDmode, new_reg,
> +			       gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc))));
> +	    code = recog_memoized (insn);
> +	    if (code >= 0)
> +	      {
> +		rtx save = *ad.disp_loc;
> +
> +		*ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
> +		if (! valid_address_p (mode, *ad.disp_loc, as))
> +		  {
> +		    *ad.disp_loc = save;
> +		    code = -1;
> +		  }
> +	      }
> +	    if (code < 0)
> +	      delete_insns_since (last);
> +	  }
> +#endif

Nice :-)

Purely for the record, I wondered whether the high part should be
generated with emit_move_insn(_1) instead, with the rhs of the move
being the HIGH rtx.  That would allow targets to cope with cases where
the high part isn't represented directly as a HIGH.  E.g. on MIPS and
Alpha, small-data accesses use the global register as the high part instead.

However, both MIPS and Alpha accept small-data addresses as legitimate
constants and addresses before and during reload and only introduce the
split form after reload.  And I think that's how any other cases that
aren't simple HIGHs should be handled too.  E.g. MIPS also represents
GOT page loads as HIGHs until after reload, and only then lowers the
HIGH to a GOT load.  Allowing the backend to generate anything other
than a plain HIGH set here would be a double-edged sword.

So after all that I agree that the gen_rtx_SET above is better than
calling the move expanders.

> +	  /* index * scale + disp => new base + index * scale  */
> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
> +
> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
> +	  lra_emit_move (new_reg, *ad.disp_loc);
> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
> +	  XEXP (*addr_loc, 0) = new_reg;

The canonical form is (plus (mult ...) (reg)) rather than
(plus (reg) (mult ...)), but it looks like we create the latter.
I realise you try both forms here:

> +	  /* Some targets like ARM, accept address operands in
> +	     specific order -- try exchange them if necessary.	*/
> +	  if (! valid_address_p (mode, *addr_loc, as))
> +	    {
> +	      exchange_plus_ops (*addr_loc);
> +	      if (! valid_address_p (mode, *addr_loc, as))
> +		exchange_plus_ops (*addr_loc);
> +	    }

but I think we should try the canonical form first.  And I'd prefer it
if we didn't try the other form at all, especially in 4.8.  It isn't
really the backend's job to reject non-canonical rtl.  This might well
be another case where some targets need a (hopefully small) tweak in
order to play by the rules.

Also, I suppose this section of code feeds back to my question on
Wednesday about the distinction that LRA seems to make between the
compile-time constant in:

  (plus (reg X1) (const_int Y1))

and the link-time constant in:

  (plus (reg X2) (symbol_ref Y2))

It looked like extract_address_regs classified X1 as a base register and
X2 as an index register.  The difference between the two constants has
no run-time significance though, and I think we should handle both X1
and X2 as base registers (as I think reload does).

I think the path above would then be specific to scaled indices.
In the original address the "complex" index must come first and the
displacement second.  In the modified address, the index would stay
first and the new base register would be second.  More below.

> +      /* We don't use transformation 'base + disp => base + new index'
> +	 because of bad practice used in some machine descriptions
> +	 (see comments for emit_spill_move).  */
> +      /* base + disp => new base  */

As before when commenting on emit_spill_move, I think we should leave
the "bad machine description" stuff out of 4.8 and treat fixing the
machine descriptions as part of the LRA port.

In this case I think there's another reason not to reload the
displacement into an index though: IIRC postreload should be able
to optimise a sequence of address reloads that have the same base
and different displacements.  LRA itself might try using "anchor"
bases in future -- although obviously not in the initial merge --
since that was one thing that LEGITIMIZE_RELOAD_ADDRESS was used for.

E.g. maybe the justification could be:

      /* base + disp => new base  */
      /* Another option would be to reload the displacement into an
	 index register.  However, postreload has code to optimize
	 address reloads that have the same base and different
	 displacements, so reloading into an index register would
	 not necessarily be a win.  */

> +      /* base + scale * index + disp => new base + scale * index  */
> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
> +      if (! valid_address_p (mode, *addr_loc, as))
> +	{
> +	  /* Some targets like ARM, accept address operands in
> +	     specific order -- try exchange them if necessary.	*/
> +	  exchange_plus_ops (*addr_loc);
> +	  if (! valid_address_p (mode, *addr_loc, as))
> +	    exchange_plus_ops (*addr_loc);
> +	}

Same comment as above about canonical rtl.  Here we can have two
registers -- in which case the base should come first -- or a more
complex index -- in which case the index should come first.

We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
with the operands in either order, and let it take care of the details.
Using simplify_gen_binary would help with the earlier index+disp case too.

> +  /* If this is post-increment, first copy the location to the reload reg.  */
> +  if (post && real_in != result)
> +    emit_insn (gen_move_insn (result, real_in));

Nit, but real_in != result can never be true AIUI, and I was confused how
the code could be correct in that case.  Maybe just remove it, or make
it an assert?

> +  /* We suppose that there are insns to add/sub with the constant
> +     increment permitted in {PRE/POST)_{DEC/INC/MODIFY}.  At least the
> +     old reload worked with this assumption.  If the assumption
> +     becomes wrong, we should use approach in function
> +     base_plus_disp_to_reg.  */
> +  if (in == value)
> +    {
> +      /* See if we can directly increment INCLOC.  */
> +      last = get_last_insn ();
> +      add_insn = emit_insn (plus_p
> +			    ? gen_add2_insn (incloc, inc)
> +			    : gen_sub2_insn (incloc, inc));
> +
> +      code = recog_memoized (add_insn);
> +      /* We should restore recog_data for the current insn.  */

Looks like this comment might be a left-over, maybe from before the
cached insn data?

> +      /* Restore non-modified value for the result.  We prefer this
> +	 way because it does not require an addition hard
> +	 register.  */
> +      if (plus_p)
> +	{
> +	  if (CONST_INT_P (inc))
> +	    emit_insn (gen_add2_insn (result, GEN_INT (-INTVAL (inc))));
> +	  else
> +	    emit_insn (gen_sub2_insn (result, inc));
> +	}
> +      else if (CONST_INT_P (inc))
> +	emit_insn (gen_add2_insn (result, inc));

The last two lines look redundant.  The behaviour is the same as for
the following else:

> +      else
> +	emit_insn (gen_add2_insn (result, inc));

and I don't think there are any cases where !plus && CONST_INT_P (inc)
would hold.

> +/* Main entry point of this file: search the body of the current insn

s/this file/the constraints code/, since it's a static function.

> +  if (change_p)
> +    /* Changes in the insn might result in that we can not satisfy
> +       constraints in lately used alternative of the insn.  */
> +    lra_set_used_insn_alternative (curr_insn, -1);

Maybe:

  /* If we've changed the instruction then any alternative that
     we chose previously may no longer be valid.  */

> +      rtx x;
> +
> +      curr_swapped = !curr_swapped;
> +      if (curr_swapped)
> +	{
> +	  x = *curr_id->operand_loc[commutative];
> +	  *curr_id->operand_loc[commutative]
> +	    = *curr_id->operand_loc[commutative + 1];
> +	  *curr_id->operand_loc[commutative + 1] = x;
> +	  /* Swap the duplicates too.  */
> +	  lra_update_dup (curr_id, commutative);
> +	  lra_update_dup (curr_id, commutative + 1);
> +	  goto try_swapped;
> +	}
> +      else
> +	{
> +	  x = *curr_id->operand_loc[commutative];
> +	  *curr_id->operand_loc[commutative]
> +	    = *curr_id->operand_loc[commutative + 1];
> +	  *curr_id->operand_loc[commutative + 1] = x;
> +	  lra_update_dup (curr_id, commutative);
> +	  lra_update_dup (curr_id, commutative + 1);
> +	}

The swap code is the same in both cases, so I think it'd be better to
make it common.  Or possibly a helper function, since the same code
appears again later on.

> +	if (GET_CODE (op) == PLUS)
> +	  {
> +	    plus = op;
> +	    op = XEXP (op, 1);
> +	  }

Sorry, I'm complaining about old reload code again, but: does this
actually happen in LRA?  In reload, a register operand could become a
PLUS because of elimination, but I thought LRA did things differently.
Besides, this is only needed for:

> +	if (CONST_POOL_OK_P (mode, op)
> +	    && ((targetm.preferred_reload_class
> +		 (op, (enum reg_class) goal_alt[i]) == NO_REGS)
> +		|| no_input_reloads_p)
> +	    && mode != VOIDmode)
> +	  {
> +	    rtx tem = force_const_mem (mode, op);
> +	    
> +	    change_p = true;
> +	    /* If we stripped a SUBREG or a PLUS above add it back.  */
> +	    if (plus != NULL_RTX)
> +	      tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);

and we shouldn't have (plus (constant ...) ...) after elimination
(or at all outside of a CONST).  I don't understand why the code is
needed even in reload.

> +  for (i = 0; i < n_operands; i++)
> +    {
> +      rtx old, new_reg;
> +      rtx op = *curr_id->operand_loc[i];
> +
> +      if (goal_alt_win[i])
> +	{
> +	  if (goal_alt[i] == NO_REGS
> +	      && REG_P (op)
> +	      && lra_former_scratch_operand_p (curr_insn, i))
> +	    change_class (REGNO (op), NO_REGS, "      Change", true);

I think this could do with a comment.  Does setting the class to NO_REGS
indirectly cause the operand to be switched back to a SCRATCH?

> +	  push_to_sequence (before);
> +	  rclass = base_reg_class (GET_MODE (op), MEM_ADDR_SPACE (op),
> +				   MEM, SCRATCH);
> +	  if (code == PRE_DEC || code == POST_DEC
> +	      || code == PRE_INC || code == POST_INC
> +	      || code == PRE_MODIFY || code == POST_MODIFY)

Very minor, but: GET_RTX_CLASS (code) == RTX_AUTOINC

> +	  enum machine_mode mode;
> +	  rtx reg, *loc;
> +	  int hard_regno, byte;
> +	  enum op_type type = curr_static_id->operand[i].type;
> +
> +	  loc = curr_id->operand_loc[i];
> +	  mode = get_op_mode (i);
> +	  if (GET_CODE (*loc) == SUBREG)
> +	    {
> +	      reg = SUBREG_REG (*loc);
> +	      byte = SUBREG_BYTE (*loc);
> +	      if (REG_P (reg)
> +		  /* Strict_low_part requires reload the register not
> +		     the sub-register.	*/
> +		  && (curr_static_id->operand[i].strict_low
> +		      || (GET_MODE_SIZE (mode)
> +			  <= GET_MODE_SIZE (GET_MODE (reg))
> +			  && (hard_regno
> +			      = get_try_hard_regno (REGNO (reg))) >= 0
> +			  && (simplify_subreg_regno
> +			      (hard_regno,
> +			       GET_MODE (reg), byte, mode) < 0)
> +			  && (goal_alt[i] == NO_REGS
> +			      || (simplify_subreg_regno
> +				  (ira_class_hard_regs[goal_alt[i]][0],
> +				   GET_MODE (reg), byte, mode) >= 0)))))
> +		{
> +		  loc = &SUBREG_REG (*loc);
> +		  mode = GET_MODE (*loc);
> +		}
> +	  old = *loc;

I think this needs a bit more justifying commentary (although I'm glad
to see it's much simpler than the reload version :-)).  One thing in
particular I didn't understand was why we don't reload the inner
register of a paradoxical subreg.

> +	  if (get_reload_reg (type, mode, old, goal_alt[i], "", &new_reg)
> +	      && type != OP_OUT)
> +	    {
> +	      push_to_sequence (before);
> +	      lra_emit_move (new_reg, old);
> +	      before = get_insns ();
> +	      end_sequence ();
> +	    }
> +	  *loc = new_reg;
> +	  if (type != OP_IN)
> +	    {
> +	      if (find_reg_note (curr_insn, REG_UNUSED, old) == NULL_RTX)
> +		{
> +		  start_sequence ();
> +		  /* We don't want sharing subregs as the pseudo can
> +		     get a memory and the memory can be processed
> +		     several times for eliminations.  */
> +		  lra_emit_move (GET_CODE (old) == SUBREG && type == OP_INOUT
> +				 ? copy_rtx (old) : old,
> +				 new_reg);

I think this should simply be:

  lra_emit_move (type == OP_INOUT ? copy_rtx (old) : old, new_reg);

leaving copy_rtx to figure out which rtxes can be shared.  No comment
would be needed for that.

> +		  emit_insn (after);
> +		  after = get_insns ();
> +		  end_sequence ();
> +		}
> +	      *loc = new_reg;
> +	    }

Very minor again, but: redundant *loc assignment (so that the two nested
if statements collapse to one).

> +      else
> +	{
> +	  lra_assert (INSN_CODE (curr_insn) < 0);
> +	  error_for_asm (curr_insn,
> +			 "inconsistent operand constraints in an %<asm%>");
> +	  /* Avoid further trouble with this insn.  */
> +	  PATTERN (curr_insn) = gen_rtx_USE (VOIDmode, const0_rtx);
> +	  return false;

Is this code handling a different case from the corresponding error
code in curr_insn_transform?  If so, it probably deserves a comment
explaining the difference.

> +/* Process all regs in debug location *LOC and change them on
> +   equivalent substitution.  Return true if any change was done.  */
> +static bool
> +debug_loc_equivalence_change_p (rtx *loc)

This doesn't keep the rtl in canonical form.  Probably the easiest and
best fix is to use simplify_replace_fn_rtx, which handles all that for you.
(simplify_replace_fn_rtx returns the original rtx if no change was made.)

> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
> +    ira_reg_equiv[i].profitable_p = true;
> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
> +    if (lra_reg_info[i].nrefs != 0)
> +      {
> +	if ((hard_regno = lra_get_regno_hard_regno (i)) >= 0)
> +	  {
> +	    int j, nregs = hard_regno_nregs[hard_regno][PSEUDO_REGNO_MODE (i)];
> +	    
> +	    for (j = 0; j < nregs; j++)
> +	      df_set_regs_ever_live (hard_regno + j, true);
> +	  }
> +	else if ((x = get_equiv_substitution (regno_reg_rtx[i])) != NULL_RTX)
> +	  {
> +	    if (! first_p && contains_reg_p (x, false, false))
> +	      /* After RTL transformation, we can not guarantee that
> +		 pseudo in the substitution was not reloaded which
> +		 might make equivalence invalid.  For example, in
> +		 reverse equiv of p0
> +
> +		 p0 <- ...
> +		 ...
> +		 equiv_mem <- p0
> +
> +		 the memory address register was reloaded before the
> +		 2nd insn.  */
> +	      ira_reg_equiv[i].defined_p = false;
> +	    if (contains_reg_p (x, false, true))
> +	      ira_reg_equiv[i].profitable_p = false;
> +	  }
> +      }

Do we need two loops because the second may check for equivalences
of other pseudos besides "i"?  I couldn't see how offhand, but I might
well have missed something.  Might be worth a comment.

> +	      dest_reg = SET_DEST (set);
> +	      /* The equivalence pseudo could be set up as SUBREG in a
> +		 case when it is a call restore insn in a mode
> +		 different from the pseudo mode.  */
> +	      if (GET_CODE (dest_reg) == SUBREG)
> +		dest_reg = SUBREG_REG (dest_reg);
> +	      if ((REG_P (dest_reg)
> +		   && (x = get_equiv_substitution (dest_reg)) != dest_reg
> +		   /* Remove insns which set up a pseudo whose value
> +		      can not be changed.  Such insns might be not in
> +		      init_insns because we don't update equiv data
> +		      during insn transformations.
> +			  
> +		      As an example, let suppose that a pseudo got
> +		      hard register and on the 1st pass was not
> +		      changed to equivalent constant.  We generate an
> +		      additional insn setting up the pseudo because of
> +		      secondary memory movement.  Then the pseudo is
> +		      spilled and we use the equiv constant.  In this
> +		      case we should remove the additional insn and
> +		      this insn is not init_insns list.	 */
> +		   && (! MEM_P (x) || MEM_READONLY_P (x)
> +		       || in_list_p (curr_insn,
> +				     ira_reg_equiv
> +				     [REGNO (dest_reg)].init_insns)))

This is probably a stupid question, sorry, but when do we ever want
to keep an assignment to a substituted pseudo?  I.e. why isn't this just:

	      if ((REG_P (dest_reg)
		   && (x = get_equiv_substitution (dest_reg)) != dest_reg)

> +/* Info about last usage of registers in EBB to do inheritance/split
> +   transformation.  Inheritance transformation is done from a spilled
> +   pseudo and split transformations from a hard register or a pseudo
> +   assigned to a hard register.	 */
> +struct usage_insns
> +{
> +  /* If the value is equal to CURR_USAGE_INSNS_CHECK, then the member
> +     value INSNS is valid.  The insns is chain of optional debug insns
> +     and a finishing non-debug insn using the corresponding reg.  */
> +  int check;
> +  /* Value of global reloads_num at the ???corresponding next insns.  */
> +  int reloads_num;
> +  /* Value of global reloads_num at the ???corresponding next insns.  */
> +  int calls_num;

"???s".  Probably "at the last instruction in INSNS" if that's accurate
(because debug insns in INSNS don't affect these fields).

> +/* Process all regs OLD_REGNO in location *LOC and change them on the
> +   reload pseudo NEW_REG.  Return true if any change was done.	*/
> +static bool
> +substitute_pseudo (rtx *loc, int old_regno, rtx new_reg)

This is another case where I found the term "reload pseudo" a bit confusing,
since AIUI new_reg can be an inheritance or split pseudo rather than a pseudo
created solely for insn reloads.  I'll follow up about that on the original
thread.  Maybe just:

/* Replace all references to register OLD_REGNO in *LOC with pseudo register
   NEW_REG.  Return true if any change was made.  */

> +  code = GET_CODE (x);
> +  if (code == REG && (int) REGNO (x) == old_regno)
> +    {
> +      *loc = new_reg;
> +      return true;
> +    }

Maybe assert that the modes are the same?

> +/* Do inheritance transformation for insn INSN defining (if DEF_P) or
> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
> +   we traverse insns in the backward direction) for the original regno
> +   is NEXT_USAGE_INSNS.	 The transformations look like

Maybe:

/* Do interitance transformations for insn INSN, which defines (if DEF_P)
   or uses ORIGINAL_REGNO.  NEXT_USAGE_INSNS specifies which instruction
   in the EBB next uses ORIGINAL_REGNO; it has the same form as the
   "insns" field of usage_insns.

   The transformations look like:

> +
> +     p <- ...		  i <- ...
> +     ...		  p <- i    (new insn)
> +     ...	     =>
> +     <- ... p ...	  <- ... i ...
> +   or
> +     ...		  i <- p    (new insn)
> +     <- ... p ...	  <- ... i ...
> +     ...	     =>
> +     <- ... p ...	  <- ... i ...
> +   where p is a spilled original pseudo and i is a new inheritance pseudo.
> +   
> +   The inheritance pseudo has the smallest class of two classes CL and
> +   class of ORIGINAL REGNO.  It will have unique value if UNIQ_P.  The
> +   unique value is necessary for correct assignment to inheritance
> +   pseudo for input of an insn which should be the same as output
> +   (bound pseudos).  Return true if we succeed in such
> +   transformation.  */

This comment looks really good, but I still wasn't sure about the
UNIQ_P thing.  AIUI this is for cases like:

                       i <- p            [new insn]
   r <- ... p ...      r <- ... i ...    [input reload]
   r <- ... r ...   => r <- ... r ...    [original insn]
   <- r                <- r              [output reload]
   ....                ......
   <- ... p ...        <- ... i ...      [next ref]

where "r" is used on both sides of the original insn and where the
output reload assigns to something other than "p" (otherwise "next ref"
wouldn't be the next ref).  But why does this affect the way "i" is created?
I think it'd be worth expanding that part a bit.

> +  if (! ira_reg_classes_intersect_p[cl][rclass])
> +    {
> +      if (lra_dump_file != NULL)
> +	{
> +	  fprintf (lra_dump_file,
> +		   "	Rejecting inheritance for %d "
> +		   "because of too different classes %s and %s\n",

Suggest s/too different/disjoint/

> +  if ((ira_class_subset_p[cl][rclass] && cl != rclass)
> +      || ira_class_hard_regs_num[cl] < ira_class_hard_regs_num[rclass])
> +    {
> +      if (lra_dump_file != NULL)
> +	fprintf (lra_dump_file, "    Use smallest class of %s and %s\n",
> +		 reg_class_names[cl], reg_class_names[rclass]);
> +      
> +      rclass = cl;
> +    }

I don't understand the second line of the if statement.  Why do we prefer
classes with fewer allocatable registers?

My guess before reading the code was that we'd use the subunion of CL and
RCLASS, so maybe a comment explaining why we use this choice would help.

> +  if (NEXT_INSN (new_insns) != NULL_RTX)
> +    {
> +      if (lra_dump_file != NULL)
> +	{
> +	  fprintf (lra_dump_file,
> +		   "	Rejecting inheritance %d->%d "
> +		   "as it results in 2 or more insns:\n",
> +		   original_regno, REGNO (new_reg));
> +	  debug_rtl_slim (lra_dump_file, new_insns, NULL_RTX, -1, 0);
> +	  fprintf (lra_dump_file,
> +		   "	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
> +	}
> +      return false;
> +    }

Hmm, I wasn't sure about this at first.  Some targets define patterns for
multiword moves and split them later.  Others expose the split straight away.
The two approaches don't really imply any difference in cost, so I didn't
want us to penalise the latter.

But I suppose on targets that split straight away, lower-subreg would
tend to replace the multiword pseudo with individual word-sized pseudos,
so LRA shouldn't see them.  I suppose this check shouldn't matter in practice.

> +  if (def_p)
> +    lra_process_new_insns (insn, NULL_RTX, new_insns,
> +			   "Add original<-inheritance");
> +  else
> +    lra_process_new_insns (insn, new_insns, NULL_RTX,
> +			   "Add inheritance<-pseudo");

Maybe "original" rather than "pseudo" here too for consistency.

> +/* Return true if we need a split for hard register REGNO or pseudo
> +   REGNO which was assigned to a hard register.
> +   POTENTIAL_RELOAD_HARD_REGS contains hard registers which might be
> +   used for reloads since the EBB end.	It is an approximation of the
> +   used hard registers in the split range.  The exact value would
> +   require expensive calculations.  If we were aggressive with
> +   splitting because of the approximation, the split pseudo will save
> +   the same hard register assignment and will be removed in the undo
> +   pass.  We still need the approximation because too aggressive
> +   splitting would result in too inaccurate cost calculation in the
> +   assignment pass because of too many generated moves which will be
> +   probably removed in the undo pass.  */
> +static inline bool
> +need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
> +{
> +  int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];
> +
> +  lra_assert (hard_regno >= 0);
> +  return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
> +	   && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
> +	   && (usage_insns[regno].reloads_num
> +	       + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
> +	   && ((regno < FIRST_PSEUDO_REGISTER
> +		&& ! bitmap_bit_p (&ebb_global_regs, regno))
> +	       || (regno >= FIRST_PSEUDO_REGISTER
> +		   && lra_reg_info[regno].nrefs > 3
> +		   && bitmap_bit_p (&ebb_global_regs, regno))))
> +	  || (regno >= FIRST_PSEUDO_REGISTER && need_for_call_save_p (regno)));
> +}

Could you add more commentary about the thinking behind this particular
choice of heuristic?  E.g. I wasn't sure what the reloads_num check did,
or why we only split hard registers that are local to the EBB and only
split pseudos that aren't.

The 2 and 3 numbers seemed a bit magic too.  I suppose the 2 has
something to do with "one save and one restore", but I wasn't sure
why we applied it only for pseudos.  (AIUI that arm of the check
deals with "genuine" split pseudos rather than call saves & restores.)

Still, it says a lot for the high quality of LRA that, out of all the
1000s of lines of code I've read so far, this is the only part that
didn't seem to have an intuitive justification.

> +  for (i = 0;
> +       (cl = reg_class_subclasses[allocno_class][i]) != LIM_REG_CLASSES;
> +       i++)
> +    if (! SECONDARY_MEMORY_NEEDED (cl, hard_reg_class, mode)
> +	&& ! SECONDARY_MEMORY_NEEDED (hard_reg_class, cl, mode)
> +	&& TEST_HARD_REG_BIT (reg_class_contents[cl], hard_regno)
> +	&& (best_cl == NO_REGS
> +	    || (hard_reg_set_subset_p (reg_class_contents[best_cl],
> +				       reg_class_contents[cl])
> +		&& ! hard_reg_set_equal_p (reg_class_contents[best_cl],
> +					   reg_class_contents[cl]))))
> +      best_cl = cl;

OK, so this suggestion isn't backed up by any evidence, but what do
you think about this alternative:

	&& (best_cl == NO_REGS
	    || (ira_class_hard_regs_num[best_cl]
		< ira_class_hard_regs_num[cl]))

which should choose the largest class that requires no secondary memory.
It looks like the subset version could get "stuck" on a single-register
class that happens to be early in the list but has no superclass smaller
than allocno_class.

> +/* Do split transformation for insn INSN defining or
> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
> +   we traverse insns in the backward direction) for the original regno
> +   is NEXT_USAGE_INSNS.	 The transformations look like

Same suggestion as for the inheritance function above.

> +  if (call_save_p)
> +    save = emit_spill_move (true, new_reg, original_reg, -1);
> +  else
> +    {
> +      start_sequence ();
> +      emit_move_insn (new_reg, original_reg);
> +      save = get_insns ();
> +      end_sequence ();
> +    }
> +  if (NEXT_INSN (save) != NULL_RTX)
> +    {
> +      lra_assert (! call_save_p);

Is emit_spill_move really guaranteed to return only one instruction in
cases where emit_move_insn might not?  Both of them use emit_move_insn_1
internally, so I wouldn't have expected much difference.

In fact I wasn't really sure why:

  save = gen_move_insn (new, original_reg);

wouldn't be correct for both.

Same comments for the restore code.

> +      /* See which defined values die here.  */
> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +	if (reg->type == OP_OUT && ! reg->early_clobber
> +	    && (! reg->subreg_p
> +		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
> +	  bitmap_clear_bit (&live_regs, reg->regno);
> +      /* Mark each used value as live.	*/
> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +	if (reg->type == OP_IN
> +	    && bitmap_bit_p (&check_only_regs, reg->regno))
> +	  bitmap_set_bit (&live_regs, reg->regno);
> +      /* Mark early clobber outputs dead.  */
> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +	if (reg->type == OP_OUT && reg->early_clobber && ! reg->subreg_p)
> +	  bitmap_clear_bit (&live_regs, reg->regno);

I don't think this would be correct for unreloaded insns because an
unreloaded insn can have the same pseudo as an input and an earlyclobber
output.  (Probably not an issue here, since we're called after the
constraints pass.)  There's also the case of matched earlyclobber operands,
where the matched input is specifically not affected by the earlyclobber.

I'd have thought:

      /* See which defined values die here.  */
      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
	if (reg->type == OP_OUT
	    && (! reg->subreg_p
		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
	  bitmap_clear_bit (&live_regs, reg->regno);
      /* Mark each used value as live.	*/
      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
	if (reg->type == OP_IN
	    && bitmap_bit_p (&check_only_regs, reg->regno))
	  bitmap_set_bit (&live_regs, reg->regno);

ought to be correct, but perhaps I'm missing something.

(I'm still uneasy about the special treatment of bound pseudos here.
A clobber really does seem better.)

> +      /* It is quite important to remove dead move insns because it
> +	 means removing dead store, we don't need to process them for
> +	 constraints, and unfortunately some subsequent optimizations
> +	 (like shrink-wrapping) currently based on assumption that
> +	 there are no trivial dead insns.  */

Maybe best to drop the "subsequent optimizations" part.  This comment
is unlikely to be updated after any change to shrink-wrapping & co.,
and the first two justifications seem convincing enough on their own.

> +/* Add inheritance info REGNO and INSNS.  */
> +static void
> +add_to_inherit (int regno, rtx insns)
> +{
> +  int i;
> +
> +  for (i = 0; i < to_inherit_num; i++)
> +    if (to_inherit[i].regno == regno)
> +      return;

Is the existing "insns" field guaranteed to match the "insns" parameter
in this case, or might they be different?  Probably worth an assert or
comment respectively.

> +/* Return first (if FIRST_P) or last non-debug insn in basic block BB.
> +   Return null if there are no non-debug insns in the block.  */
> +static rtx
> +get_non_debug_insn (bool first_p, basic_block bb)
> +{
> +  rtx insn;
> +
> +  for (insn = first_p ? BB_HEAD (bb) : BB_END (bb);
> +       insn != NULL_RTX && ! NONDEBUG_INSN_P (insn);
> +       insn = first_p ? NEXT_INSN (insn) : PREV_INSN (insn))
> +    ;
> +  if (insn != NULL_RTX && BLOCK_FOR_INSN (insn) != bb)
> +    insn = NULL_RTX;
> +  return insn;
> +}

It probably doesn't matter in practice, but it looks like it'd be better
to limit the walk to the bb, rather than walking until null and then
testing the bb after the walk.

Maybe it would be eaiser to split into two functions, since first_p is
always constant.  E.g.:

  rtx insn;

  FOR_BB_INSNS (bb, insn)
    if (NONDEBUG_INSN_P (insn))
      return insn;
  return NULL_RTX;

for first_p.  s/FOR_BB_INSNS/FOR_BB_INSNS_REVERSE/ for !first_p.

> +/* Set up RES by registers living on edges FROM except the edge (FROM,
> +   TO) or by registers set up in a jump insn in BB FROM.  */
> +static void
> +get_live_on_other_edges (basic_block from, basic_block to, bitmap res)
> +{
> +  int regno;
> +  rtx last;
> +  struct lra_insn_reg *reg;
> +  edge e;
> +  edge_iterator ei;
> +
> +  lra_assert (to != NULL);
> +  bitmap_clear (res);
> +  FOR_EACH_EDGE (e, ei, from->succs)
> +    if (e->dest != to)
> +      bitmap_ior_into (res, DF_LR_IN (e->dest));
> +  if ((last = get_non_debug_insn (false, from)) == NULL_RTX || ! JUMP_P (last))
> +    return;
> +  curr_id = lra_get_insn_recog_data (last);
> +  for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +    if (reg->type != OP_IN
> +	&& (regno = reg->regno) >= FIRST_PSEUDO_REGISTER)
> +      bitmap_set_bit (res, regno);
> +}

Probably a silly question, sorry, but: why does the JUMP_P part only
include pseudo registers?  The other calculations (here and elsewhere)
seem to handle both hard and pseudo registers.

> +/* Do inheritance/split transformations in EBB starting with HEAD and
> +   finishing on TAIL.  We process EBB insns in the reverse order.
> +   Return true if we did any inheritance/split transformation in the
> +   EBB.
> +
> +   We should avoid excessive splitting which results in worse code
> +   because of inaccurate cost calculations for spilling new split
> +   pseudos in such case.  To achieve this we do splitting only if
> +   register pressure is high in given basic block and there reload

"...and there are reload"

> +   pseudos requiring hard registers.  We could do more register
> +   pressure calculations at any given program point to avoid necessary
> +   splitting even more but it is to expensive and the current approach
> +   is well enough.  */

"works well enough".

> +  change_p = false;
> +  curr_usage_insns_check++;
> +  reloads_num = calls_num = 0;
> +  /* Remember: we can remove the current insn.	*/
> +  bitmap_clear (&check_only_regs);
> +  last_processed_bb = NULL;

I couldn't tell which part of the code the comment is referring to.
Maybe left over?

> +	  after_p = (last_insn != NULL_RTX && ! JUMP_P (last_insn)
> +		     && (! CALL_P (last_insn)
> +			 || (find_reg_note (last_insn,
> +					   REG_NORETURN, NULL) == NULL_RTX
> +			     && ((next_insn
> +				  = next_nonnote_nondebug_insn (last_insn))
> +				 == NULL_RTX
> +				 || GET_CODE (next_insn) != BARRIER))));

Genuine question, but: when are the last four lines needed?  The condition
that they're testing for sounds like a noreturn call.

> +      if (src_regno < lra_constraint_new_regno_start
> +	  && src_regno >= FIRST_PSEUDO_REGISTER
> +	  && reg_renumber[src_regno] < 0
> +	  && dst_regno >= lra_constraint_new_regno_start
> +	  && (cl = lra_get_allocno_class (dst_regno)) != NO_REGS)
> +	{
> +	  /* 'reload_pseudo <- original_pseudo'.  */
> +	  reloads_num++;
> +	  succ_p = false;
> +	  if (usage_insns[src_regno].check == curr_usage_insns_check
> +	      && (next_usage_insns = usage_insns[src_regno].insns) != NULL_RTX)
> +	    succ_p = inherit_reload_reg (false,
> +					 bitmap_bit_p (&lra_matched_pseudos,
> +						       dst_regno),
> +					 src_regno, cl,
> +					 curr_insn, next_usage_insns);
> +	  if (succ_p)
> +	    change_p = true;
> +	  else
> +	    {
> +	      usage_insns[src_regno].check = curr_usage_insns_check;
> +	      usage_insns[src_regno].insns = curr_insn;
> +	      usage_insns[src_regno].reloads_num = reloads_num;
> +	      usage_insns[src_regno].calls_num = calls_num;
> +	      usage_insns[src_regno].after_p = false;
> +	    }

Looks like this and other places could use the add_next_usage_insn
helper function.

> +	  if (cl != NO_REGS
> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
> +					live_hard_regs))
> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
> +			      reg_class_contents[cl]);

Redundant "cl != NO_REGS" check.  (Was a bit confused by that at first.)

I don't understand the way potential_reload_hard_regs is set up.
Why does it only include reload pseudos involved in moves of the form
"reload_pseudo <- original_pseudo" and "original_pseudo <- reload_pseudo",
but include those reloads regardless of whether inheritance is possible?

I wondered whether it might be deliberately selective in order to speed
up LRA, but we walk all the registers in an insn regardless.

Same for reloads_num.

> +	  if (cl != NO_REGS
> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
> +					live_hard_regs))
> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
> +			      reg_class_contents[cl]);

Same comment as for the previous block.

> +		if (reg_renumber[dst_regno] < 0
> +		    || (reg->type == OP_OUT && ! reg->subreg_p))
> +		/* Invalidate.	*/
> +		usage_insns[dst_regno].check = 0;

Could you explain this condition a bit more?  Why does reg_renumber
affect things?

> +/* This value affects EBB forming.  If probability of edge from EBB to
> +   a BB is not greater than the following value, we don't add the BB
> +   to EBB.  */ 
> +#define EBB_PROBABILITY_CUTOFF (REG_BR_PROB_BASE / 2)

It looks like schedule_ebbs uses a higher default cutoff for FDO.
Would the same distinction be useful here?

Maybe schedule_ebbs-like params would be good here too.

> +  bitmap_and (&temp_bitmap_head, removed_pseudos, live);
> +  EXECUTE_IF_SET_IN_BITMAP (&temp_bitmap_head, 0, regno, bi)

This isn't going to have much effect on compile time, but
EXECUTE_IF_AND_IN_BITMAP avoids the need for a temporary bitmap.

> +/* Remove inheritance/split pseudos which are in REMOVE_PSEUDOS and
> +   return true if we did any change.  The undo transformations for
> +   inheritance looks like
> +      i <- i2
> +      p <- i	  =>   p <- i2
> +   or removing
> +      p <- i, i <- p, and i <- i3
> +   where p is original pseudo from which inheritance pseudo i was
> +   created, i and i3 are removed inheritance pseudos, i2 is another
> +   not removed inheritance pseudo.  All split pseudos or other
> +   occurrences of removed inheritance pseudos are changed on the
> +   corresponding original pseudos.  */
> +static bool
> +remove_inheritance_pseudos (bitmap remove_pseudos)
> +{
> +  basic_block bb;
> +  int regno, sregno, prev_sregno, dregno, restore_regno;
> +  rtx set, prev_set, prev_insn;
> +  bool change_p, done_p;
> +
> +  change_p = ! bitmap_empty_p (remove_pseudos);

I wondered from the comment why we couldn't just return straight away
for the empty set, but it looks like the function also schedules a
constraints pass for instructions that keep their inheritance or
split pseudos.  Is that right?  Might be worth mentioning that
in the function comment if so.

> +	      else if (bitmap_bit_p (remove_pseudos, sregno)
> +		       && bitmap_bit_p (&lra_inheritance_pseudos, sregno))
> +		{
> +		  /* Search the following pattern:
> +		       inherit_or_split_pseudo1 <- inherit_or_split_pseudo2
> +		       original_pseudo <- inherit_or_split_pseudo1
> +		    where the 2nd insn is the current insn and
> +		    inherit_or_split_pseudo2 is not removed.  If it is found,
> +		    change the current insn onto:
> +		       original_pseudo1 <- inherit_or_split_pseudo2.  */

s/original_pseudo1/original_pseudo/ I think (we don't change the destination).

> +		  for (prev_insn = PREV_INSN (curr_insn);
> +		       prev_insn != NULL_RTX && ! NONDEBUG_INSN_P (prev_insn);
> +		       prev_insn = PREV_INSN (prev_insn))
> +		    ;
> +		  if (prev_insn != NULL_RTX && BLOCK_FOR_INSN (prev_insn) == bb
> +		      && (prev_set = single_set (prev_insn)) != NULL_RTX
> +		      /* There should be no subregs in insn we are
> +			 searching because only the original reg might
> +			 be in subreg when we changed the mode of
> +			 load/store for splitting.  */
> +		      && REG_P (SET_DEST (prev_set))
> +		      && REG_P (SET_SRC (prev_set))
> +		      && (int) REGNO (SET_DEST (prev_set)) == sregno
> +		      && ((prev_sregno = REGNO (SET_SRC (prev_set)))
> +			  >= FIRST_PSEUDO_REGISTER)
> +		      && (lra_reg_info[sregno].restore_regno
> +			  == lra_reg_info[prev_sregno].restore_regno)
> +		      && ! bitmap_bit_p (remove_pseudos, prev_sregno))

I'm sure the restore_regno comparison near the end is correct,
but could you add a comment to explain it?  The substitution
itself seems OK either way.

> +	      struct lra_insn_reg *reg;
> +	      bool insn_change_p = false;
> +
> +	      curr_id = lra_get_insn_recog_data (curr_insn);
> +	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +		if ((regno = reg->regno) >= lra_constraint_new_regno_start
> +		    && lra_reg_info[regno].restore_regno >= 0)

Is the first part of the comparison needed?  Most other places don't check,
so it looked at first glance like there was something special here.

> +		  {
> +		    if (change_p && bitmap_bit_p (remove_pseudos, regno))
> +		      {
> +			restore_regno = lra_reg_info[regno].restore_regno;
> +			substitute_pseudo (&curr_insn, regno,
> +					   regno_reg_rtx[restore_regno]);
> +			insn_change_p = true;
> +		      }
> +		    else if (NONDEBUG_INSN_P (curr_insn))
> +		      {
> +			lra_push_insn_and_update_insn_regno_info (curr_insn);
> +			lra_set_used_insn_alternative_by_uid
> +			  (INSN_UID (curr_insn), -1);
> +		      }
> +		  }
> +	      if (insn_change_p)
> +		{
> +		  lra_update_insn_regno_info (curr_insn);
> +		  if (lra_dump_file != NULL)
> +		    {
> +		      fprintf (lra_dump_file, "	   Restore original insn:\n");
> +		      debug_rtl_slim (lra_dump_file,
> +				      curr_insn, curr_insn, -1, 0);
> +		    }
> +		}

AIUI we could have a partial restore, keeping some registers but
restoring others.  Is that right?  The dump entry made it sounds
like a full restore.

Maybe something like:

	      struct lra_insn_reg *reg;
	      bool restored_regs_p = false;
	      bool kept_regs_p = false;

	      curr_id = lra_get_insn_recog_data (curr_insn);
	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
		{
		  regno = reg->regno;
		  restore_regno = lra_reg_info[regno].restore_regno;
		  if (restore_regno >= 0)
		    {
		      if (change_p && bitmap_bit_p (remove_pseudos, regno))
			{
			  substitute_pseudo (&curr_insn, regno,
					     regno_reg_rtx[restore_regno]);
			  restored_regs_p = true;
			}
		      else
			kept_regs_p = true;
		    }
		}
	      if (NONDEBUG_INSN_P (curr_insn) && kept_regs_p)
		{
		  /* The instruction has changed since the previous
		     constraints pass.  */
		  lra_push_insn_and_update_insn_regno_info (curr_insn);
		  lra_set_used_insn_alternative_by_uid
		    (INSN_UID (curr_insn), -1);
		}
	      else if (restored_regs_p)
		/* The instruction has been restored to the form that
		   it had during the previous constraints pass.  */
		lra_update_insn_regno_info (curr_insn);

	      if (restored_regs_p && lra_dump_file != NULL)
		{
		  fprintf (lra_dump_file,
			   "	   Insn after restoring regs:\n");
		  debug_rtl_slim (lra_dump_file, curr_insn, curr_insn, -1, 0);
		}

(if correct) might make the partial case clearer, but that's personal
preference, so please feel free to ignore, chop or change.

Also, is regno_reg_rtx[restore_regno] always correct when restoring
registers?  I thought restore_regno could be a hard register and that
the hard register might not necessarily be used in the same mode as
the regno_reg_rtx[] entry.

That just leaves lra.h, lra-int.h and lra.c itself.  I'm hoping to have
read through those by the middle of next week, but any comments about them
will probably just be banal stuff (even more than my comments so far :-))
so I deliberately left them to last.

Richard