RFC: LRA for x86/x86-64 [7/9] -- continuation

Wed Oct 17 01:18:00 GMT 2012

On 12-10-12 10:29 AM, Richard Sandiford wrote:
> Hi Vlad,
>
> Comments for the rest of ira-constraints.c.
>
> Vladimir Makarov<vmakarov@redhat.com>  writes:
>> +  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
>> +  change_p = equiv_address_substitution (&ad, addr_loc, mode, as, code);
>> +  if (ad.base_reg_loc != NULL)
>> +    {
>> +      if (process_addr_reg
>> +	  (ad.base_reg_loc, before,
>> +	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
>> +	    && find_regno_note (curr_insn, REG_DEAD,
>> +				REGNO (*ad.base_reg_loc)) == NULL
>> +	    ? after : NULL),
>> +	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
>> +	change_p = true;
>> +      if (ad.base_reg_loc2 != NULL)
>> +	*ad.base_reg_loc2 = *ad.base_reg_loc;
>> +      saved_base_reg = *ad.base_reg_loc;
>> +      lra_eliminate_reg_if_possible (ad.base_reg_loc);
>> +      if (ad.base_reg_loc2 != NULL)
>> +	{
>> +	  saved_base_reg2 = *ad.base_reg_loc2;
>> +	  lra_eliminate_reg_if_possible (ad.base_reg_loc2);
>> +	}
> We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
> might be clearer without saved_base_reg2.  More below...
>> +      /* The following addressing is checked by constraints and
>> +	 usually target specific legitimate address hooks do not
>> +	 consider them valid.  */
>> +      || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
>> +      || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC
> typo: two PRE_DECs, although:
>> +      || GET_CODE (*addr_loc) == PRE_MODIFY
>> +      || GET_CODE (*addr_loc) == POST_MODIFY
> the whole lot could just be replaced by ad.base_modify_p, or perhaps
> even removed entirely given:
>> +      /* In this case we can not do anything because if it is wrong
>> +	 that is because of wrong displacement.	 Remember that any
>> +	 address was legitimate in non-strict sense before LRA.	 */
>> +      || ad.disp_loc == NULL)
> It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
> E.g. something like:
>
>    if (ad.base_reg_loc != NULL
>        && (process_addr_reg
> 	  (ad.base_reg_loc, before,
> 	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
> 	    && find_regno_note (curr_insn, REG_DEAD,
> 				REGNO (*ad.base_reg_loc)) == NULL
> 	    ? after : NULL),
> 	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code))))
>      {
>        change_p = true;
>        if (ad.base_reg_loc2 != NULL)
>          *ad.base_reg_loc2 = *ad.base_reg_loc;
>      }
>
>    if (ad.index_reg_loc != NULL
>        && process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
>      change_p = true;
>
>    /* The address was valid before LRA.  We only change its form if the
>       address has a displacement, so if it has no displacement it must
>       still be valid.  */
>    if (ad.disp_loc == NULL)
>      return change_p;
>
>    /* See whether the address is still valid.  Some ports do not check
>       displacements for eliminable registers, so we replace them
>       temporarily with the elimination target.  */
>    saved_base_reg = saved_index_reg = NULL_RTX;
>    ...
>    if (ok_p)
>      return change_p;
Yes, it has sense.  I changed the code as you propose.
>> +#ifdef HAVE_lo_sum
>> +	  {
>> +	    rtx insn;
>> +	    rtx last = get_last_insn ();
>> +
>> +	    /* disp => lo_sum (new_base, disp)	*/
>> +	    insn = emit_insn (gen_rtx_SET
>> +			      (VOIDmode, new_reg,
>> +			       gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc))));
>> +	    code = recog_memoized (insn);
>> +	    if (code >= 0)
>> +	      {
>> +		rtx save = *ad.disp_loc;
>> +
>> +		*ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
>> +		if (! valid_address_p (mode, *ad.disp_loc, as))
>> +		  {
>> +		    *ad.disp_loc = save;
>> +		    code = -1;
>> +		  }
>> +	      }
>> +	    if (code < 0)
>> +	      delete_insns_since (last);
>> +	  }
>> +#endif
> Nice :-)
>
> Purely for the record, I wondered whether the high part should be
> generated with emit_move_insn(_1) instead, with the rhs of the move
> being the HIGH rtx.  That would allow targets to cope with cases where
> the high part isn't represented directly as a HIGH.  E.g. on MIPS and
> Alpha, small-data accesses use the global register as the high part instead.
>
> However, both MIPS and Alpha accept small-data addresses as legitimate
> constants and addresses before and during reload and only introduce the
> split form after reload.  And I think that's how any other cases that
> aren't simple HIGHs should be handled too.  E.g. MIPS also represents
> GOT page loads as HIGHs until after reload, and only then lowers the
> HIGH to a GOT load.  Allowing the backend to generate anything other
> than a plain HIGH set here would be a double-edged sword.
>
> So after all that I agree that the gen_rtx_SET above is better than
> calling the move expanders.
Thanks for sharing your knowledge.
>> +	  /* index * scale + disp => new base + index * scale  */
>> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
>> +
>> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
>> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
>> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
>> +	  lra_emit_move (new_reg, *ad.disp_loc);
>> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
>> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
>> +	  XEXP (*addr_loc, 0) = new_reg;
> The canonical form is (plus (mult ...) (reg)) rather than
> (plus (reg) (mult ...)), but it looks like we create the latter.
> I realise you try both forms here:
It might happen because equiv substitution in LRA.
>> +	  /* Some targets like ARM, accept address operands in
>> +	     specific order -- try exchange them if necessary.	*/
>> +	  if (! valid_address_p (mode, *addr_loc, as))
>> +	    {
>> +	      exchange_plus_ops (*addr_loc);
>> +	      if (! valid_address_p (mode, *addr_loc, as))
>> +		exchange_plus_ops (*addr_loc);
>> +	    }
> but I think we should try the canonical form first.  And I'd prefer it
> if we didn't try the other form at all, especially in 4.8.  It isn't
> really the backend's job to reject non-canonical rtl.  This might well
> be another case where some targets need a (hopefully small) tweak in
> order to play by the rules.
>
> Also, I suppose this section of code feeds back to my question on
> Wednesday about the distinction that LRA seems to make between the
> compile-time constant in:
>
>    (plus (reg X1) (const_int Y1))
>
> and the link-time constant in:
>
>    (plus (reg X2) (symbol_ref Y2))
>
> It looked like extract_address_regs classified X1 as a base register and
> X2 as an index register.  The difference between the two constants has
> no run-time significance though, and I think we should handle both X1
> and X2 as base registers (as I think reload does).
>
> I think the path above would then be specific to scaled indices.
> In the original address the "complex" index must come first and the
> displacement second.  In the modified address, the index would stay
> first and the new base register would be second.  More below.
As I wrote above the problem is also in that equiv substitution can 
create non-canonical forms.
>> +      /* We don't use transformation 'base + disp => base + new index'
>> +	 because of bad practice used in some machine descriptions
>> +	 (see comments for emit_spill_move).  */
>> +      /* base + disp => new base  */
> As before when commenting on emit_spill_move, I think we should leave
> the "bad machine description" stuff out of 4.8 and treat fixing the
> machine descriptions as part of the LRA port.
>
> In this case I think there's another reason not to reload the
> displacement into an index though: IIRC postreload should be able
> to optimise a sequence of address reloads that have the same base
> and different displacements.  LRA itself might try using "anchor"
> bases in future -- although obviously not in the initial merge --
> since that was one thing that LEGITIMIZE_RELOAD_ADDRESS was used for.
>
> E.g. maybe the justification could be:
>
>        /* base + disp => new base  */
>        /* Another option would be to reload the displacement into an
> 	 index register.  However, postreload has code to optimize
> 	 address reloads that have the same base and different
> 	 displacements, so reloading into an index register would
> 	 not necessarily be a win.  */
Fixed.
>> +      /* base + scale * index + disp => new base + scale * index  */
>> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
>> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
>> +      if (! valid_address_p (mode, *addr_loc, as))
>> +	{
>> +	  /* Some targets like ARM, accept address operands in
>> +	     specific order -- try exchange them if necessary.	*/
>> +	  exchange_plus_ops (*addr_loc);
>> +	  if (! valid_address_p (mode, *addr_loc, as))
>> +	    exchange_plus_ops (*addr_loc);
>> +	}
> Same comment as above about canonical rtl.  Here we can have two
> registers -- in which case the base should come first -- or a more
> complex index -- in which case the index should come first.
>
> We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
> with the operands in either order, and let it take care of the details.
> Using simplify_gen_binary would help with the earlier index+disp case too.
Equiv substitution can create non-canonical forms.  There are 2 approaches:
o have a code for dealing with non-canonical forms (equiv substitution, 
target stupidity)
o always support canonical forms and require them from targets.

I decided to use the 1st variant but I am reconsidering it.  I'll try to 
fix before inclusion.  But I am not sure I have time for this.  All 
these changes makes LRA unstable. In fact, I've just found that changes 
I already made so far resulted in 2 SPEC2000 tests broken although GCC 
testsuite and bootstrap looks good.

As I wrote we also have two different approaches to implement LRA. Fist 
one is to make easier porting LRA to other targets (which still requires 
some changes on target side).  The 2nd one is to require targets play by 
strict rules.  I believe that the 2nd approach will slow down transition 
to LRA and removing reload.  But probably it is more right.

After recent changes required by you, we are now in some middle 
approach.  A lot of targets on lra branch are broken by these changes 
and I will need some efforts on target dependent side to make them working.

If it were only target stupidity (as in ARM case), I would remove this 
code right away.  But unfortunately, equiv substitution is also a 
problem.  And right now, I have no time to fix it using your approach.  
The whole idea of including LRA for x86 for 4.8 was to get a real 
experience and harsh testing for LRA.  I am sorry, Richard but I think 
solving this problem as you want might result in postponing LRA 
inclusion until gcc4.9.

>> +  /* If this is post-increment, first copy the location to the reload reg.  */
>> +  if (post && real_in != result)
>> +    emit_insn (gen_move_insn (result, real_in));
> Nit, but real_in != result can never be true AIUI, and I was confused how
> the code could be correct in that case.  Maybe just remove it, or make
> it an assert?
No, it might be true:

real_in = in == value ? incloc : in;
...
if (cond)
   result = incloc;
else
   result = ...

if (post && real_in != result)

So it is true if in==value && cond
>> +  /* We suppose that there are insns to add/sub with the constant
>> +     increment permitted in {PRE/POST)_{DEC/INC/MODIFY}.  At least the
>> +     old reload worked with this assumption.  If the assumption
>> +     becomes wrong, we should use approach in function
>> +     base_plus_disp_to_reg.  */
>> +  if (in == value)
>> +    {
>> +      /* See if we can directly increment INCLOC.  */
>> +      last = get_last_insn ();
>> +      add_insn = emit_insn (plus_p
>> +			    ? gen_add2_insn (incloc, inc)
>> +			    : gen_sub2_insn (incloc, inc));
>> +
>> +      code = recog_memoized (add_insn);
>> +      /* We should restore recog_data for the current insn.  */
> Looks like this comment might be a left-over, maybe from before the
> cached insn data?
Yes.  The very first variant used recog_data and was much slower.  I 
removed the comment.
>> +      /* Restore non-modified value for the result.  We prefer this
>> +	 way because it does not require an addition hard
>> +	 register.  */
>> +      if (plus_p)
>> +	{
>> +	  if (CONST_INT_P (inc))
>> +	    emit_insn (gen_add2_insn (result, GEN_INT (-INTVAL (inc))));
>> +	  else
>> +	    emit_insn (gen_sub2_insn (result, inc));
>> +	}
>> +      else if (CONST_INT_P (inc))
>> +	emit_insn (gen_add2_insn (result, inc));
> The last two lines look redundant.  The behaviour is the same as for
> the following else:
>> +      else
>> +	emit_insn (gen_add2_insn (result, inc));
Fixed.
> and I don't think there are any cases where !plus && CONST_INT_P (inc)
> would hold.
It seems I thought about this and therefore I added a placeholder which 
I can remove now.
>> +/* Main entry point of this file: search the body of the current insn
> s/this file/the constraints code/, since it's a static function.
Fixed.
>> +  if (change_p)
>> +    /* Changes in the insn might result in that we can not satisfy
>> +       constraints in lately used alternative of the insn.  */
>> +    lra_set_used_insn_alternative (curr_insn, -1);
> Maybe:
>
>    /* If we've changed the instruction then any alternative that
>       we chose previously may no longer be valid.  */
Fixed.
>> +      rtx x;
>> +
>> +      curr_swapped = !curr_swapped;
>> +      if (curr_swapped)
>> +	{
>> +	  x = *curr_id->operand_loc[commutative];
>> +	  *curr_id->operand_loc[commutative]
>> +	    = *curr_id->operand_loc[commutative + 1];
>> +	  *curr_id->operand_loc[commutative + 1] = x;
>> +	  /* Swap the duplicates too.  */
>> +	  lra_update_dup (curr_id, commutative);
>> +	  lra_update_dup (curr_id, commutative + 1);
>> +	  goto try_swapped;
>> +	}
>> +      else
>> +	{
>> +	  x = *curr_id->operand_loc[commutative];
>> +	  *curr_id->operand_loc[commutative]
>> +	    = *curr_id->operand_loc[commutative + 1];
>> +	  *curr_id->operand_loc[commutative + 1] = x;
>> +	  lra_update_dup (curr_id, commutative);
>> +	  lra_update_dup (curr_id, commutative + 1);
>> +	}
> The swap code is the same in both cases, so I think it'd be better to
> make it common.  Or possibly a helper function, since the same code
> appears again later on.
Fixed.
>> +	if (GET_CODE (op) == PLUS)
>> +	  {
>> +	    plus = op;
>> +	    op = XEXP (op, 1);
>> +	  }
> Sorry, I'm complaining about old reload code again, but: does this
> actually happen in LRA?  In reload, a register operand could become a
> PLUS because of elimination, but I thought LRA did things differently.
> Besides, this is only needed for:
No, I don't think it happens in LRA.  It is a leftover from reload 
code.  I removed it.
>> +	if (CONST_POOL_OK_P (mode, op)
>> +	    && ((targetm.preferred_reload_class
>> +		 (op, (enum reg_class) goal_alt[i]) == NO_REGS)
>> +		|| no_input_reloads_p)
>> +	    && mode != VOIDmode)
>> +	  {
>> +	    rtx tem = force_const_mem (mode, op);
>> +	
>> +	    change_p = true;
>> +	    /* If we stripped a SUBREG or a PLUS above add it back.  */
>> +	    if (plus != NULL_RTX)
>> +	      tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);
> and we shouldn't have (plus (constant ...) ...) after elimination
> (or at all outside of a CONST).  I don't understand why the code is
> needed even in reload.
I removed the plus code.
>> +  for (i = 0; i < n_operands; i++)
>> +    {
>> +      rtx old, new_reg;
>> +      rtx op = *curr_id->operand_loc[i];
>> +
>> +      if (goal_alt_win[i])
>> +	{
>> +	  if (goal_alt[i] == NO_REGS
>> +	      && REG_P (op)
>> +	      && lra_former_scratch_operand_p (curr_insn, i))
>> +	    change_class (REGNO (op), NO_REGS, "      Change", true);
> I think this could do with a comment.  Does setting the class to NO_REGS
> indirectly cause the operand to be switched back to a SCRATCH?
When we assign NO_REGS it means that we will not assign a hard register 
to the scratch pseudo and the scratch pseudo will be spilled.  Spilled 
scratch pseudos are transformed back to scratches at the LRA end.  I've 
added a comment.
>> +	  push_to_sequence (before);
>> +	  rclass = base_reg_class (GET_MODE (op), MEM_ADDR_SPACE (op),
>> +				   MEM, SCRATCH);
>> +	  if (code == PRE_DEC || code == POST_DEC
>> +	      || code == PRE_INC || code == POST_INC
>> +	      || code == PRE_MODIFY || code == POST_MODIFY)
> Very minor, but: GET_RTX_CLASS (code) == RTX_AUTOINC
Fixed.
>> +	  enum machine_mode mode;
>> +	  rtx reg, *loc;
>> +	  int hard_regno, byte;
>> +	  enum op_type type = curr_static_id->operand[i].type;
>> +
>> +	  loc = curr_id->operand_loc[i];
>> +	  mode = get_op_mode (i);
>> +	  if (GET_CODE (*loc) == SUBREG)
>> +	    {
>> +	      reg = SUBREG_REG (*loc);
>> +	      byte = SUBREG_BYTE (*loc);
>> +	      if (REG_P (reg)
>> +		  /* Strict_low_part requires reload the register not
>> +		     the sub-register.	*/
>> +		  && (curr_static_id->operand[i].strict_low
>> +		      || (GET_MODE_SIZE (mode)
>> +			  <= GET_MODE_SIZE (GET_MODE (reg))
>> +			  && (hard_regno
>> +			      = get_try_hard_regno (REGNO (reg))) >= 0
>> +			  && (simplify_subreg_regno
>> +			      (hard_regno,
>> +			       GET_MODE (reg), byte, mode) < 0)
>> +			  && (goal_alt[i] == NO_REGS
>> +			      || (simplify_subreg_regno
>> +				  (ira_class_hard_regs[goal_alt[i]][0],
>> +				   GET_MODE (reg), byte, mode) >= 0)))))
>> +		{
>> +		  loc = &SUBREG_REG (*loc);
>> +		  mode = GET_MODE (*loc);
>> +		}
>> +	  old = *loc;
> I think this needs a bit more justifying commentary (although I'm glad
> to see it's much simpler than the reload version :-)).  One thing in
> particular I didn't understand was why we don't reload the inner
> register of a paradoxical subreg.
It seems not necessary. As we use the biggest mode, pseudo gets all hard 
registers or wide stack slot for paradoxical subreg.
>> +	  if (get_reload_reg (type, mode, old, goal_alt[i], "", &new_reg)
>> +	      && type != OP_OUT)
>> +	    {
>> +	      push_to_sequence (before);
>> +	      lra_emit_move (new_reg, old);
>> +	      before = get_insns ();
>> +	      end_sequence ();
>> +	    }
>> +	  *loc = new_reg;
>> +	  if (type != OP_IN)
>> +	    {
>> +	      if (find_reg_note (curr_insn, REG_UNUSED, old) == NULL_RTX)
>> +		{
>> +		  start_sequence ();
>> +		  /* We don't want sharing subregs as the pseudo can
>> +		     get a memory and the memory can be processed
>> +		     several times for eliminations.  */
>> +		  lra_emit_move (GET_CODE (old) == SUBREG && type == OP_INOUT
>> +				 ? copy_rtx (old) : old,
>> +				 new_reg);
> I think this should simply be:
>
>    lra_emit_move (type == OP_INOUT ? copy_rtx (old) : old, new_reg);
>
> leaving copy_rtx to figure out which rtxes can be shared.  No comment
> would be needed for that.
Fixed.
>> +		  emit_insn (after);
>> +		  after = get_insns ();
>> +		  end_sequence ();
>> +		}
>> +	      *loc = new_reg;
>> +	    }
> Very minor again, but: redundant *loc assignment (so that the two nested
> if statements collapse to one).
Fixed.  It seems a leftover from previous modifications.  First variants 
of the function were more complicated.
>> +      else
>> +	{
>> +	  lra_assert (INSN_CODE (curr_insn) < 0);
>> +	  error_for_asm (curr_insn,
>> +			 "inconsistent operand constraints in an %<asm%>");
>> +	  /* Avoid further trouble with this insn.  */
>> +	  PATTERN (curr_insn) = gen_rtx_USE (VOIDmode, const0_rtx);
>> +	  return false;
> Is this code handling a different case from the corresponding error
> code in curr_insn_transform?  If so, it probably deserves a comment
> explaining the difference.
I have no idea.  It is taken from reload.  I think we could try to 
remove it.  If process_alt_operands finds an alternative we should 
generate a code anyway.  I'll make it unreachable.
>> +/* Process all regs in debug location *LOC and change them on
>> +   equivalent substitution.  Return true if any change was done.  */
>> +static bool
>> +debug_loc_equivalence_change_p (rtx *loc)
> This doesn't keep the rtl in canonical form.  Probably the easiest and
> best fix is to use simplify_replace_fn_rtx, which handles all that for you.
> (simplify_replace_fn_rtx returns the original rtx if no change was made.)
See comments above about canonical forms.
>> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
>> +    ira_reg_equiv[i].profitable_p = true;
>> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
>> +    if (lra_reg_info[i].nrefs != 0)
>> +      {
>> +	if ((hard_regno = lra_get_regno_hard_regno (i)) >= 0)
>> +	  {
>> +	    int j, nregs = hard_regno_nregs[hard_regno][PSEUDO_REGNO_MODE (i)];
>> +	
>> +	    for (j = 0; j < nregs; j++)
>> +	      df_set_regs_ever_live (hard_regno + j, true);
>> +	  }
>> +	else if ((x = get_equiv_substitution (regno_reg_rtx[i])) != NULL_RTX)
>> +	  {
>> +	    if (! first_p && contains_reg_p (x, false, false))
>> +	      /* After RTL transformation, we can not guarantee that
>> +		 pseudo in the substitution was not reloaded which
>> +		 might make equivalence invalid.  For example, in
>> +		 reverse equiv of p0
>> +
>> +		 p0 <- ...
>> +		 ...
>> +		 equiv_mem <- p0
>> +
>> +		 the memory address register was reloaded before the
>> +		 2nd insn.  */
>> +	      ira_reg_equiv[i].defined_p = false;
>> +	    if (contains_reg_p (x, false, true))
>> +	      ira_reg_equiv[i].profitable_p = false;
>> +	  }
>> +      }
> Do we need two loops because the second may check for equivalences
> of other pseudos besides "i"?  I couldn't see how offhand, but I might
> well have missed something.  Might be worth a comment.
No, we don't need two loops.  I merged them.
>> +	      dest_reg = SET_DEST (set);
>> +	      /* The equivalence pseudo could be set up as SUBREG in a
>> +		 case when it is a call restore insn in a mode
>> +		 different from the pseudo mode.  */
>> +	      if (GET_CODE (dest_reg) == SUBREG)
>> +		dest_reg = SUBREG_REG (dest_reg);
>> +	      if ((REG_P (dest_reg)
>> +		   && (x = get_equiv_substitution (dest_reg)) != dest_reg
>> +		   /* Remove insns which set up a pseudo whose value
>> +		      can not be changed.  Such insns might be not in
>> +		      init_insns because we don't update equiv data
>> +		      during insn transformations.
>> +			
>> +		      As an example, let suppose that a pseudo got
>> +		      hard register and on the 1st pass was not
>> +		      changed to equivalent constant.  We generate an
>> +		      additional insn setting up the pseudo because of
>> +		      secondary memory movement.  Then the pseudo is
>> +		      spilled and we use the equiv constant.  In this
>> +		      case we should remove the additional insn and
>> +		      this insn is not init_insns list.	 */
>> +		   && (! MEM_P (x) || MEM_READONLY_P (x)
>> +		       || in_list_p (curr_insn,
>> +				     ira_reg_equiv
>> +				     [REGNO (dest_reg)].init_insns)))
> This is probably a stupid question, sorry, but when do we ever want
> to keep an assignment to a substituted pseudo?  I.e. why isn't this just:
>
> 	      if ((REG_P (dest_reg)
> 		   && (x = get_equiv_substitution (dest_reg)) != dest_reg)
Equivalence can be memory location.  It means you can assign many 
different values to the location.  Removing them would generate a wrong 
code.  For example, if you use your simple variant of code, you will 
have > 100 test failures of GCC testsuite.
>> +/* Info about last usage of registers in EBB to do inheritance/split
>> +   transformation.  Inheritance transformation is done from a spilled
>> +   pseudo and split transformations from a hard register or a pseudo
>> +   assigned to a hard register.	 */
>> +struct usage_insns
>> +{
>> +  /* If the value is equal to CURR_USAGE_INSNS_CHECK, then the member
>> +     value INSNS is valid.  The insns is chain of optional debug insns
>> +     and a finishing non-debug insn using the corresponding reg.  */
>> +  int check;
>> +  /* Value of global reloads_num at the ???corresponding next insns.  */
>> +  int reloads_num;
>> +  /* Value of global reloads_num at the ???corresponding next insns.  */
>> +  int calls_num;
> "???s".  Probably "at the last instruction in INSNS" if that's accurate
> (because debug insns in INSNS don't affect these fields).
Fixed.
>> +/* Process all regs OLD_REGNO in location *LOC and change them on the
>> +   reload pseudo NEW_REG.  Return true if any change was done.	*/
>> +static bool
>> +substitute_pseudo (rtx *loc, int old_regno, rtx new_reg)
> This is another case where I found the term "reload pseudo" a bit confusing,
> since AIUI new_reg can be an inheritance or split pseudo rather than a pseudo
> created solely for insn reloads.  I'll follow up about that on the original
> thread.  Maybe just:
>
> /* Replace all references to register OLD_REGNO in *LOC with pseudo register
>     NEW_REG.  Return true if any change was made.  */
Fixed.
>> +  code = GET_CODE (x);
>> +  if (code == REG && (int) REGNO (x) == old_regno)
>> +    {
>> +      *loc = new_reg;
>> +      return true;
>> +    }
> Maybe assert that the modes are the same?
I've just realized that the modes might be different because of 
secondary_memory_needed_mode in split_reg (probably small chance).  So I 
added a code dealing with it.  Thanks, Richard.
>> +/* Do inheritance transformation for insn INSN defining (if DEF_P) or
>> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
>> +   we traverse insns in the backward direction) for the original regno
>> +   is NEXT_USAGE_INSNS.	 The transformations look like
> Maybe:
>
> /* Do interitance transformations for insn INSN, which defines (if DEF_P)
>     or uses ORIGINAL_REGNO.  NEXT_USAGE_INSNS specifies which instruction
>     in the EBB next uses ORIGINAL_REGNO; it has the same form as the
>     "insns" field of usage_insns.
>
>     The transformations look like:
Fixed.
>> +
>> +     p <- ...		  i <- ...
>> +     ...		  p <- i    (new insn)
>> +     ...	     =>
>> +     <- ... p ...	  <- ... i ...
>> +   or
>> +     ...		  i <- p    (new insn)
>> +     <- ... p ...	  <- ... i ...
>> +     ...	     =>
>> +     <- ... p ...	  <- ... i ...
>> +   where p is a spilled original pseudo and i is a new inheritance pseudo.
>> +
>> +   The inheritance pseudo has the smallest class of two classes CL and
>> +   class of ORIGINAL REGNO.  It will have unique value if UNIQ_P.  The
>> +   unique value is necessary for correct assignment to inheritance
>> +   pseudo for input of an insn which should be the same as output
>> +   (bound pseudos).  Return true if we succeed in such
>> +   transformation.  */
> This comment looks really good, but I still wasn't sure about the
> UNIQ_P thing.  AIUI this is for cases like:
>
>                         i <- p            [new insn]
>     r <- ... p ...      r <- ... i ...    [input reload]
>     r <- ... r ...   => r <- ... r ...    [original insn]
>     <- r                <- r              [output reload]
>     ....                ......
>     <- ... p ...        <- ... i ...      [next ref]
>
> where "r" is used on both sides of the original insn and where the
> output reload assigns to something other than "p" (otherwise "next ref"
> wouldn't be the next ref).  But why does this affect the way "i" is created?
> I think it'd be worth expanding that part a bit.
I realized that code is not necessary anymore.  It was necessary when I 
generated 2 pseudos for reloading matching operands with different 
modes.  Now I use one pseudo for this and use (may be illegal in other 
parts of GCC) subregs of this pseudo.  I modified the code.
>> +  if (! ira_reg_classes_intersect_p[cl][rclass])
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	{
>> +	  fprintf (lra_dump_file,
>> +		   "	Rejecting inheritance for %d "
>> +		   "because of too different classes %s and %s\n",
> Suggest s/too different/disjoint/
Fixed.
>> +  if ((ira_class_subset_p[cl][rclass] && cl != rclass)
>> +      || ira_class_hard_regs_num[cl] < ira_class_hard_regs_num[rclass])
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	fprintf (lra_dump_file, "    Use smallest class of %s and %s\n",
>> +		 reg_class_names[cl], reg_class_names[rclass]);
>> +
>> +      rclass = cl;
>> +    }
> I don't understand the second line of the if statement.  Why do we prefer
> classes with fewer allocatable registers?
>
> My guess before reading the code was that we'd use the subunion of CL and
> RCLASS, so maybe a comment explaining why we use this choice would help.
I added a comment

      /* We don't use a subset of two classes because it can be
         NO_REGS.  This transformation is still profitable in most
         cases even if the classes are not intersected as register
move is probably cheaper than a memory load.  */

>> +  if (NEXT_INSN (new_insns) != NULL_RTX)
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	{
>> +	  fprintf (lra_dump_file,
>> +		   "	Rejecting inheritance %d->%d "
>> +		   "as it results in 2 or more insns:\n",
>> +		   original_regno, REGNO (new_reg));
>> +	  debug_rtl_slim (lra_dump_file, new_insns, NULL_RTX, -1, 0);
>> +	  fprintf (lra_dump_file,
>> +		   "	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
>> +	}
>> +      return false;
>> +    }
> Hmm, I wasn't sure about this at first.  Some targets define patterns for
> multiword moves and split them later.  Others expose the split straight away.
> The two approaches don't really imply any difference in cost, so I didn't
> want us to penalise the latter.
>
> But I suppose on targets that split straight away, lower-subreg would
> tend to replace the multiword pseudo with individual word-sized pseudos,
> so LRA shouldn't see them.  I suppose this check shouldn't matter in practice.
I think I saw it on practice and therefore I added the code.  I don't 
remember details although.  I tried to document in commentaries some 
specific cases in LRA.  But probably I should have made more comments 
with examples in the code because it is easy to forget them (too many 
cases and details).
>> +  if (def_p)
>> +    lra_process_new_insns (insn, NULL_RTX, new_insns,
>> +			   "Add original<-inheritance");
>> +  else
>> +    lra_process_new_insns (insn, new_insns, NULL_RTX,
>> +			   "Add inheritance<-pseudo");
> Maybe "original" rather than "pseudo" here too for consistency.
Fixed.
>> +/* Return true if we need a split for hard register REGNO or pseudo
>> +   REGNO which was assigned to a hard register.
>> +   POTENTIAL_RELOAD_HARD_REGS contains hard registers which might be
>> +   used for reloads since the EBB end.	It is an approximation of the
>> +   used hard registers in the split range.  The exact value would
>> +   require expensive calculations.  If we were aggressive with
>> +   splitting because of the approximation, the split pseudo will save
>> +   the same hard register assignment and will be removed in the undo
>> +   pass.  We still need the approximation because too aggressive
>> +   splitting would result in too inaccurate cost calculation in the
>> +   assignment pass because of too many generated moves which will be
>> +   probably removed in the undo pass.  */
>> +static inline bool
>> +need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
>> +{
>> +  int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];
>> +
>> +  lra_assert (hard_regno >= 0);
>> +  return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
>> +	   && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
>> +	   && (usage_insns[regno].reloads_num
>> +	       + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
>> +	   && ((regno < FIRST_PSEUDO_REGISTER
>> +		&& ! bitmap_bit_p (&ebb_global_regs, regno))
>> +	       || (regno >= FIRST_PSEUDO_REGISTER
>> +		   && lra_reg_info[regno].nrefs > 3
>> +		   && bitmap_bit_p (&ebb_global_regs, regno))))
>> +	  || (regno >= FIRST_PSEUDO_REGISTER && need_for_call_save_p (regno)));
>> +}
> Could you add more commentary about the thinking behind this particular
> choice of heuristic?  E.g. I wasn't sure what the reloads_num check did,
> or why we only split hard registers that are local to the EBB and only
> split pseudos that aren't.
>
> The 2 and 3 numbers seemed a bit magic too.  I suppose the 2 has
> something to do with "one save and one restore", but I wasn't sure
> why we applied it only for pseudos.  (AIUI that arm of the check
> deals with "genuine" split pseudos rather than call saves & restores.)
>
> Still, it says a lot for the high quality of LRA that, out of all the
> 1000s of lines of code I've read so far, this is the only part that
> didn't seem to have an intuitive justification.
Yes, right.  I checked many parameters and finally picked out the above 
ones.  What I found is that aggressive and even moderate splitting 
usually result in worse code.  Most splitting is undone and if we do 
splitting aggressively we generate a lot of insns which will be removed 
but their costs are taken into account during assignment pass.  
Inheritance has quite bigger chance to successful.  One reason why 
splitting is undone is that spilling + inheritance of short living 
pseudos is a substitution of splitting in some way.  Therefore I do 
splitting for long living pseudos. That means non local pseudos.

I added the following comments.

   return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
            && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
            /* We need at least 2 reloads to make pseudo splitting
               profitable.  We should provide hard regno splitting in
               any case to solve 1st insn scheduling problem when
               moving hard register definition up might result in
               impossibility to find hard register for reload pseudo of
               small register class.  */
            && (usage_insns[regno].reloads_num
                + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
            && (regno < FIRST_PSEUDO_REGISTER
                /* For short living pseudos, spilling + inheritance can
                   be considered a substitution for splitting.
                   Therefore we do not splitting for local pseudos.  It
                   decreases also aggressiveness of splitting.  The
                   minimal number of references is chosen taking into
                   account that for 2 references splitting has no sense
                   as we can just spill the pseudo.  */
                || (regno >= FIRST_PSEUDO_REGISTER
                    && lra_reg_info[regno].nrefs > 3
                    && bitmap_bit_p (&ebb_global_regs, regno))))

I removed checking that hard_regno is local.  It was a parameter to 
decrease aggressiveness of hard register splitting but I realized that 
it contradicts the reason for splitting to solve 1st insn scheduling 
problem because regional scheduling can move hard register definition 
beyond BB.
>> +  for (i = 0;
>> +       (cl = reg_class_subclasses[allocno_class][i]) != LIM_REG_CLASSES;
>> +       i++)
>> +    if (! SECONDARY_MEMORY_NEEDED (cl, hard_reg_class, mode)
>> +	&& ! SECONDARY_MEMORY_NEEDED (hard_reg_class, cl, mode)
>> +	&& TEST_HARD_REG_BIT (reg_class_contents[cl], hard_regno)
>> +	&& (best_cl == NO_REGS
>> +	    || (hard_reg_set_subset_p (reg_class_contents[best_cl],
>> +				       reg_class_contents[cl])
>> +		&& ! hard_reg_set_equal_p (reg_class_contents[best_cl],
>> +					   reg_class_contents[cl]))))
>> +      best_cl = cl;
> OK, so this suggestion isn't backed up by any evidence, but what do
> you think about this alternative:
>
> 	&& (best_cl == NO_REGS
> 	    || (ira_class_hard_regs_num[best_cl]
> 		< ira_class_hard_regs_num[cl]))
>
> which should choose the largest class that requires no secondary memory.
> It looks like the subset version could get "stuck" on a single-register
> class that happens to be early in the list but has no superclass smaller
> than allocno_class.
It seems reasonable.  I choose your variant.
>> +/* Do split transformation for insn INSN defining or
>> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
>> +   we traverse insns in the backward direction) for the original regno
>> +   is NEXT_USAGE_INSNS.	 The transformations look like
> Same suggestion as for the inheritance function above.
Fixed.
>> +  if (call_save_p)
>> +    save = emit_spill_move (true, new_reg, original_reg, -1);
>> +  else
>> +    {
>> +      start_sequence ();
>> +      emit_move_insn (new_reg, original_reg);
>> +      save = get_insns ();
>> +      end_sequence ();
>> +    }
>> +  if (NEXT_INSN (save) != NULL_RTX)
>> +    {
>> +      lra_assert (! call_save_p);
> Is emit_spill_move really guaranteed to return only one instruction in
> cases where emit_move_insn might not?  Both of them use emit_move_insn_1
> internally, so I wouldn't have expected much difference.
>
> In fact I wasn't really sure why:
>
>    save = gen_move_insn (new, original_reg);
For caller save, modes new and original can be different.
> wouldn't be correct for both.
>
> Same comments for the restore code.
I fixed it using only emit_spill_move.
>> +      /* See which defined values die here.  */
>> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +	if (reg->type == OP_OUT && ! reg->early_clobber
>> +	    && (! reg->subreg_p
>> +		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
>> +	  bitmap_clear_bit (&live_regs, reg->regno);
>> +      /* Mark each used value as live.	*/
>> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +	if (reg->type == OP_IN
>> +	    && bitmap_bit_p (&check_only_regs, reg->regno))
>> +	  bitmap_set_bit (&live_regs, reg->regno);
>> +      /* Mark early clobber outputs dead.  */
>> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +	if (reg->type == OP_OUT && reg->early_clobber && ! reg->subreg_p)
>> +	  bitmap_clear_bit (&live_regs, reg->regno);
> I don't think this would be correct for unreloaded insns because an
> unreloaded insn can have the same pseudo as an input and an earlyclobber
> output.  (Probably not an issue here, since we're called after the
> constraints pass.)  There's also the case of matched earlyclobber operands,
> where the matched input is specifically not affected by the earlyclobber.
>
> I'd have thought:
>
>        /* See which defined values die here.  */
>        for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> 	if (reg->type == OP_OUT
> 	    && (! reg->subreg_p
> 		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
> 	  bitmap_clear_bit (&live_regs, reg->regno);
>        /* Mark each used value as live.	*/
>        for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> 	if (reg->type == OP_IN
> 	    && bitmap_bit_p (&check_only_regs, reg->regno))
> 	  bitmap_set_bit (&live_regs, reg->regno);
>
> ought to be correct, but perhaps I'm missing something.
>
> (I'm still uneasy about the special treatment of bound pseudos here.
> A clobber really does seem better.)
It seems a right variant.  I am going to use it.
>> +      /* It is quite important to remove dead move insns because it
>> +	 means removing dead store, we don't need to process them for
>> +	 constraints, and unfortunately some subsequent optimizations
>> +	 (like shrink-wrapping) currently based on assumption that
>> +	 there are no trivial dead insns.  */
> Maybe best to drop the "subsequent optimizations" part.  This comment
> is unlikely to be updated after any change to shrink-wrapping & co.,
> and the first two justifications seem convincing enough on their own.
Fixed.
>> +/* Add inheritance info REGNO and INSNS.  */
>> +static void
>> +add_to_inherit (int regno, rtx insns)
>> +{
>> +  int i;
>> +
>> +  for (i = 0; i < to_inherit_num; i++)
>> +    if (to_inherit[i].regno == regno)
>> +      return;
> Is the existing "insns" field guaranteed to match the "insns" parameter
> in this case, or might they be different?  Probably worth an assert or
> comment respectively.
I added a comment.
>> +/* Return first (if FIRST_P) or last non-debug insn in basic block BB.
>> +   Return null if there are no non-debug insns in the block.  */
>> +static rtx
>> +get_non_debug_insn (bool first_p, basic_block bb)
>> +{
>> +  rtx insn;
>> +
>> +  for (insn = first_p ? BB_HEAD (bb) : BB_END (bb);
>> +       insn != NULL_RTX && ! NONDEBUG_INSN_P (insn);
>> +       insn = first_p ? NEXT_INSN (insn) : PREV_INSN (insn))
>> +    ;
>> +  if (insn != NULL_RTX && BLOCK_FOR_INSN (insn) != bb)
>> +    insn = NULL_RTX;
>> +  return insn;
>> +}
> It probably doesn't matter in practice, but it looks like it'd be better
> to limit the walk to the bb, rather than walking until null and then
> testing the bb after the walk.
>
> Maybe it would be eaiser to split into two functions, since first_p is
> always constant.  E.g.:
>
>    rtx insn;
>
>    FOR_BB_INSNS (bb, insn)
>      if (NONDEBUG_INSN_P (insn))
>        return insn;
>    return NULL_RTX;
>
> for first_p.  s/FOR_BB_INSNS/FOR_BB_INSNS_REVERSE/ for !first_p.
That is better.  Fixed.
>> +/* Set up RES by registers living on edges FROM except the edge (FROM,
>> +   TO) or by registers set up in a jump insn in BB FROM.  */
>> +static void
>> +get_live_on_other_edges (basic_block from, basic_block to, bitmap res)
>> +{
>> +  int regno;
>> +  rtx last;
>> +  struct lra_insn_reg *reg;
>> +  edge e;
>> +  edge_iterator ei;
>> +
>> +  lra_assert (to != NULL);
>> +  bitmap_clear (res);
>> +  FOR_EACH_EDGE (e, ei, from->succs)
>> +    if (e->dest != to)
>> +      bitmap_ior_into (res, DF_LR_IN (e->dest));
>> +  if ((last = get_non_debug_insn (false, from)) == NULL_RTX || ! JUMP_P (last))
>> +    return;
>> +  curr_id = lra_get_insn_recog_data (last);
>> +  for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +    if (reg->type != OP_IN
>> +	&& (regno = reg->regno) >= FIRST_PSEUDO_REGISTER)
>> +      bitmap_set_bit (res, regno);
>> +}
> Probably a silly question, sorry, but: why does the JUMP_P part only
> include pseudo registers?  The other calculations (here and elsewhere)
> seem to handle both hard and pseudo registers.
No, that is not a silly question.  I think it is a potential bug 
although it is probability is very small.  Originally, splitting and 
inheritance were done only on pseudos.  After that, splitting hard 
registers were added and that code was not changed.  Thanks, Richard.  I 
removed this condition.
>> +/* Do inheritance/split transformations in EBB starting with HEAD and
>> +   finishing on TAIL.  We process EBB insns in the reverse order.
>> +   Return true if we did any inheritance/split transformation in the
>> +   EBB.
>> +
>> +   We should avoid excessive splitting which results in worse code
>> +   because of inaccurate cost calculations for spilling new split
>> +   pseudos in such case.  To achieve this we do splitting only if
>> +   register pressure is high in given basic block and there reload
> "...and there are reload"
Fixed.
>> +   pseudos requiring hard registers.  We could do more register
>> +   pressure calculations at any given program point to avoid necessary
>> +   splitting even more but it is to expensive and the current approach
>> +   is well enough.  */
> "works well enough".
Fixed.
>> +  change_p = false;
>> +  curr_usage_insns_check++;
>> +  reloads_num = calls_num = 0;
>> +  /* Remember: we can remove the current insn.	*/
>> +  bitmap_clear (&check_only_regs);
>> +  last_processed_bb = NULL;
> I couldn't tell which part of the code the comment is referring to.
> Maybe left over?
It is a leftover.  I removed it.
>> +	  after_p = (last_insn != NULL_RTX && ! JUMP_P (last_insn)
>> +		     && (! CALL_P (last_insn)
>> +			 || (find_reg_note (last_insn,
>> +					   REG_NORETURN, NULL) == NULL_RTX
>> +			     && ((next_insn
>> +				  = next_nonnote_nondebug_insn (last_insn))
>> +				 == NULL_RTX
>> +				 || GET_CODE (next_insn) != BARRIER))));
> Genuine question, but: when are the last four lines needed?  The condition
> that they're testing for sounds like a noreturn call.
Yes, probably you are right.  I also can not imagine such situation.  I 
am removing them.
>> +      if (src_regno < lra_constraint_new_regno_start
>> +	  && src_regno >= FIRST_PSEUDO_REGISTER
>> +	  && reg_renumber[src_regno] < 0
>> +	  && dst_regno >= lra_constraint_new_regno_start
>> +	  && (cl = lra_get_allocno_class (dst_regno)) != NO_REGS)
>> +	{
>> +	  /* 'reload_pseudo <- original_pseudo'.  */
>> +	  reloads_num++;
>> +	  succ_p = false;
>> +	  if (usage_insns[src_regno].check == curr_usage_insns_check
>> +	      && (next_usage_insns = usage_insns[src_regno].insns) != NULL_RTX)
>> +	    succ_p = inherit_reload_reg (false,
>> +					 bitmap_bit_p (&lra_matched_pseudos,
>> +						       dst_regno),
>> +					 src_regno, cl,
>> +					 curr_insn, next_usage_insns);
>> +	  if (succ_p)
>> +	    change_p = true;
>> +	  else
>> +	    {
>> +	      usage_insns[src_regno].check = curr_usage_insns_check;
>> +	      usage_insns[src_regno].insns = curr_insn;
>> +	      usage_insns[src_regno].reloads_num = reloads_num;
>> +	      usage_insns[src_regno].calls_num = calls_num;
>> +	      usage_insns[src_regno].after_p = false;
>> +	    }
> Looks like this and other places could use the add_next_usage_insn
> helper function.
Fixed.
>> +	  if (cl != NO_REGS
>> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
>> +					live_hard_regs))
>> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
>> +			      reg_class_contents[cl]);
> Redundant "cl != NO_REGS" check.  (Was a bit confused by that at first.)
Fixed.
> I don't understand the way potential_reload_hard_regs is set up.
> Why does it only include reload pseudos involved in moves of the form
> "reload_pseudo <- original_pseudo" and "original_pseudo <- reload_pseudo",
> but include those reloads regardless of whether inheritance is possible?
>
> I wondered whether it might be deliberately selective in order to speed
> up LRA, but we walk all the registers in an insn regardless.
>
> Same for reloads_num.
It is just simple heuristics.  They can be changed.  Adding new 
heuristics (more complicated ones) requires a lot of experiments. This 
work can be done lately.  LRA as now is not frozen.  Its development 
will be continued.
>> +	  if (cl != NO_REGS
>> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
>> +					live_hard_regs))
>> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
>> +			      reg_class_contents[cl]);
> Same comment as for the previous block.
Fixed.
>> +		if (reg_renumber[dst_regno] < 0
>> +		    || (reg->type == OP_OUT && ! reg->subreg_p))
>> +		/* Invalidate.	*/
>> +		usage_insns[dst_regno].check = 0;
> Could you explain this condition a bit more?  Why does reg_renumber
> affect things?
I added a comment.
>> +/* This value affects EBB forming.  If probability of edge from EBB to
>> +   a BB is not greater than the following value, we don't add the BB
>> +   to EBB.  */
>> +#define EBB_PROBABILITY_CUTOFF (REG_BR_PROB_BASE / 2)
> It looks like schedule_ebbs uses a higher default cutoff for FDO.
> Would the same distinction be useful here?
>
> Maybe schedule_ebbs-like params would be good here too.
I am thinking about adding several parametersfor IRA and LRA.  But I'd 
like to add for the next release.
>> +  bitmap_and (&temp_bitmap_head, removed_pseudos, live);
>> +  EXECUTE_IF_SET_IN_BITMAP (&temp_bitmap_head, 0, regno, bi)
> This isn't going to have much effect on compile time, but
> EXECUTE_IF_AND_IN_BITMAP avoids the need for a temporary bitmap.
Fixed.
>> +/* Remove inheritance/split pseudos which are in REMOVE_PSEUDOS and
>> +   return true if we did any change.  The undo transformations for
>> +   inheritance looks like
>> +      i <- i2
>> +      p <- i	  =>   p <- i2
>> +   or removing
>> +      p <- i, i <- p, and i <- i3
>> +   where p is original pseudo from which inheritance pseudo i was
>> +   created, i and i3 are removed inheritance pseudos, i2 is another
>> +   not removed inheritance pseudo.  All split pseudos or other
>> +   occurrences of removed inheritance pseudos are changed on the
>> +   corresponding original pseudos.  */
>> +static bool
>> +remove_inheritance_pseudos (bitmap remove_pseudos)
>> +{
>> +  basic_block bb;
>> +  int regno, sregno, prev_sregno, dregno, restore_regno;
>> +  rtx set, prev_set, prev_insn;
>> +  bool change_p, done_p;
>> +
>> +  change_p = ! bitmap_empty_p (remove_pseudos);
> I wondered from the comment why we couldn't just return straight away
> for the empty set, but it looks like the function also schedules a
> constraints pass for instructions that keep their inheritance or
> split pseudos.  Is that right?  Might be worth mentioning that
> in the function comment if so.
Yes, that is right.  I've added a comment.
>> +	      else if (bitmap_bit_p (remove_pseudos, sregno)
>> +		       && bitmap_bit_p (&lra_inheritance_pseudos, sregno))
>> +		{
>> +		  /* Search the following pattern:
>> +		       inherit_or_split_pseudo1 <- inherit_or_split_pseudo2
>> +		       original_pseudo <- inherit_or_split_pseudo1
>> +		    where the 2nd insn is the current insn and
>> +		    inherit_or_split_pseudo2 is not removed.  If it is found,
>> +		    change the current insn onto:
>> +		       original_pseudo1 <- inherit_or_split_pseudo2.  */
> s/original_pseudo1/original_pseudo/ I think (we don't change the destination).
Yes, it is a typo.  Fixed.
>> +		  for (prev_insn = PREV_INSN (curr_insn);
>> +		       prev_insn != NULL_RTX && ! NONDEBUG_INSN_P (prev_insn);
>> +		       prev_insn = PREV_INSN (prev_insn))
>> +		    ;
>> +		  if (prev_insn != NULL_RTX && BLOCK_FOR_INSN (prev_insn) == bb
>> +		      && (prev_set = single_set (prev_insn)) != NULL_RTX
>> +		      /* There should be no subregs in insn we are
>> +			 searching because only the original reg might
>> +			 be in subreg when we changed the mode of
>> +			 load/store for splitting.  */
>> +		      && REG_P (SET_DEST (prev_set))
>> +		      && REG_P (SET_SRC (prev_set))
>> +		      && (int) REGNO (SET_DEST (prev_set)) == sregno
>> +		      && ((prev_sregno = REGNO (SET_SRC (prev_set)))
>> +			  >= FIRST_PSEUDO_REGISTER)
>> +		      && (lra_reg_info[sregno].restore_regno
>> +			  == lra_reg_info[prev_sregno].restore_regno)
>> +		      && ! bitmap_bit_p (remove_pseudos, prev_sregno))
> I'm sure the restore_regno comparison near the end is correct,
> but could you add a comment to explain it?  The substitution
> itself seems OK either way.
I added a comment.
>> +	      struct lra_insn_reg *reg;
>> +	      bool insn_change_p = false;
>> +
>> +	      curr_id = lra_get_insn_recog_data (curr_insn);
>> +	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +		if ((regno = reg->regno) >= lra_constraint_new_regno_start
>> +		    && lra_reg_info[regno].restore_regno >= 0)
> Is the first part of the comparison needed?  Most other places don't check,
> so it looked at first glance like there was something special here.
I think it is a leftover from older implementation where I worked 
differently with restore_reg.  I removed.
>> +		  {
>> +		    if (change_p && bitmap_bit_p (remove_pseudos, regno))
>> +		      {
>> +			restore_regno = lra_reg_info[regno].restore_regno;
>> +			substitute_pseudo (&curr_insn, regno,
>> +					   regno_reg_rtx[restore_regno]);
>> +			insn_change_p = true;
>> +		      }
>> +		    else if (NONDEBUG_INSN_P (curr_insn))
>> +		      {
>> +			lra_push_insn_and_update_insn_regno_info (curr_insn);
>> +			lra_set_used_insn_alternative_by_uid
>> +			  (INSN_UID (curr_insn), -1);
>> +		      }
>> +		  }
>> +	      if (insn_change_p)
>> +		{
>> +		  lra_update_insn_regno_info (curr_insn);
>> +		  if (lra_dump_file != NULL)
>> +		    {
>> +		      fprintf (lra_dump_file, "	   Restore original insn:\n");
>> +		      debug_rtl_slim (lra_dump_file,
>> +				      curr_insn, curr_insn, -1, 0);
>> +		    }
>> +		}
> AIUI we could have a partial restore, keeping some registers but
> restoring others.  Is that right?  The dump entry made it sounds
> like a full restore.
Yes, the dump is a bit misleading.  It might be a partial restore.
> Maybe something like:
>
> 	      struct lra_insn_reg *reg;
> 	      bool restored_regs_p = false;
> 	      bool kept_regs_p = false;
>
> 	      curr_id = lra_get_insn_recog_data (curr_insn);
> 	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> 		{
> 		  regno = reg->regno;
> 		  restore_regno = lra_reg_info[regno].restore_regno;
> 		  if (restore_regno >= 0)
> 		    {
> 		      if (change_p && bitmap_bit_p (remove_pseudos, regno))
> 			{
> 			  substitute_pseudo (&curr_insn, regno,
> 					     regno_reg_rtx[restore_regno]);
> 			  restored_regs_p = true;
> 			}
> 		      else
> 			kept_regs_p = true;
> 		    }
> 		}
> 	      if (NONDEBUG_INSN_P (curr_insn) && kept_regs_p)
> 		{
> 		  /* The instruction has changed since the previous
> 		     constraints pass.  */
> 		  lra_push_insn_and_update_insn_regno_info (curr_insn);
> 		  lra_set_used_insn_alternative_by_uid
> 		    (INSN_UID (curr_insn), -1);
> 		}
> 	      else if (restored_regs_p)
> 		/* The instruction has been restored to the form that
> 		   it had during the previous constraints pass.  */
> 		lra_update_insn_regno_info (curr_insn);
>
> 	      if (restored_regs_p && lra_dump_file != NULL)
> 		{
> 		  fprintf (lra_dump_file,
> 			   "	   Insn after restoring regs:\n");
> 		  debug_rtl_slim (lra_dump_file, curr_insn, curr_insn, -1, 0);
> 		}
>
> (if correct) might make the partial case clearer, but that's personal
> preference, so please feel free to ignore, chop or change.
  I used your variant.
> Also, is regno_reg_rtx[restore_regno] always correct when restoring
> registers?  I thought restore_regno could be a hard register and that
> the hard register might not necessarily be used in the same mode as
> the regno_reg_rtx[] entry.
Right.  I fixed it.
> That just leaves lra.h, lra-int.h and lra.c itself.  I'm hoping to have
> read through those by the middle of next week, but any comments about them
> will probably just be banal stuff (even more than my comments so far :-))
> so I deliberately left them to last.
>
Richard, thanks again for your invaluable feedback which made and will 
make LRA better.