This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: RFC: LRA for x86/x86-64 [7/9] -- continuation

From: Richard Sandiford <rdsandiford at googlemail dot com>
To: Vladimir Makarov <vmakarov at redhat dot com>
Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>
Date: Wed, 17 Oct 2012 12:24:31 +0100
Subject: Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
References: <5064DA43.90803@redhat.com> <87k3uvpxuj.fsf@sandifor-thinkpad.stglab.manchester.uk.ibm.com> <507E01BC.90006@redhat.com>
Thanks for all the updates.

Vladimir Makarov <vmakarov@redhat.com> writes:
>>> +	  /* index * scale + disp => new base + index * scale  */
>>> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
>>> +
>>> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
>>> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
>>> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
>>> +	  lra_emit_move (new_reg, *ad.disp_loc);
>>> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
>>> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
>>> +	  XEXP (*addr_loc, 0) = new_reg;
>> The canonical form is (plus (mult ...) (reg)) rather than
>> (plus (reg) (mult ...)), but it looks like we create the latter.
>> I realise you try both forms here:
> It might happen because equiv substitution in LRA.
>>> +	  /* Some targets like ARM, accept address operands in
>>> +	     specific order -- try exchange them if necessary.	*/
>>> +	  if (! valid_address_p (mode, *addr_loc, as))
>>> +	    {
>>> +	      exchange_plus_ops (*addr_loc);
>>> +	      if (! valid_address_p (mode, *addr_loc, as))
>>> +		exchange_plus_ops (*addr_loc);
>>> +	    }
>> but I think we should try the canonical form first.  And I'd prefer it
>> if we didn't try the other form at all, especially in 4.8.  It isn't
>> really the backend's job to reject non-canonical rtl.  This might well
>> be another case where some targets need a (hopefully small) tweak in
>> order to play by the rules.
>>
>> Also, I suppose this section of code feeds back to my question on
>> Wednesday about the distinction that LRA seems to make between the
>> compile-time constant in:
>>
>>    (plus (reg X1) (const_int Y1))
>>
>> and the link-time constant in:
>>
>>    (plus (reg X2) (symbol_ref Y2))
>>
>> It looked like extract_address_regs classified X1 as a base register and
>> X2 as an index register.  The difference between the two constants has
>> no run-time significance though, and I think we should handle both X1
>> and X2 as base registers (as I think reload does).
>>
>> I think the path above would then be specific to scaled indices.
>> In the original address the "complex" index must come first and the
>> displacement second.  In the modified address, the index would stay
>> first and the new base register would be second.  More below.
> As I wrote above the problem is also in that equiv substitution can 
> create non-canonical forms.

Right.  Just in case there's a misunderstanding: I'm not complaining
about these routines internally using forms that are noncanonical
(which could happen because of equiv substitution, like you say).
I just think that what we eventually try to validate should be canonical.
In a way it's similar to how the simplify-rtx.c routines work.

If there are targets that only accept noncanonical rtl (which is after
all just a specific type of invalid rtl), they need to be fixed.

>>> +      /* base + scale * index + disp => new base + scale * index  */
>>> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
>>> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
>>> +      if (! valid_address_p (mode, *addr_loc, as))
>>> +	{
>>> +	  /* Some targets like ARM, accept address operands in
>>> +	     specific order -- try exchange them if necessary.	*/
>>> +	  exchange_plus_ops (*addr_loc);
>>> +	  if (! valid_address_p (mode, *addr_loc, as))
>>> +	    exchange_plus_ops (*addr_loc);
>>> +	}
>> Same comment as above about canonical rtl.  Here we can have two
>> registers -- in which case the base should come first -- or a more
>> complex index -- in which case the index should come first.
>>
>> We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
>> with the operands in either order, and let it take care of the details.
>> Using simplify_gen_binary would help with the earlier index+disp case too.
> Equiv substitution can create non-canonical forms.  There are 2 approaches:
> o have a code for dealing with non-canonical forms (equiv substitution, 
> target stupidity)
> o always support canonical forms and require them from targets.
>
> I decided to use the 1st variant but I am reconsidering it.  I'll try to 
> fix before inclusion.  But I am not sure I have time for this.  All 
> these changes makes LRA unstable. In fact, I've just found that changes 
> I already made so far resulted in 2 SPEC2000 tests broken although GCC 
> testsuite and bootstrap looks good.

OK.  I'm happy to try fixing the noncanonical thing.

>>> +  /* If this is post-increment, first copy the location to the reload reg.  */
>>> +  if (post && real_in != result)
>>> +    emit_insn (gen_move_insn (result, real_in));
>> Nit, but real_in != result can never be true AIUI, and I was confused how
>> the code could be correct in that case.  Maybe just remove it, or make
>> it an assert?
> No, it might be true:
>
> real_in = in == value ? incloc : in;
> ...
> if (cond)
>    result = incloc;
> else
>    result = ...
>
> if (post && real_in != result)
>
> So it is true if in==value && cond

Sorry, what I meant was that cond is "! post && REG_P (incloc)":

  if (! post && REG_P (incloc))
    result = incloc;
  else
    result = lra_create_new_reg (GET_MODE (value), value, new_rclass,
				 "INC/DEC result");

so it can never be true in the "post" case quoted above.

>>> +	      dest_reg = SET_DEST (set);
>>> +	      /* The equivalence pseudo could be set up as SUBREG in a
>>> +		 case when it is a call restore insn in a mode
>>> +		 different from the pseudo mode.  */
>>> +	      if (GET_CODE (dest_reg) == SUBREG)
>>> +		dest_reg = SUBREG_REG (dest_reg);
>>> +	      if ((REG_P (dest_reg)
>>> +		   && (x = get_equiv_substitution (dest_reg)) != dest_reg
>>> +		   /* Remove insns which set up a pseudo whose value
>>> +		      can not be changed.  Such insns might be not in
>>> +		      init_insns because we don't update equiv data
>>> +		      during insn transformations.
>>> +			
>>> +		      As an example, let suppose that a pseudo got
>>> +		      hard register and on the 1st pass was not
>>> +		      changed to equivalent constant.  We generate an
>>> +		      additional insn setting up the pseudo because of
>>> +		      secondary memory movement.  Then the pseudo is
>>> +		      spilled and we use the equiv constant.  In this
>>> +		      case we should remove the additional insn and
>>> +		      this insn is not init_insns list.	 */
>>> +		   && (! MEM_P (x) || MEM_READONLY_P (x)
>>> +		       || in_list_p (curr_insn,
>>> +				     ira_reg_equiv
>>> +				     [REGNO (dest_reg)].init_insns)))
>> This is probably a stupid question, sorry, but when do we ever want
>> to keep an assignment to a substituted pseudo?  I.e. why isn't this just:
>>
>> 	      if ((REG_P (dest_reg)
>> 		   && (x = get_equiv_substitution (dest_reg)) != dest_reg)
> Equivalence can be memory location.  It means you can assign many 
> different values to the location.  Removing them would generate a wrong 
> code.  For example, if you use your simple variant of code, you will 
> have > 100 test failures of GCC testsuite.

OK :-)

>>> +/* Return true if we need a split for hard register REGNO or pseudo
>>> +   REGNO which was assigned to a hard register.
>>> +   POTENTIAL_RELOAD_HARD_REGS contains hard registers which might be
>>> +   used for reloads since the EBB end.	It is an approximation of the
>>> +   used hard registers in the split range.  The exact value would
>>> +   require expensive calculations.  If we were aggressive with
>>> +   splitting because of the approximation, the split pseudo will save
>>> +   the same hard register assignment and will be removed in the undo
>>> +   pass.  We still need the approximation because too aggressive
>>> +   splitting would result in too inaccurate cost calculation in the
>>> +   assignment pass because of too many generated moves which will be
>>> +   probably removed in the undo pass.  */
>>> +static inline bool
>>> +need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
>>> +{
>>> +  int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];
>>> +
>>> +  lra_assert (hard_regno >= 0);
>>> +  return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
>>> +	   && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
>>> +	   && (usage_insns[regno].reloads_num
>>> +	       + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
>>> +	   && ((regno < FIRST_PSEUDO_REGISTER
>>> +		&& ! bitmap_bit_p (&ebb_global_regs, regno))
>>> +	       || (regno >= FIRST_PSEUDO_REGISTER
>>> +		   && lra_reg_info[regno].nrefs > 3
>>> +		   && bitmap_bit_p (&ebb_global_regs, regno))))
>>> +	  || (regno >= FIRST_PSEUDO_REGISTER && need_for_call_save_p (regno)));
>>> +}
>> Could you add more commentary about the thinking behind this particular
>> choice of heuristic?  E.g. I wasn't sure what the reloads_num check did,
>> or why we only split hard registers that are local to the EBB and only
>> split pseudos that aren't.
>>
>> The 2 and 3 numbers seemed a bit magic too.  I suppose the 2 has
>> something to do with "one save and one restore", but I wasn't sure
>> why we applied it only for pseudos.  (AIUI that arm of the check
>> deals with "genuine" split pseudos rather than call saves & restores.)
>>
>> Still, it says a lot for the high quality of LRA that, out of all the
>> 1000s of lines of code I've read so far, this is the only part that
>> didn't seem to have an intuitive justification.
> Yes, right.  I checked many parameters and finally picked out the above 
> ones.  What I found is that aggressive and even moderate splitting 
> usually result in worse code.  Most splitting is undone and if we do 
> splitting aggressively we generate a lot of insns which will be removed 
> but their costs are taken into account during assignment pass.  
> Inheritance has quite bigger chance to successful.  One reason why 
> splitting is undone is that spilling + inheritance of short living 
> pseudos is a substitution of splitting in some way.  Therefore I do 
> splitting for long living pseudos. That means non local pseudos.
>
> I added the following comments.
>
>    return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
>             && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
>             /* We need at least 2 reloads to make pseudo splitting
>                profitable.  We should provide hard regno splitting in
>                any case to solve 1st insn scheduling problem when
>                moving hard register definition up might result in
>                impossibility to find hard register for reload pseudo of
>                small register class.  */
>             && (usage_insns[regno].reloads_num
>                 + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
>             && (regno < FIRST_PSEUDO_REGISTER
>                 /* For short living pseudos, spilling + inheritance can
>                    be considered a substitution for splitting.
>                    Therefore we do not splitting for local pseudos.  It
>                    decreases also aggressiveness of splitting.  The
>                    minimal number of references is chosen taking into
>                    account that for 2 references splitting has no sense
>                    as we can just spill the pseudo.  */
>                 || (regno >= FIRST_PSEUDO_REGISTER
>                     && lra_reg_info[regno].nrefs > 3
>                     && bitmap_bit_p (&ebb_global_regs, regno))))
>
> I removed checking that hard_regno is local.  It was a parameter to 
> decrease aggressiveness of hard register splitting but I realized that 
> it contradicts the reason for splitting to solve 1st insn scheduling 
> problem because regional scheduling can move hard register definition 
> beyond BB.

Thanks, this looks much better to me FWIW.

Richard
Follow-Ups:
- Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
  - From: Vladimir Makarov
References:
- Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
  - From: Richard Sandiford
- Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
  - From: Vladimir Makarov
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]