[RFC][PATCH][MIPS] Patch to enable LRA for MIPS backend

Wed Apr 9 21:24:00 GMT 2014

Robert Suchanek <Robert.Suchanek@imgtec.com> writes:
>> FYI, all other targets that have LRA optionally selectable or deselectable
>> use -mno-lra for this (even when -mlra is the default), it would be better
>> for consistency not to invent new switch names for that.
>
> Agreed.
>
>>> -    return !strict_p || GET_MODE_SIZE (mode) == 4 || GET_MODE_SIZE (mode) == 8;
>>> +    return GET_MODE_SIZE (mode) == 4 || GET_MODE_SIZE (mode) == 8;
>>>  
>>>    return TARGET_MIPS16 ? M16_REG_P (regno) : GP_REG_P (regno);
>>>  }
>> Not sure about this one.  We would need to update the comment that
>> explains why "!strict_p" is there, but AFAIK reason (1) would still apply.
>>
>> Was this needed for correctness or because it gave better code?
>
> "!strict_p" has been removed because of correctness issue. When LRA
> validates memory addresses pseudos are temporarily eliminated to hard
> registers (if possible) but the target hook is always called as
> non-strict. This only affects MIPS16 instructions with not directly
> accessible $sp. The strict variant, as I understand, was used in the
> reload pass to indicate if a pseudo-register has been allocated a hard
> register. Unless LRA should be setting the strict/non-strict depending
> on whether a temporal elimination to hard reg was successful or there
> is something else that I missed?

Hmm, OK, in that case I agree reason (2) doesn't apply.  That part was
always more of a consistency thing anyway, so I agree it's not worth
keeping around for reload.  I also had a look to see why
instantiate_virtual_regs_in_insn didn't complain about cases like:

  struct s { unsigned char c; };
  void foo (int, int, int, int, struct s);
  void bar (struct s *ptr) { foo (1, 2, 3, 4, *ptr); }

and I think it's because of the later:

2008-02-14  Michael Matz  <matz@suse.de>

	PR target/34930
	* function.c (instantiate_virtual_regs_in_insn): Reload address
	before falling back to reloading the whole operand.

which correctly reloads the address if necessary.

So yeah, I agree this is right after all, sorry.  Let's delete the
comment starting at "There are two problems here:" at the same time.

>>> +  M16F_REGS,			/* mips16 + frame */
>>
>> Constraints are supposed to be operating on real registers, after
>> elimination, so it seems odd to include a fake register.  What went
>> wrong with just M16_REGS?
>
> Only the stack pointer has been added to M16_REGS.

Sorry, I'd read "frame" as meaning "$frame", the soft frame pointer.
I agree M16_REGS + $sp is OK.

mips_regno_to_class should then map $sp to the new class, since it's now
the smallest containing class.  (We really should set that up automatically
one day...)

> A number of patterns need to accept it otherwise LRA inserts a lot of
> reloads and the code size goes up by about 10%.  The change does have
> also a positive effect on reload but marginally.  "frame" meant to
> indicate inclusion of both the stack and hard frame pointers in the
> class but perhaps I should name it differently to avoid confusion.

How about M16_SP_REGS, to match M16_T_REGS?

Also, the BASE_REG_CLASS/ADDR_REG_CLASS distinction isn't all that
obvious from the names.  ADDR_REG_CLASS is only needed for the "d"
constraint so maybe we could just use TARGET_MIPS16 ? M16_REGS : GR_REGS
directly for now.

>>> + SPILL_REGS, /* All but $sp and call preserved regs are in here */
>>...
>>> + { 0x0003fffc, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
>>> 0x00000000 }, /* SPILL_REGS */ \
>>
>> These two don't seem to match.  I think literally it would be 0x0300fffc,
>> but maybe you had to make SPILL_REGS a superset of M16_REGs?
>
> I initially used 0x0300fffc but did some experiments and it turned out
> that 0x0003fffc (with $16, $17 regs) gives slightly better code. I
> haven't updated the comment though.

I can imagine including all M16_REGS makes sense, but it seems odd to
drop the 2 temporaries.  Does 0x0303fffc have the same problem?

> There is yet more to do and need to return to another thread with
> MIPS16 at some point as I found some limitations of IRA/LRA to
> generate better code. $8-$15 are currently inaccessible as temporary
> storage because these registers are marked as fixed (when optimizing
> for size) but leaving them as fixed are better for the code size. I
> don't expect a big gain by using hard registers for spilling but it
> more likely to improve the performance.

Hmm, marking them fixed was supposed to be a temporary reload-only thing,
until the move to LRA.  It should never be worse to spill to these GPRs
over spilling to the stack, if the value isn't live across a call.

But that certainly doesn't need to be part of the initial patch.

>>> +/* Add costs to hard registers based on frequency. This helps to negate
>>> +   some of the reduced cost associated with argument registers which 
>>> +   unfairly promotes their use and increases register pressure */
>>> +#define IRA_HARD_REGNO_ADD_COST_MULTIPLIER(REGNO)                       \
>>> +  (TARGET_MIPS16 && optimize_size                                       \
>>> +   ? ((REGNO) >= 4 && (REGNO) <= 7 ? 2 : 0) : 0)
>>
>> So we would be trying to use, say, $4 for the first incoming argument
>> even after it had been spilled?  Hmm.
>>
>> Since this change is aimed specifically at one heuristic, I wonder
>> whether we should parameterise that heuristic somehow rather than
>> try to use a general hook to undo it.  But I don't think there's
>> anything particularly special about MIPS16 here, so maybe it's too
>> eager for all targets.
>
> In a number of cases argument registers appeared to be unfairly promoted
> increasing the register pressure and increasing the number of reloads.
> Bumping up the cost of using those registers encourages IRA to spill
> into memory but this appears to help LRA to do a better allocation. Of course,
> not always it is a win but generally the gain outweighs the loss. 
>
> I've seen an codesize improvements for other optimization levels
> but I'm unsure whether we should make this change generic.
> This part of the patch is not crucial though and can be send separately. 

OK, thanks, doing it separately sounds better.

>>>  (define_insn "*and<mode>3_mips16"
>>> ...
>>
>> I think we want to keep the LWU case at the very least, since I assume
>> otherwise we'll load the full 64-bit spill slot and use a pair of shifts
>> to zero-extend it.  Reloading the stack address into a base register
>> should always be better than that.
>>
>> I agree it's less clear for the byte and halfword cases.  All other
>> things -- including size -- being equal, reloading a stack address into
>> a base register and using an extending load should be better than
>> reloading the full spill slot and extending it, since the reloaded stack
>> address is more likely to be reused in other instructions.
>>
>> The original MIPS16 didn't have ZEB and ZEH, so on the basis above,
>> I think using LBU and LHU there was probably a win.  I can imagine
>> it making sense to disable LBU and LHU for MIPS16e though.
>>
>> It might be better to do any changes to this pattern as a follow-on
>> patch, since I think LRA should cope with it as-is.
>
> Keeping LWU case should be fine. Alternatives were removed because
> LRA was ICEing for the same reason of $sp not accessible for the byte 
> and halfword cases. I tried to find a testcase to trigger LWU case 
> but couldn't. I presume it must be a rare case? The problem with other
> alternatives was that spilled pseudos were changed to memory without
> checking if the use of $sp is valid. On the other hand, keeping LWU 
> may only delay triggering the problem because the stack pointer 
> should not be used. It could be a missing case in LRA too.

Did you see the failures even after your mips_regno_mode_ok_for_base_p
change?  LRA should know how to reload a "W" address.

>>> diff --git gcc/rtlanal.c gcc/rtlanal.c
>>> index 7a9efec..f699d17 100644
>>> --- gcc/rtlanal.c
>>> +++ gcc/rtlanal.c
>>> @@ -5577,7 +5577,8 @@ get_base_term (rtx *inner)
>>>      inner = strip_address_mutations (&XEXP (*inner, 0));
>>>    if (REG_P (*inner)
>>>        || MEM_P (*inner)
>>> -      || GET_CODE (*inner) == SUBREG)
>>> +      || GET_CODE (*inner) == SUBREG
>>> +      || CONSTANT_P (*inner))
>>>      return inner;
>>>    return 0;
>>>  }
>> 
>> This doesn't look right; the general form is base+index+displacement+segment,
>> with the constant term being treated as the displacement.
>
> I'm not particularly happy with this either. This was an attempt to fix an ICE for 
> the following RTL (gcc.dg/torture/asm-subreg-1.c compiled with -mips32r2 -mips16 -O1):
>
> (insn 9 8 0 2 (asm_operands/v ("") ("") 0 [
>             (mem/v/j/c:HI (lo_sum:SI (const:SI (unspec:SI [
>                                 (const_int 0 [0])
>                             ] UNSPEC_GP))
>                     (symbol_ref:SI ("_const_32") [flags 0x6]  <var_decl 0x7f50acd17558 _const_32>)) [0 _const_32+0 S2 A32])]
>  [(asm_input:HI ("X") (null):0)]
>  [] asm-subreg-1.c:13) asm-subreg-1.c:13 -1 (nil))
>
> Any suggestions to handle this case?

Thanks for the pointer.  I think this shows a more fundamental problem
with the handling of "X" constraints.  With something like:

void
foo (int **x, int y, int z)
{
  int *ptr = *x + y * z / 11;
  __asm__ __volatile__ ("" : : "X" (*ptr));
}

the entire expression gets treated as a MEM address, which neither
reload nor LRA can handle.  And with something like that, it isn't
obvious what class all the registers in the address should have.
With a sufficiently-complicated expression you could run out of registers.

So perhaps we should limit the propagation to things that
decompose_mem_address can handle.

Thanks,
Richard