Code bloat due to silly IRA cost model?

Georg-Johann Lay gjl@gcc.gnu.org
Thu Jan 9 09:52:00 GMT 2020


Am 13.12.19 um 13:45 schrieb Richard Sandiford:
> Georg-Johann Lay <gjl@gcc.gnu.org> writes:
>> Am 11.12.19 um 18:55 schrieb Richard Sandiford:
>>> Georg-Johann Lay <gjl@gcc.gnu.org> writes:
>>>> Hi, doesn't actually anybody know know to make memory more expensive
>>>> than registers when it comes to allocating registers?
>>>>
>>>> Whatever I am trying for TARGET_MEMORY_MOVE_COST and
>>>> TARGET_REGISTER_MOVE_COST, ira-costs.c always makes registers more
>>>> expensive than mem and therefore allocates values to stack slots instead
>>>> of keeping them in registers.
>>>>
>>>> Test case (for avr) is as simple as it gets:
>>>>
>>>> float func (float);
>>>>
>>>> float call (float f)
>>>> {
>>>>        return func (f);
>>>> }
>>>>
>>>> What am I missing?
>>>>
>>>> Johann
>>>>
>>>>
>>>> Georg-Johann Lay schrieb:
>>>>> Hi,
>>>>>
>>>>> I am trying to track down a code bloat issue and am stuck because I do
>>>>> not understand IRA's cost model.
>>>>>
>>>>> The test case is as simple as it gets:
>>>>>
>>>>> float func (float);
>>>>>
>>>>> float call (float f)
>>>>> {
>>>>>       return func (f);
>>>>> }
>>>>>
>>>>> IRA dump shows the following insns:
>>>>>
>>>>>
>>>>> (insn 14 4 2 2 (set (reg:SF 44)
>>>>>           (reg:SF 22 r22 [ f ])) "bloat.c":4:1 85 {*movsf}
>>>>>        (expr_list:REG_DEAD (reg:SF 22 r22 [ f ])
>>>>>           (nil)))
>>>>> (insn 2 14 3 2 (set (reg/v:SF 43 [ f ])
>>>>>           (reg:SF 44)) "bloat.c":4:1 85 {*movsf}
>>>>>        (expr_list:REG_DEAD (reg:SF 44)
>>>>>           (nil)))
>>>>> (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
>>>>> (insn 6 3 7 2 (set (reg:SF 22 r22)
>>>>>           (reg/v:SF 43 [ f ])) "bloat.c":5:12 85 {*movsf}
>>>>>        (expr_list:REG_DEAD (reg/v:SF 43 [ f ])
>>>>>           (nil)))
>>>>> (call_insn/j 7 6 8 2 (parallel [
>>>>>
>>>>> #14 sets pseudo 44 from arg register R22.
>>>>> #2 moves it to pseudo 43
>>>>> #6 moves it to R22 as it prepares for call_insn #7.
>>>>>
>>>>> There are 2 allocnos and cost:
>>>>>
>>>>> Pass 0 for finding pseudo/allocno costs
>>>>>
>>>>>       a1 (r44,l0) best NO_REGS, allocno NO_REGS
>>>>>       a0 (r43,l0) best NO_REGS, allocno NO_REGS
>>>>>
>>>>>     a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
>>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
>>>>>     a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
>>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
>>>>>
>>>>> which is quite odd because MEM is way more expensive here than any REG.
>>>>>
>>>>> Okay, so let's boost the MEM cost (TARGET_MEMORY_MOVE_COST) by a factor
>>>>> of 100:
>>>>>
>>>>>       a1 (r44,l0) best NO_REGS, allocno NO_REGS
>>>>>       a0 (r43,l0) best NO_REGS, allocno NO_REGS
>>>>>
>>>>>     a0(r43,l0) costs: ADDW_REGS:3200000 SIMPLE_LD_REGS:3200000
>>>>> LD_REGS:3200000 NO_LD_REGS:3200000 GENERAL_REGS:3200000 MEM:801000
>>>>>     a1(r44,l0) costs: ADDW_REGS:3200000 SIMPLE_LD_REGS:3200000
>>>>> LD_REGS:3200000 NO_LD_REGS:3200000 GENERAL_REGS:3200000 MEM:801000
>>>>>
>>>>> What??? The REG costs are 100 times higher, and stille higher that the
>>>>> MEM costs.  What the heck is going on?
>>>>>
>>>>> Setting TARGET_REGISTER_MOVE_COST and also TARGET_MEMORY_MOVE_COST to 0
>>>>> yiels:
>>>>>
>>>>>     a0(r43,l0) costs: ADDW_REGS:0 SIMPLE_LD_REGS:0 LD_REGS:0 NO_LD_REGS:0
>>>>> GENERAL_REGS:0 MEM:0
>>>>>     a1(r44,l0) costs: ADDW_REGS:0 SIMPLE_LD_REGS:0 LD_REGS:0 NO_LD_REGS:0
>>>>> GENERAL_REGS:0 MEM:0
>>>>>
>>>>> as expected, i.e. there is no other hidden source of costs considered by
>>>>> IRA.  And even TARGET_REGISTER_MOVE_COST = 0  and
>>>>> TARGET_MEMORY_MOVE_COST = original gives:
>>>>>
>>>>>     a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
>>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
>>>>>     a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
>>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
>>>>>
>>>>> How the heck do I tell ira-costs that registers are way cheaper than MEM?
>>>
>>> I think this is coming from:
>>>
>>>     /* FIXME: Ideally, the following test is not needed.
>>>           However, it turned out that it can reduce the number
>>>           of spill fails.  AVR and it's poor endowment with
>>>           address registers is extreme stress test for reload.  */
>>>
>>>     if (GET_MODE_SIZE (mode) >= 4
>>>         && regno >= REG_X)
>>>       return false;
>>
>> This was introduced to "fix" unable to find a register to spill ICE.
>>
>> What I do not understand is that the code with long (which is SImode on
>> avr) is fine:
>>
>> long lunc (long);
>>
>> long callL (long f)
>> {
>>       return lunc (f);
>> }
>>
>> callL:
>> 	rjmp lunc	 ;  7	[c=24 l=1]  call_value_insn/3
> 
> This is due to differences in the way that lower-subreg.c lowers
> SF moves vs. SI moves.  For SI it generates pure QI moves and so
> gets rid of the SI entirely.  For SF it still builds the QI values
> back into an SF:
> 
>        || (!SCALAR_INT_MODE_P (dest_mode)
> 	  && !resolve_reg_p (dest)
> 	  && !resolve_subreg_p (dest)))
> 
> I imagine this is because non-int modes are held in FPRs rather than
> GPRs on most targets, but TBH I'm not sure.  I couldn't see a comment
> that explains the above decision.
> 
> With -fno-split-wide-types I see the same RA behaviour for both SI and SF
> (i.e. both spill to memory).
> 
>>> in avr_hard_regno_mode_ok.  This forbids SFmode in r26+ and means that
>>> moves between pointer registers and general registers have the highest
>>> possible cost (65535) to prevent them for being used for SFmode.  So:
>>>
>>>      ira_register_move_cost[SFmode][POINTER_REGS][GENERAL_REGS] = 65535;
>>>
>>> The costs for union classes are the maximum (worst-case) cost of
>>> for each subclass, so this means that:
>>>
>>>      ira_register_move_cost[SFmode][GENERAL_REGS][GENERAL_REGS] = 65535;
>>>
>>> as well.
>>
>> This means that, when there is an expensive class (because it only
>> contains one register for example),
> 
> Having one register doesn't automatically make it expensive.
> E.g. there's only one "c" register on x86, but it's not more expensive
> than other registers because of that.
> 
> Move costs aren't a good way of deterring unnecessary uses of small
> classes.  The costs should just describe the actual size or speed
> overhead of moving the register.
> 
>> then it will blow the cost of GENERAL_REGS to crazy values no matter
>> what?
> 
> Yeah.  This is because (with the above intended use of costs) the
> worst-case cost of a superclass X can't be less than the worst-case cost
> of one of its subclasses Y.  If the RA decides to allocate an X, it might
> get unlucky and be forced to use a register in Y.
> 
> If a class X - Y exists then it won't be affected by the Y costs.
> So taking Y's cost into account when calculating X's cost means that
> the RA will prefer X - Y over X, which is exactly what making Y
> expensive should achieve.
> 
> FWIW, that's why I suggested seeing what would happen if you added a new
> class for GENERAL_REGS - POINTER_REGS.

Hi, I have tried that.  Didn't fix the issue.

For reference, the test case compiled with -Os -fsplit-wide-types-early:

float func (float);

float call (float f)
{
     return func (f);
}

with the attached delta.

It's still the case that there are 16 superfluous instructions, dead 
store, setup of frame even though not needed.


call:
	push r28		 ;  17	[c=4 l=1]  pushqi1/0
	push r29		 ;  18	[c=4 l=1]  pushqi1/0
	 ; SP -= 4	 ;  22	[c=4 l=2]  *addhi3_sp
	rcall .	
	rcall .	
	in r28,__SP_L__	 ;  23	[c=4 l=2]  *movhi/7
	in r29,__SP_H__
/* prologue: function */
/* frame size = 4 */
/* stack size = 6 */
.L__stack_usage = 6
	std Y+1,r22	 ;  14	[c=4 l=4]  *movsf/3
	std Y+2,r23
	std Y+3,r24
	std Y+4,r25
/* epilogue start */
	 ; SP += 4	 ;  34	[c=4 l=4]  *addhi3_sp
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop r29		 ;  35	[c=4 l=1]  popqi
	pop r28		 ;  36	[c=4 l=1]  popqi
	rjmp func	 ;  7	[c=24 l=1]  call_value_insn/3

In the IRA dump, it's still the case that REGs are consistently more 
expensive than MEM:

Pass 0 for finding pseudo/allocno costs

     a1 (r44,l0) best NO_REGS, allocno NO_REGS
     a0 (r43,l0) best NO_REGS, allocno NO_REGS

   a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 
NO_LD_REGS:32000 NO_POINTER_REGS:32000 MEM:9000
   a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 
NO_LD_REGS:32000 NO_POINTER_REGS:32000 MEM:9000


Pass 1 for finding pseudo/allocno costs

     r44: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
     r43: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS

   a0(r43,l0) costs: ADDW_REGS:48000 SIMPLE_LD_REGS:48000 LD_REGS:48000 
NO_LD_REGS:48000 NO_POINTER_REGS:48000 MEM:17000
   a1(r44,l0) costs: ADDW_REGS:40008 SIMPLE_LD_REGS:40008 LD_REGS:40008 
NO_LD_REGS:40008 NO_POINTER_REGS:40008 MEM:17000

Johann

p.s.: Also added a reg-class for the intersection of R0..r25 and 
r24..r31.  Don't know if that's a requirement for regclass layout though.


>> What's also strange is that the register allocator would not need to
>> allocate a register at all:  The incoming parameter comes in SI:22 and
>> is just be passed through to the callee, which also receives the value
>> in SI:22.  Why would one move that value to memory?  Even if memory was
>> cheaper, moving the value to mem just to load it again to the same
>> register is not very sensible...  because in almost any case, /no/
>> instruction is cheaper than /some/ instructions?
> 
> Earlier passes could perhaps propagate the pseudo registers away in very
> simple cases like this.  It would be a very special-case optimisation
> though.  If there was anything other than "move register X to register X"
> between the calls, getting rid of the pseudo registers before RA could
> introduce spill failures.
> 
> combine's to blame for the fact that we have two pseudo registers rather
> than one.  See the comments about the avr-elf results in:
> 
>     https://gcc.gnu.org/ml/gcc-patches/2019-11/msg02150.html
> 
> for more details.
> 
>>> Removing the code above fixes it.  If you don't want to do that, an
>>> alternative might be to add a class for r0-r25 (but I've not tested that).
>>
>> Is there a way that it would use a similar path like SImode?
> 
> AFAICT the SI and SF costs are the same.  The difference is coming
> from -fsplit-wide-types rather than RA.
> 
> Thanks,
> Richard
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: x.diff
Type: text/x-patch
Size: 4255 bytes
Desc: not available
URL: <https://gcc.gnu.org/pipermail/gcc/attachments/20200109/c7dc97f2/attachment.bin>


More information about the Gcc mailing list