This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Code bloat due to silly IRA cost model?

From: Georg-Johann Lay <gjl at gcc dot gnu dot org>
To: gcc at gcc dot gnu dot org
Cc: richard dot sandiford at arm dot com
Date: Thu, 9 Jan 2020 10:52:05 +0100
Subject: Re: Code bloat due to silly IRA cost model?
References: <47b10439-21b8-81f6-3838-a2bcd0fa8048@gjlay.de> <5DEFFCD4.4080903@gcc.gnu.org> <mpteexa21d5.fsf@arm.com> <84093aae-3567-65a1-581a-17ef599b09ad@gcc.gnu.org> <mpt1rt8z950.fsf@arm.com>

Am 13.12.19 um 13:45 schrieb Richard Sandiford:

Georg-Johann Lay <gjl@gcc.gnu.org> writes:

Am 11.12.19 um 18:55 schrieb Richard Sandiford:

Georg-Johann Lay <gjl@gcc.gnu.org> writes:

Hi, doesn't actually anybody know know to make memory more expensive
than registers when it comes to allocating registers?

Whatever I am trying for TARGET_MEMORY_MOVE_COST and
TARGET_REGISTER_MOVE_COST, ira-costs.c always makes registers more
expensive than mem and therefore allocates values to stack slots instead
of keeping them in registers.

Test case (for avr) is as simple as it gets:

float func (float);

float call (float f)
{
       return func (f);
}

What am I missing?

Johann


Georg-Johann Lay schrieb:

Hi,

I am trying to track down a code bloat issue and am stuck because I do
not understand IRA's cost model.

The test case is as simple as it gets:

float func (float);

float call (float f)
{
      return func (f);
}

IRA dump shows the following insns:


(insn 14 4 2 2 (set (reg:SF 44)
          (reg:SF 22 r22 [ f ])) "bloat.c":4:1 85 {*movsf}
       (expr_list:REG_DEAD (reg:SF 22 r22 [ f ])
          (nil)))
(insn 2 14 3 2 (set (reg/v:SF 43 [ f ])
          (reg:SF 44)) "bloat.c":4:1 85 {*movsf}
       (expr_list:REG_DEAD (reg:SF 44)
          (nil)))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:SF 22 r22)
          (reg/v:SF 43 [ f ])) "bloat.c":5:12 85 {*movsf}
       (expr_list:REG_DEAD (reg/v:SF 43 [ f ])
          (nil)))
(call_insn/j 7 6 8 2 (parallel [

#14 sets pseudo 44 from arg register R22.
#2 moves it to pseudo 43
#6 moves it to R22 as it prepares for call_insn #7.

There are 2 allocnos and cost:

Pass 0 for finding pseudo/allocno costs

      a1 (r44,l0) best NO_REGS, allocno NO_REGS
      a0 (r43,l0) best NO_REGS, allocno NO_REGS

    a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
    a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000

which is quite odd because MEM is way more expensive here than any REG.

Okay, so let's boost the MEM cost (TARGET_MEMORY_MOVE_COST) by a factor
of 100:

      a1 (r44,l0) best NO_REGS, allocno NO_REGS
      a0 (r43,l0) best NO_REGS, allocno NO_REGS

    a0(r43,l0) costs: ADDW_REGS:3200000 SIMPLE_LD_REGS:3200000
LD_REGS:3200000 NO_LD_REGS:3200000 GENERAL_REGS:3200000 MEM:801000
    a1(r44,l0) costs: ADDW_REGS:3200000 SIMPLE_LD_REGS:3200000
LD_REGS:3200000 NO_LD_REGS:3200000 GENERAL_REGS:3200000 MEM:801000

What??? The REG costs are 100 times higher, and stille higher that the
MEM costs.  What the heck is going on?

Setting TARGET_REGISTER_MOVE_COST and also TARGET_MEMORY_MOVE_COST to 0
yiels:

    a0(r43,l0) costs: ADDW_REGS:0 SIMPLE_LD_REGS:0 LD_REGS:0 NO_LD_REGS:0
GENERAL_REGS:0 MEM:0
    a1(r44,l0) costs: ADDW_REGS:0 SIMPLE_LD_REGS:0 LD_REGS:0 NO_LD_REGS:0
GENERAL_REGS:0 MEM:0

as expected, i.e. there is no other hidden source of costs considered by
IRA.  And even TARGET_REGISTER_MOVE_COST = 0  and
TARGET_MEMORY_MOVE_COST = original gives:

    a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
    a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000
NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000

How the heck do I tell ira-costs that registers are way cheaper than MEM?


I think this is coming from:

    /* FIXME: Ideally, the following test is not needed.
          However, it turned out that it can reduce the number
          of spill fails.  AVR and it's poor endowment with
          address registers is extreme stress test for reload.  */

    if (GET_MODE_SIZE (mode) >= 4
        && regno >= REG_X)
      return false;


This was introduced to "fix" unable to find a register to spill ICE.

What I do not understand is that the code with long (which is SImode on
avr) is fine:

long lunc (long);

long callL (long f)
{
      return lunc (f);
}

callL:
	rjmp lunc	 ;  7	[c=24 l=1]  call_value_insn/3


This is due to differences in the way that lower-subreg.c lowers
SF moves vs. SI moves.  For SI it generates pure QI moves and so
gets rid of the SI entirely.  For SF it still builds the QI values
back into an SF:

       || (!SCALAR_INT_MODE_P (dest_mode)
	  && !resolve_reg_p (dest)
	  && !resolve_subreg_p (dest)))

I imagine this is because non-int modes are held in FPRs rather than
GPRs on most targets, but TBH I'm not sure.  I couldn't see a comment
that explains the above decision.

With -fno-split-wide-types I see the same RA behaviour for both SI and SF
(i.e. both spill to memory).

in avr_hard_regno_mode_ok.  This forbids SFmode in r26+ and means that
moves between pointer registers and general registers have the highest
possible cost (65535) to prevent them for being used for SFmode.  So:

     ira_register_move_cost[SFmode][POINTER_REGS][GENERAL_REGS] = 65535;

The costs for union classes are the maximum (worst-case) cost of
for each subclass, so this means that:

     ira_register_move_cost[SFmode][GENERAL_REGS][GENERAL_REGS] = 65535;

as well.


This means that, when there is an expensive class (because it only
contains one register for example),


Having one register doesn't automatically make it expensive.
E.g. there's only one "c" register on x86, but it's not more expensive
than other registers because of that.

Move costs aren't a good way of deterring unnecessary uses of small
classes.  The costs should just describe the actual size or speed
overhead of moving the register.

then it will blow the cost of GENERAL_REGS to crazy values no matter
what?


Yeah.  This is because (with the above intended use of costs) the
worst-case cost of a superclass X can't be less than the worst-case cost
of one of its subclasses Y.  If the RA decides to allocate an X, it might
get unlucky and be forced to use a register in Y.

If a class X - Y exists then it won't be affected by the Y costs.
So taking Y's cost into account when calculating X's cost means that
the RA will prefer X - Y over X, which is exactly what making Y
expensive should achieve.

FWIW, that's why I suggested seeing what would happen if you added a new
class for GENERAL_REGS - POINTER_REGS.


Hi, I have tried that.  Didn't fix the issue.

For reference, the test case compiled with -Os -fsplit-wide-types-early:

float func (float);

float call (float f)
{
    return func (f);
}

with the attached delta.

It's still the case that there are 16 superfluous instructions, deadstore, setup of frame even though not needed.



call:
	push r28		 ;  17	[c=4 l=1]  pushqi1/0
	push r29		 ;  18	[c=4 l=1]  pushqi1/0
	 ; SP -= 4	 ;  22	[c=4 l=2]  *addhi3_sp
	rcall .	
	rcall .	
	in r28,__SP_L__	 ;  23	[c=4 l=2]  *movhi/7
	in r29,__SP_H__
/* prologue: function */
/* frame size = 4 */
/* stack size = 6 */
.L__stack_usage = 6
	std Y+1,r22	 ;  14	[c=4 l=4]  *movsf/3
	std Y+2,r23
	std Y+3,r24
	std Y+4,r25
/* epilogue start */
	 ; SP += 4	 ;  34	[c=4 l=4]  *addhi3_sp
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop r29		 ;  35	[c=4 l=1]  popqi
	pop r28		 ;  36	[c=4 l=1]  popqi
	rjmp func	 ;  7	[c=24 l=1]  call_value_insn/3

In the IRA dump, it's still the case that REGs are consistently moreexpensive than MEM:


Pass 0 for finding pseudo/allocno costs

    a1 (r44,l0) best NO_REGS, allocno NO_REGS
    a0 (r43,l0) best NO_REGS, allocno NO_REGS

a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000NO_LD_REGS:32000 NO_POINTER_REGS:32000 MEM:9000a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000NO_LD_REGS:32000 NO_POINTER_REGS:32000 MEM:9000



Pass 1 for finding pseudo/allocno costs

    r44: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    r43: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS

a0(r43,l0) costs: ADDW_REGS:48000 SIMPLE_LD_REGS:48000 LD_REGS:48000NO_LD_REGS:48000 NO_POINTER_REGS:48000 MEM:17000a1(r44,l0) costs: ADDW_REGS:40008 SIMPLE_LD_REGS:40008 LD_REGS:40008NO_LD_REGS:40008 NO_POINTER_REGS:40008 MEM:17000


Johann

p.s.: Also added a reg-class for the intersection of R0..r25 andr24..r31. Don't know if that's a requirement for regclass layout though.

What's also strange is that the register allocator would not need to
allocate a register at all:  The incoming parameter comes in SI:22 and
is just be passed through to the callee, which also receives the value
in SI:22.  Why would one move that value to memory?  Even if memory was
cheaper, moving the value to mem just to load it again to the same
register is not very sensible...  because in almost any case, /no/
instruction is cheaper than /some/ instructions?


Earlier passes could perhaps propagate the pseudo registers away in very
simple cases like this.  It would be a very special-case optimisation
though.  If there was anything other than "move register X to register X"
between the calls, getting rid of the pseudo registers before RA could
introduce spill failures.

combine's to blame for the fact that we have two pseudo registers rather
than one.  See the comments about the avr-elf results in:

    https://gcc.gnu.org/ml/gcc-patches/2019-11/msg02150.html

for more details.

Removing the code above fixes it.  If you don't want to do that, an
alternative might be to add a class for r0-r25 (but I've not tested that).


Is there a way that it would use a similar path like SImode?


AFAICT the SI and SF costs are the same.  The difference is coming
from -fsplit-wide-types rather than RA.

Thanks,
Richard

Index: config/avr/avr.c
===================================================================
--- config/avr/avr.c	(revision 279994)
+++ config/avr/avr.c	(working copy)
@@ -850,7 +850,7 @@ avr_regno_reg_class (int r)
       SIMPLE_LD_REGS, SIMPLE_LD_REGS, SIMPLE_LD_REGS, SIMPLE_LD_REGS,
       SIMPLE_LD_REGS, SIMPLE_LD_REGS, SIMPLE_LD_REGS, SIMPLE_LD_REGS,
       /* r24, r25 */
-      ADDW_REGS, ADDW_REGS,
+      R24_R25_REGS, R24_R25_REGS,
       /* X: r26, 27 */
       POINTER_X_REGS, POINTER_X_REGS,
       /* Y: r28, r29 */
@@ -12704,6 +12704,7 @@ avr_conditional_register_usage (void)
 
       CLEAR_HARD_REG_SET (reg_class_contents[(int) ADDW_REGS]);
       CLEAR_HARD_REG_SET (reg_class_contents[(int) NO_LD_REGS]);
+	  reg_class_contents[NO_POINTER_REGS] &= reg_class_contents[LD_REGS];
     }
 }
 
Index: config/avr/avr.h
===================================================================
--- config/avr/avr.h	(revision 279994)
+++ config/avr/avr.h	(working copy)
@@ -219,6 +219,7 @@ These two properties are reflected by bu
 enum reg_class {
   NO_REGS,
   R0_REG,			/* r0 */
+  R24_R25_REGS,			/* r24 - r25 */
   POINTER_X_REGS,		/* r26 - r27 */
   POINTER_Y_REGS,		/* r28 - r29 */
   POINTER_Z_REGS,		/* r30 - r31 */
@@ -229,6 +230,7 @@ enum reg_class {
   SIMPLE_LD_REGS,		/* r16 - r23 */
   LD_REGS,			/* r16 - r31 */
   NO_LD_REGS,			/* r0 - r15 */
+  NO_POINTER_REGS,		/* r0 - r25 */
   GENERAL_REGS,			/* r0 - r31 */
   ALL_REGS, LIM_REG_CLASSES
 };
@@ -236,25 +238,28 @@ enum reg_class {
 
 #define N_REG_CLASSES (int)LIM_REG_CLASSES
 
-#define REG_CLASS_NAMES {					\
-		 "NO_REGS",					\
-		   "R0_REG",	/* r0 */                        \
-		   "POINTER_X_REGS", /* r26 - r27 */		\
-		   "POINTER_Y_REGS", /* r28 - r29 */		\
-		   "POINTER_Z_REGS", /* r30 - r31 */		\
-		   "STACK_REG",	/* STACK */			\
-		   "BASE_POINTER_REGS",	/* r28 - r31 */		\
-		   "POINTER_REGS", /* r26 - r31 */		\
-		   "ADDW_REGS",	/* r24 - r31 */			\
-                   "SIMPLE_LD_REGS", /* r16 - r23 */            \
-		   "LD_REGS",	/* r16 - r31 */			\
-                   "NO_LD_REGS", /* r0 - r15 */                 \
-		   "GENERAL_REGS", /* r0 - r31 */		\
-		   "ALL_REGS" }
+#define REG_CLASS_NAMES {			\
+	"NO_REGS",				\
+	"R0_REG",	     /* r0 */		\
+	"R24_R25_REGS",	     /* r24 - r25 */	\
+	"POINTER_X_REGS",    /* r26 - r27 */	\
+	"POINTER_Y_REGS",    /* r28 - r29 */	\
+	"POINTER_Z_REGS",    /* r30 - r31 */	\
+	"STACK_REG",	     /* STACK */	\
+	"BASE_POINTER_REGS", /* r28 - r31 */	\
+	"POINTER_REGS",	     /* r26 - r31 */	\
+	"ADDW_REGS",	     /* r24 - r31 */	\
+	"SIMPLE_LD_REGS",    /* r16 - r23 */	\
+	"LD_REGS",	     /* r16 - r31 */	\
+	"NO_LD_REGS",	     /* r0 - r15 */	\
+	"NO_POINTER_REGS",   /* r0 - r35 */	\
+	"GENERAL_REGS",	     /* r0 - r31 */	\
+	"ALL_REGS" }
 
 #define REG_CLASS_CONTENTS {						\
   {0x00000000,0x00000000},	/* NO_REGS */				\
   {0x00000001,0x00000000},	/* R0_REG */                            \
+  {3u << 24,0x00000000},        /* R24_R25_REGS,   r24 - r25 */		\
   {3u << REG_X,0x00000000},     /* POINTER_X_REGS, r26 - r27 */		\
   {3u << REG_Y,0x00000000},     /* POINTER_Y_REGS, r28 - r29 */		\
   {3u << REG_Z,0x00000000},     /* POINTER_Z_REGS, r30 - r31 */		\
@@ -269,6 +274,7 @@ enum reg_class {
   {(3u << REG_X)|(3u << REG_Y)|(3u << REG_Z)|(3u << REG_W)|(0xffu << 16),\
      0x00000000},	/* LD_REGS, r16 - r31 */			\
   {0x0000ffff,0x00000000},	/* NO_LD_REGS  r0 - r15 */              \
+  {0x03ffffff,0x00000000},	/* NO_POINTER_REGS  r0 - r25 */         \
   {0xffffffff,0x00000000},	/* GENERAL_REGS, r0 - r31 */		\
   {0xffffffff,0x00000003}	/* ALL_REGS */				\
 }
Index: config/avr/constraints.md
===================================================================
--- config/avr/constraints.md	(revision 279992)
+++ config/avr/constraints.md	(working copy)
@@ -53,6 +53,12 @@ (define_register_constraint "z" "POINTER
 (define_register_constraint "q" "STACK_REG"
   "Stack pointer register (SPH:SPL).")
 
+(define_register_constraint "R24R25" "R24_R25_REGS"
+  "Register pair r25:r24.")
+
+(define_register_constraint "R00R25" "NO_POINTER_REGS"
+  "Non-pointer registers r0 to r25.")
+
 (define_constraint "I"
   "Integer constant in the range 0 @dots{} 63."
   (and (match_code "const_int")

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]