This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Move costs, allocno costs, and per-register costs

[Andreas, I'm cc:ing you as m68k maintainer; scroll down for why]

This patch helps to fix the MIPS dspr2-MULT{,U}.c regressions
that were introduced by my MIPS IRA patch.  I think the underlying
problem is a general one.

MIPS defines the following REGISTER_MOVE_COSTS:

  GR_REGS  -> GR_REGS  : 2
  GR_REGS  -> ACC_REGS : 6
  ACC_REGS -> GR_REGS  : 6
  ACC_REGS -> ACC_REGS : 12  (because this move has to through a GPR)

  [ At this point I should insert my usual caveat that these costs are
    historical, and are not accurate for all MIPS processors.  They should
    really be part of the cost structure, so that we can define processor-
    specific costs instead.  But that's not relevant to this dicussion. ]

MIPS also has a union and cover class called GR_AND_ACC_REGS.
The problem relates to the allocno costs of this class.

First of all, move_cost[MODE][FROM][TO] is the maximum of all:


such that FROM' \subseteq FROM and TO \subseteq TO.  So we have:

  move_cost[MODE][GR_AND_ACC_REGS][GR_REGS] = 6    (= ACC_REGS -> GR_REGS)
  move_cost[MODE][GR_AND_ACC_REGS][ACC_REGS] = 12  (= ACC_REGS -> ACC_REGS)
  move_cost[MODE][GR_REGS][GR_AND_ACC_REGS] = 6    (= GR_REGS  -> ACC_REGS)
  move_cost[MODE][ACC_REGS][GR_AND_ACC_REGS] = 12  (= ACC_REGS -> ACC_REGS)

and so on.  This is correct from a MIPS point of view.

We also have:

  /* Similar, but here we don't have to move if the first index is a subset
     of the second so in that case the cost is zero.  */
  move_table *may_move_in_cost[MAX_MACHINE_MODE];

  /* Similar, but here we don't have to move if the first index is a superset
     of the second so in that case the cost is zero.  */
  move_table *may_move_out_cost[MAX_MACHINE_MODE];


    = { 0 if FROM \subseteq TO
      { move_cost[MODE][FROM][TO] otherwise

This is intended to be used in cases where:

  - FROM is a class that the register allocator might use for
    a pseudo register; and

  - TO is the class required by a particular alternative of a
    particular insn operand.

So the cost is rightly zero when the source register already belongs
to TO; no input reload would be needed in that case .  However,
I think the "otherwise" case is an overestimate.

To explain why, suppose FROM == GR_AND_ACC_REGS and TO == ACC_REGS.
We're therefore calculating the cost of reloading a GR_AND_ACC_REGS
value into an ACC_REGS operand.  At the moment:

    = move_cost[MODE][GR_AND_ACC_REGS][ACC_REGS]

And, from above, move_cost[MODE][GR_AND_ACC_REGS][ACC_REGS] == 12,
because ACC_REGS to ACC_REGS moves have a cost of 12.  But no reload
is needed if the source is already a member of ACC_REGS, so a more
accurate value would be:

    = move_cost[MODE][GR_REGS][ACC_REGS]

which is 6 rather than 12.  In other words, if we have an insn that
requires an accumulator register, we will overestimate the cost of

More generally, a tighter definition of may_move_in_cost would be:

    = { 0 if FROM \subseteq TO
      { move_cost[MODE][FROM \ TO][TO] otherwise

(Note that the "if" case is still needed: move_cost[MODE][NO_REGS][TO]
is the maximum cost because NO_REGS has no registers of mode MODE.
Also, FROM \ TO might not exist as a class, so this would be a
"superdiff"; c.f. our current "superunions".)

The same reasoning applies to may_move_out_cost.

Although this definition would solve the overestimate for a single insn,
there's still an inherent overestimate in the combined costs.
E.g. consider a pseudo register P that is used in two instructions:

  - I1, which writes to a GPR
  - I2, which reads from an accumulator

where I1 and I2 have the same frequency F.  The cost of using

  may_move_out_cost[MODE][GR_REGS][GR_AND_ACC_REGS] * F
  + may_move_in_cost[MODE][GR_AND_ACC_REGS][ACC_REGS] * F

which after the suggested change would be equivalent to:

  move_cost[MODE][GR_REGS][ACC_REGS] * 2 * F

But only one of the two moves is needed.  The allocator will either
pick a GPR (in which case the second move is needed) or an accumulator
(in which case the first move is needed).  This will be reflected
in the costs of P as follows:

  GR_REGS : 6 * F               (for I2)
  ACC_REGS : 6 * F              (for I1)
  GR_AND_ACC_REGS : 12 * F      (for both I1 and I2)

(Before the changed may-move costs we would have:

  GR_AND_ACC_REGS : 24 * F


Now the old regclass.c code recognised this problem, and picked the
"preferred" class as the union of all classes with the minimum cost.
So P's preferred class is the union of GR_REGS and ACC_REGS,
i.e. GR_AND_ACC_REGS, even though GR_AND_ACC_REGS itself was
given a much higher cost than the individual classes.

The regclass.c code was adopted for IRA, so it too makes the
same (correct) choice of preferred class.

However -- getting to the point at last -- once it has decided to
use this union class, IRA sets the ALLOCNO_COVER_CLASS_COST to the
cost of the _union class_, not the cost of the classes whose unions
were taken.  So in the example above, ALLOCNO_COVER_CLASS_COST would be
12 * F or 24 * F rather than 6 * F.  This is an overestimate for the
reasons given above.

In dspr2-MULTU.c, the overestimate is such that the GR_AND_ACC_REGS
cost exceeds the memory cost, even though the individual GR_REGS
and ACC_REGS costs are much lower than the memory cost.  So we end
up thinking that it is cheaper to spill the register to memory.

Things are slightly more complicated with IRA than the old allocators
because we have two sets of costs:

  - per-allocno (allocno_costs)
  - per-register (total_costs)

The "preferred" class is chosen using total_costs, but the
ALLOCNO_COVER_CLASS_COST is taken from allocno_costs.  So although
find_allocno_class_costs knows what the minimum *total_cost* is --
namely "best_cost" -- this isn't what we want the ALLOCNO_COVER_CLASS_COST
to be.  We want a value derived from allocno_costs instead.

One fix would be work out a set of partitions PP for each class C such that
C's allocno cost would be no greater than:

  MIN_{P \in PP} (MAX_{C' \in P} (allocno_cost of C'))

That would be a little complicated though, and the calculation for
each C would be quadratic in the number of register classes.

Instead, let's suppose that the preferred class is calculated as
the union of C1 ... Cn, where each Ci has the cheapest total_cost.
The approach I've taken in the patch is to calculate the maximum
of the allocno_costs for each such Ci, and use that value as the
ALLOCNO_COVER_CLASS_COST.  Then the cost of any register in the
"preferred" class should be that of the ALLOCNO_COVER_CLASS_COST.
(The costs of other registers are the same as before the patch.)

This fixes the MIPS problem: the allocno above would now have
cost 6 * F rather than 12 * F or 24 * F, and we no longer spill
accumulator values to memory for dspr2-MULT{,U}.c.  I also think
it makes conceptual sense, for the reasons given above.

I tested the patch by compiling the gcc.c-torture, gcc.dg and g++.dg
testsuites at -O2 for the targets listed below and then compared the
before and after assembly code.  The differences were as follows:

  alpha-linux-gnu	no change
  arc-elf		no change
  arm-eabi		no change
  avr-rtems		improvements in 2 tests
  bfin-elf		some tests get better, some get worse; see below
  cris-elf		no change
  fr30-elf		no change
  frv-elf		like-for-like changes in 1 test
  h8300-elf		no change
  hppa2.0-linux-gnu	no change
  i686-linux-gnu	changes in 2 tests, both improvements by my reading
  iq2000-elf		no change
  m32c-elf		no change
  m32r-elf		no change
  m68hc11-elf		no change
  m68k-elf		some tests get better, some get worse; see below
  mcore-elf		no change
  mipsisa64-elf		changes in 3 tests, all improvements
  mmix-knuth-mmixware	no change
  pdp11-bsd		no change
  powerpc64-linux-gnu	no change
  powerpc-linux-gnu	no change
  s390-linux-gnu	no change
  score-elf		no change
  sh64-elf		no change
  sh-elf		lots of like-for-like changes
  sparc-sun-solaris2.8	no change
  spu-elf		no change
  v850-elf		no change
  vax-netbsdelf		no change
  x86_64-linux-gnu	like-for-like changes in 1 test
  xstormy16-elf		like-for-like changes in 32 tests
  xtensa-elf		no change

(Some of these targets haven't been ported to IRA yet.)

I've attached the diffs for all targets that saw some change.


  There are a lot of like-for-like changes, a few improvements, and a
  few regressions.  One of the regressions is in struct-return-2_x.c,
  in which one problem is repeated in several similar functions.
  Although the test is hardly representative of typical code,
  the problem is interesting.

  Reload decides it has to free up a register by spilling a pseudo,
  so IRA has to pick the best candidate.  The two cheapest candidates are:

    P1: a register that is set once and used as the source address
        in a rep_movhi instruction

    P2: a register that is set once and used as the R1 argument
        to a function.

  So P1 and P2 are set and used the same number of times, but P2 has
  the additional constraint that its value must be moved into a specific
  register -- R1 -- at the point P2 dies.  So:

    - P2:R1 has an extra negative cost, to make R1 seem very cheap
    - IRA has duly allocated R1 to P2, as hoped
    - this negative cost is added to the ALLOCNO_COVER_CLASS_COST of P2
      (probably correctly)

  So if P1 and P2 had the same ALLOCNO_MEMORY_COST and _unadjusted_
  ALLOCNO_COVER_CLASS_COST, adding the negative R1 cost to P2's
  ALLOCNO_COVER_CLASS_COST would make it seem cheaper to spill P1.
  And that's exactly what we want to do in this case.

  However, the rep_movhi instruction that uses P1 also sets P1 to
  the post-incremented source address:

      (parallel [
        (set (reg:SI P1)    <---
             (plus:SI (plus:SI (reg:SI P1) ..)))
        (set (mem:BLK (...))
             (mem:BLK (reg:SI P1)))

  This updated P1 is not used, and there is a REG_UNUSED note to
  indicate this.  However, ira-costs.c still takes the destination
  into account when calculating the cost of P1.  That is, we calculate
  the cost of _three_ operands:

    - the output operand for the original P1 definition
    - the input operand for the rep_movhi
    - the output operand for the unused rep_movhi destination

  We're really considering the cost of potential reloads here,
  so I don't think the last operand should be counted.  As it stands,
  P1 has an inflated ALLOCNO_MEMORY_COST, because we wrongly assume
  that we'd need to copy P1 to memory twice.

  Now P1 can use either IREGS or PREGS, so its preferred class
  is IPREGS.  The cost of IPREGS itself is overestimated for
  the reason given above, and before the patch, we would set
  ALLOCNO_COVER_CLASS_COST to this overestimated value.
  This overestimate of the cover class cost goes some way
  to cancelling out the overestimate of the memory cost,
  so before the patch, spilling P1 had the same cost as
  spilling P2.  We ended up choosing to spill P1.

  After the patch, the ALLOCNO_COVER_CLASS_COST is the cost of using
  either IREGS or PREGS, so the gap between the ALLOCNO_COVER_CLASS_COST
  and the ALLOCNO_MEMORY_COST is greater.  This makes it seem cheaper to
  spill P2.

  In summary, I think ira-costs.c:record_reg_classes should not add
  ira_may_move_out_cost and ira_memory_cost for unused output operands.
  I'd like to test that at some point, but I've got a lot on my plate
  and probably won't find time.  (An agreement in principle might
  add a bit of incentive though.)


  The tests that get worse have XFmode registers that only occur in move
  instructions.  The situation here is similar to the MIPS one that I'm
  trying to fix: GENERAL_REGS and FP_REGS have a cost of 0 for an XFmode
  pseudo register P, but the union class (ALL_REGS) has a higher cost
  than memory.  (Move instructions have memory-to-memory alternatives,
  so memory looks cheap.  And FPR<->GPR moves have a cost of 4 rather
  than 2.)

  So before the patch, IRA would decide that spilling P to memory was
  cheaper than allocating a register (which is not in fact true).  We then
  rely on reload inheritance rather than register allocation to handle P

  In the problem cases, reload inheritance is able to treat P as though
  it had been allocated a register: the definition and uses are in the
  same block, and reload is able to inherit the definition's output
  reload.  The difference is that reload sees the "!" markers in the
  movxf instruction, so it only considers using an FPR, never a data
  or address register.  IRA (and the old reg-class.c code) ignore "!"
  markers, so GPRs and FPRs seem equally cheap.

  Thus if we allocate the register using IRA, the reg_alloc_order causes
  us to prefer a GPR.  If we don't allocate a register and rely on reload
  inheritance, we would only ever use an FPR.  And GPR XFmode moves are
  more expensive than FPR moves, which leads to the code-quality regression.

  Andrew recently raised a similar issue for PowerPC:

  and I believe the decision reached on that thread is that IRA and
  reg-class.c are doing the right thing by ignoring "!".  If so, I think
  this is a bug in the m68k port.  The movxf GPR alternatives are more
  expensive than the FPR ones, so they should have "?" markers as well
  as "!" markers.  (Or maybe "*" markers, but they've always seemed like
  a bit of a hack.)

Bootstrapped & regression-tested on x86_64-linux-gnu.  OK to install?


	* ira-costs.c (find_allocno_class_costs): Work out the maximum
	allocno_costs value of the classees with the lowest total_costs
	value.  Use this to set ALLOCNO_COVER_CLASS_COST here...
	(setup_allocno_cover_class_and_costs): ...rather than here.
	Use the ALLOCNO_COVER_CLASS_COST for all registers in the
	preferred class.

Index: gcc/ira-costs.c
--- gcc/ira-costs.c	2008-11-02 15:37:40.000000000 +0000
+++ gcc/ira-costs.c	2008-11-07 23:58:50.000000000 +0000
@@ -1149,7 +1149,7 @@ find_allocno_class_costs (void)
 	  ira_allocno_t a, parent_a;
 	  int rclass, a_num, parent_a_num;
 	  ira_loop_tree_node_t parent;
-	  int best_cost;
+	  int best_cost, allocno_cost;
 	  enum reg_class best, alt_class, common_class;
 	  int inc_dec_p = false;
@@ -1255,6 +1255,7 @@ find_allocno_class_costs (void)
 		  /* Finding best class which is subset of the common
 		     class.  */
 		  best_cost = (1 << (HOST_BITS_PER_INT - 2)) - 1;
+		  allocno_cost = best_cost;
 		  best = ALL_REGS;
 		  for (k = 0; k < cost_classes_num; k++)
@@ -1279,12 +1280,21 @@ find_allocno_class_costs (void)
 			    = COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k];
+			  allocno_cost
+			    = COSTS_OF_ALLOCNO (allocno_costs, a_num)->cost[k];
 			  best = (enum reg_class) rclass;
 		      else if (COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]
 			       == best_cost)
-			best = ira_reg_class_union[best][rclass];
+			{
+			  best = ira_reg_class_union[best][rclass];
+			  allocno_cost
+			    = MAX (allocno_cost,
+				   COSTS_OF_ALLOCNO (allocno_costs,
+						     a_num)->cost[k]);
+			}
+		  ALLOCNO_COVER_CLASS_COST (a) = allocno_cost;
 	      if (internal_flag_ira_verbose > 2 && ira_dump_file != NULL
 		  && (pass == 0 || allocno_pref[a_num] != best))
@@ -1398,6 +1408,7 @@ setup_allocno_cover_class_and_costs (voi
   int *reg_costs;
   enum reg_class cover_class, rclass;
   enum machine_mode mode;
+  HARD_REG_SET *pref;
   ira_allocno_t a;
   ira_allocno_iterator ai;
@@ -1412,10 +1423,7 @@ setup_allocno_cover_class_and_costs (voi
       if (cover_class == NO_REGS)
       ALLOCNO_AVAILABLE_REGS_NUM (a) = ira_available_class_regs[cover_class];
-      num = cost_class_nums[allocno_pref[i]];
-      ira_assert (num >= 0);
-	= COSTS_OF_ALLOCNO (allocno_costs, i)->cost[num];
+      pref = &reg_class_contents[allocno_pref[i]];
       if (optimize && ALLOCNO_COVER_CLASS (a) != allocno_pref[i])
 	  n = ira_class_hard_regs_num[cover_class];
@@ -1424,17 +1432,23 @@ setup_allocno_cover_class_and_costs (voi
 	  for (j = n - 1; j >= 0; j--)
 	      regno = ira_class_hard_regs[cover_class][j];
-	      rclass = REGNO_REG_CLASS (regno);
-	      num = cost_class_nums[rclass];
-	      if (num < 0)
+	      if (TEST_HARD_REG_BIT (*pref, regno))
+		reg_costs[j] = ALLOCNO_COVER_CLASS_COST (a);
+	      else
-		  /* The hard register class is not a cover class or a
-		     class not fully inside in a cover class -- use
-		     the allocno cover class.  */
-		  ira_assert (ira_hard_regno_cover_class[regno] == cover_class);
-		  num = cost_class_nums[cover_class];
+		  rclass = REGNO_REG_CLASS (regno);
+		  num = cost_class_nums[rclass];
+		  if (num < 0)
+		    {
+		      /* The hard register class is not a cover class or a
+			 class not fully inside in a cover class -- use
+			 the allocno cover class.  */
+		      ira_assert (ira_hard_regno_cover_class[regno]
+				  == cover_class);
+		      num = cost_class_nums[cover_class];
+		    }
+		  reg_costs[j] = COSTS_OF_ALLOCNO (allocno_costs, i)->cost[num];
-	      reg_costs[j] = COSTS_OF_ALLOCNO (allocno_costs, i)->cost[num];

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]