This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Register allocation: caller-save vs spilling

From: "Wilco Dijkstra" <wdijkstr at arm dot com>
To: <gcc at gcc dot gnu dot org>
Date: Wed, 27 Aug 2014 17:24:46 +0100
Subject: Register allocation: caller-save vs spilling
Authentication-results: sourceware.org; auth=none

Hi,

I'm investigating various register allocation inefficiencies. The first thing that stands out is
that GCC both supports caller-saves as well as spilling. Spilling seems to spill all definitions and
all uses of a liverange. This means you often end up with multiple reloads close together, while it
would be more efficient to do a single load and then reuse the loaded value several times.
Caller-save does better in that case, but it is inefficient in that it repeatedly stores registers
across every call even if unchanged. If both were fixed to minimise the number of loads/stores I
can't see how one could beat the other, so you'd no longer need both.

Anyway due to the current implementation there are clearly cases where caller-save is best and cases
where spilling is best. However I do not see it making the correct decision despite trying to
account for the costs - some code is significantly faster with -fno-caller-saves, other code wins
with -fcaller-saves. As an example, I see code like this on AArch64:

        ldr     s4, .LC20
        fmul    s0, s0, s4
        str     s4, [x29, 104]
        bl      f
        ldr     s4, [x29, 104]
        fmul    s0, s0, s4

With -fno-caller-saves it spills and rematerializes the constant as you'd expect:

        ldr     s1, .LC20
        fmul    s0, s0, s1
        bl      f
        ldr     s5, .LC20
        fmul    s0, s0, s5

So given this, is the cost calculation correct and does it include rematerialization? The spill code
understands how to rematerialize so it should take this into account in the costs. I did find some
code in ira-costs.c in scan_one_insn() that attempts something that looks like an adjustment for
rematerialization but it doesn't appear to handle all cases (simple immediates, 2-instruction
immediates, address-constants, non-aliased loads such as literal pool and const data loads).

Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall performance improves
significantly if I add this (basically the default heuristic used on instruction frequencies):

--- a/gcc/ira-costs.c
+++ b/gcc/ira-costs.c
@@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void)
                           * ALLOCNO_FREQ (a)
                           * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);
 #endif
+                  if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a))
+                    cost = INT_MAX;
                }
              if (INT_MAX - cost < reg_costs[j])
                reg_costs[j] = INT_MAX;

If such a simple heuristic can beat the costs, they can't be quite right. 

Is there anyone who understands the cost calculations?

Wilco

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]