This is the mail archive of the
mailing list for the GCC project.
Register allocation: caller-save vs spilling
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: <gcc at gcc dot gnu dot org>
- Date: Wed, 27 Aug 2014 17:24:46 +0100
- Subject: Register allocation: caller-save vs spilling
- Authentication-results: sourceware.org; auth=none
I'm investigating various register allocation inefficiencies. The first thing that stands out is
that GCC both supports caller-saves as well as spilling. Spilling seems to spill all definitions and
all uses of a liverange. This means you often end up with multiple reloads close together, while it
would be more efficient to do a single load and then reuse the loaded value several times.
Caller-save does better in that case, but it is inefficient in that it repeatedly stores registers
across every call even if unchanged. If both were fixed to minimise the number of loads/stores I
can't see how one could beat the other, so you'd no longer need both.
Anyway due to the current implementation there are clearly cases where caller-save is best and cases
where spilling is best. However I do not see it making the correct decision despite trying to
account for the costs - some code is significantly faster with -fno-caller-saves, other code wins
with -fcaller-saves. As an example, I see code like this on AArch64:
ldr s4, .LC20
fmul s0, s0, s4
str s4, [x29, 104]
ldr s4, [x29, 104]
fmul s0, s0, s4
With -fno-caller-saves it spills and rematerializes the constant as you'd expect:
ldr s1, .LC20
fmul s0, s0, s1
ldr s5, .LC20
fmul s0, s0, s5
So given this, is the cost calculation correct and does it include rematerialization? The spill code
understands how to rematerialize so it should take this into account in the costs. I did find some
code in ira-costs.c in scan_one_insn() that attempts something that looks like an adjustment for
rematerialization but it doesn't appear to handle all cases (simple immediates, 2-instruction
immediates, address-constants, non-aliased loads such as literal pool and const data loads).
Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall performance improves
significantly if I add this (basically the default heuristic used on instruction frequencies):
@@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void)
* ALLOCNO_FREQ (a)
* IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);
+ if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a))
+ cost = INT_MAX;
if (INT_MAX - cost < reg_costs[j])
reg_costs[j] = INT_MAX;
If such a simple heuristic can beat the costs, they can't be quite right.
Is there anyone who understands the cost calculations?