On 22 Mar 2005, Ian Lance Taylor wrote:
Miles Bader <miles@lsi.nec.co.jp> writes:
I've defined SECONDARY_*_RELOAD_CLASS (and PREFERRED_* to try to help
things along), and am now running into more understandable reload
problems: "unable to find a register to spill in class" :-/
The problem, as I understand, is that reload doesn't deal with
conflicts
between secondary and primary reloads -- which are common with my
arch
because it's an accumulator architecture.
For instance, slightly modifying my previous example:
Say I've got a mov instruction that only works via an accumulator
A,
and a two-operand add instruction. "r" regclass includes regs
A,X,Y,
and "a" regclass only includes reg A.
mov has constraints like: 0 = "g,a" 1 = "a,gi"
and add3 has constraints: 0 = "a" 1 = "0" 2 = "ri"
(say)
So if before reload you've got an instruction like:
add temp, [sp + 4], [sp + 6]
and v2 and v3 are in memory, it will have to have generate something
like:
mov A, [sp + 4] ; primary reload 1 in X, with secondary reload
0 A
mov X, A ; ""
mov A, [sp + 6] ; primary reload 2 in A, with no secondary
reload
add A, X
mov temp, A
There's really only _one_ register that can be used for many
reloads, A.
I don't think there is any way that reload can cope with this
directly. reload would have to get a lot smarter about ordering the
reloads.
Since you need the accumulator for so much, one approach you should
consider is not exposing the accumulator register until after reload.
You could do this by writing pretty much every insn as a
define_insn_and_split, with reload_completed as the split condition.
Then you split into code that uses the accumulator. Your add
instruction permits you to add any two general registers, and you
split into moving one into the accumulator, doing the add, and moving
the result whereever it should go. If you then split all the insns
before the postreload pass, perhaps the generated code won't even be
too horrible.
Ian
This approach by itself has obvious problems.
It will generate a lot of redundant moves to/from the accumulator
because
the accumulator is exposed much too late.
Consider the 3AC code:
add i,j,k
add k,l,m
it will be broken down into:
mov i,a
add j,a
mov a,k
mov k,a
add l,a
mov a,m
where the third and fourth instructions are basically redundant.
I did a lot of processor architecture research about three years ago,
and
I came to some interesting conclusions about accumulator architectures.
Basically, with naive code generation, you will generate 3x as many
instructions for an accumulator machine than for a 3AC machine.
If you have a SWAP instruction so you can swap the accumulator with the
index registers, then you can lower the instruction count penalty to
about
2x that of a 3AC machine. If you think about this for a while, the
reason
will become readily apparent.
In order to reach this 2x figure, it requires a good understanding of
how
the data flows through the accumulator in an accumulator arch.
Toshi