This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Autoincrement examples
- To: m dot hayes at elec dot canterbury dot ac dot nz (Michael Hayes)
- Subject: Re: Autoincrement examples
- From: Toshiyasu Morita <tm at netcom dot com>
- Date: Thu, 18 Nov 1999 23:06:20 -0800 (PST)
- Cc: gcc at gcc dot gnu dot org
Michael Hayes wrote:
...
> > - I'm not sure you if your code can rewrite
> > (mem (plus (reg ..) (const_int ..))) to use a different offset?
> > FWIW, mine can't, but it is feasible with some effort to
> > implement it.
>
> With the information that I collect, these transformations would be
> straightforward. More importantly, I would like to reorder some of
> the memory references to improve autoinc generation.
IMHO this is bad.
Improving autoinc generation prior to sched1 may have the nasty side effect
of reducing sched1's ability to reorder instructions since sched1 is
unable to recalculate memory offsets.
This problem with autoinc generation is basically a subset of the address
inheritance issue, and I believe a better solution to the entire address
inheritance issue is to eliminate as much address arithmetic
(incl. autoinc generation) prior to sched1 and regenerate it after scheduling.
Consider this code on an in-order superscalar processor:
*dest++ += 1;
*dest++ += 1;
If autoinc is generated before sched1, then it will generate code
somewhat like this:
move.l (r0),r1
add #1,r1
move.l r1,(r0)+
move.l (r0),r2
add #1,r2
move.l r2,(r0)+
when sched1 tries to reorder this code, it will try to hide the memory
load latency of the two memory loads but fail because the post-increment
memory stores inhibit proper scheduling.
If the target processor is a typical in-order single scalar processor,
with a memory load latency of two (Hitachi SH2/SH3, 486, R4000, etc), the
previous code saves two clocks (two add #4,r0) but loses two clocks in
the memory latency for a net gain of zero clocks.
If the target processor is a typical in-order superscalar processor
issuing two instructions per clock with a memory load latency of two
clocks (Hitachi SH4, Pentium, R5000, etc) then you will have saved two
half-clocks (two add #4,r0 instructions) but the resulting code is unable
to hide the memory load latency so the processsor stalls for either four
or six half-clocks (depending on pairing) for a net loss of two or four
half-clocks.
I believe that the best solution to this problem is to "flatten" all the
autoinc addressing modes prior to sched1, e.g. pretend the target supports
large offsets for memory references and convert all pre/post inc/dec
instructions to offset memory references followed by a fixup at the end of
the basic block. This gives the scheduler maximum freedom for reordering
instructions. Post scheduling, address inheritance can be generated.
Applying this to the previous sample would give the following code
prior to sched:
move.l (r0),r1
add #1,r1
move.l r1,(r0)
move.l (4,r0),r2
add #1,r2
move.l r2,(4,r0)
add #8,r0
This code could be properly optimized by the scheduler to:
move.l (r0),r1
move.l (4,r0),r2
add #1,r1
add #1,r2
move.l r1,(r0)
move.l r2,(4,r0)
add #8,r0
post-sched1 the autoinc could be generated:
move.l (r0),r1
move.l (4,r0),r2
add #1,r1
add #1,r2
move.l r1,(r0)+
move.l r2,(r0)+
This, IMHO, is a better way to generate address inheritance.
Toshi