This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Autoincrement examples


Michael Hayes wrote:

...
>  > - I'm not sure you if your code can rewrite
>  >   (mem (plus (reg ..) (const_int ..))) to use a different offset?
>  >   FWIW, mine can't, but it is feasible with some effort to
>  > implement it.
> 
> With the information that I collect, these transformations would be
> straightforward.  More importantly, I would like to reorder some of
> the memory references to improve autoinc generation.

IMHO this is bad.

Improving autoinc generation prior to sched1 may have the nasty side effect
of reducing sched1's ability to reorder instructions since sched1 is 
unable to recalculate memory offsets.

This problem with autoinc generation is basically a subset of the address 
inheritance issue, and I believe a better solution to the entire address 
inheritance issue is to eliminate as much address arithmetic
(incl. autoinc generation) prior to sched1 and regenerate it after scheduling.

Consider this code on an in-order superscalar processor:

	*dest++ += 1;
	*dest++ += 1;

If autoinc is generated before sched1, then it will generate code
somewhat like this:

	move.l	(r0),r1
	add	#1,r1
	move.l	r1,(r0)+
	move.l	(r0),r2
	add	#1,r2
	move.l	r2,(r0)+

when sched1 tries to reorder this code, it will try to hide the memory 
load latency of the two memory loads but fail because the post-increment 
memory stores inhibit proper scheduling.

If the target processor is a typical in-order single scalar processor, 
with a memory load latency of two (Hitachi SH2/SH3, 486, R4000, etc), the 
previous code saves two clocks (two add #4,r0) but loses two clocks in 
the memory latency for a net gain of zero clocks.

If the target processor is a typical in-order superscalar processor 
issuing two instructions per clock with a memory load latency of two 
clocks (Hitachi SH4, Pentium, R5000, etc) then you will have saved two 
half-clocks (two add #4,r0 instructions) but the resulting code is unable 
to hide the memory load latency so the processsor stalls for either four
or six half-clocks (depending on pairing) for a net loss of two or four
half-clocks.

I believe that the best solution to this problem is to "flatten" all the
autoinc addressing modes prior to sched1, e.g. pretend the target supports
large offsets for memory references and convert all pre/post inc/dec
instructions to offset memory references followed by a fixup at the end of
the basic block. This gives the scheduler maximum freedom for reordering
instructions. Post scheduling, address inheritance can be generated. 

Applying this to the previous sample would give the following code
prior to sched:

	move.l	(r0),r1
	add	#1,r1
	move.l	r1,(r0)
	move.l	(4,r0),r2
	add	#1,r2
	move.l	r2,(4,r0)
	add	#8,r0

This code could be properly optimized by the scheduler to:

	move.l	(r0),r1
	move.l	(4,r0),r2
	add	#1,r1
	add	#1,r2
	move.l	r1,(r0)
	move.l	r2,(4,r0)
	add	#8,r0

post-sched1 the autoinc could be generated:

	move.l	(r0),r1
	move.l	(4,r0),r2
	add	#1,r1
	add	#1,r2
	move.l	r1,(r0)+
	move.l	r2,(r0)+

This, IMHO, is a better way to generate address inheritance.

Toshi


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]