This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: ARM compiler rewriting code to be longer and slower


Hi Zoltan,

<some parts snipped>
On Fri, Mar 13, 2009 at 9:16 AM,  <zoltan@bendor.com.au> wrote:

> Note that it is sub-optimal on two counts.
>
> First, each loading of a constant takes 3 instructions and 3 clocks.
> Storing the constant and fetching it using an ldr also takes 3 clocks but
> only two 32-bit words and identical constants need to be stored only once.
> The speed increase is only true on the ARM7TDMI-S, which has no caches, so
> that's just a minor issue, but the memory saving is true no matter what
> ARM core you have (note that -Os was specified).
>
> Second, and this is the real problem, if the compiler did not want to be
> overly clever and compiled the code as it was written, then instead of
> loading the constants 4 times, at the cost of 3 instuctions each, it could
> have loaded it only once and then generated the next constants at the cost
> of a single-word, single clock shift. The code would have been rather
> shorter *and* faster, plus some of the jumps could have been eliminated.
> Practically each C statement line (except the braces) corresponds to one
> assembly instruction, so without being clever, just translating what's
> written, it could be done in 20 words instead of 30.

I took a look at this for some time on Friday and I found that the
conditional constant propagation pass has pushed down the value
(tree-ssa-ccp.c). This is done by the CCP pass up in the optimization
pipeline because in general constant propagation is a good idea . In
any case there are a bunch of tree optimizers that identify these and
generally bring in constants into expressions as generally a good
idea. One might argue that constant propagation in general is a good
thing but the problem appears to be that the moment one has an
architecture where costs of loading immediate's is higher than the
cost of simple arithmetic operations the final code generated might
not be the most efficient.


With some more experimentation in the last hour or so I found that for
this particular case, I can get the following code

divby1e9:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L7
	cmp	r0, r3
	mov	r2, #0
	bcc	.L2
	mov	r3, r3, asl #2
	cmp	r0, r3
	rsbcs	r0, r3, r0
	addcs	r2, r2, #4
	bcs	.L2
	mov	r3, r3, lsr #1
	cmp	r0, r3
	rsbcs	r0, r3, r0
	mov	r3, r3, lsr #1
	movcs	r2, #2
	cmp	r0, r3
	rsbcs	r0, r3, r0
	addcs	r2, r2, #1
.L2:
	str	r2, [r1, #0]
	bx	lr
.L8:
	.align	2
.L7:
	.word	1000000000
	.size	divby1e9, .-divby1e9
	.ident	"GCC: (GNU) 4.4.0 20090313 (experimental) [trunk revision 143499]"


but with the following command line options.

./xgcc -B`pwd` -S -Os newpr.c -fno-tree-ccp -fno-tree-fre
-fno-tree-vrp -fno-tree-dominator-opts -fno-gcse


I'm not sure about the best way to fix this but I've filed this for
the moment as

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468



cheers
Ramana

---
Ramana Radhakrishnan
ARM Ltd.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]