This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: ARM : code less efficient with gcc-trunk ?


On Mon, 16 Feb 2009 10:17:36 -0500
Daniel Jacobowitz <drow@false.org> wrote:

> On Mon, Feb 16, 2009 at 12:19:52PM +0100, Vincent R. wrote:
> > 00011000 <WinMainCRTStartup>:
> > [...]
> 
> Notice how many more registers used to be pushed?  I expect the new
> code is faster.

Assuming an ARM7 core with 0 wait-state memory and removing all the
identical call bits from the functions, the clocks are on the right
hand side:

   11000:	e92d40f0 	push	{r4, r5, r6, r7, lr}  7
   11004:	e1a04000 	mov	r4, r0                1
   11008:	e1a05001 	mov	r5, r1                1
   1100c:	e1a06002 	mov	r6, r2                1
   11010:	e1a07003 	mov	r7, r3                1
   11024:	e1a01005 	mov	r1, r5                1
   11028:	e1a00004 	mov	r0, r4                1
   1102c:	e1a02006 	mov	r2, r6                1
   11030:	e1a03007 	mov	r3, r7                1
   11038:	e1a04000 	mov	r4, r0                1
   11040:	e1a01004 	mov	r1, r4                1
   11044:	e3a00042 	mov	r0, #66               1

Total: 12 insns, 18 clocks

   11000:	e92d4010 	push	{r4, lr}              4
   11004:	e1a04000 	mov	r4, r0                1
   11008:	e24dd00c 	sub	sp, sp, #12           1
   1100c:	e58d1008 	str	r1, [sp, #8]          2
   11010:	e58d2004 	str	r2, [sp, #4]          2
   11014:	e58d3000 	str	r3, [sp]              2
   11028:	e59d1008 	ldr	r1, [sp, #8]          3
   1102c:	e1a00004 	mov	r0, r4                1
   11030:	e59d2004 	ldr	r2, [sp, #4]          3
   11034:	e59d3000 	ldr	r3, [sp]              3
   1103c:	e1a04000 	mov	r4, r0                1
   11044:	e1a01004 	mov	r1, r4                1
   11048:	e3a00042 	mov	r0, #66               1

Total: 13 insns, 25 clocks.

So the version generated by the 4.4.x compiler version is almost 40%
slower (25-18)/18 = 0.3889) than the 4.1.x version and it is also
longer. Pushing many registers is cheap because you it takes 2+n clocks
to move n registers to memory, and then it is n extra clocks to copy
your n registers to the call-saved ones that you pushed. Total cost
2+2n. Storing them individually costs you 1 clock to make space on the
stack, 3n clocks to store them on the stack, i.e. 1+3n. In addition,
when you get them to become parameters to the function calls, a reg-reg
move costs you 1 clock while a load from memory is 3. The example
function does not actually return, but if it did, the old compiler
would lose some of its advantage. The old compiler would finish the
function with

  pop {r4,r5,r6,r7,pc} (9 clocks, final: 13 insns 27 clocks)

and the new compiler's version would be

  add sp,sp,#12 (1 clock)
  pop {r4,pc}   (6 clocks, final: 15 insns 32 clocks)

Even then the old compiler would still beat the new one both in size
and speed.

Zoltan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]