This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: ARM : code less efficient with gcc-trunk ?
On Mon, 16 Feb 2009 10:17:36 -0500
Daniel Jacobowitz <drow@false.org> wrote:
> On Mon, Feb 16, 2009 at 12:19:52PM +0100, Vincent R. wrote:
> > 00011000 <WinMainCRTStartup>:
> > [...]
>
> Notice how many more registers used to be pushed? I expect the new
> code is faster.
Assuming an ARM7 core with 0 wait-state memory and removing all the
identical call bits from the functions, the clocks are on the right
hand side:
11000: e92d40f0 push {r4, r5, r6, r7, lr} 7
11004: e1a04000 mov r4, r0 1
11008: e1a05001 mov r5, r1 1
1100c: e1a06002 mov r6, r2 1
11010: e1a07003 mov r7, r3 1
11024: e1a01005 mov r1, r5 1
11028: e1a00004 mov r0, r4 1
1102c: e1a02006 mov r2, r6 1
11030: e1a03007 mov r3, r7 1
11038: e1a04000 mov r4, r0 1
11040: e1a01004 mov r1, r4 1
11044: e3a00042 mov r0, #66 1
Total: 12 insns, 18 clocks
11000: e92d4010 push {r4, lr} 4
11004: e1a04000 mov r4, r0 1
11008: e24dd00c sub sp, sp, #12 1
1100c: e58d1008 str r1, [sp, #8] 2
11010: e58d2004 str r2, [sp, #4] 2
11014: e58d3000 str r3, [sp] 2
11028: e59d1008 ldr r1, [sp, #8] 3
1102c: e1a00004 mov r0, r4 1
11030: e59d2004 ldr r2, [sp, #4] 3
11034: e59d3000 ldr r3, [sp] 3
1103c: e1a04000 mov r4, r0 1
11044: e1a01004 mov r1, r4 1
11048: e3a00042 mov r0, #66 1
Total: 13 insns, 25 clocks.
So the version generated by the 4.4.x compiler version is almost 40%
slower (25-18)/18 = 0.3889) than the 4.1.x version and it is also
longer. Pushing many registers is cheap because you it takes 2+n clocks
to move n registers to memory, and then it is n extra clocks to copy
your n registers to the call-saved ones that you pushed. Total cost
2+2n. Storing them individually costs you 1 clock to make space on the
stack, 3n clocks to store them on the stack, i.e. 1+3n. In addition,
when you get them to become parameters to the function calls, a reg-reg
move costs you 1 clock while a load from memory is 3. The example
function does not actually return, but if it did, the old compiler
would lose some of its advantage. The old compiler would finish the
function with
pop {r4,r5,r6,r7,pc} (9 clocks, final: 13 insns 27 clocks)
and the new compiler's version would be
add sp,sp,#12 (1 clock)
pop {r4,pc} (6 clocks, final: 15 insns 32 clocks)
Even then the old compiler would still beat the new one both in size
and speed.
Zoltan