This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
- From: "rguenth at gcc dot gnu dot org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 25 Apr 2010 20:03:20 -0000
- Subject: [Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
- References: <bug-43884-6649@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #3 from rguenth at gcc dot gnu dot org 2010-04-25 20:03 -------
Well, the innermost loop with current trunk is
.L3:
leal -1(%ebx), %eax
subl $2, %ebx
movl %eax, (%esp)
call fib
addl %eax, %esi
cmpl $2, %ebx
jg .L3
which is pretty much optimal. The intel compiler doesn't detect the
tail-recursion (huh) but has multiple entry-points into the function
and uses register passing conventions for the recursions.
With -fwhole-program GCC does the same (or with static fib), and we
then end up with a program faster than what ICC produces (16s)
A 4.3 compiled version is indeed a bit faster (as fast as 4.4 on i?86, 15.4s).
A 4.1 compiled version is even faster (14.1s), the 3.4 baseline is 21.5s.
That's on i?86-linux, all -O2.
4.1 assembly, fib is not inlined:
fib:
pushl %esi
pushl %ebx
movl %eax, %ebx
cmpl $2, %ebx
movl $1, %eax
jle .L5
xorl %esi, %esi
.p2align 4,,7
.L6:
leal -1(%ebx), %eax
subl $2, %ebx
call fib
addl %eax, %esi
cmpl $2, %ebx
jg .L6
leal 1(%esi), %eax
.L5:
popl %ebx
popl %esi
ret
trunk assembler:
fib:
pushl %esi
pushl %ebx
movl %eax, %ebx
subl $4, %esp
cmpl $2, %ebx
movl $1, %eax
jle .L2
xorl %esi, %esi
.p2align 4,,7
.p2align 3
.L3:
leal -1(%ebx), %eax
subl $2, %ebx
call fib
addl %eax, %esi
cmpl $2, %ebx
jg .L3
leal 1(%esi), %eax
.L2:
addl $4, %esp
popl %ebx
popl %esi
ret
where the only difference is different loop alignment and keeping the
stack 16-bytes aligned. Indeed we get the same speed as 4.1 when
building with -mpreffered-stack-boundary=2. Why do we bother to
keep the stack aligned for leaf functions?
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl at gcc dot gnu dot org,
| |hubicka at gcc dot gnu dot
| |org
Component|c++ |target
GCC target triplet| |i?86-*-*
Keywords| |missed-optimization
Known to work| |4.1.3
Summary|[4.4/4.5 Regression] |[4.4/4.5/4.6 Regression]
|Performance degradation for |Performance degradation for
|simple fibonacci numbers |simple fibonacci numbers
|calculation |calculation due to extra
| |stack alignment
Target Milestone|--- |4.4.4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884