This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 25 Mar 2016 07:35:11 +0000
- Subject: [Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408
Bug ID: 70408
Summary: reusing the same call-preserved register would give
smaller code in some cases
Product: gcc
Version: 6.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
int foo(int); // not inlineable
int bar(int a) {
return foo(a+2) + 5 * foo (a);
}
gcc (and clang and icc) all make bigger code than necessary for x86. gcc uses
two call-preserved registers to save `a` and `foo(a+2)`. Besides the extra
push/pop, stack alignment requires a sub/add esp,8 pair.
Combining data-movement with arithmetic wherever possible is also a win (using
lea), but gcc also misses out on that.
# gcc6 snapshot 20160221 on godbolt (with -O3): http://goo.gl/dN5OXD
pushq %rbp
pushq %rbx
movl %edi, %ebx
leal 2(%rdi), %edi # why lea instead of add rdi,2?
subq $8, %rsp
call foo # foo(a+2)
movl %ebx, %edi
movl %eax, %ebp
call foo # foo(a)
addq $8, %rsp
leal (%rax,%rax,4), %eax
popq %rbx
addl %ebp, %eax
popq %rbp
ret
clang 3.8 makes essentially the same code (but wastes an extra mov because it
doesn't produce the result in %eax).
By hand, the best I can come up with is:
push %rbx
lea 2(%rdi), %ebx # stash ebx=a+2
call foo # foo(a)
mov %ebx, %edi
lea (%rax,%rax,4), %ebx # reuse ebx to stash 5*foo(a)
call foo # foo(a+2)
add %ebx, %eax
pop %rbx
ret
Note that I do the calls to foo() in the other order, which allows more folding
of MOV into LEA. The savings from that are somewhat orthogonal to the savings
from reusing the same call-preserved register.
Should I open a separate bug report for the failure to optimize by reordering
the calls?
I haven't tried to look closely at ARM or PPC code to see if they succeed at
combining data movement with math (prob. worth testing with `foo(a) * 4` since
x86's shift+add LEA is not widely available). I didn't mark this as an
i386/x86-64 but, because the reuse of call-preserved registers affects all
architectures.
IDK if teaching gcc about either of these tricks would help with real code in
many cases, or how hard it would be.