This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 25 Mar 2016 07:35:11 +0000
Subject: [Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases
Auto-submitted: auto-generated

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408

            Bug ID: 70408
           Summary: reusing the same call-preserved register would give
                    smaller code in some cases
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

int foo(int);  // not inlineable
int bar(int a) {
  return foo(a+2) + 5 * foo (a);
}

gcc (and clang and icc) all make bigger code than necessary for x86.  gcc uses
two call-preserved registers to save `a` and `foo(a+2)`.  Besides the extra
push/pop, stack alignment requires a sub/add esp,8 pair.

Combining data-movement with arithmetic wherever possible is also a win (using
lea), but gcc also misses out on that.

    # gcc6 snapshot 20160221 on godbolt (with -O3): http://goo.gl/dN5OXD
    pushq   %rbp
    pushq   %rbx
    movl    %edi, %ebx
    leal    2(%rdi), %edi      # why lea instead of add rdi,2?
    subq    $8, %rsp
    call    foo                # foo(a+2)
    movl    %ebx, %edi
    movl    %eax, %ebp
    call    foo                # foo(a)
    addq    $8, %rsp
    leal    (%rax,%rax,4), %eax
    popq    %rbx
    addl    %ebp, %eax
    popq    %rbp
    ret

clang 3.8 makes essentially the same code (but wastes an extra mov because it
doesn't produce the result in %eax).

By hand, the best I can come up with is:

    push    %rbx
    lea     2(%rdi), %ebx          # stash ebx=a+2
    call    foo                    # foo(a)
    mov     %ebx, %edi
    lea     (%rax,%rax,4), %ebx    # reuse ebx to stash 5*foo(a)
    call    foo                    # foo(a+2)
    add     %ebx, %eax
    pop     %rbx
    ret

Note that I do the calls to foo() in the other order, which allows more folding
of MOV into LEA.  The savings from that are somewhat orthogonal to the savings
from reusing the same call-preserved register.

Should I open a separate bug report for the failure to optimize by reordering
the calls?

I haven't tried to look closely at ARM or PPC code to see if they succeed at
combining data movement with math (prob. worth testing with `foo(a) * 4` since
x86's shift+add LEA is not widely available).  I didn't mark this as an
i386/x86-64 but, because the reuse of call-preserved registers affects all
architectures.


IDK if teaching gcc about either of these tricks would help with real code in
many cases, or how hard it would be.

Follow-Ups:
- [Bug c/70408] reusing the same call-preserved register would give smaller code in some cases
  - From: pinskia at gcc dot gnu.org
- [Bug c/70408] reusing the same call-preserved register would give smaller code in some cases
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]