This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: GCC 3.0.3 produces large code
On Thu, Jan 31, 2002 at 04:47:23PM +0100, Ingo Krabbe wrote:
>
> I'm not sure about that topic, but I don't think that code size
> reduction is pushed too far in the development process of gcc, since
> it isn't used by too many people ?! I would be interested in the
> results of -O2 replacing -Os. The code structure optimizations of
> -O2 may also result in a reduction of code size. BTW. in my opinion
> is the usage of two return statements in one function a design
> fault. It is also remarkeble that the most cleanest code concerning
> function design compiles into the smallest result. That's exactly
> what I want from my gcc.
In *my* opinion, GCC should generate equally good code for all three
functions, rather than registering a preference for one style or
other. Also, GCC should care more about code size than it currently
does. It's true that most people use -O2, but with modern computers
code size has direct effects on performance.
Let's look at this in a bit more depth. Here's what we get with
Nicholas' switches and the current mainline. (Warning, long lines.
The numbers in parens are size in bytes as reported by nm
--size-sort.)
_ZN1b7DoThingEv (50): _ZN1b8DoThing2Ev (44): _ZN1b8DoThing3Ev (44):
pushl %esi pushl %esi pushl %esi
pushl %ebx pushl %ebx pushl %ebx
pushl %ebx pushl %edx pushl %ebx
pushl %ebx pushl %edx pushl %ebx
movl 20(%esp), %esi movl 20(%esp), %esi movl 20(%esp), %esi
pushl $.LC0 pushl $.LC0 pushl $.LC0
call printf call printf call printf
popl %ecx popl %eax popl %ecx
leal 8(%esi), %ebx leal 8(%esi), %ebx leal 8(%esi), %ebx
movl %ebx, (%esp) movl %ebx, (%esp) movl %ebx, (%esp)
movl $1, 4(%esp) movl $1, 4(%esp) xorl %ebx, %ebx
cmpl $0, 4(%esi) cmpl $0, 4(%esi) movl $1, 4(%esp)
je .L23 jne .L43 cmpl $0, 4(%esi)
call rand xorl %ebx, %ebx jne .L51
pushl $.LC1 .L38: .L46:
movl %eax, %ebx pushl $.LC1 pushl $.LC1
call printf call printf call printf
popl %edx addl $12, %esp addl $12, %esp
movl %ebx, %eax movl %ebx, %eax movl %ebx, %eax
.L21: popl %ebx popl %ebx
popl %ebx popl %esi popl %esi
popl %esi ret ret
popl %ebx .L43: .L51:
popl %esi call rand call rand
ret movl %eax, %ebx movl %eax, %ebx
.L23: jmp .L38 jmp .L46
pushl $.LC1
call printf
popl %eax
xorl %eax, %eax
jmp .L21
First, you will notice that the code generated for DoThing2 and
DoThing3 is identical except for the position of one xorl instruction.
That's good. We ought to have hoisted the xor operation in DoThing2,
but the global optimizer isn't up to it yet.
Second, the differences between DoThing and DoThing2 are entirely
caused by branch prediction. GCC decided that in DoThing, it was more
likely for m_pa to be non-NULL, and in DoThing2, it was more likely
for it to be NULL. I am not sure why the printf operation got
duplicated in DoThing, I would have expected to see code like this
xorl %ebx, %ebx
cmpl $0, 4(%esi)
je .L23
call rand
movl %eax, %ebx
.L23:
pushl $.LC1
call printf
<tear down stack frame and return>
-O2 produces similar code except for the stack manipulations.
_ZN1b7DoThingEv (70): _ZN1b8DoThing2Ev (60): _ZN1b8DoThing3Ev (60):
subl $28, %esp subl $28, %esp subl $28, %esp
movl %esi, 24(%esp) movl %esi, 24(%esp) movl %esi, 24(%esp)
movl 32(%esp), %esi movl 32(%esp), %esi movl 32(%esp), %esi
movl %ebx, 20(%esp) movl %ebx, 20(%esp) movl %ebx, 20(%esp)
movl $.LC0, (%esp) movl $.LC0, (%esp) movl $.LC0, (%esp)
call printf call printf call printf
movl $1, 12(%esp) movl $1, 12(%esp) movl $1, 12(%esp)
leal 8(%esi), %ebx leal 8(%esi), %ebx leal 8(%esi), %ebx
movl %ebx, 8(%esp) movl %ebx, 8(%esp) movl %ebx, 8(%esp)
movl 4(%esi), %edx movl 4(%esi), %ecx xorl %ebx, %ebx
testl %edx, %edx testl %ecx, %ecx movl 4(%esi), %esi
je .L23 jne .L43 testl %esi, %esi
call rand xorl %ebx, %ebx jne .L51
movl $.LC1, (%esp) .L38: .L46:
movl %eax, %ebx movl $.LC1, (%esp) movl $.LC1, (%esp)
call printf call printf call printf
movl %ebx, %eax movl 24(%esp), %esi movl 24(%esp), %esi
.L21: movl %ebx, %eax movl %ebx, %eax
movl 20(%esp), %ebx movl 20(%esp), %ebx movl 20(%esp), %ebx
movl 24(%esp), %esi addl $28, %esp addl $28, %esp
addl $28, %esp ret ret
ret .L43: .L51:
.L23: call rand call rand
movl $.LC1, (%esp) movl %eax, %ebx movl %eax, %ebx
call printf jmp .L38 jmp .L46
xorl %eax, %eax
jmp .L21
This code _looks_ smaller, but it produces bigger object code. That's
probably because the instructions being used take more bytes in
machine language. Other than that, it's the same thing.
Okay, so why does the Visual C++ compiler do so much better on this?
Well, let's look at (part of) the source code...
struct b
{
b() { m_pa = new a; }
~b() { delete m_pa; }
virtual int DoThing()
{
AutoRWL Lock(&m_RWL, 1);
if( m_pa )
return m_pa->DoOtherThing();
else
return 0;
}
}
You can see that gcc has inlined the calls to AutoRWL's constructor
and destructor, and to a::DoOtherThing. Now suppose we were going to
write assembly language for DoThing by hand. The first thing we'd
probably notice is that DoThing can only be called on a validly
constructed object of class b, which means that m_pa cannot possibly
be NULL, and therefore we could throw away the else branch entirely:
_ZN1b7DoThingEv:
pushl %esi
pushl %ebx
pushl %ebx
pushl %ebx
movl 20(%esp), %esi
pushl $.LC0
call printf
popl %ecx
leal 8(%esi), %ebx
movl %ebx, (%esp)
movl $1, 4(%esp)
call rand
pushl $.LC1
movl %eax, %ebx
call printf
popl %edx
movl %ebx, %eax
popl %ebx
popl %esi
popl %ebx
popl %esi
ret
We would then notice that there is no point in doing a complete
construction of the AutoRWL object, since that data is never used
again. That in turn means we don't ever use the this pointer.
Furthermore, the program cannot tell whether the second printf call
occurs before or after the call to rand, and if we swap them we can
use a sibling call. Finally, we've managed to eliminate all need for
a stack frame.
_ZN1b7DoThingEv:
pushl $.LC0
call printf
popl %eax
pushl $.LC1
call printf
popl %eax
jmp rand
You didn't post the code generated by Visual C++ but I bet it's
capable of one or more of those optimizations. GCC has basically no
framework for whole-program analysis, but we're working on it.
zw