This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: GCC 3.0.3 produces large code

From: Zack Weinberg <zack at codesourcery dot com>
To: Ingo Krabbe <i dot krabbe at dokom dot net>
Cc: Nicholas Adrian Vinen <hb at x256 dot org>, gcc at gcc dot gnu dot org
Date: Thu, 31 Jan 2002 08:56:03 -0800
Subject: Re: GCC 3.0.3 produces large code
References: <Pine.LNX.4.33.0201310543010.20842-100000@pandora.x256.com> <E16WJSi-0006V1-00@mail.dokom.net>

On Thu, Jan 31, 2002 at 04:47:23PM +0100, Ingo Krabbe wrote:
>
> I'm not sure about that topic, but I don't think that code size
> reduction is pushed too far in the development process of gcc, since
> it isn't used by too many people ?! I would be interested in the
> results of -O2 replacing -Os. The code structure optimizations of
> -O2 may also result in a reduction of code size.  BTW. in my opinion
> is the usage of two return statements in one function a design
> fault. It is also remarkeble that the most cleanest code concerning
> function design compiles into the smallest result. That's exactly
> what I want from my gcc.

In *my* opinion, GCC should generate equally good code for all three
functions, rather than registering a preference for one style or
other.  Also, GCC should care more about code size than it currently
does.  It's true that most people use -O2, but with modern computers
code size has direct effects on performance.

Let's look at this in a bit more depth.  Here's what we get with
Nicholas' switches and the current mainline.  (Warning, long lines.
The numbers in parens are size in bytes as reported by nm
--size-sort.)

_ZN1b7DoThingEv (50):		_ZN1b8DoThing2Ev (44):		_ZN1b8DoThing3Ev (44):
	pushl	%esi			pushl	%esi			pushl	%esi
	pushl	%ebx			pushl	%ebx			pushl	%ebx
	pushl	%ebx			pushl	%edx			pushl	%ebx
	pushl	%ebx			pushl	%edx			pushl	%ebx
	movl	20(%esp), %esi		movl	20(%esp), %esi		movl	20(%esp), %esi
	pushl	$.LC0			pushl	$.LC0			pushl	$.LC0
	call	printf			call	printf			call	printf
	popl	%ecx			popl	%eax			popl	%ecx
	leal	8(%esi), %ebx		leal	8(%esi), %ebx		leal	8(%esi), %ebx
	movl	%ebx, (%esp)		movl	%ebx, (%esp)		movl	%ebx, (%esp)
	movl	$1, 4(%esp)		movl	$1, 4(%esp)		xorl	%ebx, %ebx
	cmpl	$0, 4(%esi)		cmpl	$0, 4(%esi)		movl	$1, 4(%esp)
	je	.L23			jne	.L43			cmpl	$0, 4(%esi)
	call	rand			xorl	%ebx, %ebx		jne	.L51
	pushl	$.LC1		.L38:				.L46:
	movl	%eax, %ebx		pushl	$.LC1			pushl	$.LC1
	call	printf			call	printf			call	printf
	popl	%edx			addl	$12, %esp		addl	$12, %esp
	movl	%ebx, %eax		movl	%ebx, %eax		movl	%ebx, %eax
.L21:					popl	%ebx			popl	%ebx
	popl	%ebx			popl	%esi			popl	%esi
	popl	%esi			ret				ret
	popl	%ebx		.L43:				.L51:
	popl	%esi			call	rand			call	rand
	ret				movl	%eax, %ebx		movl	%eax, %ebx
.L23:					jmp	.L38			jmp	.L46
	pushl	$.LC1
	call	printf
	popl	%eax
	xorl	%eax, %eax
	jmp	.L21

First, you will notice that the code generated for DoThing2 and
DoThing3 is identical except for the position of one xorl instruction.
That's good.  We ought to have hoisted the xor operation in DoThing2,
but the global optimizer isn't up to it yet.

Second, the differences between DoThing and DoThing2 are entirely
caused by branch prediction.  GCC decided that in DoThing, it was more
likely for m_pa to be non-NULL, and in DoThing2, it was more likely
for it to be NULL.  I am not sure why the printf operation got
duplicated in DoThing, I would have expected to see code like this

	xorl	%ebx, %ebx
	cmpl	$0, 4(%esi)
	je	.L23
	call	rand
	movl	%eax, %ebx
.L23:
	pushl	$.LC1
	call	printf
	<tear down stack frame and return>

-O2 produces similar code except for the stack manipulations.

_ZN1b7DoThingEv (70):		_ZN1b8DoThing2Ev (60):		_ZN1b8DoThing3Ev (60):
	subl	$28, %esp		subl	$28, %esp		subl	$28, %esp
	movl	%esi, 24(%esp)		movl	%esi, 24(%esp)		movl	%esi, 24(%esp)
	movl	32(%esp), %esi		movl	32(%esp), %esi		movl	32(%esp), %esi
	movl	%ebx, 20(%esp)		movl	%ebx, 20(%esp)		movl	%ebx, 20(%esp)
	movl	$.LC0, (%esp)		movl	$.LC0, (%esp)		movl	$.LC0, (%esp)
	call	printf			call	printf			call	printf
	movl	$1, 12(%esp)		movl	$1, 12(%esp)		movl	$1, 12(%esp)
	leal	8(%esi), %ebx		leal	8(%esi), %ebx		leal	8(%esi), %ebx
	movl	%ebx, 8(%esp)		movl	%ebx, 8(%esp)		movl	%ebx, 8(%esp)
	movl	4(%esi), %edx		movl	4(%esi), %ecx		xorl	%ebx, %ebx
	testl	%edx, %edx		testl	%ecx, %ecx		movl	4(%esi), %esi
	je	.L23			jne	.L43			testl	%esi, %esi
	call	rand			xorl	%ebx, %ebx		jne	.L51
	movl	$.LC1, (%esp)	.L38:				.L46:
	movl	%eax, %ebx		movl	$.LC1, (%esp)		movl	$.LC1, (%esp)
	call	printf			call	printf			call	printf
	movl	%ebx, %eax		movl	24(%esp), %esi		movl	24(%esp), %esi
.L21:					movl	%ebx, %eax		movl	%ebx, %eax
	movl	20(%esp), %ebx		movl	20(%esp), %ebx		movl	20(%esp), %ebx
	movl	24(%esp), %esi		addl	$28, %esp		addl	$28, %esp
	addl	$28, %esp		ret				ret
	ret			.L43:				.L51:
.L23:					call	rand			call	rand
	movl	$.LC1, (%esp)		movl	%eax, %ebx		movl	%eax, %ebx
	call	printf			jmp	.L38			jmp	.L46
	xorl	%eax, %eax
	jmp	.L21

This code _looks_ smaller, but it produces bigger object code.  That's
probably because the instructions being used take more bytes in
machine language.  Other than that, it's the same thing.

Okay, so why does the Visual C++ compiler do so much better on this?
Well, let's look at (part of) the source code...

struct b
{
        b() { m_pa = new a; }
        ~b() { delete m_pa; }
        virtual int DoThing()
        {
                AutoRWL Lock(&m_RWL, 1);
                if( m_pa )
                        return m_pa->DoOtherThing();
                else
                        return 0;
        }
}

You can see that gcc has inlined the calls to AutoRWL's constructor
and destructor, and to a::DoOtherThing.  Now suppose we were going to
write assembly language for DoThing by hand.  The first thing we'd
probably notice is that DoThing can only be called on a validly
constructed object of class b, which means that m_pa cannot possibly
be NULL, and therefore we could throw away the else branch entirely:

_ZN1b7DoThingEv:
	pushl	%esi
	pushl	%ebx
	pushl	%ebx
	pushl	%ebx
	movl	20(%esp), %esi
	pushl	$.LC0
	call	printf
	popl	%ecx
	leal	8(%esi), %ebx
	movl	%ebx, (%esp)
	movl	$1, 4(%esp)
	call	rand
	pushl	$.LC1
	movl	%eax, %ebx
	call	printf
	popl	%edx
	movl	%ebx, %eax
	popl	%ebx
	popl	%esi
	popl	%ebx
	popl	%esi
	ret

We would then notice that there is no point in doing a complete
construction of the AutoRWL object, since that data is never used
again.  That in turn means we don't ever use the this pointer.
Furthermore, the program cannot tell whether the second printf call
occurs before or after the call to rand, and if we swap them we can
use a sibling call.  Finally, we've managed to eliminate all need for
a stack frame.

_ZN1b7DoThingEv:
	pushl	$.LC0
	call	printf
	popl	%eax
	pushl	$.LC1
	call	printf
	popl	%eax
	jmp	rand

You didn't post the code generated by Visual C++ but I bet it's
capable of one or more of those optimizations.  GCC has basically no
framework for whole-program analysis, but we're working on it.

zw

Follow-Ups:
- Re: GCC 3.0.3 produces large code
  - From: Nicholas Adrian Vinen

References:
- GCC 3.0.3 produces large code
  - From: Nicholas Adrian Vinen
- Re: GCC 3.0.3 produces large code
  - From: Ingo Krabbe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]