This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: gcc will become the best optimizing x86 compiler

From: ZoltÃn KÃcsi <zoltan at bendor dot com dot au>
To: gcc at gcc dot gnu dot org
Date: Thu, 24 Jul 2008 20:27:54 +1000
Subject: Re: gcc will become the best optimizing x86 compiler
References: <2E073B3ABB3F664DBA1D1C4D5FB47EF40EBDAD8E@NT-IRVA-0752.brcm.ad.broadcom.com> <4887592E.4040804@agner.org>

> [...]
> I have made a few optimized functions myself and published them as a 
> multi-platform library (www.agner.org/optimize/asmlib.zip). It is
> faster than most other libraries on an Intel Core2 and up to ten
> times faster than gcc using builtin functions. My library is
> published with GPL license, but I will allow you to use my code in
> gnu libc if you wish (Sorry, I don't have the time to work on the gnu
> project myself, but you may contact me for details about the code).
> [...]

But then it's not gcc that is the best optimising compiler, but it's 
the best library *hand optimised so that gcc compiles it very well*.

Here's an example:

void foo( void )
{
unsigned x;

    for ( x = 0 ; x < 200 ; x++ ) func();
}

void bar( void )
{
unsigned x;

    for ( x = 201 ; --x ; ) func();
}

foo() and bar() are completely equivalent, they call func() 200
times and that's all. Yet, if you compile them with -O3 for arm-elf
target with version 4.0.2 (yes, I know, it's an ancient version, but
still) bar() will be 6 insns long with the loop itself being 3 while
foo() compiles to 7 insns of which 4 is the loop. In fact, the compiler
is clever enough to transform bar()'s loop from

    for ( x = 201 ; --x ; ) func();
to
    x = 200; do func() while ( --x );

internally, the latter form being shorter to evaluate and since x is
not used other than as the loop counter it doesn't matter. However, it
is not clever enough to figure out that foo()'s loop is doing exactly
what bar()'s is doing. Since x is only the loop counter, gcc could
transform foo()'s loop to bar()'s freely but it doesn't. It generates
the equivalent of this:

    x = 0; do { x += 1; func(); } while ( x != 240 );

that is not as efficient as what it generates from bar()'s code.

Of course you get surprised when you change -O3 to -Os, in which case
gcc suddenly realises that foo() can indeed be transformed to the
internal representation that it used for bar() with -O3. Thus, we have
foo() now being only 6 insns long with a 3 insn loop. Unfortunately,
bar() is not that lucky. Although it's loop remains 3 insns long, the
entire function is increased by an additional instruction, for bar()
internally now looks like this:

   x = 201;
   goto label;
   do {
      func();
label:
   } while ( --x );

You can play with gcc and see which one of the equivalent C
constructs it compiles to better code with any particular -O level
(and if you have to work  with severely constrained embedded systems
you often do) but then hand-crafting your C code to fit gcc's taste is
actually not that good an idea. With the next release, when different
constructs will be recognised, you may end up with larger and/or slower
code (as it happened to me when changing 4.0.x -> 4.3.x and before when
going from 2.9.x to 3.1.x).

Gcc will be the best optimising compiler when it will generate
faster/shorter code that the other compilers on the majority of
a large set of arbitrary, *not* hand-optimised sources. Preferrably 
for most targets, not only for the x86, if possible :-)

Zoltan

References:
- Is cross-section inlining valid behaviour?
  - From: Bingfeng Mei
- gcc will become the best optimizing x86 compiler
  - From: Agner Fog

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]