This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: gcc300 benchmarks slower than gcc295.3?

To: tm at kloo dot net, gcc at gcc dot gnu dot org
Subject: Re: gcc300 benchmarks slower than gcc295.3?
From: Linus Torvalds <torvalds at transmeta dot com>
Date: Wed, 18 Jul 2001 09:09:14 -0700
Newsgroups: linux.egcs
References: <Pine.LNX.4.21.0107171847130.20101-100000@ab-initio.mit.edu>

In article <3B55596B.F8AEDE6@kloo.net> you write:
>
>I think I understand what's happening.
>
>In gcc 3.0, the 'addl" instruction is being forced to use register
>operands.

That would be stupid if true. 

>	 In this case, it causes two extra spills/restores to 
>be generated, which would be fine except as a side effect this
>causes another two extra spills/restores to be generated.

It wouldn't be fine.  Modern CPU's will do better "cracking" of the addl
than the compiler can ever do.  These days the only reason to ever _not_
use a memory operand is if that same memory operand is re-used for other
things, and could be profitably cached in a register for multiple use. 

>On the x86, forcing the destination of a arithmetic instruction
>is a huge win becuase it avoids the read/modify/write pipeline
>interlock.

It's not a win at all.  It's a loss on newer CPU's, and even on older
CPU's the added register pressure it implies probably forces it to be a
wash. 

I would strongly discourage the notion that x86 performs "better" at
doing RISC-like operations.  It was somewhat true for the Pentium and
the i486, but even then mainly on small test-cases that didn't show any
of the downsides (bigger code size resulting in icache pressure, and
more register pressure resulting in worse code for a compiler that is
already bad at handling register pressure). 

There's no read-modify-write interlock on any modern CPU (ie PPro+) that
does out-of-order execution with the instructions already split up into
simpler parts. So using the memory operand form gives you:

 - better decoding throughput (fewer instructions)
 - better code density
 - less register pressure

with basically no downsides.  This is true both for source and
destination: write a small benchmark if you like. 

If the value gets re-used, you obviously want to keep it in a register. 
But even then you have to worry about the register pressure thing.  But
you should NEVER EVER just force non-memory operands just because you
think they are faster.

Anyway, I actually wrote the benchmark: load+add and add-from-memory are
exactly the same speed in the absense of cache effects and register
pressure on the PPro and P4 I have access to.  VERY simplistic silly
loop, but you get the idea. 

For load-add-store vs add-to-memory, the expanded version is 1% faster
on a PPro, but 8% slower on a P4 in my simplistic tests. 

And remember: even that 1% (assuming you want to optimize for PPro
rather than the newest) has to overcome the fact that the single 4-byte
instruction got expanded into 3 instructions totalling 9 bytes (yeah,
this will depend on register allocation and what kind of memory access
it is, of course, but you get the idea). Thats a code-size expansion of
more than a factor of two.

Now, on a Pentium you might get an added benefit from trying to hide
memory access latencies etc.  Who knows.  I don't have old machines to
even benchmark on any more.  But fairly clearly it is a loss to not
accept memory arguments in general. 

		Linus

Follow-Ups:
- Re: gcc300 benchmarks slower than gcc295.3?
  - From: Gareth Hughes
- Re: gcc300 benchmarks slower than gcc295.3?
  - From: Richard Henderson

References:
- Re: gcc300 benchmarks slower than gcc295.3?
  - From: Steven G. Johnson
- Re: gcc300 benchmarks slower than gcc295.3?
  - From: tm

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]