This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: gcc300 benchmarks slower than gcc295.3?
- To: tm at kloo dot net, gcc at gcc dot gnu dot org
- Subject: Re: gcc300 benchmarks slower than gcc295.3?
- From: Linus Torvalds <torvalds at transmeta dot com>
- Date: Wed, 18 Jul 2001 09:09:14 -0700
- Newsgroups: linux.egcs
- References: <Pine.LNX.4.21.0107171847130.20101-100000@ab-initio.mit.edu>
In article <3B55596B.F8AEDE6@kloo.net> you write:
>
>I think I understand what's happening.
>
>In gcc 3.0, the 'addl" instruction is being forced to use register
>operands.
That would be stupid if true.
> In this case, it causes two extra spills/restores to
>be generated, which would be fine except as a side effect this
>causes another two extra spills/restores to be generated.
It wouldn't be fine. Modern CPU's will do better "cracking" of the addl
than the compiler can ever do. These days the only reason to ever _not_
use a memory operand is if that same memory operand is re-used for other
things, and could be profitably cached in a register for multiple use.
>On the x86, forcing the destination of a arithmetic instruction
>is a huge win becuase it avoids the read/modify/write pipeline
>interlock.
It's not a win at all. It's a loss on newer CPU's, and even on older
CPU's the added register pressure it implies probably forces it to be a
wash.
I would strongly discourage the notion that x86 performs "better" at
doing RISC-like operations. It was somewhat true for the Pentium and
the i486, but even then mainly on small test-cases that didn't show any
of the downsides (bigger code size resulting in icache pressure, and
more register pressure resulting in worse code for a compiler that is
already bad at handling register pressure).
There's no read-modify-write interlock on any modern CPU (ie PPro+) that
does out-of-order execution with the instructions already split up into
simpler parts. So using the memory operand form gives you:
- better decoding throughput (fewer instructions)
- better code density
- less register pressure
with basically no downsides. This is true both for source and
destination: write a small benchmark if you like.
If the value gets re-used, you obviously want to keep it in a register.
But even then you have to worry about the register pressure thing. But
you should NEVER EVER just force non-memory operands just because you
think they are faster.
Anyway, I actually wrote the benchmark: load+add and add-from-memory are
exactly the same speed in the absense of cache effects and register
pressure on the PPro and P4 I have access to. VERY simplistic silly
loop, but you get the idea.
For load-add-store vs add-to-memory, the expanded version is 1% faster
on a PPro, but 8% slower on a P4 in my simplistic tests.
And remember: even that 1% (assuming you want to optimize for PPro
rather than the newest) has to overcome the fact that the single 4-byte
instruction got expanded into 3 instructions totalling 9 bytes (yeah,
this will depend on register allocation and what kind of memory access
it is, of course, but you get the idea). Thats a code-size expansion of
more than a factor of two.
Now, on a Pentium you might get an added benefit from trying to hide
memory access latencies etc. Who knows. I don't have old machines to
even benchmark on any more. But fairly clearly it is a loss to not
accept memory arguments in general.
Linus