This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: gcc300 benchmarks slower than gcc295.3?



On Thu, 19 Jul 2001, Richard Henderson wrote:
>
> The memory/register splitting that is done at present is done after
> normal register allocation, and so does not impact allocation.

Ok.

What else is done after normal register allocation? It might still make it
impossible to do other (better) tricks.

> It is done when possible, given available registers, in order to
> help P1 scheduling, and P2 dispatch.  (Even read-only memory operand
> is at least 2 uop, which means decoder 0; separate read operate
> instructions are one 1 uop each, which can use decoder 1 or 2.)

If I remember correctly, the K7 has more regular (but fewer) decoders,
and is often decode limited. And my P4 test definitely implies that the P4
doesn't like it - the main reason I can imagine probably being that it
ends up doing the address arithmetic twice when you split the r-w-m into
different operations.

> It did make a measurable difference at one point -- it's entirely
> possible that this wants disabling for Athlon and P4.

Is the difference measurable on small benchmarks that fit in the L1
icache, like dhrystone? Or is it measurable on real applications that have
to load?

If the first, you never even take the code expansion into account. For the
r-w-m case in particular, this optimization pretty much always makes the
instruction at least twice as big.

Although I don't actually know how common this is at all. It was pointed
to as the reason for the slowdown on the FP benchmark, but if it doesn't
affect register allocation I suspect that something else was the _real_
issue. The FP slowdown seemed to be due to extra memory accesses.

			Linus


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]