This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: An unusual Performance approach using Synthetic registers


On Wed, 8 Jan 2003, Michael S.Zick wrote:

> I do not make any claims of this being anything other than a WAFG...
> 
> It wasn't used as a numerical measure, just "==", "<", ">" to
> determine an order among alternative code sequences.
> 
> But I used it as my guide in the past and is why I suggested XCHG.
> 
> Why:
> If user wanted "Best Size" I dropped the "C" term
> if user wanted "Best Speed" I dropped the "D" term
> Otherwise, just use the diagonal of a cube.
> 
> How:
> Scaled everything so it could be done with integer math.
> 
> Legend:
> B == Buss Cycles
> C == Clock Cycles
> S == Instruction Size
> D == (Instruction Size DIV D-Cache Size)
> Cost == SQRT(256*( B*B + C*C + D*D))

This is an arbitrary and nonsensical cost metric.

GCC either optimizes for size or speed, not both.
 
> Presumes:
> 1) Write to Stack meets the "Write Before Read" requirement
> So the first stack read does not generate a buss cycle.
> 2) If temporary is required, use EAX 
> 3) If EAX not available, spill/restore with push/pop
> 4) Newer processors will never be worse than 80386
> 5) D-Cache line size 64 bytes
> 
> Notes:
> Case 1 leaves a buss write pending
> Follow with a Reg <-> Reg to hide write cycle
> 
> Case 2 the "load/store" version, needs register
> Follow with another Reg <-> Reg if available
> 
> Case 3 leaves a buss write pending
> Case 4 puts other Reg <-> Reg ops to hide buss write
> 
> PATH____________B_|_C_|_S_|__D__|__Cost
> 
> Case 1 == Cost 80
> xchg ebx, [esp+16]__0_|_5_|_3_|_0.05_|___80
> With a pending Buss Cycle so,
> Reg <-> Reg pad here

We've tried to tell you several times.

XCHG issues a BUS LOCK on the external bus for the duration of the
instruction if a memory operand used.

Assuming an 800 Mhz P3, with a 133 Mhz external bus, you first need to:

1. Synchronize with the external bus

   There's a 6:1 differential between the internal clock and the external
   clock, so it can take up to 5 clock cycles to sync the buses.

2. Get a bus lock

   Takes at least one bus clock, or 6 internal clocks

3. Execute the instruction

   Possibly one or two clocks.

4. Release the bus lock.

   Requires resyncing with the external bus again, so up to 5 clocks.
   Takes a bus clock, or 6 internal clocks

So basically, it's at least 18 clocks, and might be as bad as 23 clocks
to execute XCHG on an 800 Mhz P3. 

It's worse on faster processors, because the ratio between CPU clock and
bus clock is even worse.

It's even worse on SMP systems, because another processor might be doing
transactions on the external bus. In that case, you need to wait for the
other processor to finish its transactions before you can even acquire a
bus lock.

XCHG is NOT designed for simply swapping data between a register and a
memory location. It is an instruction designed to guarantee atomic updates
on a multiprocessor system.

Toshi


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]