This is the mail archive of the
mailing list for the GCC project.
Re: patch for volatile/immediate comparison
- To: tm at netcom dot com
- Subject: Re: patch for volatile/immediate comparison
- From: christian dot bruel at st dot com (Christian Bruel)
- Date: Tue, 23 Mar 1999 09:02:48 +0100
- Cc: egcs-patches at cygnus dot com
OK, this is a good explanation for dual-issue Pentium optimization, but anyway
we should expect the -fno-force-mem option (don't force memory operands to
registers before arith) to work if we want, even (especially ?) if the operand
On a general point of vue, I think that the dual-issue Pentium instruction
scheduler should not impact other non-risc architectures, for example I'm
afraid that a register would also be wasted on the m68k on this particular
test, as I didn't see force_mem forced to 0 is its description. Somebody knows
thanks for the pointers,
> > With the attached modification we now get (-O2 -fno-force-mem):
> > foo:
> > pushl %ebp
> > movl %esp,%ebp
> > cmpl $2,i <---
> > je .L2
> > xorl %eax,%eax
> > jmp .L3
> > .align 4
> > .L2:
> > movl $1,%eax
> > .L3:
> > movl %ebp,%esp
> > popl %ebp
> > ret
> > I'm not familiar with the Intel cost of the intructions, but there are some
> > architectures where loading a register and doing the comparison is more
> > expensive that directly doing the comparison with the memory (if the value
> > not reused after) (at least to all non-risk machines allowing memory
> > operations).
> This is bad. :(
> For the Pentium and above processors, Intel recommends a "RISClike code
> generation strategy" where memory references are decoupled from the
> operation dependent on them.
> 1) The Pentium is a dual-issue in-order processor. This means that
> there are two execution pipelines which run in lockstep, e.g.
> if one pipe stalls, then both pipes stall.
> 2) The register load latency on the Pentium is two clocks.
> For this sample:
> cmpl $2,i
> assuming these two instructions are paired, the "cmpl $2,i" will stall
> the processor for at least two clocks while the data at variable "i" is
> read from cache and compared against $2. During this time, the second
> pipe executing someinst2 will be stalled because the first pipe is stalled.
> Basicall, these two instructions will take 3 clocks to execute (1 clock
> for issue, two for latency) which gives you an IPC of 0.66 out of a
> possible maximum of 2, which is only about 33% resource utilization.
> If the memory reference is generated separately from the compare, the
> instruction scheduler has a chance of moving the memory reference
> backwards so the register will already be loaded by the time the compare
> occurs. This is good because:
> 1) The instruction scheduler may find another instruction which pairs
> with the load instruction so the IPC will hopefully be better, and
> 2) By the time the compare is executed, the register may already be
> loaded, which would eliminate the stall.
> Currently, I see a lot of instructions generated which are pessimal for
> Pentium and above; the really horrible case I see generated is
> increments and decrements of memory. These are truly evil because these
> instructions must be atomic, so the second pipe is stalled for the
> entire duration of the instruction, which is something like 4 clocks - 2
> for the load latency, one for the increment, and one for the write. This
> instruction lowers the IPC to about 0.5 (two instructions executed in
> four clocks, if the second instruction isn't dependent on the first).
> There's actually a really good optimization white paper available from
> Intel which covers these issues entitled something like "Optimization
> Strategies for Blended Code Generation" or something like this. It may be
> available from their web site at http://developer.intel.com.