patch for volatile/immediate comparison

Christian Bruel christian.bruel@st.com
Wed Mar 31 18:59:00 GMT 1999


OK, this is a good explanation for dual-issue Pentium optimization, but anyway 
we should expect the -fno-force-mem option (don't force memory operands to 
registers before arith) to work if we want, even (especially ?) if the operand 
is volatile.

On a general point of vue, I think that the dual-issue Pentium instruction 
scheduler should not impact other non-risc architectures, for example I'm 
afraid that a register would also be wasted on the m68k on this particular 
test, as I didn't see force_mem forced to 0 is its description. Somebody knows 
?

thanks for the pointers, 

Regards, 

 Christian


> > With the attached modification we now get (-O2 -fno-force-mem):
> > 
> > foo:
> >         pushl %ebp
> >         movl %esp,%ebp
> >         cmpl $2,i        <---
> >         je .L2
> >         xorl %eax,%eax
> >         jmp .L3
> >         .align 4
> > .L2:
> >         movl $1,%eax
> > .L3:
> >         movl %ebp,%esp
> >         popl %ebp
> >         ret
> > 
> > I'm not familiar with the Intel cost of the intructions, but there are some
 
> > architectures where loading a register and doing the comparison is more 
> > expensive that directly doing the comparison with the memory (if the value 
is 
> > not reused after) (at least to all non-risk machines allowing memory 
> > operations).
> 
> This is bad. :(
> 
> For the Pentium and above processors, Intel recommends a "RISClike code 
> generation strategy" where memory references are decoupled from the 
> operation dependent on them.
> 
> Consider:
> 
> 1) The Pentium is a dual-issue in-order processor. This means that
>    there are two execution pipelines which run in lockstep, e.g.
>    if one pipe stalls, then both pipes stall.
> 
> 2) The register load latency on the Pentium is two clocks.
> 
> For this sample:
> 
> 	cmpl $2,i
> 	someinst2
> 
> assuming these two instructions are paired, the "cmpl $2,i" will stall
> the processor for at least two clocks while the data at variable "i" is
> read from cache and compared against $2. During this time, the second
> pipe executing someinst2 will be stalled because the first pipe is stalled.
> 
> Basicall, these two instructions will take 3 clocks to execute (1 clock 
> for issue, two for latency) which gives you an IPC of 0.66 out of a 
> possible maximum of 2, which is only about 33% resource utilization.
> 
> If the memory reference is generated separately from the compare, the 
> instruction scheduler has a chance of moving the memory reference 
> backwards so the register will already be loaded by the time the compare 
> occurs. This is good because:
> 
> 1) The instruction scheduler may find another instruction which pairs
>    with the load instruction so the IPC will hopefully be better, and
> 
> 2) By the time the compare is executed, the register may already be 
>    loaded, which would eliminate the stall.
> 
> Currently, I see a lot of instructions generated which are pessimal for 
> Pentium and above; the really horrible case I see generated is 
> increments and decrements of memory. These are truly evil because these 
> instructions must be atomic, so the second pipe is stalled for the 
> entire duration of the instruction, which is something like 4 clocks - 2 
> for the load latency, one for the increment, and one for the write. This 
> instruction lowers the IPC to about 0.5 (two instructions executed in 
> four clocks, if the second instruction isn't dependent on the first).
> 
> There's actually a really good optimization white paper available from 
> Intel which covers these issues entitled something like "Optimization 
> Strategies for Blended Code Generation" or something like this. It may be 
> available from their web site at http://developer.intel.com .
> 
> Toshi
> 






More information about the Gcc-patches mailing list