patch for volatile/immediate comparison
Toshiyasu Morita
tm@netcom.com
Mon Mar 22 11:26:00 GMT 1999
> With the attached modification we now get (-O2 -fno-force-mem):
>
> foo:
> pushl %ebp
> movl %esp,%ebp
> cmpl $2,i <---
> je .L2
> xorl %eax,%eax
> jmp .L3
> .align 4
> .L2:
> movl $1,%eax
> .L3:
> movl %ebp,%esp
> popl %ebp
> ret
>
> I'm not familiar with the Intel cost of the intructions, but there are some
> architectures where loading a register and doing the comparison is more
> expensive that directly doing the comparison with the memory (if the value is
> not reused after) (at least to all non-risk machines allowing memory
> operations).
This is bad. :(
For the Pentium and above processors, Intel recommends a "RISClike code
generation strategy" where memory references are decoupled from the
operation dependent on them.
Consider:
1) The Pentium is a dual-issue in-order processor. This means that
there are two execution pipelines which run in lockstep, e.g.
if one pipe stalls, then both pipes stall.
2) The register load latency on the Pentium is two clocks.
For this sample:
cmpl $2,i
someinst2
assuming these two instructions are paired, the "cmpl $2,i" will stall
the processor for at least two clocks while the data at variable "i" is
read from cache and compared against $2. During this time, the second
pipe executing someinst2 will be stalled because the first pipe is stalled.
Basicall, these two instructions will take 3 clocks to execute (1 clock
for issue, two for latency) which gives you an IPC of 0.66 out of a
possible maximum of 2, which is only about 33% resource utilization.
If the memory reference is generated separately from the compare, the
instruction scheduler has a chance of moving the memory reference
backwards so the register will already be loaded by the time the compare
occurs. This is good because:
1) The instruction scheduler may find another instruction which pairs
with the load instruction so the IPC will hopefully be better, and
2) By the time the compare is executed, the register may already be
loaded, which would eliminate the stall.
Currently, I see a lot of instructions generated which are pessimal for
Pentium and above; the really horrible case I see generated is
increments and decrements of memory. These are truly evil because these
instructions must be atomic, so the second pipe is stalled for the
entire duration of the instruction, which is something like 4 clocks - 2
for the load latency, one for the increment, and one for the write. This
instruction lowers the IPC to about 0.5 (two instructions executed in
four clocks, if the second instruction isn't dependent on the first).
There's actually a really good optimization white paper available from
Intel which covers these issues entitled something like "Optimization
Strategies for Blended Code Generation" or something like this. It may be
available from their web site at http://developer.intel.com .
Toshi
More information about the Gcc-patches
mailing list