patch for volatile/immediate comparison

Toshiyasu Morita tm@netcom.com
Mon Mar 22 11:26:00 GMT 1999


> With the attached modification we now get (-O2 -fno-force-mem):
> 
> foo:
>         pushl %ebp
>         movl %esp,%ebp
>         cmpl $2,i        <---
>         je .L2
>         xorl %eax,%eax
>         jmp .L3
>         .align 4
> .L2:
>         movl $1,%eax
> .L3:
>         movl %ebp,%esp
>         popl %ebp
>         ret
> 
> I'm not familiar with the Intel cost of the intructions, but there are some 
> architectures where loading a register and doing the comparison is more 
> expensive that directly doing the comparison with the memory (if the value is 
> not reused after) (at least to all non-risk machines allowing memory 
> operations).

This is bad. :(

For the Pentium and above processors, Intel recommends a "RISClike code 
generation strategy" where memory references are decoupled from the 
operation dependent on them.

Consider:

1) The Pentium is a dual-issue in-order processor. This means that
   there are two execution pipelines which run in lockstep, e.g.
   if one pipe stalls, then both pipes stall.

2) The register load latency on the Pentium is two clocks.

For this sample:

	cmpl $2,i
	someinst2

assuming these two instructions are paired, the "cmpl $2,i" will stall
the processor for at least two clocks while the data at variable "i" is
read from cache and compared against $2. During this time, the second
pipe executing someinst2 will be stalled because the first pipe is stalled.

Basicall, these two instructions will take 3 clocks to execute (1 clock 
for issue, two for latency) which gives you an IPC of 0.66 out of a 
possible maximum of 2, which is only about 33% resource utilization.

If the memory reference is generated separately from the compare, the 
instruction scheduler has a chance of moving the memory reference 
backwards so the register will already be loaded by the time the compare 
occurs. This is good because:

1) The instruction scheduler may find another instruction which pairs
   with the load instruction so the IPC will hopefully be better, and

2) By the time the compare is executed, the register may already be 
   loaded, which would eliminate the stall.

Currently, I see a lot of instructions generated which are pessimal for 
Pentium and above; the really horrible case I see generated is 
increments and decrements of memory. These are truly evil because these 
instructions must be atomic, so the second pipe is stalled for the 
entire duration of the instruction, which is something like 4 clocks - 2 
for the load latency, one for the increment, and one for the write. This 
instruction lowers the IPC to about 0.5 (two instructions executed in 
four clocks, if the second instruction isn't dependent on the first).

There's actually a really good optimization white paper available from 
Intel which covers these issues entitled something like "Optimization 
Strategies for Blended Code Generation" or something like this. It may be 
available from their web site at http://developer.intel.com .

Toshi



More information about the Gcc-patches mailing list