This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH] Tweak cost of lea on Pentium4
> On Sun, 20 Jun 2004, Jan Hubicka wrote:
> > The idea was that P4 and some of earlier cores executes lea in same time
> > as if it were equivalent sequence of primitive operations (shifts and
> > adds). This is still what P4 does.
> Indeed, shifts and adds, not multiplications and adds. The canonical
> form for these instructions in GCC's RTL is to use a multiplication rather
> than a shift, hence its the multiplication that needs to be (and is)
> matched in rtx_costs.
> > I think all we need with the old scheme is to teach rtx_cost that
> > multiplication by power of 2 is in fact shifting and thus it is
> > significantly cheaper.
> Here I disagree. The role of the backend's rtx_cost is to report the
> cost of the specified operation as given. It's not for each back-end
> to second guess the middle-end's optimizations. rtx_costs should
It was meant only as a trick to get LEA cost estimated right without
duplicating rtx_cost recursive walk. It should not hurt in non-lea
cases as we should be faced to canonized form of RTL containing no such
multipliciations except for the addressing, but as an alternative, we
can simply match all the cases of supported lea instructions in the lea
discovery and compute the cost based on it.
> report the cost of an addition for "(plus x x)", the cost of a shift
> for "(ashift x 1)" and the cost of a multiplication for "(mult x 2)".
> It is by asking the backend how much each instruction pattern costs
> that it gets to choose which one is most suitable. Therer should be no
> need to special case every backend to tweak the costs of multiplications
> by zero, one, two, three, four, etc... This is the middle-end's job.
> The approach of prentending that "lea" doesn't exist at all on the
> Pentium4, just produces inferior code. If the middle-end wants to
Performance wise lea does not exist for P4 since the decoder frontend of
CPU simply decompose it into sequence of primitive operations (only
advantage is the 3 adress nature of the instruction). Most of code in
the backend already preffers the combined sequence even if the
instruction, but perhaps it may make sense to compute the cost of
decomposed sequence on TARGET_DECOMPOSE_LEA with taking MULT as SHIFT.
> know how must an instruction to compute "(X*4 + Y)" costs, it should
> be able to find out, and avoid that instruction if it's a bad choice.
> As I pointed out, we really do want to use an lea on P4 when optimizing
> for size.
> > Then the pattern above would result in 3 cycles as you suggest,
> > while more complex leas will be more expensive....
> If some leas are more expensive than others, this reflects a defficiency
> in the i386.c's backend's processor_cost structure, which just provides
> a single value for all leas insns. However, on the P4 all shifts (and
> multiplications) are currently parameterized in i386.c as having the
> same cost, so I'd be suprised if "X*2 + Y", "X*4 + Y" and "X*8 + Y"
> were significantly different speeds.
Main point was to distinquish X+Y+DISP that is cycle faster and
X*2+Y+DISP that is cycle slower. (precisely as if it were rewriten to
the primitive operations).
My schedule is somewhat busy right now, but I might try to give this a
try later in stage2. The point of TARGET_DECOMPOSE_LEA is not to avoid
backend from using it at all, but to tell backend that using LEA brings
no performance advantages as CPU decompose it anyway (tought due to bug
in the RTX cost estimates it resulted in somewhat different behaviour,
unforutnately). It would be also nice to have some way to dump the
costs so we know what is going on...