This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Tweak cost of lea on Pentium4


On Sun, 20 Jun 2004, Jan Hubicka wrote:
> The idea was that P4 and some of earlier cores executes lea in same time
> as if it were equivalent sequence of primitive operations (shifts and
> adds). This is still what P4 does.

Indeed, shifts and adds, not multiplications and adds.  The canonical
form for these instructions in GCC's RTL is to use a multiplication rather
than a shift, hence its the multiplication that needs to be (and is)
matched in rtx_costs.


> I think all we need with the old scheme is to teach rtx_cost that
> multiplication by power of 2 is in fact shifting and thus it is
> significantly cheaper.

Here I disagree.  The role of the backend's rtx_cost is to report the
cost of the specified operation as given.  It's not for each back-end
to second guess the middle-end's optimizations.  rtx_costs should
report the cost of an addition for "(plus x x)", the cost of a shift
for "(ashift x 1)" and the cost of a multiplication for "(mult x 2)".
It is by asking the backend how much each instruction pattern costs
that it gets to choose which one is most suitable.  Therer should be no
need to special case every backend to tweak the costs of multiplications
by zero, one, two, three, four, etc...  This is the middle-end's job.

The approach of prentending that "lea" doesn't exist at all on the
Pentium4, just produces inferior code.  If the middle-end wants to
know how must an instruction to compute "(X*4 + Y)" costs, it should
be able to find out, and avoid that instruction if it's a bad choice.
As I pointed out, we really do want to use an lea on P4 when optimizing
for size.


> Then the pattern above would result in 3 cycles as you suggest,
> while more complex leas will be more expensive....

If some leas are more expensive than others, this reflects a defficiency
in the i386.c's backend's processor_cost structure, which just provides
a single value for all leas insns.  However, on the P4 all shifts (and
multiplications) are currently parameterized in i386.c as having the
same cost, so I'd be suprised if "X*2 + Y", "X*4 + Y" and "X*8 + Y"
were significantly different speeds.


Roger
--


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]