This is the mail archive of the
mailing list for the GCC project.
Re: Performance of Integer Multiplication on PIII
----- Original Message -----
From: "Jan Hubicka" <email@example.com>
To: "Kevin Atkinson" <firstname.lastname@example.org>
Cc: "Tim Prince" <email@example.com>;
Sent: Monday, November 05, 2001 5:08 AM
Subject: Re: Performance of Integer Multiplication on PIII
> > On Sat, 3 Nov 2001, Tim Prince wrote:
> > > > Form Agner Fog (http://www.agner.org/assem/) pentopt.zip
> > > >
> > > > PPlain PMMX PPro PII PIII
> > > > IMUL latency 9 9 4 4 4
> > > > IMUL throughput 1/9 1/9 1/1 1/1 1/1
> > > >
> > > > That means, imul is pipelined on i686 ...
> > > >
> > > > > So I guess the lesson here is that on PIII integer
> > > fast
> > > > > enough that doing special tricks to avoid integer
> > > hurt
> > > > > performs in stead of helping it.
> > > >
> > > Even on the P4, code which permits full pipelining will run well
> > > imul, while the add and shift sequences are preferable in contexts
> > > that is not possible. I haven't seen any compiler which is able to
> > > distinguish those situations.
> > The Intel compiler seams to be able to as it gets about the same
> > as my hand coded assembly did. It should not be two difficult to
> Maybe it is because the gcc's algorithm is based purely on the
> latencies, not the troughtput (as it is dificult to estimate the
> genrated sequence in early compilation passes).
Also, the longer add and shift sequences would not produce the expected
benefit, due to their effect on the trace cache or instruction queue.
This would defy static prediction, except that we know that the current
costs exaggerate the preference to avoid imul or shift in most cases.
> Perhaps we can aritifically lower the gcc's imul cost on Pentiums if it
> in better code, but this needs some larger scale benchmark than your
> You can try to edit the config/i386/i386.c the pentiumpro_cost
> see what happends.
I embarked already on such an experiment, making the cost of the P4 shift
4 rather than 8, in accord with the quotation from the Intel literature.
Someone might wish to tune the compiler against a well-known benchmark,
or against their own application, taking the costs more as an indication
of which code sequence is preferred rather than as an attempt to predict
> At the moment it says:
> 1, /* cost of an add instruction */
> 1, /* cost of a lea instruction */
> 1, /* variable shift costs */
> 1, /* constant shift costs */
> 4, /* cost of starting a multiply */
> That is consistent with the numbers above.
> probably I see the problem - gcc relies on fact that lea instruction
> cycle, while on pentiums IMO the lea is decomposed to primitive
> Perhaps all we need is to show this fact to gcc in RTX_COST.
> I will prepare patch shortly.