This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Performance of Integer Multiplication on PIII



----- Original Message -----
From: "Jan Hubicka" <jh@suse.cz>
To: "Kevin Atkinson" <kevin@atkinson.dhs.org>
Cc: "Tim Prince" <tprince@computer.org>;
<pete@ltoi.iap.physik.tu-darmstadt.de>; <gcc@gcc.gnu.org>
Sent: Monday, November 05, 2001 5:08 AM
Subject: Re: Performance of Integer Multiplication on PIII


> > On Sat, 3 Nov 2001, Tim Prince wrote:
> >
> > > >  Form Agner Fog (http://www.agner.org/assem/) pentopt.zip
> > > >
> > > >                  PPlain      PMMX    PPro    PII   PIII
> > > >  IMUL latency       9          9       4      4      4
> > > >  IMUL throughput   1/9        1/9     1/1    1/1    1/1
> > > >
> > > >  That means, imul is pipelined on i686 ...
> > > >
> > > > > So I guess the lesson here is that on PIII integer
multiplication is
> > > fast
> > > > > enough that doing special tricks to avoid integer
multiplication will
> > > hurt
> > > > > performs in stead of helping it.
> > > >
> > > Even on the P4, code which permits full pipelining will run well
with
> > > imul, while the add and shift sequences are preferable in contexts
where
> > > that is not possible.  I haven't seen any compiler which is able to
> > > distinguish those situations.
> >
> > The Intel compiler seams to be able to as it gets about the same
results
> > as my hand coded assembly did.  It should not be two difficult to
tell when
>
> Maybe it is because the gcc's algorithm is based purely on the
instruction
> latencies, not the troughtput (as it is dificult to estimate the
troughput of
> genrated sequence in early compilation passes).
>
Also, the longer add and shift sequences would not produce the expected
benefit, due to their effect on the trace cache or instruction queue.
This would defy static prediction, except that we know that the current
costs exaggerate the preference to avoid imul or shift in most cases.

> Perhaps we can aritifically lower the gcc's imul cost on Pentiums if it
results
> in better code, but this needs some larger scale benchmark than your
testcase.
>
> You can try to edit the config/i386/i386.c the pentiumpro_cost
structure and
> see what happends.
>
I embarked already on such an experiment, making the cost of the P4 shift
4 rather than 8, in accord with the quotation from the Intel literature.
Someone might wish to tune the compiler against a well-known benchmark,
or against their own application, taking the costs more as an indication
of which code sequence is preferred rather than as an attempt to predict
performance.
> At the moment it says:
>
>   1, /* cost of an add instruction */
>   1, /* cost of a lea instruction */
>   1, /* variable shift costs */
>   1, /* constant shift costs */
>   4, /* cost of starting a multiply */
>
> That is consistent with the numbers above.
>
> BTW
> probably I see the problem - gcc relies on fact that lea instruction
has one
> cycle, while on pentiums IMO the lea is decomposed to primitive
instructions.
> Perhaps all we need is to show this fact to gcc in RTX_COST.
>
> I will prepare patch shortly.
>
> Honza



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]