This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Performance of Integer Multiplication on PIII

To: <pete at ltoi dot iap dot physik dot tu-darmstadt dot de>
Subject: Re: Performance of Integer Multiplication on PIII
From: "Tim Prince" <tprince at computer dot org>
Date: Sun, 4 Nov 2001 08:27:25 -0800
Cc: <kevin at atkinson dot dhs dot org>, <gcc at gcc dot gnu dot org>
References: <Pine.A32.4.40.0111041600070.15165-100000@ltoi.iap.physik.tu-darmstadt.de>


----- Original Message -----
From: <pete@ltoi.iap.physik.tu-darmstadt.de>
To: "Tim Prince" <tprince@computer.org>
Cc: <kevin@atkinson.dhs.org>; <gcc@gcc.gnu.org>
Sent: Sunday, November 04, 2001 7:22 AM
Subject: Re: Performance of Integer Multiplication on PIII


>
> > gcc-3.1 has -march=pentium4 for the P4.
> Good
>
> > Of course, subsequent NetBurst versions should reduce a few of the
excessive
> > operation costs

> Now you add the P4, with the result:
>
>   Method   P55C      P6     P4
>   imul     slow     fast   well
> shift&add  fast     slow   preferable for some cases
>
> Here is the rationale:
>
> At page 2-55 in the Intel P4 Optization manual we read: imul incur some
extra
> latency [p. C-13: Latency 14, throughtput 3] due to being executed on
the
> FPU. &
> Ass/Compiler Coding rule 44: Replace imul's by small constant with two
or more
> add & lea instr., especially when imul is part of a dependecy chain.
>
> And on p. 2-54: shift's have longer a latency then on previous
proccessor's
> [p. C-13: Latency 4, throughtput 1]
>
> And specifically: As/Comp. C. Rule 42: if shift is on the critical
path,
> replace it by a sequence of up to three adds. (sic! Not more)
>
So, the cost of shift by 4 should be set less than the cost of 4 adds,
even if this does not exactly agree with the current table, while the
cost of shift by 3 must be more than the cost of 3 adds. This interesting
statement might even be taken as an indication of intent to make future
processors conform with this assessment of shift performance.  Shift by 3
is a frequent case where MSVC code is slow on current P4. Thanks for the
excellent summary of available documents.
> Thus, if (and you could see this in the .s files) gcc uses to much
> equivalent replacement instructions for imul, then, even on the P4,
with his
> (not fully pipelined) imul, your handcoded version runs faster ...
> thus leading us again, to the question, in what respect, the new
gcc-3.0.x
> x86 backend is improved?
Yes, even when there are sequential dependencies, large expansion may
lose due to overflowing trace cache.  Besides, it's useful to have
options which work well on a variety of processors.
>
> - Supports: -march=athlon
> - ?
I know that the SuSE people have been working hard on this.
I only wish it were not so difficult to change the OS on these AthlonMP
SCSI drive boxes which come from Taiwan with Win98 installed.  Sorry to
be OT, but does anything work? SuSE?

Follow-Ups:
- Re: Performance of Integer Multiplication on PIII
  - From: pete

References:
- Re: Performance of Integer Multiplication on PIII
  - From: pete

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]