This is the mail archive of the
mailing list for the GCC project.
Re: Performance of Integer Multiplication on PIII
- To: <pete at ltoi dot iap dot physik dot tu-darmstadt dot de>
- Subject: Re: Performance of Integer Multiplication on PIII
- From: "Tim Prince" <tprince at computer dot org>
- Date: Sun, 4 Nov 2001 08:27:25 -0800
- Cc: <kevin at atkinson dot dhs dot org>, <gcc at gcc dot gnu dot org>
- References: <Pine.A32.firstname.lastname@example.org>
----- Original Message -----
To: "Tim Prince" <email@example.com>
Cc: <firstname.lastname@example.org>; <email@example.com>
Sent: Sunday, November 04, 2001 7:22 AM
Subject: Re: Performance of Integer Multiplication on PIII
> > gcc-3.1 has -march=pentium4 for the P4.
> > Of course, subsequent NetBurst versions should reduce a few of the
> > operation costs
> Now you add the P4, with the result:
> Method P55C P6 P4
> imul slow fast well
> shift&add fast slow preferable for some cases
> Here is the rationale:
> At page 2-55 in the Intel P4 Optization manual we read: imul incur some
> latency [p. C-13: Latency 14, throughtput 3] due to being executed on
> FPU. &
> Ass/Compiler Coding rule 44: Replace imul's by small constant with two
> add & lea instr., especially when imul is part of a dependecy chain.
> And on p. 2-54: shift's have longer a latency then on previous
> [p. C-13: Latency 4, throughtput 1]
> And specifically: As/Comp. C. Rule 42: if shift is on the critical
> replace it by a sequence of up to three adds. (sic! Not more)
So, the cost of shift by 4 should be set less than the cost of 4 adds,
even if this does not exactly agree with the current table, while the
cost of shift by 3 must be more than the cost of 3 adds. This interesting
statement might even be taken as an indication of intent to make future
processors conform with this assessment of shift performance. Shift by 3
is a frequent case where MSVC code is slow on current P4. Thanks for the
excellent summary of available documents.
> Thus, if (and you could see this in the .s files) gcc uses to much
> equivalent replacement instructions for imul, then, even on the P4,
> (not fully pipelined) imul, your handcoded version runs faster ...
> thus leading us again, to the question, in what respect, the new
> x86 backend is improved?
Yes, even when there are sequential dependencies, large expansion may
lose due to overflowing trace cache. Besides, it's useful to have
options which work well on a variety of processors.
> - Supports: -march=athlon
> - ?
I know that the SuSE people have been working hard on this.
I only wish it were not so difficult to change the OS on these AthlonMP
SCSI drive boxes which come from Taiwan with Win98 installed. Sorry to
be OT, but does anything work? SuSE?