This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Performance of Integer Multiplication on PIII
- To: Tim Prince <tprince at computer dot org>
- Subject: Re: Performance of Integer Multiplication on PIII
- From: pete at ltoi dot iap dot physik dot tu-darmstadt dot de
- Date: Sun, 4 Nov 2001 17:22:41 +0200 (MEST)
- Cc: kevin at atkinson dot dhs dot org, <gcc at gcc dot gnu dot org>
... Delay due to big time zone shift constant ...
> gcc-3.1 has -march=pentium4 for the P4.
Good
> Of course, subsequent NetBurst versions should reduce a few of the excessive
> operation costs
But who knows ... even then, the fact remains, that the P4 is sufficient
different from (his successor: P5 and the successor of his successor ;-)
P6 to justify an own -march option, which, as said previously, is not
the case the various P6 variants (modulo the architectural extensions).
> > > When running these same tests on on Mobile Pentium MMX (using -march=i586)
> > > Gcc code does out perform mine. I do not have anything in between to run
> > > these tests on so I would appreciate it if someone with a Pentium Pro and
> > > PII (or is that the same thing as a Pentium Pro?) could run them and post
> > > the results.
> >
> > Form Agner Fog (http://www.agner.org/assem/) pentopt.zip
> >
> > PPlain PMMX PPro PII PIII
> > IMUL latency 9 9 4 4 4
> > IMUL throughput 1/9 1/9 1/1 1/1 1/1
> >
> > That means, imul is pipelined on i686 ...
> >
> > > So I guess the lesson here is that on PIII integer multiplication is
> fast
> > > enough that doing special tricks to avoid integer multiplication will
> hurt
> > > performs in stead of helping it.
> >
> Even on the P4, code which permits full pipelining will run well with
> imul, while the add and shift sequences are preferable in contexts where
> that is not possible. I haven't seen any compiler which is able to
> distinguish those situations.
Hm, thought initialy you spook about the P6 Model III & the P55C and you
found:
Method P55C P6
imul slow fast
shift&add fast slow
and that's, because of the pipelined imul & the out-of order P6 core
{other word for Jan Hubicka's "very complicated pipeline"}.
gcc's -march=i586 code was not shown, but since here imul needs 9 clocks
(and is unpairable) you could use quite many add/lea's & shift's.
Thus:
On i586: gcc's code generator is not bad
On i686: gcc's code generator could/should be smarter.
But, for a sort of varity: Did you crosschecked
with gcc-2.95.x?
It may be, that this is just another example of
the new de-improved x86 backend of the gcc-3.0 line .(?).
And for the other topic (see if Pentium Pro, PII perform different), i
said (or guessed, especially for this this example) the used Model of the P6
line (note: the P4 is not a P6 variant in this context) will not matter,
because they are all essentially equivalent (this answers Honza's question too)
{essentially: 32 bit mode (PPro ...), neclecting architectural extensions
& cache issues, which are not served by -march=i686 anyway
}
Now you add the P4, with the result:
Method P55C P6 P4
imul slow fast well
shift&add fast slow preferable for some cases
Here is the rationale:
At page 2-55 in the Intel P4 Optization manual we read: imul incur some extra
latency [p. C-13: Latency 14, throughtput 3] due to being executed on the
FPU. &
Ass/Compiler Coding rule 44: Replace imul's by small constant with two or more
add & lea instr., especially when imul is part of a dependecy chain.
And on p. 2-54: shift's have longer a latency then on previous proccessor's
[p. C-13: Latency 4, throughtput 1]
And specifically: As/Comp. C. Rule 42: if shift is on the critical path,
replace it by a sequence of up to three adds. (sic! Not more)
Thus, if (and you could see this in the .s files) gcc uses to much
equivalent replacement instructions for imul, then, even on the P4, with his
(not fully pipelined) imul, your handcoded version runs faster ...
thus leading us again, to the question, in what respect, the new gcc-3.0.x
x86 backend is improved?
- Supports: -march=athlon
- ?
Peter