This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Performance of Integer Multiplication on PIII (Results forgcc-2.95 & Athlon)
- To: Jan Hubicka <jh at suse dot cz>
- Subject: Re: Performance of Integer Multiplication on PIII (Results forgcc-2.95 & Athlon)
- From: pete at ltoi dot iap dot physik dot tu-darmstadt dot de
- Date: Tue, 6 Nov 2001 18:09:14 +0200 (MEST)
- Cc: gcc at gcc dot gnu dot org
> > > Note that with -march=i686 new gcc often perofrms worse on Athlon, but it is
> > > mainly because it do more Athlon specific stuff.
> >
> > well, actually i compare gcc-3.0.x with gcc-2.95.x (doesn't matter much
> > which, since both scheduler are switched off for i686) for my floating-point
> > heavy stuff. And gcc-3.0.x performs worse compared with gcc-2.95.x (with
> > -march=i686) regardles of -march=athlon or -march=i686. Other "oddities"
> > may render any improvements of this flag to nearly invisibility ...
> In case you have easy enought examples to analyze by hand, I can take a look
> at it.
Well, initially i thought of a full Radix-8 example ... but that's
to large. Is a (full) Radix-2 FFT {C or Fortran} easy enough?
{together with testbed}
But please, note, at the moment, i didn't know, if this case exhibits
similar patterns, then what i could extract out of C. Whaley analysis.
I only know {or thought that i know ;-), that gcc-3.0.x version is
slower than gcc-2.95 version. ... But since this FFT stuff did by far not
"rides the bull" as Clint's ATLAS stuff, differences are also not _that_
drastic *), but i could send you a list about how ~50 different FFT
implementations {mostly stolen from the net, some written by me} perform
with gcc-2.95 & gcc-3.0.x
*) but you can see the reason for the slowdown in the asm files of the FFT
kernels, concerning lenght & "how they look".
> Speaking about the Atlas problems, I made some patches that should track the
> problem, but I am not quite sure they got reviewed.
>
> The problem gcc is running into is quite nasty relation of Athlon on chip
> scheduler and gcc's scheduler that both do good job given the information they
> have, but because they do have incomplette information it makes problems.
>
> On FP specific code, there appears to be issues with loop optimizer and strength
> reduction (so nothing related to i386 backend itself). Basically gcc is now
> bit stronger on strength reduction as bugs that disabled it's loop optimizer
> has been fixed. Fortran do use -freduce-all-givs switch, that often makes
> strength reduction pass to create too many temporaries (and in other testcases
> reduce the temporaries) that in turn results in instable perofrmance.
>
> For fortran I recoment playing around with -fno-reduce-all-givs that helps
> in testcases I do have.
>
> There are also other problems I am tracking down, but the issue is that gcc's
> loop optimizer is too outdated and interferre badly with gcse and other passes.
> I believe proper solution is to rewrite it, I made some steps in this direction
> but it is more probably 3.2.x issue.
Thank you for your explanation, which improves the understanding of
someone, who can only judge from the outside {of the blackbox gcc}!
Cheers,
Peter