This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [BENCHMARK]-mfpmath=sse should disable x387 intrinsics


On Thu, 25 Nov 2004 19:12:55 +0100, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, 25 Nov 2004 08:57:08 -0700 (MST), Roger Sayle
> 
> 
> <roger@eyesopen.com> wrote:
> >
> > On Thu, 25 Nov 2004, Richard Guenther wrote:
> > > On Thu, 25 Nov 2004 10:18:27 +0100, Uros Bizjak <uros@kss-loka.si> wrote:
> > > > -mfpmath=sse is the worst choice in case of pentium4. The result is
> > > > lower by 18%, comparing to the default. That is, -mfpmath=sse,387 is
> > > > faster by 28%, comparing to -mfpmath=sse on pentium4.
> > >
> > > For me, specifying -mfpmath=sse,387 is 4% slower than -mfpmath=sse.
> > > I would prefer the -mfpmath=sse behavior _not_ to be changed for ia32.
> >
> > Could you present the performance results for your testcase with
> > "-mfpmath=387", "-mfpmath=sse" and "-mfpmath=sse,387"?  It's relatively
> > rare for "-mfpmath=sse" to be a win on a Pentium4 benchmark, and to quote
> > Robert Scott Ladd from his Coyote Gulch benchmarking:
> >
> > From http://www.coyotegulch.com/products/acovea/acovea_4.html
> > >> Much to my surprise, I have yet to find any consistent evidence that
> > >> options like -mfpmath=sse improve program performance. Thus Acovea
> > >> bears out my personal experience, though it does not explain why so
> > >> many people continue to suggest that I should use -mfpmath=sse to
> > >> generate floating-point code. If someone could suggest a good
> > >> "-mfpmath=sse", I'd appreciate seeing it.
> >
> > If your result is reproducible, there may be a latent bug in GCC that
> > is unable to handle the competition for resources between the SSE unit
> > and the FP unit.  Probably not a surprise as Pentium4 doesn't even use
> > the DFA's scheduler.  If you can reduce a small test case, I'll try
> > and fix it and thereby resolve your issue.
> 
> I guess unrolling loops increases register pressure and as such makes
> use of the extra FP registers.  The testcase is again 50 iterations of
> my famous tramp3d-v3.cpp.
> To address the DFA issue, I present the numbers for -march=athlon64
> (and of course run on a Athlon64) - note this is without your patch and
> -D__NO_MATH_INLINES is not only due to a very old libc from Debian woody:
> 
> -mfpmath=sse -D__NO_MATH_INLINES: 55.3s
> -mfpmath=sse,387 -D__NO_MATH_INLINES: 57.6s
> -mfpmath=387 -D__NO_MATH_INLINES: 59.1s
> -mfpmath=sse -fno-builtin-pow -fno-builtin-sqrt -D__NO_MATH_INLINES: 1m32s
> -mfpmath=sse -fno-builtin-pow -fno-builtin-sqrt: 1m34.7s

Now I actually read the patch and looked at the patch context.  Simulating the
effect of your patch results in

-mfpmath=sse -mno-fancy-math-387 -D__NO_MATH_INLINES: 55.8s
-mfpmath=sse -mno-fancy-math-387: 55.8s

Oh - and your patch misses updating of the documentation of -mfpmath (and
possibly -mno-fancy-math-387).

So I can still get old behavior with -mfpmath=sse -mfancy-math-387?  I guess
not - at least I could not find processing of that arg and check if you override
sooner or later - can you check that?

Thanks,
Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]