[BENCHMARK]-mfpmath=sse should disable x387 intrinsics

Thu Nov 25 18:50:00 GMT 2004

On Thu, 25 Nov 2004 08:57:08 -0700 (MST), Roger Sayle
<roger@eyesopen.com> wrote:
> 
> On Thu, 25 Nov 2004, Richard Guenther wrote:
> > On Thu, 25 Nov 2004 10:18:27 +0100, Uros Bizjak <uros@kss-loka.si> wrote:
> > > -mfpmath=sse is the worst choice in case of pentium4. The result is
> > > lower by 18%, comparing to the default. That is, -mfpmath=sse,387 is
> > > faster by 28%, comparing to -mfpmath=sse on pentium4.
> >
> > For me, specifying -mfpmath=sse,387 is 4% slower than -mfpmath=sse.
> > I would prefer the -mfpmath=sse behavior _not_ to be changed for ia32.
> 
> Could you present the performance results for your testcase with
> "-mfpmath=387", "-mfpmath=sse" and "-mfpmath=sse,387"?  It's relatively
> rare for "-mfpmath=sse" to be a win on a Pentium4 benchmark, and to quote
> Robert Scott Ladd from his Coyote Gulch benchmarking:
> 
> From http://www.coyotegulch.com/products/acovea/acovea_4.html
> >> Much to my surprise, I have yet to find any consistent evidence that
> >> options like -mfpmath=sse improve program performance. Thus Acovea
> >> bears out my personal experience, though it does not explain why so
> >> many people continue to suggest that I should use -mfpmath=sse to
> >> generate floating-point code. If someone could suggest a good
> >> "-mfpmath=sse", I'd appreciate seeing it.
> 
> If your result is reproducible, there may be a latent bug in GCC that
> is unable to handle the competition for resources between the SSE unit
> and the FP unit.  Probably not a surprise as Pentium4 doesn't even use
> the DFA's scheduler.  If you can reduce a small test case, I'll try
> and fix it and thereby resolve your issue.

I guess unrolling loops increases register pressure and as such makes
use of the extra FP registers.  The testcase is again 50 iterations of
my famous tramp3d-v3.cpp.
To address the DFA issue, I present the numbers for -march=athlon64
(and of course run on a Athlon64) - note this is without your patch and
-D__NO_MATH_INLINES is not only due to a very old libc from Debian woody:

-mfpmath=sse -D__NO_MATH_INLINES: 55.3s
-mfpmath=sse,387 -D__NO_MATH_INLINES: 57.6s
-mfpmath=387 -D__NO_MATH_INLINES: 59.1s
-mfpmath=sse -fno-builtin-pow -fno-builtin-sqrt -D__NO_MATH_INLINES: 1m32s
-mfpmath=sse -fno-builtin-pow -fno-builtin-sqrt: 1m34.7s

Other switches used are -ffast-math -funroll-loops -march=athlon64

The last two should be numbers equivalent to with your patch applied (pow
and sqrt are the only used math fns in my testcase), but maybe I'm confused
about the exact meaning of -fno-builtin-pow -fno-builtin-sqrt.  I'll build an
updated mainline soon.

> Additionally, Uros asked if you used "-D__NO_MATH_INLINES" to which
> you replied "Yes, I did".  To which I'd recommend that you now stop
> using it if you now want x87 intrinsics but insist on turning them
> off with "-mfpmath=sse".

What I am most unhappy with is the changed semantics of -mfpmath=sse between
3.4 and 4.0 then - wouldn't a -fno-builtin-XXX work, too?  I guess your new
semantics make sense for amd64 ABI (for which it should be the default anyway?),
but is confusing for ia32.

> Finally, you may find that if you want to use "-mfpmath=sse" effectively,
> it may help to build a libm multi-lib (either sse-specific or soft-float)
> that maximizes performance.  Again, most Linux distributions don't bother
> with such a specialization as -fpmath=sse is so rarely a win.

Sorry, but you usually are not root at a supercomputing facility.

> I don't think its unreasonable for you to ask for this patch to be
> reverted.  An even better compromise is to only use this logic on
> TARGET_64BIT where its a clear advantage by default.  There's also

Yes, I would support that - for TARGET_64BIT we should only generate
387 intrinsics, if asked to, as we default to SSE math anyways.

> the complication that the *BSD support in the i386.c backend makes
> it difficult to enable and disable 387 intrinsics independently.
> Apparently, their kernel x87 emulator doesn't handle "fancy math",
> so i386.c plays games with "-mfancy-math-387", such that
> "-mno-fancy-math-387" no longer works on Pentium4, and the only way
> to disable x87 intrinsics on the command-line it to use the corrected
> "-mfpmath=sse".
> 
> However, my guess is that you're in a small minority where taking
> advantage in a bug in the intention/implementation of the "-fpmath=sse"
> flag results in marginally better code.  Hopefully, once a few more
> opinions have been voiced we'll reach a consensus.  At the moment
> Uros is clearly for the patch, and you're clearly against it.

I'm against making it impossible to activate previous behavior which your
patch effectively disables.

> 
> In my defense not only did I e-mail a request for further benchmarking
> when I posted the patch, including e-mailing Robert Scott Ladd directly,
> but I also waited 48 hours after its approval before committing it, to
> ensure that all the repsonses were taken under consideration.  In your
> defense, you did ask about support for SSE intrinsics (and in a later
> e-mail for sqrt and pow specifically).  The good news is that SSE sqrt
> is already supported as an SSE inline intrinsic.  GCC is a volunteer
> project and if someone contributes a suitable SSE pow (or other)
> intrinsic pattern, I'm sure the x86 backend maintainers would be happy
> to accept it.
> 
> I hope the above comments are not unreasonable?

Certainly not - as should mine - especially that I no longer can have the
old fastest-for-me behavior.

Richard.

> 
> Roger
> --
> 
>