This is the mail archive of the
mailing list for the GCC project.
Re: [BENCHMARK]-mfpmath=sse should disable x387 intrinsics
- From: Roger Sayle <roger at eyesopen dot com>
- To: Richard Guenther <richard dot guenther at gmail dot com>
- Cc: Uros Bizjak <uros at kss-loka dot si>, <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 26 Nov 2004 07:38:43 -0700 (MST)
- Subject: Re: [BENCHMARK]-mfpmath=sse should disable x387 intrinsics
On Fri, 26 Nov 2004, Richard Guenther wrote:
> This tests were with todays CVS, using -mfpmath=sse -mfancy-math-387
> does not show any difference to my surprise, using g++ from Nov21 with
> -mfpmath=sse, the difference is in the noise, too. So I guess your patch
> is ok - sorry for not testing enough before complaining.
Hey, no problem. You've actually helped more than you can imagine.
You've prompted enough discussion that we now have a much better
understanding of fpmath, especially when its a win and when it's not.
Reading through the assembly dumps of tramp3d.cpp and tramp3d-v3.cpp
with different compiler options for GCC and Intel compilers has
helped explain a lot.
The first myth that is busted, and even supported by Martin Reinecke's
posting, is that -fpmath=sse can be a significant win, but primarily
on register hungry code doing lots of FP math with relatively few
function calls or uses of math functions. When there's no requirement
to move things to/from FP registers, the corresponding SSE math is
faster than the 80-bit x387 equivalents.
As soon as an ABI or inline intrinsics require common shuffling between
registers, such as almabench or whetstone, the scales tip and x87 again
The second myth that is busted is that tramp3d spends a significant
amount of time in x87 math functions. This may seem bizarre given the
dramatic slowdowns with -fno-builtin-sqrt and -fno-builtin-pow, but
looking at the generated code reveals a different story.
Firstly, there are admittedly a huge number of calls to sqrt, but
-fpmath=sse has always used the SSE's inline sqrt intrinsic, so
disabling the x87 "fsqrt" instruction has no effect on this code.
Hence, you can see why -fno-builtin-sqrt (which disables the use
of both the SSE and x87 intrinsics) cripples performance, but why
my patch had no effect.
Secondly, there are in the original tramp3d a large number of calls
to the "pow" function. The interesting aspect here is that all but
one of them use either 2 or 3 as the exponent. Here the middle-end
is optimizing pow(x,2) to "x*x", and "pow(x,3)" as "x*x*x", where
these floating point multiplications use either the SSE mult or the
x87 mult as appropriate. In fact, GCC doesn't even have an inline
intrinsic for pow! Once again you can see that -fno-builtin-pow
would have a catastrophic effect on performance, but there would be
no effect with my patch to disable x87 intrinsics.
Admittedly, there are differences a call to "exp" no longer gets
inlined, and some rounding functions are now implemented differently,
but I suspect these changes aren't on any hot paths and may improve
performance as much as they hurt it.
But now the best bit, for which I'll thank you in advance. In looking
at so much floating point code, it's become apparent that GCC's
reg-stack.c pass can do a much better job at shuffling floating point
registers. I was up late last night working on an improvement/rewrite
of change_stack that should reduce the number of fxch instructions we
generate, and replace more uses "fstp %st(x)" with "ffreep %st(0)"
(which is faster on AMD processors). I know there are PRs in this area,
so these changes might even make it into GCC v4.0.
Anyway many thanks again. Its been far more constructive than if
you'd just run your benchmark in a few weeks time and not noticed