This is the mail archive of the
mailing list for the GCC project.
Re: [RFC PATCH] implement fma() as builtin x87 or SSE intrinsic
- From: Roger Sayle <roger at eyesopen dot com>
- To: Uros Bizjak <uros at kss-loka dot si>
- Cc: "Joseph S. Myers" <jsm at polyomino dot org dot uk>, Falk Hueffner <hueffner at informatik dot uni-tuebingen dot de>, <gcc-patches at gcc dot gnu dot org>
- Date: Thu, 13 May 2004 05:19:23 -0600 (MDT)
- Subject: Re: [RFC PATCH] implement fma() as builtin x87 or SSE intrinsic
On Thu, 13 May 2004, Uros Bizjak wrote:
> Following this analysis, I still suggest for fma() and hypot() to be
> implemented as builtin i387 function.
I agree with your arguments, however, I also appreciate Joseph's side.
Unlike many of the other x87 specific builtins, there are several other
architectures including powerpc and s390 that have fused multiply add
instructions and some consideration has to be given to their requirements.
Indeed, its because IA-32 doesn't actually have a fused-multiply-add
and therefore implements this functionality using regular XFmode
multiplications and additions that causes some of the confusion.
However providing the backends with an fma expander is a good thing.
At the moment, GCC's RTL optimizers can replace/implement additions and
multiplications with fma instructions. However it doesn't currently
have a portable way to specify their use. One important aspect of
"__builtin_fma" is that when used by the programmer it shouldn't be
decomposed into its primitve operation (introducing additional rounding),
hence most backends will need to implement it using an UNSPEC rtl that
will be untouched by GCC's optimizers. I believe that the rs6000 and
s390 currently describe these instructions using MULT and PLUS rtl,
which can be rearranged by combine and RTL simplification.
There are changes that do need to be made to your patch. The first is
that the expanders should use XFmode multiplications and additions and
store the intermediate in an XFmode pseudo. The same extend_noop and
trunc_noop that we use elsewhere. This prevents the register allocator
spilling the intermediate value into the stack as a SFmode float, loosing
much of the fma-emulation. The math-inline/glibc implementations achieve
this as their code sequences are effectively atomic.
Secondly, using XFmode prevents the use of the SSE implementation of
hypot that you describe (except perhaps for -ffast-math). If you look
at a typical IEEE implementation on hypot, such as glibc's
libc/sysdeps/ieee754/dbl-64/e_hypot.c, you'll notice the implementation
is far more advanced than sqrt(x*x + y*y), handling cases where x*x,
y*y or x*x+y*y would overflow or underflow a double, but where the
result still falls in the desired range and must be correctly rounded.
On x87, and only using the x87, the additional precision provided by
the 80-bit XFmode intermediates means that its possible to implement
this directly as sqrt(x*x + y*y). However, if !TARGET_80387, this
expansion can't be used and may be significantly less accurate than
the system call. I suspect that mathinline.h's hypot macro was added
or written before the development of SSE, and was never updated to
reflect the availability on lower precision floating point.
Having said that, on x87 atleast, fma and hypot are safely expanded
even without -ffast-math, for singles and doubles as they return the
expected perfectly rounded results. [Which is why mathinline.h doesn't
have a FAST_MATH guard]. And certainly on other platforms the "fmadf4"
expander could be used without -ffast-math.
I'll leave it to this list to decide whether expanding "sqrt(x*x + y*y)"
for hypot on IEEE targets is a reasonable thing in the presence of
flag_unsafe_math_optimizations. Likewise whether "fmaf" could be
implemented generically using double (or even long double) operations?
I hope this explains/resolves some of the controversy. Could you
repost your fma patch with XFmode intermediates, and without the test
for flag_unsafe_math_optimizations on the fmasf4 and fmadf4 expanders,
i.e. just "TARGET_80387" conditions, and remove the same test from
the expand_builtin so that the expanders are considered even without