[RFC PATCH] implement fma() as builtin x87 or SSE intrinsic

Thu May 13 21:40:00 GMT 2004

On May 13, 2004, at 13:55, Richard Guenther wrote:
> Geert Bosch writes:
> > In order to comply with the C99 standard, GCC should implement
> > a __builtin_fma(x,y,z) that always has an error of at most 0.5 ulp
> > compared to the exact result of x*y + z.
>
> Ugh.  This opens a whole can of worms and we really need to discuss if 
> we want to go this route.  Think of pow, sqrt and other C99 standard 
> math functions.  I doubt libm or math inlines provide 0.5 upl accuracy 
> on those, does it?

Although these may all seem like "math functions" to you, they are
not comparable. The sqrt function is an operation that is specified to
round exactly for IEEE floating-point arithmetic. So, yes, these will
always be correctly rounded, just like * + / and -.

The power function is a completely different beast and doesn't belong in
this discussion. For one reason, this function cannot be exactly rounded
in all cases unless arbitrary precision arithmetic is used.

The "fma" function is completely different, and belongs in the same 
class
as * + / and -. Many recent algorithms to implement functions such as 
"pow"
are based on the presence of a fused multiply-add in order to compute
a result in the arithmetic unit with more precision than the final unit
to get correctly rounded results in most cases.

One can say that these algorithms should not be used on machines where
they'll give meaningless results because we implement fma() incorrectly,
but that seems a misguided direction to go. The C99 standard explicitly
tries to avoid this mess by unambiguously stating required semantics of 
fma()!

If one wants to compute x*y+z without specific rounding requirements,
taking advantage of fma where available just write x*y+z and run the
compiler in a mode that allows contraction of operations, which is
currently the default.

On the other hand, if you're implementing algorithms that depend on
fused multipy-add, you'd use fma(x,y,z) and get the required behavior,
either through a direct instruction if available, or through explicit
code.

Note that I don't advocate replacing x*y+z in the user's code by
the expanded code I gave. This would be ridiculous, one would only
replace x*y+z if two conditions hold:
    - it is faster than performing a multiplication and addition
    - contraction of floating-point operations is allows (dependent on
      compiler options and language semantics)

   -Geert