This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: What is acceptable for -ffast-math? (Was: associative law incombine)
- To: <dewar at gnat dot com>
- Subject: Re: What is acceptable for -ffast-math? (Was: associative law incombine)
- From: Linus Torvalds <torvalds at transmeta dot com>
- Date: Tue, 31 Jul 2001 19:01:50 -0700 (PDT)
- cc: <fjh at cs dot mu dot oz dot au>, <gcc at gcc dot gnu dot org>, <gdr at codesourcery dot com>, <moshier at moshier dot ne dot mediaone dot net>, <tprince at computer dot org>
On Tue, 31 Jul 2001 dewar@gnat.com wrote:
>
> I still see *no* quantitative data showing that the transformation most
> discussed here (associative redistribution) is effective in improving
> performance.
Actually, the transformation that was discussed here was really
a/b/c -> a/(b*c)
and I don't know where the associative redistribution thing really came
from. Probably just an example of another thing like it..
> Linus seems to be getting a little emotional in this discussion but swearing
> does not replace data.
Hey, I called people silly, not <censored>. You must have a very low
tolerance ;)
And the replacement-with-reciprocal kind of thing can certainly make a
difference. I'm too lazy to dig out my old quake3 source CD to try to dig
up what I ended up doing there, though.
> This seems such a reasonable position that I really have a heck of a time
> understanding why Linus is so incensed by it.
I'm incensed by people who think that the current -ffast-math already does
something that is positively EVIL - even though the current setup with
inline replacements of "fsin" etc has been there for years.
> Let's take an example, which is multiplication by a reciprocal. This of
> course does give "wrong" results. But division is often quite slow. On the
> other hand, I can't imagine floating-point programmers, including those
> doing game stuff, not knowing this, and thus writing the multiplication
> in the first place.
Actually, in the case of quake3, I distinctly remember how the x86 version
actually used division on purpose, because the x86 code was hand-tuned
assembler and they had counted cycles (for a Pentium machine) where the
division overlapped the other arithmetic.
The x86 hand-tuned stuff also took care to fit inside the FP stack etc -
they had started off with a C version and compiled that, and then once
they got as far as they could they just extracted the asm and did the rest
by hand.
It was also the much more obvious algorithm to do a division.
On alpha (and hey, this was before the 21264), the same did not hold true,
and it was faster to do a reciprocal. But it's been too long.
> <<Oh, round-to-zero is definitely acceptable in the world of "who cares
> about IEEE, we want fast math, and we'll use fixed arithmetic if the FP
> code is too slow".
> >>
>
> That's probably a bad idea, on most modern machines, full IEEE floating-point
> is faster than integer arithmetic (particularly in the case of multiplication)
Ehh.. You seem to think that "hardware == fast".
Denormal handling on x86 is done in microcode as a internal trap in the
pipeline. Round-to-zero is noticeably faster - even when the _code_ is the
same. Which is why Intel has special mode flags - where it internally
rounds to zero instead of making a denormal.
I'm not kidding. Look at the "Denormals are zero flag", bit 6 in the
MXCSR.
And I quote (all typos are probably mine):
"The denormals-are-zero mode is not compatible with IEEE Standard 754.
The denormals-are-zeros mode is provided to improve processor
performance for applications such as streaming media processing, where
rounding a denormal to zero does not appreciably affect the quality of
the procesed data"
What I'm saying is that a buge company like Intel chose to do this
optimization IN HARDWARE because they thought that (a) it was meaningful
and (b) they thought the problem warranted it.
Linus