This is the mail archive of the
mailing list for the GCC project.
Re: Floating point trouble with x86's extended precision
- From: Volker Reichelt <reichelt at igpm dot rwth-aachen dot de>
- To: wilson at tuliptree dot org
- Cc: lucier at math dot purdue dot edu, gcc at gcc dot gnu dot org
- Date: Thu, 21 Aug 2003 15:50:33 +0200 (CEST)
- Subject: Re: Floating point trouble with x86's extended precision
- Reply-to: Volker Reichelt <reichelt at igpm dot rwth-aachen dot de>
first of, thanks for the explanation!
On 20 Aug, Jim Wilson wrote:
> Volker Reichelt wrote:
>> In the process of revamping the non-bugs section of the bug reporting
>> instructions I came across a problem with the excess precision of the
>> x86 FPU:
> This is a complicated issue.
> There is a bug in the x86 port that causes it to emit buggy FP code.
> This is partly a flaw in the x86 hardware; it lacks SFmode/DFmode
> operations on the floating point register stack. This is partly a flaw
> in the x86 backend. It lies, and claims that SFmode/DFmode operations
> are available. Thus the gcc optimizer thinks it is emitting
> SFmode/DFmode instructions when it is actually emitting XFmode
> instructions, and this causes unexpected rounding problems. The easiest
> way to see this is to write an expression that needs more than 8
> register to evaluate. Reload will spill registers in the middle of the
> expression, and they will be truncated to 64-bits when spilled because
> the optimizer thinks we have 64-bit values. This results in rounding
Just to make sure I get this right: The register is spilled with
the last bits truncated (cut off) instead of rounded, right?
And one question out of curiosity. What happens to values in the FPU
that finally get written into memory *not* because the floating point
stack is filled, but because of other reasons (like the variable y in my
example). Do they also suffer from not being rounded but truncated?
I'd think so, right?
But I think the difference between rounding and truncating is not the
problem for the users.
> The same expression can give different results at different
> optimization levels because different pseudos get spilled. This is
> clearly a gcc bug. This problem has been known for over a decade, and
> has not yet been fixed, and probably never will be. Getting correct
> results will require emitting a lot of explicitly rounding operations
> which will reduce performance noticably, and may cause more complaints
> than the rounding bug.
> The easiest way to fix this problem is to fix the hardware. Both Intel
> and AMD have done so, but in different ways. Intel has the IPF (aka
> IA-64) architecture which has explicit SFmode/DFmode operations and thus
> no problem. AMD has the AMD64 architecture which has an ABI that
> requires use of the SSE registers instead of the floating point register
> stack, and the SSE registers have explicit SFmode/DFmode operations and
> hence do not have this problem.
> There is also another problem here that excess precision can cause
> problems even when it doesn't result in rounding errors. This is the
> immediate case you are discussing with Brad Lucier.
Ideally, we should
> have no excess precision, and the testcase should work. However, due to
> the design of the x86, eliminating the excess precision is a burden on
> the compiler, and hurts performance, hence it is easier to ask users to
> program around it. Excess precision has even been accepted by the IEEE
> FP standard in some cases. For instance, the powerpc has a multiply
> accumulate instruction that doesn't round the intermediate result. This
> means you get a different answer with separate multiply and add
> instructions than you do if you use the multiply accumulate instruction.
That's one part of the problem: Omitting rounding for intermediate
results. The other part is: What's an intermediate result and what is
not. I'd take the more radical approach of saying that everything is
intermediate - postpone rounding to 64 bit as long as possible.
Others (e.g. Brad, if I get him right), would say that an assignment to
a variable requires rounding.
> This was officially blessed by the IEEE FP committee as being OK,
> because the multiply accumulate result was more accurate even though it
> is different. In most numerical calculations, you have to expect some
> rounding error, and one could argue that this case is no different.
> In this case, I think we have to admit that both viewpoints are valid,
> and then agree to disagree.
> As for solutions to the problem...
> 1) If you care about numerical accuraccy, don't use x86. Seriously.
I don't agree to that one. Just try to compute the sum of the entries
of a large double vector "array" like below:
for (size_t i=0; i<10000; ++i)
You'll get much better results with the extended precision than without
(since I get 1 rounding to 64 bits instead of 10000).
I think the definition of accuracy is the major problem here.
i) You could say that accurate results are obtained, if you do the
rounding to 64 bit after each computation.
(That has the consequence that the result does not change with
optimization, if computations aren't rearranged using commutativity
or associativity etc.)
ii) You could say that accurate results are obtained, if the final
result is close to the exact arithmetic result. This means, that you
should postpone rounding to 64 bit as long as possible.
(In my understanding this is what GCC currently does - if one
ignores the fact that it doesn't round, but truncate in several
But you said it yourself: "In this case, I think we have to admit that both
viewpoints are valid, and then agree to disagree."
> AMD64 and IPF (aka IA-64) are OK, but IA-32 is not. I realize this is
> impractical in most cases, but it is something that should be mentioned.
> If you must use x86 FP hardware then...
> 2) Do FP arithmetic in the SSE registers via the -mfpmath=sse option. I
> haven't tried this myself, so I don't know how practical it is.
> 3) Set the FP reg rounding precision to 64 bits. This has the flaw that
> you can no longer perform XFmode operations. This is only a partial
> fix, in that we still have excess precision problems for SFmode operations.
> 4) Try using -ffloat-store. This works for some but not all programs.
> -ffloat-store forces user declared variables to be allocated on the
> stack, and hence avoid the in register excess precision problems.
> However, temporaries are still allocated to registers, and can still
> cause rounding errors due to excess precision, so this is not a complete
> 5) Fix the x86 backend to stop lying about availability of SFmode/DFmode
> operations, probably via an option, since this will reduce performance
> so much as to cause other problems. This would at least give people the
> option of getting slow but correct code instead of the current fast but
> incorrect code.
Your proposed fixes all try to enforce definition i) of accuracy, but
I think definition ii) is also a valid position.
IMHO, that's what Richard wants to say with his comment:
>No, not better, just different. You anger a different set of folk.
I think I'll rewrite the patch to incorporate both sides of view.
I'd still like to add it to the non-bugs section, though, since
a) The behavior *can* be regarded as correct.
b) It's more a hardware feature than a buggy compiler.
c) The workaround would cause severe performance regressions.
The only thing GCC can really blamed for is that there's no option to turn
on the workaround in c). Therefore, I'd rather call this a missing feature.
> I've found FP bugs in all of the x86 compilers that I've ever used. I
> am not sure if there are any that get it right, so I am skeptical that
> gcc will ever get it right. I haven't tried any of the compilers from
> companies that specialize in FP though, maybe some of them get it right.
At least, we're not alone in our struggle with floating point