Bug 30255 - register spills in x87 unit need to be 80-bit, not 64
Summary: register spills in x87 unit need to be 80-bit, not 64
Status: RESOLVED DUPLICATE of bug 323
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Depends on:
Reported: 2006-12-18 20:08 UTC by R. Clint Whaley
Modified: 2014-02-16 13:12 UTC (History)
44 users (show)

See Also:
Known to work:
Known to fail:
Last reconfirmed:


Note You need to log in before you can comment on or make changes to this bug.
Description R. Clint Whaley 2006-12-18 20:08:26 UTC

I am aware that gcc attempts to avoid any reordering of floating-piont operations by default, as this leads to slightly different answers on different runs.  There appears to be a similar problem on the x87, where from my assembly-diving, I believe I've established that when a register spill is required, gcc only stores to the precision of the computation (eg., 64 bits for double precision).  On the x87 unit, this therefore introduces an unpredictable (in the sense that the source does not have a store with its implicit round, but the executable does) round operation in the middle of the computation.  This unasked-for round operation has the exact same affect as reordering two fp computations (eg, it introduces an epsilon error).  This means that not only do you have differing answers where you don't expect them, but theoretically, the 80-bit x87 could produce less accurate results than true 32 or 64-bit (though this would almost never happen in practice, as it would require massive spilling).

It came to my attention because a user of my ATLAS library noted that ATLAS failed to produce a true symmetric matrix when C = A * transpose(A) was taken.  If there is no reorderings, the lower triangle of C should exactly match the upper triangle.  When using gcc 4.2.0 20060807 (experimental) a register spill is introduced in the calculation of a 4x1 sub-block of C.  The spill only affects the C[0], and that element gets an additional round that other elements do not, leading to a slightly non-symmetric matrix.

Note that this is not stores in the algorithm causing rounding (which is inevitable), but stores unpredictably introduced into the algorithm by gcc.

A complete fix for this problem is to always do 80-bit register spills for the x87, regardless of the data type of the final calculation, and thus avoid the unpredictable round steps.

In order to get the problem, you need a code that has a spill, and depends on getting the same answer to one spilled and one unspilled redundant calculation.  I have a test case that does so for the above experimental gcc, but not for gcc 4.1.1 20060525 (Red Hat 4.1.1-1), since this earlier one doesn't inject a spill in the right place.  I have not tried on various other compiler versions, because I figure this is a general policy, and if I have figured the problem right, you can confirm easily how many bits you spill from the x87.

If you are interested in making the x87 produce the same answer in this case, and it is helpful, I can certainly post my tester that demonstrates the problem.  I don't want to go through the trouble if the answer is either "confirmed, not going to fixed", or "confirmed, see how it would cause the error, will fix".

Let me know,
Comment 1 Andrew Pinski 2006-12-18 20:16:28 UTC

*** This bug has been marked as a duplicate of 323 ***
Comment 2 R. Clint Whaley 2006-12-18 20:43:01 UTC

While it may be decided not to fix this problem, this is not a duplicate of bug 323, and so it should be closed for another reason if you want to ignore it.  323 has a problem because of the function call, where a programmer knows that a round-down can occur by examining the code.  This problem is due to register spilling, and so no amount of source examination can figure out if this could occur.  Therefore, 323 can be worked around by the knowledgable user, and this one cannot.  Also, the 323 would require a pragmas or something to prevent, whereas this problem could be completely avoided merely by spilling the 80-bit value when gcc decides to spill.  Since this problem cannot be worked around, and has a much more discrete fix, it is very different indeed from the much harder to fix 323.

Comment 3 R. Clint Whaley 2006-12-18 21:16:40 UTC
BTW, in case it isn't obvious, here's the fix that I typically use for problems like bug 323 that I cannot when it is gcc itself that is unpredictably spilling the computation:

void test(double x, double y)
  const double y2 = x + 1.0;
  volatile double v[2];
  v[0] = y2;
  v[1] = y;
  if (v[0] != v[1]) printf("error\n");

The idea being that the volatile keyword prevents gcc from getting rid of the store/load cycle, which forces the round-down.  This allows me to still do this kind of comparison, w/o the speed loss of associated with -ffloat-store (the compare itself becomes slow due to the store/load, but the body of the code runs as fast as normal), or the loss of precision associated with always rounding to 64 bit, as when you change the x87 control word.
Comment 4 Andrew Pinski 2006-12-18 22:04:05 UTC
The problem with register spilling and what PR 323 is talking about is all the same issue really, it is just exposed differently.

*** This bug has been marked as a duplicate of 323 ***
Comment 5 R. Clint Whaley 2006-12-18 22:14:20 UTC
I cannot, of course, force you to admit it, but 323 is a bug fixable by the programmer, and this one is not.  The other requires a lot of work in the compiler, and this does not.  So, viewing them as the same can be done, in the same way that all x87/gcc bugs are the same, or all precision bugs are the same, but since neither their genesis or solution are the same, it is misleading to do so.  Saying you don't care to fix it is an honest answer, closing it because it is a duplicate of a much larger and harder problem for which known workarounds exist is not.
Comment 6 Andrew Pinski 2006-12-18 23:02:39 UTC
>I cannot, of course, force you to admit it, but 323 is a bug fixable by the
> programmer, and this one is not. 

Depends on what you mean by fixable by the programmer because most people don't know anything about precusion issues.  This was a design of the x86 back-end because it gives a nice speed.
Comment 7 R. Clint Whaley 2006-12-19 00:31:48 UTC
>Depends on what you mean by fixable by the programmer because most people don't
know anything about precusion issues.  

Most people don't know programming at all, so I guess you are suggesting that errors that are fixable at the source-code level must nonetheless always be fixed by the compiler?   More to the point, the people who truly care about precision *are* often aware of these kinds of fixes, but they are helpless in this case, unlike for bug 323 (which is why they should not be conflated).

My point was that for bug 323, there is something the user can do to fix, and that something does not hurt overall performance or accuracy.  Since the problem I reported is caused completely by gcc, impacts accuracy in the same way as reordering (which gcc prohibits), and there is nothing that the user can do to fix without drastic loss of performance or accuracy, gcc is the only place it can be fixed.  This problem is a narrow discrete case that can clearly be fixed by gcc, whereas 323 is a broad class of problems which cannot be fixed without adding to the C language the concept of mixed precisions within a type.  Therefore, I strongly believe that it is perfectly valid to say that 323 cannot be solved in gcc, but clearly untrue to say that about this case, and so this bug report should have been closed as "we don't care", not as "duplicate".

>This was a design of the x86 back-end because it gives a nice speed.

The fix I suggested would only slow spill (note: I mean gcc-spilled code, not explicit load/stores by the programmer) code, and would therefore make noticable performance difference in very few cases.  Note that unlike the straw-man of bug 323 I am *not* advocating gcc handle all extra precision behavior, just its undetectable spill rounding.  

If the performance issue is greater than I suspect, obviously there could be a flag for this behavior.  I find it a bit anomolous that a compiler that is so picky about bit-level accuracy that it forbids reordering operations without a special flag, feels free to randomly round in an algorithm, even though the fix would not hurt performance as much as not performing reordering optimizations does, and introduces the same type of error.  That it does so on the most common platform on earth just adds to the beauty :)

Comment 8 Ian Lance Taylor 2006-12-19 14:57:12 UTC
I think I agree that if we spill an 80387 register to the stack, and then load the value back into an 80387 register, that we should spill all 80 bits, rather than implicitly converting to DFmode or SFmode.

This would unfortunately be rather difficult to implement in the context of gcc's register allocator, because it is perfectly normal for gcc to spill values from one type of register and reload them into a different type of register.  Thus the value might move between an 80387 register, a pair of ordinary x86 registers, and an SSE/SSE2 register, all in the same function.  It would just depend on how the value was being used.

Currently gcc simply says the value is DFmode or SFmode, and more or less ignores the fact that it is being represented as an 80-bit value in an 80387 register.  To implement this suggestion we would need to add a new notion: the mode of the spill value.  And we would need to support secondary reloads to convert 80-bit spill values as required.  That sounds rather complicated, but if we didn't do that, then I think we would still be inconsistent in some cases.  I don't see any point to making this change unless we can always be consistent.

All in all it's pretty hard for me to get excited about better support for 80387 when all modern x87 chips support SSE2 which is more consistent and faster.  See the option -mfpmath=sse.
Comment 9 R. Clint Whaley 2006-12-19 16:04:21 UTC

Thanks for the info.  I see I failed to consider the cross-register moves you mentioned.  However, can't those be moved through memory, where something destined for a 64-bit register is first written from the 80-bit reg with round-down?  Thus, you only do the round down when you have to change register sets.  In a code compiled with -mfpmath=387, I would think that would occur pretty much only at function epilogue for the return value . . .  Anyway, I see how, depending on the framework, this may be more complicated than it seemed.  However, my own compilation experience is that cross-precision/type conversions are always complicated?

>All in all it's pretty hard for me to get excited about better support for
>80387 when all modern x87 chips support SSE2 which is more consistent and
>faster.  See the option -mfpmath=sse.

First, it is consistant only in that it always has 64-bit precision.  This is like prefering a car that can only achieve 30 MPH to one that can go to 60, but only for short stretches, and must sometimes slow down to 30.  The first is more consistant, but hardly to be prefered :)

It is certainly the case that the x87 is of decreasing importance.  However, scalar SSE (the default with gcc) does *not* in general on the present generation run as fast as the x87 (I believe this common misconception comes from conflating vector and scalar performance; on AMDs, even vector performance is less than x87 for double precision).  

In particular, single precision scalar SSE seems to be much slower than x87 code, and double precision seems to be slightly slower *even when all 16 SSE regs are used, in contrast to the crappy 8-reg x87 stack*.  Without proof, I ascribe the closer double performance to the availability of movlpd, which provides a low-cost scalar load not enjoyed by single precision (which must use movss).  The only platform where scalar SSE *may* be competitive or better is Core2Duo, and I haven't had a chance to do benchmarks there to see.  Note that there is one performance advantage that x87 code will pretty much always have, even once the archs improve their scalar SSE performance: it's much more compact due to being defined earlier in the CISC instruction set, which can massively reduce your instruction load on heavily unrolled loops, and allow more instructions to fit in the selection window.

Now, if the performance were even (rather than x87 being faster), numerical guys would still sometimes prefer the x87, in order to get that free extra precision.  If 10,000 flops are done in 80-bit precision, your worst-case error is roughly epsilon.  If they are done in 64-bit (SSE), your worst-case error is 10,000*epsilon.  Which would you prefer if you were in the space ship whose flight path was being calculated? :)

Comment 10 R. Clint Whaley 2006-12-19 17:18:00 UTC

In the interests of full disclosure, I did some quick timings on the Core2Duo, and as I kind of suspected, scalar SSE crushed x87 there.  I was pretty sure scalar SSE could achieve 2 flop/cycle, while Intel kept the x87 at 1 flop/cycle, and that's what my timings show.  So, it does appear likely that the only people using the x87 in the future on the Intel will be people who need the extra precision (and those people would really like this fix, I will point out :).  All other Intel archs (P4, PIII, etc) do 1 flop cycle for both scalar SSE and x87.

On the AMDs, both x87 and scalar SSE can achieve 2 flop/cycle, with x87 running somewhat faster, with only a slight advantage in double precision, and a more commanding one in single.  It looks like the next generation of AMDs will increase the maximal flop rate of vector SSE, but it does not look like they will increase the max flop rate of scalar SSE, so this may continue to be the case going forward . . .

Comment 11 Richard Biener 2006-12-27 16:21:53 UTC
Just to mention it - you can use 'long double' to force 80bit spills.
Comment 12 Jackie Rosen 2014-02-16 13:12:45 UTC Comment hidden (spam)