Bug 61399 - LDBL_MAX is incorrect with IBM long double format / overflow issues near large values
Summary: LDBL_MAX is incorrect with IBM long double format / overflow issues near larg...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: c (show other bugs)
Version: 4.7.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2014-06-02 23:37 UTC by Vincent Lefèvre
Modified: 2022-12-19 20:38 UTC (History)
1 user (show)

See Also:
Host:
Target: powerpc64-linux-gnu powerpc-darwin
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefèvre 2014-06-02 23:37:09 UTC
On PowerPC, which uses the IBM long double format for the type long double, one gets:

LDBL_MANT_DIG = 106 (this is the precision of the FP model),
LDBL_MAX_EXP = 1024 (this is the maximum exponent of the FP model).

Even though the IBM long double format, a.k.a. double-double format, is not a floating-point format, it is some superset of a floating-point format as specified by the C standard (see C11, §5.2.4.2.2 from p1 to p3), where p = 106 and emax = 1024.

By definition, for radix b = 2, LDBL_MAX is the value (1 - 2^(-p)) * 2^emax (see §5.2.4.2.2p12), which is the largest value representable in the floating-point model.

However, with GCC 4.7.2 20121109 (Red Hat 4.7.2-8), I get:

LDBL_MAX = 0x1.fffffffffffff7ffffffffffff8p+1023

instead of 0x1.ffffffffffffffffffffffffff8p+1023.

The following program shows that this is not a display bug:

#include <stdio.h>
#include <float.h>

int main (void)
{
  long double dmax = DBL_MAX;
  printf ("%La\n", LDBL_MAX);
  printf ("%La\n", LDBL_MAX - dmax);
  printf ("%La\n", dmax + dmax * DBL_EPSILON / 4);
  printf ("%La\n", dmax + dmax * DBL_EPSILON / 2);
  return 0;
}

It outputs:

0x1.fffffffffffff7ffffffffffff8p+1023
0x1.ffffffffffffep+969
0x1.fffffffffffff7ffffffffffffcp+1023
inf

also showing that the arithmetic is buggy. I suppose that the high double is the value rounded to double precision, but this rule is incorrect near the largest values, due to the overflow.

One may choose to keep the behavior, i.e. consider that the high double is the value rounded to double precision, but this means that the floating-point model would need to be changed; otherwise some values are not representable, as shown above.
Comment 1 Vincent Lefèvre 2014-06-03 00:07:53 UTC
(In reply to Vincent Lefèvre from comment #0)
> One may choose to keep the behavior, i.e. consider that the high double is
> the value rounded to double precision, but this means that the
> floating-point model would need to be changed; otherwise some values are not
> representable, as shown above.

By "the floating-point model would need to be changed", I mean, for instance, choose LDBL_MAX_EXP = 1023. I think that this would be correct. A possible drawback is that one would have LDBL_MAX_EXP < DBL_MAX_EXP, but I don't think that this is a problem (note that one already has LDBL_MIN_EXP > DBL_MIN_EXP). This would just mean that one has "not normalized" values outside the normal range.
Comment 2 Jakub Jelinek 2014-06-03 07:04:12 UTC
I think lowering LDBL_MAX_EXP would be far worse than this.
The double-double format is unusable for any real numerics in so many ways that this is just one small part of that, the variable precision from 106 to thousands of precision bits depending on exact value is far worse.
Comment 3 Vincent Lefèvre 2014-06-03 07:45:46 UTC
The variable precision is unavoidable with this format (this is even a feature, despite the drawbacks). But the fact that the variable precision is problematic by itself isn't a reason not to try to solve other issues.
Comment 4 Vincent Lefèvre 2016-11-17 15:37:21 UTC
(In reply to Vincent Lefèvre from comment #0)
> By definition, for radix b = 2, LDBL_MAX is the value (1 - 2^(-p)) * 2^emax
> (see §5.2.4.2.2p12), which is the largest value representable in the
> floating-point model.

There has been a defect report and this is no longer the case. See:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2092.htm

"DR 467 decided that FLT_MAX, DBL_MAX, LDBL_MAX are the maximum representable finite numbers for their respective types."

So, this solves this issue: LDBL_MAX is correct according to this DR. As a consequence, I'm resolving the PR as invalid.
Comment 5 Vincent Lefèvre 2017-06-19 01:36:11 UTC
I'm reopening this PR since it is actually not solved. There are two remaining issues.

The first issue is no longer the formula, but the fact that LDBL_MAX is strictly less than the maximum normalized number in the floating-point model. I think that either LDBL_MAX_EXP should be reduced to 1023 (thus any representable value with exponent 1024 would be regarded as unnormalized), or the standard should be corrected in some other way, e.g. to allow an incomplete binade for the largest exponent.

The second issue is that one can get a finite value 0x1.fffffffffffff7ffffffffffffcp+1023 that is strictly larger than LDBL_MAX = 0x1.fffffffffffff7ffffffffffff8p+1023 (see the first and third numbers output by my program given in comment 0). Thus LDBL_MAX is not the maximum representable finite number as required by the standard.
Comment 6 jsm-csl@polyomino.org.uk 2017-06-19 17:41:01 UTC
The current proposed wording for DR#467 
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2148.htm#dr_467> changes 
"maximum representable finite floating-point number, [ math formula ]" to 
"maximum representable finite floating-point number; if that value is 
normalized, its value is [ math formula ],", which I think addresses the 
first issue without requiring any change to LDBL_MAX_EXP; LDBL_MAX should 
be the largest value, which in this case is not normalized, and 
LDBL_MAX_EXP has nothing to do with what exponents do or not not have 
normalized values.  (The last proposed change would require LDBL_EPSILON 
to change, however.)

It's true that GCC cannot handle IBM long double constants with more than 
106 mantissa bits, including the largest representable finite value.
Comment 7 Vincent Lefèvre 2017-06-20 13:51:18 UTC
(In reply to joseph@codesourcery.com from comment #6)
> The current proposed wording for DR#467 
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2148.htm#dr_467> changes 
> "maximum representable finite floating-point number, [ math formula ]" to 
> "maximum representable finite floating-point number; if that value is 
> normalized, its value is [ math formula ],", which I think addresses the 
> first issue without requiring any change to LDBL_MAX_EXP;

This change concerns the LDBL_MAX definition. However, there is another issue, related to the floating-point model 5.2.4.2.2p2 and p3. I assume that the intent of this definition is to require that *all* normalized floating-point numbers from this model are elements of the floating-point type (otherwise emin and emax would no longer make any sense). But with the current status, some floating-point numbers of exponent LDBL_MAX_EXP = 1024 (those with f54 = 1, if I'm not mistaken) are not elements of the type "long double". The reason is that the first double must be the exact value rounded to nearest (with the even rounding rule).

> LDBL_MAX should 
> be the largest value, which in this case is not normalized, and 
> LDBL_MAX_EXP has nothing to do with what exponents do or not not have 
> normalized values.

But the largest value must be at least the largest normalized value, which equals (1−b^(−p))b^emax. Currently, LDBL_MAX is smaller than this value.

> It's true that GCC cannot handle IBM long double constants with more than 
> 106 mantissa bits, including the largest representable finite value.

With the current LDBL_MAX_EXP definition, GCC can handle all values with 53 mantissa bits, but this becomes wrong for 54 mantissa bits.
Comment 8 Vincent Lefèvre 2017-06-20 14:19:21 UTC
It seems that the choice emin = -968, emax = 1024, p = 106 had been made originally because all normalized values in this floating-point system (as specified by 5.2.4.2.2p2) are representable as a sum of two double's (and even two non-overlapping double's). However, the mistake is that not all pairs of double's are valid in the type "long double" (as you can see with the LDBL_MAX value). If only valid pairs are considered, then possible choices are:

* emin = -968, emax = 1024, p = 53 (but a poor choice as it would appear that this is no better than the type double);

* emin = -968, emax = 1023, p = 106 (thus just decrease emax from the current parameters);

* emin = -967, emax = 1023, p = 107 (I think that this would be a better choice). The extra bit precision comes from the sign of the second double.
Comment 9 jsm-csl@polyomino.org.uk 2017-06-20 17:38:55 UTC
On Tue, 20 Jun 2017, vincent-gcc at vinc17 dot net wrote:

> > The current proposed wording for DR#467 
> > <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2148.htm#dr_467> changes 
> > "maximum representable finite floating-point number, [ math formula ]" to 
> > "maximum representable finite floating-point number; if that value is 
> > normalized, its value is [ math formula ],", which I think addresses the 
> > first issue without requiring any change to LDBL_MAX_EXP;
> 
> This change concerns the LDBL_MAX definition. However, there is another issue,
> related to the floating-point model 5.2.4.2.2p2 and p3. I assume that the
> intent of this definition is to require that *all* normalized floating-point
> numbers from this model are elements of the floating-point type (otherwise emin
> and emax would no longer make any sense). But with the current status, some
> floating-point numbers of exponent LDBL_MAX_EXP = 1024 (those with f54 = 1, if
> I'm not mistaken) are not elements of the type "long double". The reason is
> that the first double must be the exact value rounded to nearest (with the even
> rounding rule).

That wording defines what "normalized" is, so that values with maximum 
exponent in this case don't count as normalized because not all values 
with that exponent in the floating-point model are representable, and 
defines LDBL_MAX so that it doesn't need to be normalized (and in this 
case isn't, by that definition of normalized).  The definition of 
LDBL_MAX_EXP is unchanged; it still just requires 2^(LDBL_MAX_EXP-1) to be 
representable, without requiring it to be normalized.

> > LDBL_MAX should 
> > be the largest value, which in this case is not normalized, and 
> > LDBL_MAX_EXP has nothing to do with what exponents do or not not have 
> > normalized values.
> 
> But the largest value must be at least the largest normalized value, which
> equals (1−b^(−p))b^emax. Currently, LDBL_MAX is smaller than this value.

That's not a normalized value, because not all values with exponent emax 
are representable in the type, so none of them are normalized.
Comment 10 Vincent Lefèvre 2017-06-20 19:03:59 UTC
(In reply to joseph@codesourcery.com from comment #9)
> That wording defines what "normalized" is, so that values with maximum 
> exponent in this case don't count as normalized because not all values 
> with that exponent in the floating-point model are representable, and 
> defines LDBL_MAX so that it doesn't need to be normalized (and in this 
> case isn't, by that definition of normalized).  The definition of 
> LDBL_MAX_EXP is unchanged; it still just requires 2^(LDBL_MAX_EXP-1) to be 
> representable, without requiring it to be normalized.

This would be pretty useless as a definition. This would mean that emin is in the "normalized" range (due to the LDBL_MIN_EXP definition), but one doesn't have any guarantee for the larger exponents.

Thus a type that contains only 0, normalized values of exponent emin, 1+LDBL_EPSILON, and 2^(LDBL_MAX_EXP-1), could be a valid type (with, say, long double = double = float). Just because you assume that emax has no relations with normalized values.

Note that the standard doesn't specify a range for the property related to LDBL_DIG, and this property is obviously incorrect for *any* floating-point number with q decimal digits. In particular, the property is not satisfied in general when the target is in the range of the subnormal numbers. So, I don't expect it to be necessarily satisfied outside the range of the normalized numbers.
Comment 11 Vincent Lefèvre 2021-08-18 15:04:24 UTC
In addition to the maximum exponent issue, for LDBL_MAX following the defect report, instead of

  0x1.fffffffffffff7ffffffffffff8p+1023

I would expect

  0x1.fffffffffffff7ffffffffffffcp+1023 = DBL_MAX + DBL_MAX * DBL_EPSILON / 4

as it is larger (it has one more trailing 1) and representable.
Comment 12 Jakub Jelinek 2021-08-18 15:13:03 UTC
(In reply to Vincent Lefèvre from comment #11)
> In addition to the maximum exponent issue, for LDBL_MAX following the defect
> report, instead of
> 
>   0x1.fffffffffffff7ffffffffffff8p+1023
> 
> I would expect
> 
>   0x1.fffffffffffff7ffffffffffffcp+1023 = DBL_MAX + DBL_MAX * DBL_EPSILON / 4
> 
> as it is larger (it has one more trailing 1) and representable.

That isn't representable in the GCC internal representation, which pretends the
type has fixed 106 bit precision (like double has 53 bit precision),
0x1.fffffffffffff7ffffffffffffcp+1023 needs 107 bit precision (and generally the type has variable precision).
The only way around that would be to actually represent it in GCC internal representation as sum of two doubles and rewrite all operations on this mode to treat it specially.  That is a lot of work and given that powerpc64le is switching from this floating point format to IEEE quad long double format, I'm certain nobody is willing to spend that much time (months) on it.
Comment 13 Vincent Lefèvre 2021-08-18 15:40:17 UTC
(In reply to Jakub Jelinek from comment #12)
> That isn't representable in the GCC internal representation, which pretends
> the type has fixed 106 bit precision
[...]

So, if I understand correctly, this is due to a limitation of GCC internals.

> The only way around that would be to actually represent it in GCC internal
> representation as sum of two doubles and rewrite all operations on this mode
> to treat it specially.  That is a lot of work and given that powerpc64le is
> switching from this floating point format to IEEE quad long double format,
> I'm certain nobody is willing to spend that much time (months) on it.

Well, if there is a difficulty of implementation, perhaps leave LDBL_MAX as is, and consider for instance that an operation that with the result 0x1.fffffffffffff7ffffffffffffcp+1023 would overflow (which is AFAIK undefined behavior for non-IEEE-754 formats, so that this would be conforming). But this should be documented.