On PowerPC, which uses the IBM long double format for the type long double, one gets: LDBL_MANT_DIG = 106 (this is the precision of the FP model), LDBL_MAX_EXP = 1024 (this is the maximum exponent of the FP model). Even though the IBM long double format, a.k.a. double-double format, is not a floating-point format, it is some superset of a floating-point format as specified by the C standard (see C11, §5.2.4.2.2 from p1 to p3), where p = 106 and emax = 1024. By definition, for radix b = 2, LDBL_MAX is the value (1 - 2^(-p)) * 2^emax (see §5.2.4.2.2p12), which is the largest value representable in the floating-point model. However, with GCC 4.7.2 20121109 (Red Hat 4.7.2-8), I get: LDBL_MAX = 0x1.fffffffffffff7ffffffffffff8p+1023 instead of 0x1.ffffffffffffffffffffffffff8p+1023. The following program shows that this is not a display bug: #include <stdio.h> #include <float.h> int main (void) { long double dmax = DBL_MAX; printf ("%La\n", LDBL_MAX); printf ("%La\n", LDBL_MAX - dmax); printf ("%La\n", dmax + dmax * DBL_EPSILON / 4); printf ("%La\n", dmax + dmax * DBL_EPSILON / 2); return 0; } It outputs: 0x1.fffffffffffff7ffffffffffff8p+1023 0x1.ffffffffffffep+969 0x1.fffffffffffff7ffffffffffffcp+1023 inf also showing that the arithmetic is buggy. I suppose that the high double is the value rounded to double precision, but this rule is incorrect near the largest values, due to the overflow. One may choose to keep the behavior, i.e. consider that the high double is the value rounded to double precision, but this means that the floating-point model would need to be changed; otherwise some values are not representable, as shown above.
(In reply to Vincent Lefèvre from comment #0) > One may choose to keep the behavior, i.e. consider that the high double is > the value rounded to double precision, but this means that the > floating-point model would need to be changed; otherwise some values are not > representable, as shown above. By "the floating-point model would need to be changed", I mean, for instance, choose LDBL_MAX_EXP = 1023. I think that this would be correct. A possible drawback is that one would have LDBL_MAX_EXP < DBL_MAX_EXP, but I don't think that this is a problem (note that one already has LDBL_MIN_EXP > DBL_MIN_EXP). This would just mean that one has "not normalized" values outside the normal range.
I think lowering LDBL_MAX_EXP would be far worse than this. The double-double format is unusable for any real numerics in so many ways that this is just one small part of that, the variable precision from 106 to thousands of precision bits depending on exact value is far worse.
The variable precision is unavoidable with this format (this is even a feature, despite the drawbacks). But the fact that the variable precision is problematic by itself isn't a reason not to try to solve other issues.
(In reply to Vincent Lefèvre from comment #0) > By definition, for radix b = 2, LDBL_MAX is the value (1 - 2^(-p)) * 2^emax > (see §5.2.4.2.2p12), which is the largest value representable in the > floating-point model. There has been a defect report and this is no longer the case. See: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2092.htm "DR 467 decided that FLT_MAX, DBL_MAX, LDBL_MAX are the maximum representable finite numbers for their respective types." So, this solves this issue: LDBL_MAX is correct according to this DR. As a consequence, I'm resolving the PR as invalid.
I'm reopening this PR since it is actually not solved. There are two remaining issues. The first issue is no longer the formula, but the fact that LDBL_MAX is strictly less than the maximum normalized number in the floating-point model. I think that either LDBL_MAX_EXP should be reduced to 1023 (thus any representable value with exponent 1024 would be regarded as unnormalized), or the standard should be corrected in some other way, e.g. to allow an incomplete binade for the largest exponent. The second issue is that one can get a finite value 0x1.fffffffffffff7ffffffffffffcp+1023 that is strictly larger than LDBL_MAX = 0x1.fffffffffffff7ffffffffffff8p+1023 (see the first and third numbers output by my program given in comment 0). Thus LDBL_MAX is not the maximum representable finite number as required by the standard.
The current proposed wording for DR#467 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2148.htm#dr_467> changes "maximum representable finite floating-point number, [ math formula ]" to "maximum representable finite floating-point number; if that value is normalized, its value is [ math formula ],", which I think addresses the first issue without requiring any change to LDBL_MAX_EXP; LDBL_MAX should be the largest value, which in this case is not normalized, and LDBL_MAX_EXP has nothing to do with what exponents do or not not have normalized values. (The last proposed change would require LDBL_EPSILON to change, however.) It's true that GCC cannot handle IBM long double constants with more than 106 mantissa bits, including the largest representable finite value.
(In reply to joseph@codesourcery.com from comment #6) > The current proposed wording for DR#467 > <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2148.htm#dr_467> changes > "maximum representable finite floating-point number, [ math formula ]" to > "maximum representable finite floating-point number; if that value is > normalized, its value is [ math formula ],", which I think addresses the > first issue without requiring any change to LDBL_MAX_EXP; This change concerns the LDBL_MAX definition. However, there is another issue, related to the floating-point model 5.2.4.2.2p2 and p3. I assume that the intent of this definition is to require that *all* normalized floating-point numbers from this model are elements of the floating-point type (otherwise emin and emax would no longer make any sense). But with the current status, some floating-point numbers of exponent LDBL_MAX_EXP = 1024 (those with f54 = 1, if I'm not mistaken) are not elements of the type "long double". The reason is that the first double must be the exact value rounded to nearest (with the even rounding rule). > LDBL_MAX should > be the largest value, which in this case is not normalized, and > LDBL_MAX_EXP has nothing to do with what exponents do or not not have > normalized values. But the largest value must be at least the largest normalized value, which equals (1−b^(−p))b^emax. Currently, LDBL_MAX is smaller than this value. > It's true that GCC cannot handle IBM long double constants with more than > 106 mantissa bits, including the largest representable finite value. With the current LDBL_MAX_EXP definition, GCC can handle all values with 53 mantissa bits, but this becomes wrong for 54 mantissa bits.
It seems that the choice emin = -968, emax = 1024, p = 106 had been made originally because all normalized values in this floating-point system (as specified by 5.2.4.2.2p2) are representable as a sum of two double's (and even two non-overlapping double's). However, the mistake is that not all pairs of double's are valid in the type "long double" (as you can see with the LDBL_MAX value). If only valid pairs are considered, then possible choices are: * emin = -968, emax = 1024, p = 53 (but a poor choice as it would appear that this is no better than the type double); * emin = -968, emax = 1023, p = 106 (thus just decrease emax from the current parameters); * emin = -967, emax = 1023, p = 107 (I think that this would be a better choice). The extra bit precision comes from the sign of the second double.
On Tue, 20 Jun 2017, vincent-gcc at vinc17 dot net wrote: > > The current proposed wording for DR#467 > > <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2148.htm#dr_467> changes > > "maximum representable finite floating-point number, [ math formula ]" to > > "maximum representable finite floating-point number; if that value is > > normalized, its value is [ math formula ],", which I think addresses the > > first issue without requiring any change to LDBL_MAX_EXP; > > This change concerns the LDBL_MAX definition. However, there is another issue, > related to the floating-point model 5.2.4.2.2p2 and p3. I assume that the > intent of this definition is to require that *all* normalized floating-point > numbers from this model are elements of the floating-point type (otherwise emin > and emax would no longer make any sense). But with the current status, some > floating-point numbers of exponent LDBL_MAX_EXP = 1024 (those with f54 = 1, if > I'm not mistaken) are not elements of the type "long double". The reason is > that the first double must be the exact value rounded to nearest (with the even > rounding rule). That wording defines what "normalized" is, so that values with maximum exponent in this case don't count as normalized because not all values with that exponent in the floating-point model are representable, and defines LDBL_MAX so that it doesn't need to be normalized (and in this case isn't, by that definition of normalized). The definition of LDBL_MAX_EXP is unchanged; it still just requires 2^(LDBL_MAX_EXP-1) to be representable, without requiring it to be normalized. > > LDBL_MAX should > > be the largest value, which in this case is not normalized, and > > LDBL_MAX_EXP has nothing to do with what exponents do or not not have > > normalized values. > > But the largest value must be at least the largest normalized value, which > equals (1−b^(−p))b^emax. Currently, LDBL_MAX is smaller than this value. That's not a normalized value, because not all values with exponent emax are representable in the type, so none of them are normalized.
(In reply to joseph@codesourcery.com from comment #9) > That wording defines what "normalized" is, so that values with maximum > exponent in this case don't count as normalized because not all values > with that exponent in the floating-point model are representable, and > defines LDBL_MAX so that it doesn't need to be normalized (and in this > case isn't, by that definition of normalized). The definition of > LDBL_MAX_EXP is unchanged; it still just requires 2^(LDBL_MAX_EXP-1) to be > representable, without requiring it to be normalized. This would be pretty useless as a definition. This would mean that emin is in the "normalized" range (due to the LDBL_MIN_EXP definition), but one doesn't have any guarantee for the larger exponents. Thus a type that contains only 0, normalized values of exponent emin, 1+LDBL_EPSILON, and 2^(LDBL_MAX_EXP-1), could be a valid type (with, say, long double = double = float). Just because you assume that emax has no relations with normalized values. Note that the standard doesn't specify a range for the property related to LDBL_DIG, and this property is obviously incorrect for *any* floating-point number with q decimal digits. In particular, the property is not satisfied in general when the target is in the range of the subnormal numbers. So, I don't expect it to be necessarily satisfied outside the range of the normalized numbers.
In addition to the maximum exponent issue, for LDBL_MAX following the defect report, instead of 0x1.fffffffffffff7ffffffffffff8p+1023 I would expect 0x1.fffffffffffff7ffffffffffffcp+1023 = DBL_MAX + DBL_MAX * DBL_EPSILON / 4 as it is larger (it has one more trailing 1) and representable.
(In reply to Vincent Lefèvre from comment #11) > In addition to the maximum exponent issue, for LDBL_MAX following the defect > report, instead of > > 0x1.fffffffffffff7ffffffffffff8p+1023 > > I would expect > > 0x1.fffffffffffff7ffffffffffffcp+1023 = DBL_MAX + DBL_MAX * DBL_EPSILON / 4 > > as it is larger (it has one more trailing 1) and representable. That isn't representable in the GCC internal representation, which pretends the type has fixed 106 bit precision (like double has 53 bit precision), 0x1.fffffffffffff7ffffffffffffcp+1023 needs 107 bit precision (and generally the type has variable precision). The only way around that would be to actually represent it in GCC internal representation as sum of two doubles and rewrite all operations on this mode to treat it specially. That is a lot of work and given that powerpc64le is switching from this floating point format to IEEE quad long double format, I'm certain nobody is willing to spend that much time (months) on it.
(In reply to Jakub Jelinek from comment #12) > That isn't representable in the GCC internal representation, which pretends > the type has fixed 106 bit precision [...] So, if I understand correctly, this is due to a limitation of GCC internals. > The only way around that would be to actually represent it in GCC internal > representation as sum of two doubles and rewrite all operations on this mode > to treat it specially. That is a lot of work and given that powerpc64le is > switching from this floating point format to IEEE quad long double format, > I'm certain nobody is willing to spend that much time (months) on it. Well, if there is a difficulty of implementation, perhaps leave LDBL_MAX as is, and consider for instance that an operation that with the result 0x1.fffffffffffff7ffffffffffffcp+1023 would overflow (which is AFAIK undefined behavior for non-IEEE-754 formats, so that this would be conforming). But this should be documented.