[PATCH] Practical Improvement to libgcc Complex Divide

Tue Oct 13 14:54:59 GMT 2020

Ping - still need review of version 4 of this patch.
It has been over a month since the last comment.

- patrick

On 9/9/2020 2:13 AM, Richard Biener wrote:
> On Tue, Sep 8, 2020 at 8:50 PM Patrick McGehearty via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>> (Version 4)
>>
>> (Added in version 4)
>> Fixed Changelog entry to include __divsc3, __divdc3, __divxc3, __divtc3.
>> Revised description to avoid incorrect use of "ulp (units last place)".
>> Modified float precison case to use double precision when double
>> precision hardware is available. Otherwise float uses the new algorithm.
>> Added code to scale subnormal numerator arguments when appropriate.
>> This change reduces 16 bit errors in double precision by a factor of 140.
>> Revised results charts to match current version of code.
>> Added background of tuning approach.
>>
>> Summary of Purpose
>>
>> The following patch to libgcc/libgcc2.c __divdc3 provides an
>> opportunity to gain important improvements to the quality of answers
>> for the default complex divide routine (half, float, double, extended,
>> long double precisions) when dealing with very large or very small exponents.
>>
>> The current code correctly implements Smith's method (1962) [2]
>> further modified by c99's requirements for dealing with NaN (not a
>> number) results. When working with input values where the exponents
>> are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
>> substantially different from the answers provided by quad precision
>> more than 1% of the time. This error rate may be unacceptable for many
>> applications that cannot a priori restrict their computations to the
>> safe range. The proposed method reduces the frequency of
>> "substantially different" answers by more than 99% for double
>> precision at a modest cost of performance.
>>
>> Differences between current gcc methods and the new method will be
>> described. Then accuracy and performance differences will be discussed.
>>
>> Background
>>
>> This project started with an investigation related to
>> https://urldefense.com/v3/__https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714__;!!GqivPVa7Brio!NjjEhnQQ38VNyP_v8nlAm9uVjvZldUnobfY5hZdq22cMMVauop64MFw3nOHIQXUmy8PToRw$ .  Study of Beebe[1]
>> provided an overview of past and recent practice for computing complex
>> divide. The current glibc implementation is based on Robert Smith's
>> algorithm [2] from 1962.  A google search found the paper by Baudin
>> and Smith [3] (same Robert Smith) published in 2012. Elen Kalda's
>> proposed patch [4] is based on that paper.
>>
>> I developed two sets of test set by randomly distributing values over
>> a restricted range and the full range of input values. The current
>> complex divide handled the restricted range well enough, but failed on
>> the full range more than 1% of the time. Baudin and Smith's primary
>> test for "ratio" equals zero reduced the cases with 16 or more error
>> bits by a factor of 5, but still left too many flawed answers. Adding
>> debug print out to cases with substantial errors allowed me to see the
>> intermediate calculations for test values that failed. I noted that
>> for many of the failures, "ratio" was a subnormal. Changing the
>> "ratio" test from check for zero to check for subnormal reduced the 16
>> bit error rate by another factor of 12. This single modified test
>> provides the greatest benefit for the least cost, but the percentage
>> of cases with greater than 16 bit errors (double precision data) is
>> still greater than 0.027% (2.7 in 10,000).
>>
>> Continued examination of remaining errors and their intermediate
>> computations led to the various tests of input value tests and scaling
>> to avoid under/overflow. The current patch does not handle some of the
>> rarest and most extreme combinations of input values, but the random
>> test data is only showing 1 case in 10 million that has an error of
>> greater than 12 bits. That case has 18 bits of error and is due to
>> subtraction cancellation. These results are significantly better
>> than the results reported by Baudin and Smith.
>>
>> Support for half, float, double, extended, and long double precision
>> is included as all are handled with suitable preprocessor symbols in a
>> single source routine. Since half precision is computed with float
>> precision as per current libgcc practice, the enhanced algorithm
>> provides no benefit for half precision and would cost performance.
>> Therefore half precision is left unchanged.
>>
>> The existing constants for each precision:
>> float: FLT_MAX, FLT_MIN;
>> double: DBL_MAX, DBL_MIN;
>> extended and/or long double: LDBL_MAX, LDBL_MIN
>> are used for avoiding the more common overflow/underflow cases.
>>
>> Testing for when both parts of the denominator had exponents roughly
>> small enough to allow shifting any subnormal values to normal values,
>> all input values could be scaled up without risking unnecessary
>> overflow and gaining a clear improvement in accuracy. Similarly, when
>> either numerator was subnormal and the other numerator and both
>> denominator values were not too large, scaling could be used to reduce
>> risk of computing with subnormals.  The test and scaling values used
>> all fit within the allowed exponent range for each precision required
>> by the C standard.
>>
>> Float precision has even more difficulty with getting correct answers
>> than double precision. When hardware for double precision floating
>> point operations is available, float precision is now handled in
>> double precision intermediate calculations with the original Smith
>> algorithm (i.e. the current approach). Using the higher precision
>> yields exact results for all tested input values (64-bit double,
>> 32-bit float) with the only performance cost being the requirement to
>> convert the four input values from float to double. If double
>> precision hardware is not available, then float complex divide will
>> use the same algorithm as the other precisions with similar
>> decrease in performance.
>>
>> Further Improvement
>>
>> The most common remaining substantial errors are due to accuracy loss
>> when subtracting nearly equal values. This patch makes no attempt to
>> improve that situation.
>>
>> NOTATION
>>
>> For all of the following, the notation is:
>> Input complex values:
>>    a+bi  (a= real part, b= imaginary part)
>>    c+di
>> Output complex value:
>>    e+fi = (a+bi)/(c+di)
>>
>> For the result tables:
>> current = current method (SMITH)
>> b1div = method proposed by Elen Kalda
>> b2div = alternate method considered by Elen Kalda
>> new = new method proposed by this patch
>>
>> DESCRIPTIONS of different complex divide methods:
>>
>> NAIVE COMPUTATION (-fcx-limited-range):
>>    e = (a*c + b*d)/(c*c + d*d)
>>    f = (b*c - a*d)/(c*c + d*d)
>>
>> Note that c*c and d*d will overflow or underflow if either
>> c or d is outside the range 2^-538 to 2^512.
>>
>> This method is available in gcc when the switch -fcx-limited-range is
>> used. That switch is also enabled by -ffast-math. Only one who has a
>> clear understanding of the maximum range of all intermediate values
>> generated by an application should consider using this switch.
>>
>> SMITH's METHOD (current libgcc):
>>    if(fabs(c)<fabs(d) {
>>      r = c/d;
>>      denom = (c*r) + d;
>>      e = (a*r + b) / denom;
>>      f = (b*r - a) / denom;
>>    } else {
>>      r = d/c;
>>      denom = c + (d*r);
>>      e = (a + b*r) / denom;
>>      f = (b - a*r) / denom;
>>    }
>>
>> Smith's method is the current default method available with __divdc3.
>>
>> Elen Kalda's METHOD
>>
>> Elen Kalda proposed a patch about a year ago, also based on Baudin and
>> Smith, but not including tests for subnormals:
>> https://urldefense.com/v3/__https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html__;!!GqivPVa7Brio!NjjEhnQQ38VNyP_v8nlAm9uVjvZldUnobfY5hZdq22cMMVauop64MFw3nOHIQXUm2_PCYts$  [4]
>> It is compared here for accuracy with this patch.
>>
>> This method applies the most significant part of the algorithm
>> proposed by Baudin&Smith (2012) in the paper "A Robust Complex
>> Division in Scilab" [3]. Elen's method also replaces two divides by
>> one divide and two multiplies due to the high cost of divide on
>> aarch64. In the comparison sections, this method will be labeled
>> b1div. A variation discussed in that patch which does not replace the
>> two divides will be labeled b2div.
>>
>>    inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
>>    {
>>      r = d/c;
>>      t = 1.0 / (c + (d * r));
>>      if (r != 0) {
>>          x = (a + (b * r)) * t;
>>          y = (b - (a * r)) * t;
>>      }  else {
>>      /* Changing the order of operations avoids the underflow of r impacting
>>       the result. */
>>          x = (a + (d * (b / c))) * t;
>>          y = (b - (d * (a / c))) * t;
>>      }
>>    }
>>
>>    if (FABS (d) < FABS (c)) {
>>        improved_internal (a, b, c, d);
>>    } else {
>>        improved_internal (b, a, d, c);
>>        y = -y;
>>    }
>>
>> NEW METHOD (proposed by patch) to replace the current default method:
>>
>> The proposed method starts with an algorithm proposed by Baudin&Smith
>> (2012) in the paper "A Robust Complex Division in Scilab" [3]. The
>> patch makes additional modifications to that method for further
>> reductions in the error rate. The following code shows the #define
>> values for double precision. See the patch for #define values used
>> for other precisions.
>>
>>    #define RBIG ((DBL_MAX)/2.0)
>>    #define RMIN (DBL_MIN)
>>    #define RMIN2 (0x1.0p-53)
>>    #define RMINSCAL (0x1.0p+51)
>>    #define RMAX2  ((RBIG)*(RMIN2))
>>
>>    if (FABS(c) < FABS(d)) {
>>    /* prevent overflow when arguments are near max representable */
>>    if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
>>        a = a * 0.5;
>>        b = b * 0.5;
>>        c = c * 0.5;
>>        d = d * 0.5;
>>    }
>>    /* minimize overflow/underflow issues when c and d are small */
>>    else if (FABS (d) < RMIN2) {
>>        a = a * RMINSCAL;
>>        b = b * RMINSCAL;
>>        c = c * RMINSCAL;
>>        d = d * RMINSCAL;
>>    }
>>    else {
>>      if(((FABS (a) < RMIN) && (FABS (b) < RMAX2) && (FABS (d) < RMAX2)) ||
>>         ((FABS (b) < RMIN) && (FABS (a) < RMAX2) && (FABS (d) < RMAX2))) {
>>          a = a * RMINSCAL;
>>          b = b * RMINSCAL;
>>          c = c * RMINSCAL;
>>          d = d * RMINSCAL;
>>      }
>>    }
>>    r = c/d; denom = (c*r) + d;
>>    if( r > RMIN ) {
>>        e = (a*r + b) / denom   ;
>>        f = (b*r - a) / denom
>>    } else {
>>        e = (c * (a/d) + b) / denom;
>>        f = (c * (b/d) - a) / denom;
>>    }
>>    }
>> [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]
>>
>> Before any computation of the answer, the code checks for any input
>> values near maximum to allow down scaling to avoid overflow.  These
>> scalings almost never harm the accuracy since they are by 2. Values that
>> are over RBIG are relatively rare but it is easy to test for them and
>> allow aviodance of overflows.
>>
>> Testing for RMIN2 reveals when both c and d are less than 2^-53 (for
>> double precision, see patch for other values).  By scaling all values
>> by 2^51, the code avoids many underflows in intermediate computations
>> that otherwise might occur. If scaling a and b by 2^51 causes either
>> to overflow, then the computation will overflow whatever method is
>> used.
>>
>> Finally, we test for either a or b being subnormal (RMIN) and if so,
>> for the other three values being small enough to allow scaling.  We
>> only need to test a single denominator value since we have already
>> determined which of c and d is larger.
>>
>> Next, r (the ratio of c to d) is checked for being near zero. Baudin
>> and Smith checked r for zero. This code improves that approach by
>> checking for values less than DBL_MIN (subnormal) covers roughly 12
>> times as many cases and substantially improves overall accuracy. If r
>> is too small, then when it is used in a multiplication, there is a
>> high chance that the result will underflow to zero, losing significant
>> accuracy. That underflow is avoided by reordering the computation.
>> When r is subnormal, the code replaces a*r (= a*(c/d)) with ((a/d)*c)
>> which is mathematically the same but avoids the unnecessary underflow.
>>
>> TEST Data
>>
>> Two sets of data are presented to test these methods. Both sets
>> contain 10 million pairs of complex values.  The exponents and
>> mantissas are generated using multiple calls to random() and then
>> combining the results. Only values which give results to complex
>> divide that are representable in the appropriate precision after
>> being computed in quad precision are used.
>>
>> The first data set is labeled "moderate exponents".
>> The exponent range is limited to -DBL_MAX_EXP/2 to DBL_MAX_EXP/2
>> for Double Precision (use FLT_MAX_EXP or LDBL_MAX_EXP for the
>> appropriate precisions.
>> The second data set is labeled "full exponents".
>> The exponent range for these cases is the full exponent range
>> including subnormals for a given precision.
>>
>> ACCURACY Test results:
>>
>> Note: The following accuracy tests are based on IEEE-754 arithmetic.
>>
>> Note: All results reporteed are based on use of fused multiply-add. If
>> fused multiply-add is not used, the error rate increases, giving more
>> 1 and 2 bit errors for both current and new complex divide.
>> Differences between using fused multiply and not using it that are
>> greater than 2 bits are less than 1 in a million.
>>
>> The complex divide methods are evaluated by determining the percentage
>> of values that exceed differences in low order bits.  If a "2 bit"
>> test results show 1%, that would mean that 1% of 10,000,000 values
>> (100,000) have either a real or imaginary part that differs from the
>> quad precision result by more than the last 2 bits.
>>
>> Results are reported for differences greater than or equal to 1 bit, 2
>> bits, 8 bits, 16 bits, 24 bits, and 52 bits for double precision.  Even
>> when the patch avoids overflows and underflows, some input values are
>> expected to have errors due to the potential for catastrophic roundoff
>> from floating point subtraction. For example, when b*c and a*d are
>> nearly equal, the result of subtraction may lose several places of
>> accuracy. This patch does not attempt to detect or minimize this type
>> of error, but neither does it increase them.
>>
>> I only show the results for Elen Kalda's method (with both 1 and
>> 2 divides) and the new method for only 1 divide in the double
>> precision table.
>>
>> In the following charts, lower values are better.
>>
>> current - current complex divide in libgcc
>> b1div - Elen Kalda's method from Baudin & Smith with one divide
>> b2div - Elen Kalda's method from Baudin & Smith with two divides
>> new   - This patch which uses 2 divides
>>
>> ===================================================
>> Errors   Moderate Dataset
>> gtr eq     current    b1div      b2div        new
>> ======    ========   ========   ========   ========
>>   1 bit    0.24707%   0.92986%   0.24707%   0.24707%
>>   2 bits   0.01762%   0.01770%   0.01762%   0.01762%
>>   8 bits   0.00026%   0.00026%   0.00026%   0.00026%
>> 16 bits   0.00000%   0.00000%   0.00000%   0.00000%
>> 24 bits         0%         0%         0%         0%
>> 52 bits         0%         0%         0%         0%
>> ===================================================
>> Table 1: Errors with Moderate Dataset (Double Precision)
>>
>> Note in Table 1 that both the old and new methods give identical error
>> rates for data with moderate exponents. Errors exceeding 16 bits are
>> exceedingly rare. There are substantial increases in the 1 bit error
>> rates for b1div (the 1 divide/2 multiplys method) as compared to b2div
>> (the 2 divides method). These differences are minimal for 2 bits and
>> larger error measurements.
>>
>> ===================================================
>> Errors   Full Dataset
>> gtr eq     current    b1div      b2div        new
>> ======    ========   ========   ========   ========
>>   1 bit      2.05%   1.23842%    0.67130%   0.16664%
>>   2 bits     1.88%   0.51615%    0.50354%   0.00900%
>>   8 bits     1.77%   0.42856%    0.42168%   0.00011%
>> 16 bits     1.63%   0.33840%    0.32879%   0.00001%
>> 24 bits     1.51%   0.25583%    0.24405%   0.00000%
>> 52 bits     1.13%   0.01886%    0.00350%   0.00000%
>> ===================================================
>> Table 2: Errors with Full Dataset (Double Precision)
>>
>> Table 2 shows significant differences in error rates. First, the
>> difference between b1div and b2div show a significantly higher error
>> rate for the b1div method both for single bit errros and well
>> beyond. Even for 52 bits, we see the b1div method gets completely
>> wrong answers more than 5 times as often as b2div. To retain
>> comparable accuracy with current complex divide results for small
>> exponents and due to the increase in errors for large exponents, I
>> choose to use the more accurate method of two divides.
>>
>> The current method has more 1.6% of cases where it is getting results
>> where the low 24 bits of the mantissa differ from the correct
>> answer. More than 1.1% of cases where the answer is completely wrong.
>> The new method shows less than one case in 10,000 with greater than
>> two bits of error and only one case in 10 million with greater than
>> 16 bits of errors. The new patch reduces 8 bit errors by
>> a factor of 16,000 and virtually eliminates completely wrong
>> answers.
>>
>> As noted above, for architectures with double precision
>> hardware, the new method uses that hardware for the
>> intermediate calculations before returning the
>> result in float precision. Testing of the new patch
>> has shown zero errors found as seen in Tables 3 and 4.
>>
>> Correctness for float
>> =============================
>> Errors   Moderate Dataset
>> gtr eq     current     new
>> ======    ========   ========
>>   1 bit   28.68070%         0%
>>   2 bits   0.64386%         0%
>>   8 bits   0.00401%         0%
>> 16 bits   0.00001%         0%
>> 24 bits         0%         0%
>> =============================
>> Table 3: Errors with Moderate Dataset (float)
>>
>> =============================
>> Errors   Full Dataset
>> gtr eq     current     new
>> ======    ========   ========
>>   1 bit     19.97%         0%
>>   2 bits     3.20%         0%
>>   8 bits     1.90%         0%
>> 16 bits     1.03%         0%
>> 24 bits     0.52%         0%
>> =============================
>> Table 4: Errors with Full Dataset (float)
>>
>> As before, the current method shows an troubling rate of extreme
>> errors.
>>
>> There is no change in accuracy for half-precision since the code is
>> unchanged. libgcc computes half-precision functions in float precision
>> allowing the existing methods to avoid overflow/underflow issues
>> for the allowed range of exponents for half-precision.
>>
>> Extended precision (using x87 80-bit format on x86) and Long double
>> (using IEEE-754 128-bit on x86 and aarch64) both have 15-bit exponents
>> as compared to 11-bit exponents in double precision. We note that the
>> C standard also allows Long Double to be implemented in the equivalent
>> range of Double. The RMIN2 and RMINSCAL constants are selected to work
>> within the Double range as well as with extended and 128-bit ranges.
>> We will limit our performance and accurancy discussions to the 80-bit
>> and 128-bit formats as seen on x86 here.
>>
>> The extended and long double precision investigations were more
>> limited. Aarch64 does not support extended precision but does support
>> the software implementation of 128-bit long double precision. For x86,
>> long double defaults to the 80-bit precision but using the
>> -mlong-double-128 flag switches to using the software implementation
>> of 128-bit precision. Both 80-bit and 128-bit precisions have the same
>> exponent range, with the 128-bit precision has extended mantissas.
>> Since this change is only aimed at avoiding underflow/overflow for
>> extreme exponents, I studied the extended precision results on x86 for
>> 100,000 values. The limited exponent dataset showed no differences.
>> For the dataset with full exponent range, the current and new values
>> showed major differences (greater than 32 bits) in 567 cases out of
>> 100,000 (0.56%). In every one of these cases, the ratio of c/d or d/c
>> (as appropriate) was zero or subnormal, indicating the advantage of
>> the new method and its continued correctness where needed.
>>
>> PERFORMANCE Test results
>>
>> In order for a library change to be practical, it is necessary to show
>> the slowdown is tolerable. The slowdowns observed are much less than
>> would be seen by (for example) switching from hardware double precison
>> to a software quad precision, which on the tested machines causes a
>> slowdown of around 100x).
>>
>> The actual slowdown depends on the machine architecture. It also
>> depends on the nature of the input data. If underflow/overflow is
>> rare, then implementations that have strong branch prediction will
>> only slowdown by a few cycles. If underflow/overflow is common, then
>> the branch predictors will be less accurate and the cost will be
>> higher.
>>
>> Results from two machines are presented as examples of the overhead
>> for the new method. The one labeled x86 is a 5 year old Intel x86
>> processor and the one labeled aarch64 is a 3 year old arm64 processor.
>>
>> In the following chart, the times are averaged over a one million
>> value data set. All values are scaled to set the time of the current
>> method to be 1.0. Lower values are better. A value of less than 1.0
>> would be faster than the current method and a value greater than 1.0
>> would be slower than the current method.
>>
>> ================================================
>>                 Moderate set          full set
>>                 x86  aarch64        x86  aarch64
>> ========     ===============     ===============
>> float         0.68    1.26        0.79    1.26
>> double        1.08    1.33        1.47    1.76
>> long double   1.08    1.23        1.08    1.24
>> ================================================
>> Table 5: Performance Comparisons (ratio new/current)
>>
>> The above tables omit the timing for the 1 divide and 2 multiply
>> comparison with the 2 divide approach.
>>
>> For the proposed change, the moderate dataset shows less overhead for
>> the newer methods than the full dataset. That's because the moderate
>> dataset does not ever take the new branches which protect from
>> under/overflow. The better the branch predictor, the lower the cost
>> for these untaken branches. Both platforms are somewhat dated, with
>> the x86 having a better branch predictor which reduces the cost of the
>> additional branches in the new code. Of course, the relative slowdown
>> may be greater for some architectures, especially those with limited
>> branch prediction combined with a high cost of misprediction.
>>
>> Special note for x86 float: On the particular model of x86 used for
>> these tests, float complex divide runs slower than double complex
>> divide. While the issue has not been investigated, I suspect
>> the issue to be floating point register assignment which results
>> in false sharing being detected by the hardware. A combination
>> of HW/compiler without the glitch would likely show something
>> like 10-20% slowdown for the new method.
>>
>> The observed cost for all precisions is claimed to be tolerable on the
>> grounds that:
>>
>> (a) the cost is worthwhile considering the accuracy improvement shown.
>> (b) most applications will only spend a small fraction of their time
>>      calculating complex divide.
>> (c) it is much less than the cost of extended precision
>> (d) users are not forced to use it (as described below)
>>
>> Those users who find this degree of slowdown unsatisfactory may use
>> the gcc switch -fcx-fortran-rules which does not use the library
>> routine, instead inlining Smith's method without the C99 requirement
>> for dealing with NaN results. The proposed patch for libgcc complex
>> divide does not affect the code generated by -fcx-fortran-rules.
>>
>> SUMMARY
>>
>> When input data to complex divide has exponents whose absolute value
>> is less than half of *_MAX_EXP, this patch makes no changes in
>> accuracy and has only a modest effect on performance.  When input data
>> contains values outside those ranges, the patch eliminates more than
>> 99.9% of major errors with a tolerable cost in performance.
>>
>> In comparison to Elen Kalda's method, this patch introduces more
>> performance overhead but reduces major errors by a factor of
>> greater than 4000.
> Thanks for working on this.  Speaking about performance and
> accuracy I spot a few opportunities to use FMAs [and eventually
> vectorization] - do FMAs change anything on the accuracy analysis
> (is there the chance they'd make it worse?).  We might want to use
> IFUNCs in libgcc to version for ISA variants (with/without FMA)?
>
> Thanks,
> Richard.
>
>> REFERENCES
>>
>> [1] Nelson H.F. Beebe, "The Mathematical-Function Computation Handbook.
>> Springer International Publishing AG, 2017.
>>
>> [2] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,
>>   5(8):435, 1962.
>>
>> [3] Michael Baudin and Robert L. Smith. "A robust complex division in
>> Scilab," October 2012, available at https://urldefense.com/v3/__http://arxiv.org/abs/1210.4539__;!!GqivPVa7Brio!NjjEhnQQ38VNyP_v8nlAm9uVjvZldUnobfY5hZdq22cMMVauop64MFw3nOHIQXUm6W2O4QM$ .
>>
>> [4] Elen Kalda: Complex division improvements in libgcc
>> https://urldefense.com/v3/__https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html__;!!GqivPVa7Brio!NjjEhnQQ38VNyP_v8nlAm9uVjvZldUnobfY5hZdq22cMMVauop64MFw3nOHIQXUm2_PCYts$
>> ---
>>   gcc/c-family/c-cppbuiltin.c |   5 ++
>>   libgcc/ChangeLog            |   7 ++
>>   libgcc/libgcc2.c            | 178 ++++++++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 182 insertions(+), 8 deletions(-)
>>
>> diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
>> index 74ecca8..02c06d8 100644
>> --- a/gcc/c-family/c-cppbuiltin.c
>> +++ b/gcc/c-family/c-cppbuiltin.c
>> @@ -1343,6 +1343,11 @@ c_cpp_builtins (cpp_reader *pfile)
>>         builtin_define_with_value ("__LIBGCC_INIT_SECTION_ASM_OP__",
>>                                   INIT_SECTION_ASM_OP, 1);
>>   #endif
>> +      /* For libgcc float/double optimization */
>> +#ifdef HAVE_adddf3
>> +      builtin_define_with_int_value ("__LIBGCC_HAVE_HWDBL__",
>> +                                HAVE_adddf3);
>> +#endif
>>   #ifdef INIT_ARRAY_SECTION_ASM_OP
>>         /* Despite the name of this target macro, the expansion is not
>>           actually used, and may be empty rather than a string
>> diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog
>> index ccfd6f6..8bd66c5 100644
>> --- a/libgcc/ChangeLog
>> +++ b/libgcc/ChangeLog
>> @@ -1,3 +1,10 @@
>> +2020-08-27  Patrick McGehearty <patrick.mcgehearty@oracle.com>
>> +
>> +       * libgcc2.c (__divsc3, __divdc3, __divxc3, __divtc3): Enhance
>> +       accuracy of complex divide by avoiding underflow/overflow when
>> +       ratio underflows or when arguments have very large or very
>> +       small exponents.
>> +
>>   2020-08-26  Jozef Lawrynowicz  <jozef.l@mittosystems.com>
>>
>>          * config/msp430/slli.S (__gnu_mspabi_sllp): New.
>> diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
>> index e0a9fd7..a9866f3 100644
>> --- a/libgcc/libgcc2.c
>> +++ b/libgcc/libgcc2.c
>> @@ -2036,27 +2036,189 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
>>   CTYPE
>>   CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
>>   {
>> +#if defined(L_divsc3)
>> +#define RBIG   ((FLT_MAX)/2.0)
>> +#define RMIN   (FLT_MIN)
>> +#define RMIN2  (0x1.0p-21)
>> +#define RMINSCAL (0x1.0p+19)
>> +#define RMAX2  ((RBIG)*(RMIN2))
>> +#endif
>> +
>> +#if defined(L_divdc3)
>> +#define RBIG   ((DBL_MAX)/2.0)
>> +#define RMIN   (DBL_MIN)
>> +#define RMIN2  (0x1.0p-53)
>> +#define RMINSCAL (0x1.0p+51)
>> +#define RMAX2  ((RBIG)*(RMIN2))
>> +#endif
>> +
>> +#if (defined(L_divxc3) || defined(L_divtc3))
>> +#define RBIG   ((LDBL_MAX)/2.0)
>> +#define RMIN   (LDBL_MIN)
>> +#define RMIN2  (0x1.0p-53)
>> +#define RMINSCAL (0x1.0p+51)
>> +#define RMAX2  ((RBIG)*(RMIN2))
>> +#endif
>> +
>> +#if defined(L_divhc3)
>> +  /* half precision is handled with float precision.
>> +     no extra measures are needed to avoid overflow/underflow */
>> +
>> +  float aa, bb, cc, dd;
>> +  float denom, ratio, x, y;
>> +  CTYPE res;
>> +  aa = a;
>> +  bb = b;
>> +  cc = c;
>> +  dd = d;
>> +
>> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */
>> +  if (FABS (c) < FABS (d))
>> +    {
>> +      ratio = cc / dd;
>> +      denom = (cc * ratio) + dd;
>> +      x = ((aa * ratio) + bb) / denom;
>> +      y = ((bb * ratio) - aa) / denom;
>> +    }
>> +  else
>> +    {
>> +      ratio = dd / cc;
>> +      denom = (dd * ratio) + cc;
>> +      x = ((bb * ratio) + aa) / denom;
>> +      y = (bb - (aa * ratio)) / denom;
>> +    }
>> +
>> +#elif (defined(L_divsc3) && \
>> +       (defined(__LIBGCC_HAVE_HWDBL__) && __LIBGCC_HAVE_HWDBL__ == 1))
>> +
>> +  /* float is handled with double precision,
>> +     no extra measures are needed to avoid overflow/underflow */
>> +  double aa, bb, cc, dd;
>> +  double denom, ratio, x, y;
>> +  CTYPE res;
>> +
>> +  aa = a;
>> +  bb = b;
>> +  cc = c;
>> +  dd = d;
>> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */
>> +  if (FABS (c) < FABS (d))
>> +    {
>> +      ratio = cc / dd;
>> +      denom = (cc * ratio) + dd;
>> +      x = ((aa * ratio) + bb) / denom;
>> +      y = ((bb * ratio) - aa) / denom;
>> +    }
>> +  else
>> +    {
>> +      ratio = dd / cc;
>> +      denom = (dd * ratio) + cc;
>> +      x = ((bb * ratio) + aa) / denom;
>> +      y = (bb - (aa * ratio)) / denom;
>> +    }
>> +
>> +#else
>>     MTYPE denom, ratio, x, y;
>>     CTYPE res;
>>
>> -  /* ??? We can get better behavior from logarithmic scaling instead of
>> -     the division.  But that would mean starting to link libgcc against
>> -     libm.  We could implement something akin to ldexp/frexp as gcc builtins
>> -     fairly easily...  */
>> +  /* double, extended, long double have significant potential
>> +    underflow/overflow errors than can be greatly reduced with
>> +    a limited number of adjustments.
>> +    float is handled the same way when no HW double is available
>> +  */
>> +
>> +  /* Scale by max(c,d) to reduce chances of denominator overflowing */
>>     if (FABS (c) < FABS (d))
>>       {
>> +      /* prevent underflow when denominator is near max representable */
>> +      if (FABS (d) >= RBIG)
>> +       {
>> +         a = a * 0.5;
>> +         b = b * 0.5;
>> +         c = c * 0.5;
>> +         d = d * 0.5;
>> +       }
>> +      /* avoid overflow/underflow issues when c and d are small */
>> +      /* scaling up helps avoid some underflows */
>> +      /* No new overflow possible since c&d < RMIN2 */
>> +      if (FABS (d) < RMIN2)
>> +       {
>> +         a = a * RMINSCAL;
>> +         b = b * RMINSCAL;
>> +         c = c * RMINSCAL;
>> +         d = d * RMINSCAL;
>> +       }
>> +      else
>> +       {
>> +         if(((FABS (a) < RMIN) && (FABS (b) < RMAX2) && (FABS (d) < RMAX2)) ||
>> +            ((FABS (b) < RMIN) && (FABS (a) < RMAX2) && (FABS (d) < RMAX2)))
>> +           {
>> +             a = a * RMINSCAL;
>> +             b = b * RMINSCAL;
>> +             c = c * RMINSCAL;
>> +             d = d * RMINSCAL;
>> +           }
>> +       }
>>         ratio = c / d;
>>         denom = (c * ratio) + d;
>> -      x = ((a * ratio) + b) / denom;
>> -      y = ((b * ratio) - a) / denom;
>> +      /* choose alternate order of computation if ratio is subnormal */
>> +      if (FABS (ratio) > RMIN)
>> +       {
>> +         x = ((a * ratio) + b) / denom;
>> +         y = ((b * ratio) - a) / denom;
>> +       }
>> +      else
>> +       {
>> +         x = ((c * (a / d)) + b) / denom;
>> +         y = ((c * (b / d)) - a) / denom;
>> +       }
>>       }
>>     else
>>       {
>> +      /* prevent underflow when denominator is near max representable */
>> +      if (FABS(c) >= RBIG)
>> +       {
>> +         a = a * 0.5;
>> +         b = b * 0.5;
>> +         c = c * 0.5;
>> +         d = d * 0.5;
>> +       }
>> +      /* avoid overflow/underflow issues when both c and d are small */
>> +      /* scaling up helps avoid some underflows */
>> +      /* No new overflow possible since both c&d are less than RMIN2 */
>> +      if (FABS(c) < RMIN2)
>> +       {
>> +         a = a * RMINSCAL;
>> +         b = b * RMINSCAL;
>> +         c = c * RMINSCAL;
>> +         d = d * RMINSCAL;
>> +       }
>> +      else
>> +       {
>> +         if(((FABS(a) < RMIN) && (FABS(b) < RMAX2) && (FABS(c) < RMAX2)) ||
>> +            ((FABS(b) < RMIN) && (FABS(a) < RMAX2) && (FABS(c) < RMAX2)))
>> +           {
>> +             a = a * RMINSCAL;
>> +             b = b * RMINSCAL;
>> +             c = c * RMINSCAL;
>> +             d = d * RMINSCAL;
>> +           }
>> +       }
>>         ratio = d / c;
>>         denom = (d * ratio) + c;
>> -      x = ((b * ratio) + a) / denom;
>> -      y = (b - (a * ratio)) / denom;
>> +      /* choose alternate order of computation if ratio is subnormal */
>> +      if (FABS(ratio) > RMIN)
>> +       {
>> +         x = ((b * ratio) + a) / denom;
>> +         y = (b - (a * ratio)) / denom;
>> +       }
>> +      else
>> +       {
>> +         x = (a + (d * (b / c))) / denom;
>> +         y = (b - (d * (a / c))) / denom;
>> +       }
>>       }
>> +#endif
>>
>>     /* Recover infinities and zeros that computed as NaN+iNaN; the only cases
>>        are nonzero/zero, infinite/finite, and finite/infinite.  */
>> --
>> 1.8.3.1
>>