This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH][GCC] Complex division improvements in Libgcc
- From: Jeff Law <law at redhat dot com>
- To: Elen Kalda <Elen dot Kalda at arm dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Cc: nd <nd at arm dot com>, "ian at airs dot com" <ian at airs dot com>, "Joseph S. Myers" <jsm at polyomino dot org dot uk>
- Date: Fri, 23 Aug 2019 15:18:36 -0600
- Subject: Re: [PATCH][GCC] Complex division improvements in Libgcc
- References: <AM0PR08MB336270B1CBEBF8122A3D36FD84A40@AM0PR08MB3362.eurprd08.prod.outlook.com>
On 8/23/19 5:19 AM, Elen Kalda wrote:
> Hi all,
> Libgcc currently implements the Smith's algorithm
> (https://dl.acm.org/citation.cfm?id=368661) for complex division. It is
> designed to prevent overflows and underflows and it does that well for around
> 97.7% cases. However, there has been some interest in improving that result, as
> an example Baudin improves the robustness to ~99.5% by introducing an extra
> underflow check and also gains in performance by replacing two divisions by two
> multiplications and one division (https://arxiv.org/abs/1210.4539).
> This patch proposes the implementation of Baudin algorithm to Libgcc (in case
> the increase in rounding error is acceptable).
> In general, when it comes to deciding what is the best way of doing the
> complex divisions, there are three things to consider:
> - speed
> - accuracy
> - robustness (how often overflow/underflow happens, how many division results
> are way off from the exact value)
> Improving the algorithms by looking at the failures of specific cases is not
> simple. There are 4 real numbers to be manipulated to get the final result and
> some minimum number of operations that needs to be done. The quite minimal
> Smiths algorithm currently does 9 operations, of which 6 are multiplications
> or divisions which are susceptible to overflow and underflow. It is not
> feasible to check for both, underflow and overflow, in all 6 operations
> without significantly impacting the performance. Most of the complex division
> algorithms have been created to solve some special difficult cases, however,
> because of the abundance of failure opportunities, algorithm that works well
> for some type of divisions is likely to fail for other types of divisions.
> The simplest way to dodge overflow and underflow without impacting performance
> much is to vary the order of operations. When naively expanding the complex
> division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the
> denominator is the greatest source of overflow and underflow. Smiths
> improvement was to introduce scaling factor r = real(y)/imag(y), such that the
> denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check
> explicitly for underflow in r and change the order of operations if necessary.
> One way of comparing the algorithms is to generate lots of random complex
> numbers and see how the different algorithms perform. That approach is closer
> to the "real world" situation where the complex numbers are often random, not
> powers of two (which oddly is the assumption many authors make, including
> Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm
> was compared to two versions of Baudin. The difference is that in one version
> (b2div), real and imaginary parts of the result are both divided through by
> the same denominator d, in the other one (b1div) the real and imaginary
> parts are multiplied through by the reciprocal of that d, effectively
> replacing two divisions with one division and two multiplications. Since
> division is significantly slower on AArch64, that swap introduces gains in
> performance (though with the cost of rounding error, unfortunately).
> To compare the performance, I used a test that generates 1800 random complex
> double pairs (which fit nicely into the 64 kb cache) and divide each pair 200
> 000 times, effectively doing 360 000 000 operations.
> | CPU time
> smiths | 7 296 924
> b1div | 6 558 148
> b2div | 8 658 079
> On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than
> For the accuracy, 1 000 000 pairs of complex doubles were divided and compared
> to the exact results (assuming that the division of the same numbers as long
> doubles gives the exact value).
> | >2ulp | >1ulp | >0.5ulp | <0.5ulp
> smiths | 22 937 | 23 944 | 124 228 | 828 891
> b1div | 4 450 | 59 716 | 350 792 | 585 042
> b2div | 4 036 | 24 488 | 127 144 | 844 332
> Same data, but showing the percentage of divisions falling into each category:
> | >2ulp | >1ulp | >0.5ulp | <0.5ulp
> smiths | 2.29% | 2.39% | 12.42% | 82.89%
> b1div | 0.45% | 5.97% | 35.08% | 58.50%
> b2div | 0.40% | 2.45% | 12.71% | 84.43%
> The rightmost column (<0.5ulp) shows the number of calculations for which the
> ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning
> they were correctly rounded. >0.5ulp gives the number of calculations for
> which the largest ulp error of the real and imaginary parts were between 0.5
> and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but
> not THE nearest double. Anything less accurate is not great...
> FMA doesn't create any problems on AArch64. Bootstrapped and tested on
> aarch64-none-linux-gnu -- no problems with bootstrap, but some regressions in
> gcc/testsuite/gcc.dg/torture/builtin-math-7.c. The regressions are a result of
> the loss in accuracy - for a division (3. + 4.i) / 5i = 0.8 - 0.6i, Smiths
> 0.800000000000000044408920985006 - 0.599999999999999977795539507497 i
> ulp error in real part: 0.4
> ulp error in imaginary part: 0.2
> b1div gives:
> 0.800000000000000044408920985006 - 0.600000000000000088817841970013 i
> ulp error in real part: 0.4
> ulp error in imaginary part: 0.8
> That means the imaginary part of the Baudin result is rounded to one of the two
> nearest floats but not the one which is the nearest. Similar thing happens to
> another division in that test.
> Some questions:
> - Is the 10.13% improvement in performance with b1div worth the loss in
> - In the case of b1div, most of the results that ceased to be correctly
> rounded as a result of replacing the two divisions with multiplications, ended
> up in being in 1.0ulp. Maybe that is acceptable?
> - The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is
> that a significant/useful improvement?
> Best wishes,
> 2019-07-31 Elen Kalda <email@example.com>
> * libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm
I'd really like to hear from Joseph on this. He's got *far* more
experience in evaluating stuff in this space than I do and is almost
certainly going to be able to give a more informed opinion than myself.
I believe he's digging out from some time off, so it might be a little
while before he can respond.