This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
- From: Charles Baylis <charles dot baylis at linaro dot org>
- To: Ramana Radhakrishnan <ramana dot radhakrishnan at foss dot arm dot com>
- Cc: Prathamesh Kulkarni <prathamesh dot kulkarni at linaro dot org>, gcc Patches <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 31 Jul 2015 13:23:00 +0100
- Subject: Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
- Authentication-results: sourceware.org; auth=none
- References: <CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ at mail dot gmail dot com> <55BB4127 dot 5050202 at foss dot arm dot com>
On 31 July 2015 at 10:34, Ramana Radhakrishnan
<ramana.radhakrishnan@foss.arm.com> wrote:
> I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree.
My understanding is that 2 iterations is sufficient for single
precision floating point (although not for double precision), because
each iteration of Newton-Raphson doubles the number of bits of
accuracy.
I haven't worked through the maths myself, but
https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division
says
"This squaring of the error at each iteration step â the so-called
quadratic convergence of NewtonâRaphson's method â has the
effect that the number of correct digits in the result roughly
doubles for every iteration, a property that becomes extremely
valuable when the numbers involved have many digits"
Therefore:
vrecpe -> 8 bits of accuracy
+1 iteration -> 16 bits of accuracy
+2 iterations -> 32 bits of accuracy (but in reality limited to
precision of 32bit float)
Since 32 bits is much more accuracy than the 24 bits of precision in a
single precision FP value, 2 iterations should be sufficient.
> I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness.
I can't argue with confirming theory matches practice :)
Some corner cases (eg numbers around FLT_MAX, FLT_MIN etc) may result
in denormals or out of range values during the reciprocal calculation
which could result in answers which are less accurate than the typical
case but I think that is acceptable with -ffast-math.
Charles