[PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

Mon Jul 13 19:09:00 GMT 2015

FWIW, I was curious about the precision of the results using such instructions for the standard sqrt{,f} functions.  This is not a wide sample, but it does point to a floor of series iterations to 3 for DP and 2 for SP:

x               sqrt(x)         1 Step (ulps)           2 Steps (ulps)          3 Steps (ulps)
2.2251e-308     1.4917e-154     1.4917e-154 (999)       1.4917e-154 (999)       1.4917e-154 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (999)        4.0027e-10 (999)        4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)        1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)        1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)        1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (999)        1.4142e+00 (999)        1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (999)        1.5000e+00 (999)        1.5000e+00 (000)
2.5600e+00      1.6000e+00      1.6000e+00 (000)        1.6000e+00 (000)        1.6000e+00 (000)
3.1416e+00      1.7725e+00      1.7725e+00 (999)        1.7725e+00 (999)        1.7725e+00 (000)
6.0221e+23      7.7602e+11      7.7602e+11 (999)        7.7602e+11 (999)        7.7602e+11 (000)
1.7977e+308     1.3408e+154     1.3408e+154 (000)       1.3408e+154 (000)       1.3408e+154 (000)

x               sqrtf(x)        1 Step (ulps)           2 Steps (ulps)          3 Steps (ulps)
1.1755e-38      1.0842e-19      1.0842e-19 (096)        1.0842e-19 (000)        1.0842e-19 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (008)        4.0027e-10 (000)        4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)        1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (096)        1.0000e+00 (000)        1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (094)        1.0000e+00 (001)        1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (146)        1.4142e+00 (001)        1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (018)        1.5000e+00 (000)        1.5000e+00 (001)
2.5600e+00      1.6000e+00      1.6000e+00 (001)        1.6000e+00 (001)        1.6000e+00 (001)
3.1416e+00      1.7725e+00      1.7725e+00 (006)        1.7725e+00 (001)        1.7725e+00 (001)
6.0221e+23      7.7602e+11      7.7602e+11 (069)        7.7602e+11 (001)        7.7602e+11 (000)
3.4028e+38      1.8447e+19      1.8447e+19 (000)        1.8447e+19 (000)        1.8447e+19 (000)

The error in ULPs saturates at 999 above.

The result of having to use so many iterations to achieve accuracy would defeat using the Newton series, as it would likely be slower than the FSQRT instruction.

Unlike in x86, I have the impression that the initial estimate in AArch64 is meant to be used in applications that do not require precision, like graphics, etc.  Then, a single series iteration for SP would perhaps be good enough.

-- 
Evandro Menezes                              Austin, TX

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Dr. Philipp Tomsich
> Sent: Monday, June 29, 2015 6:45
> To: James Greenhalgh
> Cc: Kumar, Venkataramanan; pinskia@gmail.com; Benedikt Huber; gcc-
> patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> James,
> 
> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> >
> > On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
> >>
> >>> -----Original Message-----
> >>> From: Dr. Philipp Tomsich
> >>> [mailto:philipp.tomsich@theobroma-systems.com]
> >>> Sent: Monday, June 29, 2015 2:17 PM
> >>> To: Kumar, Venkataramanan
> >>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> >>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>> (rsqrt) estimation in -ffast-math
> >>>
> >>> Kumar,
> >>>
> >>> This does not come unexpected, as the initial estimation and each
> >>> iteration will add an architecturally-defined number of bits of
> >>> precision (ARMv8 guarantuees only a minimum number of bits provided
> >>> per operation… the exact number is specific to each micro-arch, though).
> >>> Depending on your architecture and on the required number of precise
> >>> bits by any given benchmark, one may see miscompares.
> >>
> >> True.
> >
> > I would be very uncomfortable with this approach.
> 
> Same here. The default must be safe. Always.
> Unlike other architectures, we don’t have a problem with making the proper
> defaults for “safety”, as the ARMv8 ISA guarantees a minimum number of
> precise bits per iteration.
> 
> > From Richard Biener's post in the thread Michael Matz linked earlier
> > in the thread:
> >
> >    It would follow existing practice of things we allow in
> >    -funsafe-math-optimizations.  Existing practice in that we
> >    want to allow -ffast-math use with common benchmarks we care
> >    about.
> >
> >    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> >
> > With the solution you seem to be converging on (2-steps for some
> > microarchitectures, 3 for others), a binary generated for one
> > micro-arch may drop below a minimum guarantee of precision when run on
> > another. This seems to go against the spirit of the practice above. I
> > would only support adding this optimization to -Ofast if we could keep
> > to architectural guarantees of precision in the generated code (i.e. 3-
> steps everywhere).
> >
> > I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> > which would be off by default, would enable the 2-step mode, and would
> > need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I
> > don't see what this buys you beyond the Gromacs boost (and even there
> > you would be creating an Invalid Run as optimization flags must be
> > applied across all workloads).
> 
> Any flag that reduces precision (and thus breaks IEEE floating-point
> semantics) needs to be gated with an “unsafe” flag (i.e. one that is never on
> by default).
> As a consequence, the “peak”-tuning for SPEC will turn this on… but barely
> anyone else would.
> 
> > For the 3-step optimization, it is clear to me that for "generic"
> > tuning we don't want this to be enabled by default experimental
> > results and advice in this thread argues against it for thunderx and
> cortex-a57 targets.
> > However, enabling it based on the CPU tuning selected seems fine to me.
> 
> I do not agree on this one, as I would like to see the safe form (i.e. 3 and
> 5 iterations respectively) to become the default. Most “server-type” chips
> should not see a performance regression, while it will be easier to optimise
> for this in hardware than for a (potentially microcoded) sqrt-instruction
> (and subsequent, dependent divide).
> 
> I have not heard anyone claim a performance regression (either on thunderx or
> on cortex-a57), but merely heard a “no speed-up”.
> 
> So I am strongly in favor of defaulting to the ‘safe’ number of iterations,
> even when compiling for a generic target.
> 
> Best,
> Philipp.