This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

From: Evandro Menezes <e dot menezes at samsung dot com>
To: "'Dr. Philipp Tomsich'" <philipp dot tomsich at theobroma-systems dot com>, 'James Greenhalgh' <james dot greenhalgh at arm dot com>
Cc: "'Kumar, Venkataramanan'" <Venkataramanan dot Kumar at amd dot com>, pinskia at gmail dot com, 'Benedikt Huber' <benedikt dot huber at theobroma-systems dot com>, gcc-patches at gcc dot gnu dot org, 'Marcus Shawcroft' <Marcus dot Shawcroft at arm dot com>, 'Ramana Radhakrishnan' <ramrad01 at arm dot com>, 'Richard Earnshaw' <rearnsha at arm dot com>
Date: Mon, 13 Jul 2015 14:09:00 -0500
Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
Authentication-results: sourceware.org; auth=none
References: <1434629045-24650-1-git-send-email-benedikt dot huber at theobroma-systems dot com> <8B73CF78-11D4-4963-A60A-E1C2A3B219E2 at gmail dot com> <F2FF9755-1DF9-4000-8602-77AB12077240 at theobroma-systems dot com> <7794A52CE4D579448B959EED7DD0A4723DD10430 at satlexdag06 dot amd dot com> <1E4680F0-02C8-4999-958C-8B531BC850DA at theobroma-systems dot com> <7794A52CE4D579448B959EED7DD0A4723DD104AF at satlexdag06 dot amd dot com> <08D3EBD5-B67B-4D97-9940-3CAE6D020DC6 at gmail dot com> <7794A52CE4D579448B959EED7DD0A4723DD109D3 at satlexdag06 dot amd dot com> <1FEA8C0A-15E0-4309-B10D-B45032A68306 at theobroma-systems dot com> <7794A52CE4D579448B959EED7DD0A4723DD10A1C at satlexdag06 dot amd dot com> <20150629113635 dot GA14400 at arm dot com> <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11 at theobroma-systems dot com>

FWIW, I was curious about the precision of the results using such instructions for the standard sqrt{,f} functions.  This is not a wide sample, but it does point to a floor of series iterations to 3 for DP and 2 for SP:

x               sqrt(x)         1 Step (ulps)           2 Steps (ulps)          3 Steps (ulps)
2.2251e-308     1.4917e-154     1.4917e-154 (999)       1.4917e-154 (999)       1.4917e-154 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (999)        4.0027e-10 (999)        4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)        1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)        1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)        1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (999)        1.4142e+00 (999)        1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (999)        1.5000e+00 (999)        1.5000e+00 (000)
2.5600e+00      1.6000e+00      1.6000e+00 (000)        1.6000e+00 (000)        1.6000e+00 (000)
3.1416e+00      1.7725e+00      1.7725e+00 (999)        1.7725e+00 (999)        1.7725e+00 (000)
6.0221e+23      7.7602e+11      7.7602e+11 (999)        7.7602e+11 (999)        7.7602e+11 (000)
1.7977e+308     1.3408e+154     1.3408e+154 (000)       1.3408e+154 (000)       1.3408e+154 (000)

x               sqrtf(x)        1 Step (ulps)           2 Steps (ulps)          3 Steps (ulps)
1.1755e-38      1.0842e-19      1.0842e-19 (096)        1.0842e-19 (000)        1.0842e-19 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (008)        4.0027e-10 (000)        4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)        1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (096)        1.0000e+00 (000)        1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (094)        1.0000e+00 (001)        1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (146)        1.4142e+00 (001)        1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (018)        1.5000e+00 (000)        1.5000e+00 (001)
2.5600e+00      1.6000e+00      1.6000e+00 (001)        1.6000e+00 (001)        1.6000e+00 (001)
3.1416e+00      1.7725e+00      1.7725e+00 (006)        1.7725e+00 (001)        1.7725e+00 (001)
6.0221e+23      7.7602e+11      7.7602e+11 (069)        7.7602e+11 (001)        7.7602e+11 (000)
3.4028e+38      1.8447e+19      1.8447e+19 (000)        1.8447e+19 (000)        1.8447e+19 (000)

The error in ULPs saturates at 999 above.

The result of having to use so many iterations to achieve accuracy would defeat using the Newton series, as it would likely be slower than the FSQRT instruction.

Unlike in x86, I have the impression that the initial estimate in AArch64 is meant to be used in applications that do not require precision, like graphics, etc.  Then, a single series iteration for SP would perhaps be good enough.

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Dr. Philipp Tomsich
> Sent: Monday, June 29, 2015 6:45
> To: James Greenhalgh
> Cc: Kumar, Venkataramanan; pinskia@gmail.com; Benedikt Huber; gcc-
> patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> James,
> 
> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> >
> > On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
> >>
> >>> -----Original Message-----
> >>> From: Dr. Philipp Tomsich
> >>> [mailto:philipp.tomsich@theobroma-systems.com]
> >>> Sent: Monday, June 29, 2015 2:17 PM
> >>> To: Kumar, Venkataramanan
> >>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> >>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>> (rsqrt) estimation in -ffast-math
> >>>
> >>> Kumar,
> >>>
> >>> This does not come unexpected, as the initial estimation and each
> >>> iteration will add an architecturally-defined number of bits of
> >>> precision (ARMv8 guarantuees only a minimum number of bits provided
> >>> per operationâ the exact number is specific to each micro-arch, though).
> >>> Depending on your architecture and on the required number of precise
> >>> bits by any given benchmark, one may see miscompares.
> >>
> >> True.
> >
> > I would be very uncomfortable with this approach.
> 
> Same here. The default must be safe. Always.
> Unlike other architectures, we donât have a problem with making the proper
> defaults for âsafetyâ, as the ARMv8 ISA guarantees a minimum number of
> precise bits per iteration.
> 
> > From Richard Biener's post in the thread Michael Matz linked earlier
> > in the thread:
> >
> >    It would follow existing practice of things we allow in
> >    -funsafe-math-optimizations.  Existing practice in that we
> >    want to allow -ffast-math use with common benchmarks we care
> >    about.
> >
> >    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> >
> > With the solution you seem to be converging on (2-steps for some
> > microarchitectures, 3 for others), a binary generated for one
> > micro-arch may drop below a minimum guarantee of precision when run on
> > another. This seems to go against the spirit of the practice above. I
> > would only support adding this optimization to -Ofast if we could keep
> > to architectural guarantees of precision in the generated code (i.e. 3-
> steps everywhere).
> >
> > I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> > which would be off by default, would enable the 2-step mode, and would
> > need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I
> > don't see what this buys you beyond the Gromacs boost (and even there
> > you would be creating an Invalid Run as optimization flags must be
> > applied across all workloads).
> 
> Any flag that reduces precision (and thus breaks IEEE floating-point
> semantics) needs to be gated with an âunsafeâ flag (i.e. one that is never on
> by default).
> As a consequence, the âpeakâ-tuning for SPEC will turn this onâ but barely
> anyone else would.
> 
> > For the 3-step optimization, it is clear to me that for "generic"
> > tuning we don't want this to be enabled by default experimental
> > results and advice in this thread argues against it for thunderx and
> cortex-a57 targets.
> > However, enabling it based on the CPU tuning selected seems fine to me.
> 
> I do not agree on this one, as I would like to see the safe form (i.e. 3 and
> 5 iterations respectively) to become the default. Most âserver-typeâ chips
> should not see a performance regression, while it will be easier to optimise
> for this in hardware than for a (potentially microcoded) sqrt-instruction
> (and subsequent, dependent divide).
> 
> I have not heard anyone claim a performance regression (either on thunderx or
> on cortex-a57), but merely heard a âno speed-upâ.
> 
> So I am strongly in favor of defaulting to the âsafeâ number of iterations,
> even when compiling for a generic target.
> 
> Best,
> Philipp.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]