[PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

Mon Jun 29 19:07:00 GMT 2015

Hi,

> -----Original Message-----
> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> Sent: Monday, June 29, 2015 10:23 PM
> To: Dr. Philipp Tomsich
> Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard
> Earnshaw
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> 
> 
> 
> 
> > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> <philipp.tomsich@theobroma-systems.com> wrote:
> >
> > James,
> >
> >> On 29 Jun 2015, at 13:36, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
> >>
> >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan
> wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Dr. Philipp Tomsich
> >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> >>>> Sent: Monday, June 29, 2015 2:17 PM
> >>>> To: Kumar, Venkataramanan
> >>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>> (rsqrt) estimation in -ffast-math
> >>>>
> >>>> Kumar,
> >>>>
> >>>> This does not come unexpected, as the initial estimation and each
> >>>> iteration will add an architecturally-defined number of bits of
> >>>> precision (ARMv8 guarantuees only a minimum number of bits
> provided
> >>>> per operation… the exact number is specific to each micro-arch,
> though).
> >>>> Depending on your architecture and on the required number of
> >>>> precise bits by any given benchmark, one may see miscompares.
> >>>
> >>> True.
> >>
> >> I would be very uncomfortable with this approach.
> >
> > Same here. The default must be safe. Always.
> > Unlike other architectures, we don’t have a problem with making the
> > proper defaults for “safety”, as the ARMv8 ISA guarantees a minimum
> > number of precise bits per iteration.
> >
> >> From Richard Biener's post in the thread Michael Matz linked earlier
> >> in the thread:
> >>
> >>   It would follow existing practice of things we allow in
> >>   -funsafe-math-optimizations.  Existing practice in that we
> >>   want to allow -ffast-math use with common benchmarks we care
> >>   about.
> >>
> >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> >>
> >> With the solution you seem to be converging on (2-steps for some
> >> microarchitectures, 3 for others), a binary generated for one
> >> micro-arch may drop below a minimum guarantee of precision when run
> >> on another. This seems to go against the spirit of the practice
> >> above. I would only support adding this optimization to -Ofast if we
> >> could keep to architectural guarantees of precision in the generated code
> (i.e. 3-steps everywhere).
> >>
> >> I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> >> which would be off by default, would enable the 2-step mode, and
> >> would need to be explicitly enabled (i.e. not implied by -mcpu=foo)
> >> but I don't see what this buys you beyond the Gromacs boost (and even
> >> there you would be creating an Invalid Run as optimization flags must
> >> be applied across all workloads).
> >
> > Any flag that reduces precision (and thus breaks IEEE floating-point
> > semantics) needs to be gated with an “unsafe” flag (i.e. one that is never
> on by default).
> > As a consequence, the “peak”-tuning for SPEC will turn this on… but
> > barely anyone else would.
> >
> >> For the 3-step optimization, it is clear to me that for "generic"
> >> tuning we don't want this to be enabled by default experimental
> >> results and advice in this thread argues against it for thunderx and cortex-
> a57 targets.
> >> However, enabling it based on the CPU tuning selected seems fine to me.
> >
> > I do not agree on this one, as I would like to see the safe form (i.e.
> > 3 and 5 iterations respectively) to become the default. Most
> > “server-type” chips should not see a performance regression, while it
> > will be easier to optimise for this in hardware than for a
> > (potentially microcoded) sqrt-instruction (and subsequent, dependent
> divide).
> >
> > I have not heard anyone claim a performance regression (either on
> > thunderx or on cortex-a57), but merely heard a “no speed-up”.
> 
> Actually it does regress performance on thunderX, I just assumed that when
> I said not going to be a win it was taken as a slow down. It regress gromacs by
> more than 10% on thunderX but I can't remember how much as i had
> someone else run it. The latency difference is also over 40%; for example
> single precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with the
> rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge
> difference right there.  ThunderX has a fast div and a fast sqrt for 32bit and a
> reasonable one for double.   So again this is not just not a win but rather a
> regression for thunderX. I suspect cortex-a57 is also true.
> 
> Thanks,
> Andrew
> 

Yes theoretically  should be  true for cortex-57 case as well.   But  I believe hardware pipelining with instruction scheduling in compiler helps a little for gromacs case  ~3% to 4% with the original patch.

I have not tested other FP benchmarks.   As James said a flag -mlow-precision-recip-sqrt if allowed can be used as a peak flag. 

> >
> > So I am strongly in favor of defaulting to the ‘safe’ number of
> > iterations, even when compiling for a generic target.
> >
> > Best,
> > Philipp.
> >

Regards,
Venkat.