This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
- From: Evandro Menezes <e dot menezes at samsung dot com>
- To: "'Kumar, Venkataramanan'" <Venkataramanan dot Kumar at amd dot com>, pinskia at gmail dot com, "'Dr. Philipp Tomsich'" <philipp dot tomsich at theobroma-systems dot com>
- Cc: 'James Greenhalgh' <james dot greenhalgh at arm dot com>, 'Benedikt Huber' <benedikt dot huber at theobroma-systems dot com>, gcc-patches at gcc dot gnu dot org, 'Marcus Shawcroft' <Marcus dot Shawcroft at arm dot com>, 'Ramana Radhakrishnan' <ramrad01 at arm dot com>, 'Richard Earnshaw' <rearnsha at arm dot com>
- Date: Tue, 14 Jul 2015 17:14:30 -0500
- Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
- Authentication-results: sourceware.org; auth=none
- References: <1434629045-24650-1-git-send-email-benedikt dot huber at theobroma-systems dot com> <8B73CF78-11D4-4963-A60A-E1C2A3B219E2 at gmail dot com> <F2FF9755-1DF9-4000-8602-77AB12077240 at theobroma-systems dot com> <7794A52CE4D579448B959EED7DD0A4723DD10430 at satlexdag06 dot amd dot com> <1E4680F0-02C8-4999-958C-8B531BC850DA at theobroma-systems dot com> <7794A52CE4D579448B959EED7DD0A4723DD104AF at satlexdag06 dot amd dot com> <08D3EBD5-B67B-4D97-9940-3CAE6D020DC6 at gmail dot com> <7794A52CE4D579448B959EED7DD0A4723DD109D3 at satlexdag06 dot amd dot com> <1FEA8C0A-15E0-4309-B10D-B45032A68306 at theobroma-systems dot com> <7794A52CE4D579448B959EED7DD0A4723DD10A1C at satlexdag06 dot amd dot com> <20150629113635 dot GA14400 at arm dot com> <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11 at theobroma-systems dot com> <326A6111-183B-4F72-BEF9-4FE1AA708DE4 at gmail dot com> <7794A52CE4D579448B959EED7DD0A4723DD10BF7 at satlexdag06 dot amd dot com>
I ran a simple test on A57 rev. 0, looping a million times around sqrt{,f} and the respective series iterations with the values in the sequence 1..1000000 and got these results:
sqrt(x): 36593844/s 1/sqrt(x): 18283875/s
3 Steps: 47922557/s 3 Steps: 49005194/s
sqrtf(x): 143988480/s 1/sqrtf(x): 69516857/s
2 Steps: 78740157/s 2 Steps: 80385852/s
I'm a bit surprised that the 3-iteration series for DP is faster than sqrt(), but not that it's much faster for the reciprocal of sqrt(). As for SP, the 2-iteration series is faster only for the reciprocal for sqrtf().
There might still be some leg for this patch in real-world cases which I'd like to investigate.
--
Evandro Menezes Austin, TX
> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Kumar, Venkataramanan
> Sent: Monday, June 29, 2015 13:50
> To: pinskia@gmail.com; Dr. Philipp Tomsich
> Cc: James Greenhalgh; Benedikt Huber; gcc-patches@gcc.gnu.org; Marcus
> Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
>
> Hi,
>
> > -----Original Message-----
> > From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > Sent: Monday, June 29, 2015 10:23 PM
> > To: Dr. Philipp Tomsich
> > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> > patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard
> > Earnshaw
> > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > (rsqrt) estimation in -ffast-math
> >
> >
> >
> >
> >
> > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> > <philipp.tomsich@theobroma-systems.com> wrote:
> > >
> > > James,
> > >
> > >> On 29 Jun 2015, at 13:36, James Greenhalgh
> > <james.greenhalgh@arm.com> wrote:
> > >>
> > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan
> > wrote:
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Dr. Philipp Tomsich
> > >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> > >>>> Sent: Monday, June 29, 2015 2:17 PM
> > >>>> To: Kumar, Venkataramanan
> > >>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >>>> (rsqrt) estimation in -ffast-math
> > >>>>
> > >>>> Kumar,
> > >>>>
> > >>>> This does not come unexpected, as the initial estimation and each
> > >>>> iteration will add an architecturally-defined number of bits of
> > >>>> precision (ARMv8 guarantuees only a minimum number of bits
> > provided
> > >>>> per operationâ the exact number is specific to each micro-arch,
> > though).
> > >>>> Depending on your architecture and on the required number of
> > >>>> precise bits by any given benchmark, one may see miscompares.
> > >>>
> > >>> True.
> > >>
> > >> I would be very uncomfortable with this approach.
> > >
> > > Same here. The default must be safe. Always.
> > > Unlike other architectures, we donât have a problem with making the
> > > proper defaults for âsafetyâ, as the ARMv8 ISA guarantees a minimum
> > > number of precise bits per iteration.
> > >
> > >> From Richard Biener's post in the thread Michael Matz linked
> > >> earlier in the thread:
> > >>
> > >> It would follow existing practice of things we allow in
> > >> -funsafe-math-optimizations. Existing practice in that we
> > >> want to allow -ffast-math use with common benchmarks we care
> > >> about.
> > >>
> > >> https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> > >>
> > >> With the solution you seem to be converging on (2-steps for some
> > >> microarchitectures, 3 for others), a binary generated for one
> > >> micro-arch may drop below a minimum guarantee of precision when run
> > >> on another. This seems to go against the spirit of the practice
> > >> above. I would only support adding this optimization to -Ofast if
> > >> we could keep to architectural guarantees of precision in the
> > >> generated code
> > (i.e. 3-steps everywhere).
> > >>
> > >> I don't object to adding a "-mlow-precision-recip-sqrt" style
> > >> option, which would be off by default, would enable the 2-step
> > >> mode, and would need to be explicitly enabled (i.e. not implied by
> > >> -mcpu=foo) but I don't see what this buys you beyond the Gromacs
> > >> boost (and even there you would be creating an Invalid Run as
> > >> optimization flags must be applied across all workloads).
> > >
> > > Any flag that reduces precision (and thus breaks IEEE floating-point
> > > semantics) needs to be gated with an âunsafeâ flag (i.e. one that is
> > > never
> > on by default).
> > > As a consequence, the âpeakâ-tuning for SPEC will turn this onâ but
> > > barely anyone else would.
> > >
> > >> For the 3-step optimization, it is clear to me that for "generic"
> > >> tuning we don't want this to be enabled by default experimental
> > >> results and advice in this thread argues against it for thunderx
> > >> and cortex-
> > a57 targets.
> > >> However, enabling it based on the CPU tuning selected seems fine to me.
> > >
> > > I do not agree on this one, as I would like to see the safe form (i.e.
> > > 3 and 5 iterations respectively) to become the default. Most
> > > âserver-typeâ chips should not see a performance regression, while
> > > it will be easier to optimise for this in hardware than for a
> > > (potentially microcoded) sqrt-instruction (and subsequent, dependent
> > divide).
> > >
> > > I have not heard anyone claim a performance regression (either on
> > > thunderx or on cortex-a57), but merely heard a âno speed-upâ.
> >
> > Actually it does regress performance on thunderX, I just assumed that
> > when I said not going to be a win it was taken as a slow down. It
> > regress gromacs by more than 10% on thunderX but I can't remember how
> > much as i had someone else run it. The latency difference is also over
> > 40%; for example single precision: 29 cycles with div (12) sqrt(17)
> > directly vs 42 cycles with the rsqrte and 2 iterations of 2mul/rsqrts
> > (double is 53 vs 60). That is huge difference right there. ThunderX has a
> fast div and a fast sqrt for 32bit and a
> > reasonable one for double. So again this is not just not a win but rather
> a
> > regression for thunderX. I suspect cortex-a57 is also true.
> >
> > Thanks,
> > Andrew
> >
>
> Yes theoretically should be true for cortex-57 case as well. But I
> believe hardware pipelining with instruction scheduling in compiler helps a
> little for gromacs case ~3% to 4% with the original patch.
>
> I have not tested other FP benchmarks. As James said a flag -mlow-
> precision-recip-sqrt if allowed can be used as a peak flag.
>
> > >
> > > So I am strongly in favor of defaulting to the âsafeâ number of
> > > iterations, even when compiling for a generic target.
> > >
> > > Best,
> > > Philipp.
> > >
>
> Regards,
> Venkat.