This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math


On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
> 
> > -----Original Message-----
> > From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.com]
> > Sent: Monday, June 29, 2015 2:17 PM
> > To: Kumar, Venkataramanan
> > Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> > estimation in -ffast-math
> > 
> > Kumar,
> > 
> > This does not come unexpected, as the initial estimation and each iteration
> > will add an architecturally-defined number of bits of precision (ARMv8
> > guarantuees only a minimum number of bits provided per operationâ the
> > exact number is specific to each micro-arch, though).
> > Depending on your architecture and on the required number of precise bits
> > by any given benchmark, one may see miscompares.
> 
> True.  

I would be very uncomfortable with this approach.

>From Richard Biener's post in the thread Michael Matz linked earlier
in the thread:

    It would follow existing practice of things we allow in
    -funsafe-math-optimizations.  Existing practice in that we
    want to allow -ffast-math use with common benchmarks we care
    about.

    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html

With the solution you seem to be converging on (2-steps for some
microarchitectures, 3 for others), a binary generated for one micro-arch
may drop below a minimum guarantee of precision when run on another. This
seems to go against the spirit of the practice above. I would only support
adding this optimization to -Ofast if we could keep to architectural
guarantees of precision in the generated code (i.e. 3-steps everywhere).

I don't object to adding a "-mlow-precision-recip-sqrt" style option,
which would be off by default, would enable the 2-step mode, and would
need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I don't
see what this buys you beyond the Gromacs boost (and even there you would
be creating an Invalid Run as optimization flags must be applied across
all workloads). 

For the 3-step optimization, it is clear to me that for "generic" tuning
we don't want this to be enabled by default experimental results and advice
in this thread argues against it for thunderx and cortex-a57 targets.
However, enabling it based on the CPU tuning selected seems fine to me.

Thanks,
James

> > 
> > Do you know the exact number of bits that the initial estimate and the
> > subsequent refinement steps add for your micro-arch?
> 
> I am not sure on this. I  need to check for cortex-a57 case.  
> 
> But best thing for "cortex-a57" case is not to use this optimization by default. 
> If we get an -mrecip-sqrt command line , then we can add it for "gromacs" kind application to get gains.
> 
> Any thoughts on this?
>     
> Regards,
> Venkat.
>  
> 
> > 
> > Thanks,
> > Philipp.
> > 
> > > On 29 Jun 2015, at 10:17, Kumar, Venkataramanan
> > <Venkataramanan.Kumar@amd.com> wrote:
> > >
> > >
> > > Hmm,  Reducing the iterations to "1 step for float" and "2 steps for double"
> > >
> > > I got VE (miscompares) on following benchmarks 416.gamess
> > > 453.povray
> > > 454.calculix
> > > 459.GemsFDTD
> > >
> > > Benedikt , I have ICE for 444.namd with your patch,  not sure if something
> > wrong in my local tree.
> > >
> > > Regards,
> > > Venkat.
> > >
> > >> -----Original Message-----
> > >> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > >> Sent: Sunday, June 28, 2015 8:35 PM
> > >> To: Kumar, Venkataramanan
> > >> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org
> > >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >> (rsqrt) estimation in -ffast-math
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan
> > >> <Venkataramanan.Kumar@amd.com> wrote:
> > >>>
> > >>> I got around ~12% gain with -Ofast -mcpu=cortex-a57.
> > >>
> > >> I get around 11/12% on thunderX with the patch and the decreasing the
> > >> iterations change (1/2) compared to without the patch.
> > >>
> > >> Thanks,
> > >> Andrew
> > >>
> > >>
> > >>>
> > >>> Regards,
> > >>> Venkat.
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> > >>>> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> > >>>> Sent: Thursday, June 25, 2015 9:13 PM
> > >>>> To: Kumar, Venkataramanan
> > >>>> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
> > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >>>> (rsqrt) estimation in -ffast-math
> > >>>>
> > >>>> Kumar,
> > >>>>
> > >>>> what is the relative gain that you see on Cortex-A57?
> > >>>>
> > >>>> Thanks,
> > >>>> Philipp.
> > >>>>
> > >>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
> > >>>>> <Venkataramanan.Kumar@amd.com> wrote:
> > >>>>>
> > >>>>> Changing to  "1 step for float" and "2 steps for double" gives
> > >>>>> better gains
> > >>>> now for gromacs on cortex-a57.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Venkat.
> > >>>>>> -----Original Message-----
> > >>>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> > >>>>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> > >>>>>> Sent: Thursday, June 25, 2015 4:09 PM
> > >>>>>> To: pinskia@gmail.com
> > >>>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-
> > >> systems.com
> > >>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >>>>>> (rsqrt) estimation in -ffast-math
> > >>>>>>
> > >>>>>> Andrew,
> > >>>>>>
> > >>>>>>> This is NOT a win on thunderX at least for single precision
> > >>>>>>> because you have
> > >>>>>> to do the divide and sqrt in the same time as it takes 5
> > >>>>>> multiples (estimate and step are multiplies in the thunderX pipeline).
> > >>>>>> Doubles is 10 multiplies which is just the same as what the patch
> > >>>>>> does (but it is really slightly less than 10, I rounded up). So
> > >>>>>> in the end this is NOT a win at all for thunderX unless we do one
> > >>>>>> less step for both single
> > >>>> and double.
> > >>>>>>
> > >>>>>> Yes, the expected benefit from rsqrt estimation is implementation
> > >>>>>> specific. If one has a better initial rsqrte or an application
> > >>>>>> that can trade precision for execution time, we could offer a
> > >>>>>> command line option to do only 2 steps for doulbe and 1 step for
> > >>>>>> float; similar to -
> > >>>> mrecip-precision for PowerPC.
> > >>>>>> What are your thoughts on that?
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>> Benedikt
> > >>>
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]