This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt


On Tue, Jan 12, 2016 at 05:53:21AM +0000, Kumar, Venkataramanan wrote:
> Hi James,
> 
> > -----Original Message-----
> > From: James Greenhalgh [mailto:james.greenhalgh@arm.com]
> > Sent: Monday, January 11, 2016 5:24 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: nd@arm.com; marcus.shawcroft@arm.com;
> > richard.earnshaw@arm.com; Kumar, Venkataramanan;
> > philipp.tomsich@theobroma-systems.com; pinskia@gmail.com;
> > Kyrylo.Tkachov@arm.com; e.menezes@samsung.com
> > Subject: [Patch AArch64] Use software sqrt expansion always for -mlow-
> > precision-recip-sqrt
> > 
> > 
> > Hi,
> > 
> > I'd like to switch the logic around in aarch64.c such that -mlow-precision-
> > recip-sqrt causes us to always emit the low-precision software expansion for
> > reciprocal square root. I have two reasons to do this; first is consistency
> > across -mcpu targets, second is enabling more -mcpu targets to use the flag
> > for peak tuning.
> > 
> > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > differs between cores (and possibly compiler revisions). Yes, we're under -
> > ffast-math but I take this flag to mean the user explicitly wants the low-
> > precision expansion, and we should not diverge from that based on an
> > internal decision as to what is optimal for performance in the high-precision
> > case. I'd prefer to keep things as predictable as possible, and here that
> > means always emitting the low-precision expansion when asked.
> > 
> > Judging by the comments in the thread proposing the reciprocal square root
> > optimisation, this will benefit all cores currently supported by GCC.
> > To be clear, we would still not expand in the high-precision case for any cores
> > which do not explicitly ask for it. Currently that is Cortex-A57 and xgene,
> > though I will be proposing a patch to remove Cortex-A57 from that list
> > shortly.
> > 
> > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> > is intended as a tuning flag for situations where performance is more
> > important than precision, but the current logic requires setting an internal
> > flag which also changes the performance characteristics where high-precision
> > is needed. This conflates two decisions the target might want to make, and
> > reduces the applicability of an option targets might want to enable for
> > performance. In particular, I'd still like to see -mlow-precision-recip-sqrt
> > continue to emit the cheaper, low-precision sequence for floats under
> > Cortex-A57.
> > 
> > Based on that reasoning, this patch makes the appropriate change to the
> > logic. I've checked with the current -mcpu values to ensure that behaviour
> > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > 
> > I've also put this through bootstrap and test on aarch64-none-linux-gnu with
> > no issues.
> > 
> > OK?
> > 
> > Thanks,
> > James
> > 
> 
> Yes I like enabling this optimization for all cpus target via
> -mlow-precision-recip-sqrt .
>  
> If my understanding is correct for cortex-a57 we now need to use only
> -mlow-precision-recip-sqrt to emit software sqrt expansion?
> 
> In the below code 
> ---snip---
> void
> aarch64_emit_swrsqrt (rtx dst, rtx src)
> {
> ............
> ............
>   int iterations = double_mode ? 3 : 2;
> 
>   if (flag_mrecip_low_precision_sqrt)
>     iterations--;
>  ---snip---
> 
> Now cortex-a57 case we will always do  2 and 1 steps  for double and float
> and  3 and 2 will never be used.     Should we make it 2 and 1 as default? Or
> any target still needs to use 3 and 2. 

The code here should handle two cases:

  1) Normal -Ofast case -> Some targets use the estimate expansion with
     3 iterations for double, 2 for float. Other targets use the hardware
     fsqrt/fdiv instructions.
  2) -mlow-precision-recip-sqrt -> All targets use the estimate expansion
     with 2 iterations for double, 1 for float.

-mlow-precision-recip-sqrt is a specialisation to be used only when the
programmer knows the lower precision is acceptable. It should not be on
by default...

> Ps: I remember reducing iterations benefited gromacs but caused some VE in
> other FP benchmarks.  

... For exactly this reason :-)

Thanks,
James


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]