This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math





> On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich <philipp.tomsich@theobroma-systems.com> wrote:
> 
> James,
> 
>> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>> 
>>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
>>> 
>>>> -----Original Message-----
>>>> From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.com]
>>>> Sent: Monday, June 29, 2015 2:17 PM
>>>> To: Kumar, Venkataramanan
>>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>>>> estimation in -ffast-math
>>>> 
>>>> Kumar,
>>>> 
>>>> This does not come unexpected, as the initial estimation and each iteration
>>>> will add an architecturally-defined number of bits of precision (ARMv8
>>>> guarantuees only a minimum number of bits provided per operationâ the
>>>> exact number is specific to each micro-arch, though).
>>>> Depending on your architecture and on the required number of precise bits
>>>> by any given benchmark, one may see miscompares.
>>> 
>>> True.  
>> 
>> I would be very uncomfortable with this approach.
> 
> Same here. The default must be safe. Always.
> Unlike other architectures, we donât have a problem with making the proper
> defaults for âsafetyâ, as the ARMv8 ISA guarantees a minimum number of
> precise bits per iteration.
> 
>> From Richard Biener's post in the thread Michael Matz linked earlier
>> in the thread:
>> 
>>   It would follow existing practice of things we allow in
>>   -funsafe-math-optimizations.  Existing practice in that we
>>   want to allow -ffast-math use with common benchmarks we care
>>   about.
>> 
>>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
>> 
>> With the solution you seem to be converging on (2-steps for some
>> microarchitectures, 3 for others), a binary generated for one micro-arch
>> may drop below a minimum guarantee of precision when run on another. This
>> seems to go against the spirit of the practice above. I would only support
>> adding this optimization to -Ofast if we could keep to architectural
>> guarantees of precision in the generated code (i.e. 3-steps everywhere).
>> 
>> I don't object to adding a "-mlow-precision-recip-sqrt" style option,
>> which would be off by default, would enable the 2-step mode, and would
>> need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I don't
>> see what this buys you beyond the Gromacs boost (and even there you would
>> be creating an Invalid Run as optimization flags must be applied across
>> all workloads).
> 
> Any flag that reduces precision (and thus breaks IEEE floating-point semantics)
> needs to be gated with an âunsafeâ flag (i.e. one that is never on by default).
> As a consequence, the âpeakâ-tuning for SPEC will turn this onâ but barely 
> anyone else would.
> 
>> For the 3-step optimization, it is clear to me that for "generic" tuning
>> we don't want this to be enabled by default experimental results and advice
>> in this thread argues against it for thunderx and cortex-a57 targets.
>> However, enabling it based on the CPU tuning selected seems fine to me.
> 
> I do not agree on this one, as I would like to see the safe form (i.e. 3 and 5
> iterations respectively) to become the default. Most âserver-typeâ chips
> should not see a performance regression, while it will be easier to optimise for
> this in hardware than for a (potentially microcoded) sqrt-instruction (and 
> subsequent, dependent divide).
> 
> I have not heard anyone claim a performance regression (either on thunderx
> or on cortex-a57), but merely heard a âno speed-upâ.

Actually it does regress performance on thunderX, I just assumed that when I said not going to be a win it was taken as a slow down. It regress gromacs by more than 10% on thunderX but I can't remember how much as i had someone else run it. The latency difference is also over 40%; for example single precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge difference right there.  ThunderX has a fast div and a fast sqrt for 32bit and a reasonable one for double.   So again this is not just not a win but rather a regression for thunderX. I suspect cortex-a57 is also true. 

Thanks,
Andrew

> 
> So I am strongly in favor of defaulting to the âsafeâ number of iterations, even
> when compiling for a generic target.
> 
> Best,
> Philipp.
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]