[AArch64] Add more precision choices for the reciprocal square root approximation

Tue Apr 12 18:16:00 GMT 2016

On 04/04/16 11:13, Evandro Menezes wrote:
> On 04/01/16 18:08, Wilco Dijkstra wrote:
>> Evandro Menezes wrote:
>>> I hope that this gets in the ballpark of what's been discussed 
>>> previously.
>> Yes that's very close to what I had in mind. A minor issue is that 
>> the vector
>> modes cannot work as they start at MAX_MODE_FLOAT (which is > 32):
>>
>> +/* Control approximate alternatives to certain FP operators. */
>> +#define AARCH64_APPROX_MODE(MODE) \
>> +  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
>> +   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
>> +   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= 
>> MAX_MODE_VECTOR_FLOAT) \
>> +     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT + 1)) \
>> +     : (0))
>>
>> That should be:
>>
>> +     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT - 
>> MIN_MODE_FLOAT + 1)) \
>>
>> It would be worth testing all the obvious cases to be sure they work.
>>
>> Also I don't think it is a good idea to enable all modes on Exynos-M1 
>> and XGene-1 -
>> I haven't seen any evidence that shows it gives a speedup on real 
>> code for all modes
>> (or at least on a good micro benchmark like the unit vector test I 
>> suggested - a simple
>> throughput test does not count!).
>
> This approximation does benefit M1 in general across several 
> benchmarks.  As for my choice for Xgene1, it preserves the original 
> setting.  I believe that, with this more granular option, developers 
> can fine tune their targets.
>
>> The issue is it hides performance gains from an improved divider/sqrt 
>> on new revisions
>> or microarchitectures. That means you should only enable cases where 
>> there is evidence
>> of a major speedup that cannot be matched by a future improved 
>> divider/sqrt.
>
> I did notice that some benchmarks with heavy use of multiplication or 
> multiply-accumulation, the series may be detrimental, since it 
> increases the competition for the unit(s) that do(es) such operations.
>
> But those micro-architectures that get a better unit for division or 
> sqrt() are free to add their own tuning parameters.  Granted, I assume 
> that running legacy code is not much of an issue only in a few markets.

Ping^1

-- 
Evandro Menezes