# [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

Richard Biener richard.guenther@gmail.com
Wed May 13 11:32:00 GMT 2015

On Fri, May 8, 2015 at 5:09 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> On 08/05/15 14:56, Kyrill Tkachov wrote:
>>
>> On 08/05/15 11:18, Richard Biener wrote:
>>>
>>> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow
>>>> (x,
>>>> (int)k + 0.5)
>>>> using square roots. So, for the above examples it would generate sqrt
>>>> (x) *
>>>> sqrt (sqrt (x)),
>>>> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
>>>> will calculate the
>>>> reciprocal of that).
>>>>
>>>> However, the implementation of these optimisations is done on a bit of
>>>> an
>>>> the 0.25, 0.5, 0.75 cases hardcoded.
>>>> Judging by
>>>>
>>>> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
>>>> these are the most commonly used exponents (at least in SPEC ;))
>>>>
>>>> This patch generalises this optimisation into a (hopefully) more robust
>>>> algorithm.
>>>> In particular, it expands calls to pow (x, CST) by expanding the integer
>>>> part of CST
>>>> using a powi, like it does already, and then expanding the fractional
>>>> part
>>>> as a product
>>>> of repeated applications of a square root if the fractional part can be
>>>> expressed
>>>> as a multiple of a power of 0.5.
>>>>
>>>> I try to explain the algorithm in more detail in the comments in the
>>>> patch
>>>> but, for example:
>>>>
>>>> pow (x, 5.625) is not currently handled, but with this patch will be
>>>> expanded
>>>> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0
>>>> +
>>>> 0.5 + 0.5**3
>>>>
>>>> Negative exponents are handled in either of two ways, depending on the
>>>> exponent value:
>>>> * Using a simple reciprocal.
>>>>     For example:
>>>>     pow (x, -5.625) == 1.0 / pow (x, 5.625)
>>>>       --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>>>>
>>>> * For pow (x, EXP) with negative exponent EXP with integer part INT and
>>>> fractional part FRAC:
>>>> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>>>>     For example:
>>>>     pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>>>>       --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>>>>
>>>>
>>>> Since hardware square root instructions tend to be expensive, we may
>>>> want to
>>>> reduce the number
>>>> of square roots we are willing to calculate. Since we reuse intermediate
>>>> square root results,
>>>> this boils down to restricting the depth of the square root chains. In
>>>> all
>>>> the examples above
>>>> that depth is 3. I've made this maximum depth parametrisable in
>>>> params.def.
>>>> parameter we can adjust the resolution of this optimisation. So, if it's
>>>> set
>>>> to '4' then we
>>>> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
>>>> including negative
>>>> multiples. Currently, GCC will not try to expand negative multiples of
>>>> anything else than 0.5
>>>>
>>>> I have tried to keep the existing functionality intact and activate this
>>>> only for
>>>> -funsafe-math-optimizations and only when the target has a sqrt
>>>> instruction.
>>>>    An exception to that is pow (x, 0.5) which we prefer to transform to
>>>> sqrt
>>>> even
>>>> when a hardware sqrt is not available, presumably because the library
>>>> function for
>>>> sqrt is usually faster than pow (?).
>>>
>>> Yes.  It's also a safe transform - which you seem to put under
>>> flag_unsafe_math_optimizations only with your patch.
>>>
>>> It would be clearer to just leave the special-case
>>>
>>> -  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
>>> -     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
>>> -     sqrt(-0) = -0.  */
>>> -  if (sqrtfn
>>> -      && REAL_VALUES_EQUAL (c, dconsthalf)
>>> -      && !HONOR_SIGNED_ZEROS (mode))
>>> -    return build_and_insert_call (gsi, loc, sqrtfn, arg0);
>>>
>>> in as-is.
>>
>> Ok, I'll leave that case explicit.
>>
>>> You also removed the Os constraint which you should put back in.
>>> Basically if !optimize_function_for_speed_p then generate at most
>>> two calls to sqrt (iff the HW has a sqrt instruction).
>>
>> I tried to move that logic into expand_with_sqrts but
>> I'll move it outside it. It seems that this boils down to
>> only 0.25, as any other 2xsqrt chain will also involve a
>> multiply or a divide which we currently avoid.
>>
>>> You fail to add a testcase that checks that the optimization applies.
>>
>> I'll add one to scan the sincos dump.
>> I notice that we don't have a testuite check that the target has
>> a hw sqrt instructions. Would you like me to add one? Or can I make
>> the testcase aarch64-specific?

Would be great to have a testsuite check for this.

>>
>>> Otherwise the idea looks good though there must be a better way
>>> to compute the series than by using real-arithmetic and forcefully
>>> trying out all possibilities...
>>
>> I get that feeling too. What I need is not only a way
>> of figuring out if the fractional part of the exponent can be
>> represented in this way, but also compute the depth of the
>> sqrt chain and the number of multiplies...
>> That being said, the current approach is O(maximum depth) and
>> I don't expect the depth to go much beyond 3 or 4 in practice.
>>
>> Thanks for looking at it!
>> I'll respin the patch.
>
>
> And here it is, with my above comments implemented.
> Bootstrapped on x86_64 and tested on aarch64.
> Full testing on arm and aarch64 ongoing.
>
> Is this ok if testing comes clean?

Ok.

Thanks,
Richard.

> Thanks,
> Kyrill
>
>
>> Kyrill
>>
>>> Richard.
>>>
>>>> Having seen the glibc implementation of a fully IEEE-754-compliant pow
>>>> function, I think we
>>>> would prefer synthesising the pow call whenever we can for -ffast-math.
>>>>
>>>> I have seen this optimisation trigger a few times in SPEC2k6, in
>>>> particular
>>>> in 447.dealII
>>>> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125)
>>>> and
>>>> pow (x, 0.875)
>>>> with square roots, multiplies and, in the case of -0.25, divides.
>>>> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>>>>
>>>> On 481.wrf on aarch64 I saw about a 1% improvement.
>>>> The cycle count on x86_64 was also smaller, but not by a significant
>>>> amount
>>>> (the same calls to
>>>> pow were eliminated).
>>>>
>>>> In general, I think this can shine if multiple expandable calls to pow
>>>> appear together.
>>>> So, for example for code:
>>>> double
>>>> baz (double a)
>>>> {
>>>>     return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) -
>>>> __builtin_pow
>>>> (a, 3.375);
>>>> }
>>>>
>>>> we can generate:
>>>> baz:
>>>>           fsqrt   d3, d0
>>>>           fmul    d4, d0, d0
>>>>           fmov    d5, 1.0e+0
>>>>           fmul    d6, d0, d4
>>>>           fsqrt   d2, d3
>>>>           fmul    d1, d0, d2
>>>>           fsqrt   d0, d2
>>>>           fmul    d3, d3, d2
>>>>           fdiv    d1, d5, d1
>>>>           fmul    d3, d3, d6
>>>>           fmul    d2, d2, d0
>>>>           fmadd   d0, d4, d3, d1
>>>>           fmsub   d0, d6, d2, d0
>>>>           ret
>>>>
>>>> reusing the sqrt results and doing more optimisations rather than the
>>>> current:
>>>> baz:
>>>>           stp     x29, x30, [sp, -48]!
>>>>           fmov    d1, -1.25e+0
>>>>           stp     d8, d9, [sp, 16]
>>>>           fmov    d9, d0
>>>>           str     d10, [sp, 32]
>>>>           bl      pow
>>>>           fmov    d8, d0
>>>>           fmov    d0, d9
>>>>           fmov    d1, 5.75e+0
>>>>           bl      pow
>>>>           fmov    d10, d0
>>>>           fmov    d0, d9
>>>>           fmov    d1, 3.375e+0
>>>>           bl      pow
>>>>           ldr     d10, [sp, 32]
>>>>           fsub    d0, d8, d0
>>>>           ldp     d8, d9, [sp, 16]
>>>>           ldp     x29, x30, [sp], 48
>>>>           ret
>>>>
>>>>
>>>> Of course gcc could already do that if the exponents were to fall in the
>>>> set
>>>> {0.25, 0.75, k+0.5}
>>>> but with this patch that set can be greatly expanded.
>>>>
>>>> I suppose if we're really lucky we might even open up new vectorisation
>>>> opportunities.
>>>> For example:
>>>> void
>>>> vecfoo (float *a, float *b)
>>>> {
>>>>     for (int i = 0; i < N; i++)
>>>>       a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i],
>>>> 3.625);
>>>> }
>>>>
>>>> will now be vectorisable if we have a hardware vector sqrt instruction.
>>>> Though I'm not sure
>>>> how often this would appear in real code.
>>>>
>>>>
>>>> I would like advice on some implementation details:
>>>> - The new function representable_as_half_series_p is used to check if
>>>> the
>>>> fractional
>>>> part of an exponent can be represented as a multiple of a power of 0.5.
>>>> It
>>>> does so
>>>> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there
>>>> is a
>>>> smarter
>>>> way of doing this, considering that REAL_VALUE_TYPE holds the bit
>>>> representation of the
>>>> floating point number?
>>>>
>>>> - Are there any correctness cases that I may have missed? This patch
>>>> gates
>>>> the optimisation
>>>> on -funsafe-math-optimizations, but maybe there are some edge cases that
>>>> I
>>>> missed?
>>>>
>>>> - What should be the default value of the max-pow-sqrt-depth param? In
>>>> this
>>>> patch it's 5 which
>>>> on second thought strikes me as a bit aggressive. To catch exponents
>>>> that
>>>> are multiples of
>>>> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
>>>> suppose it depends on
>>>> how likely more fine-grained powers are to appear in real code, how
>>>> expensive the C library
>>>> implementation of pow is, and how expensive are the sqrt instructions in
>>>> hardware.
>>>>
>>>>
>>>> Bootstrapped and tested on x86_64, aarch64, arm (pending)
>>>> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase
>>>> that
>>>> might
>>>> be of interest to look at?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>>>>       * tree-ssa-math-opts.c: Include params.h
>>>>       (pow_synth_sqrt_info): New struct.
>>>>       (representable_as_half_series_p): New function.
>>>>       (get_fn_chain): Likewise.
>>>>       (print_nested_fn): Likewise.
>>>>       (dump_fractional_sqrt_sequence): Likewise.
>>>>       (dump_integer_part): Likewise.
>>>>       (expand_pow_as_sqrts): Likewise.
>>>>       (gimple_expand_builtin_pow): Use above to attempt to expand
>>>>       pow as series of square roots.  Removed now unused variables.
>>>>
>>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       * gcc.dg/pow-sqrt.x: New file.
>>>>       * gcc.dg/pow-sqrt-1.c: New test.
>>>>       * gcc.dg/pow-sqrt-2.c: Likewise.
>>>>       * gcc.dg/pow-sqrt-3.c: Likewise.
>
>