This is the mail archive of the
`gcc-patches@gcc.gnu.org`
mailing list for the GCC project.

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |

Other format: | [Raw text] |

*From*: Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>*To*: Richard Biener <richard dot guenther at gmail dot com>*Cc*: GCC Patches <gcc-patches at gcc dot gnu dot org>*Date*: Fri, 08 May 2015 14:56:54 +0100*Subject*: Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible*Authentication-results*: sourceware.org; auth=none*References*: <5543A3AB dot 1090309 at foss dot arm dot com> <CAFiYyc3uz9-m0g5FB2YPKy5Vef-mgS7PJ5qUtsvw8=0VwzenZQ at mail dot gmail dot com>

On 08/05/15 11:18, Richard Biener wrote:

On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:Hi all, GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, (int)k + 0.5) using square roots. So, for the above examples it would generate sqrt (x) * sqrt (sqrt (x)), sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will calculate the reciprocal of that). However, the implementation of these optimisations is done on a bit of an ad-hoc basis with the 0.25, 0.5, 0.75 cases hardcoded. Judging by https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf these are the most commonly used exponents (at least in SPEC ;)) This patch generalises this optimisation into a (hopefully) more robust algorithm. In particular, it expands calls to pow (x, CST) by expanding the integer part of CST using a powi, like it does already, and then expanding the fractional part as a product of repeated applications of a square root if the fractional part can be expressed as a multiple of a power of 0.5. I try to explain the algorithm in more detail in the comments in the patch but, for example: pow (x, 5.625) is not currently handled, but with this patch will be expanded to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 + 0.5**3 Negative exponents are handled in either of two ways, depending on the exponent value: * Using a simple reciprocal. For example: pow (x, -5.625) == 1.0 / pow (x, 5.625) --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x)))) * For pow (x, EXP) with negative exponent EXP with integer part INT and fractional part FRAC: pow (1.0 - FRAC) / powi (ceil (abs (EXP))). For example: pow (x, -5.875) == pow (x, 0.125) / powi (X, 6) --> sqrt (sqrt (sqrt (x))) / (powi (x, 6)) Since hardware square root instructions tend to be expensive, we may want to reduce the number of square roots we are willing to calculate. Since we reuse intermediate square root results, this boils down to restricting the depth of the square root chains. In all the examples above that depth is 3. I've made this maximum depth parametrisable in params.def. By adjusting that parameter we can adjust the resolution of this optimisation. So, if it's set to '4' then we will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, including negative multiples. Currently, GCC will not try to expand negative multiples of anything else than 0.5 I have tried to keep the existing functionality intact and activate this only for -funsafe-math-optimizations and only when the target has a sqrt instruction. An exception to that is pow (x, 0.5) which we prefer to transform to sqrt even when a hardware sqrt is not available, presumably because the library function for sqrt is usually faster than pow (?).Yes. It's also a safe transform - which you seem to put under flag_unsafe_math_optimizations only with your patch. It would be clearer to just leave the special-case - /* Optimize pow(x,0.5) = sqrt(x). This replacement is always safe - unless signed zeros must be maintained. pow(-0,0.5) = +0, while - sqrt(-0) = -0. */ - if (sqrtfn - && REAL_VALUES_EQUAL (c, dconsthalf) - && !HONOR_SIGNED_ZEROS (mode)) - return build_and_insert_call (gsi, loc, sqrtfn, arg0); in as-is.

Ok, I'll leave that case explicit.

You also removed the Os constraint which you should put back in. Basically if !optimize_function_for_speed_p then generate at most two calls to sqrt (iff the HW has a sqrt instruction).

I tried to move that logic into expand_with_sqrts but I'll move it outside it. It seems that this boils down to only 0.25, as any other 2xsqrt chain will also involve a multiply or a divide which we currently avoid.

You fail to add a testcase that checks that the optimization applies.

I'll add one to scan the sincos dump. I notice that we don't have a testuite check that the target has a hw sqrt instructions. Would you like me to add one? Or can I make the testcase aarch64-specific?

Otherwise the idea looks good though there must be a better way to compute the series than by using real-arithmetic and forcefully trying out all possibilities...

I get that feeling too. What I need is not only a way of figuring out if the fractional part of the exponent can be represented in this way, but also compute the depth of the sqrt chain and the number of multiplies... That being said, the current approach is O(maximum depth) and I don't expect the depth to go much beyond 3 or 4 in practice. Thanks for looking at it! I'll respin the patch. Kyrill

Richard.Having seen the glibc implementation of a fully IEEE-754-compliant pow function, I think we would prefer synthesising the pow call whenever we can for -ffast-math. I have seen this optimisation trigger a few times in SPEC2k6, in particular in 447.dealII and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow (x, 0.875) with square roots, multiplies and, in the case of -0.25, divides. On 481.wrf I saw it remove a total of 22 out of 322 calls to pow On 481.wrf on aarch64 I saw about a 1% improvement. The cycle count on x86_64 was also smaller, but not by a significant amount (the same calls to pow were eliminated). In general, I think this can shine if multiple expandable calls to pow appear together. So, for example for code: double baz (double a) { return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375); } we can generate: baz: fsqrt d3, d0 fmul d4, d0, d0 fmov d5, 1.0e+0 fmul d6, d0, d4 fsqrt d2, d3 fmul d1, d0, d2 fsqrt d0, d2 fmul d3, d3, d2 fdiv d1, d5, d1 fmul d3, d3, d6 fmul d2, d2, d0 fmadd d0, d4, d3, d1 fmsub d0, d6, d2, d0 ret reusing the sqrt results and doing more optimisations rather than the current: baz: stp x29, x30, [sp, -48]! fmov d1, -1.25e+0 add x29, sp, 0 stp d8, d9, [sp, 16] fmov d9, d0 str d10, [sp, 32] bl pow fmov d8, d0 fmov d0, d9 fmov d1, 5.75e+0 bl pow fmov d10, d0 fmov d0, d9 fmov d1, 3.375e+0 bl pow fadd d8, d8, d10 ldr d10, [sp, 32] fsub d0, d8, d0 ldp d8, d9, [sp, 16] ldp x29, x30, [sp], 48 ret Of course gcc could already do that if the exponents were to fall in the set {0.25, 0.75, k+0.5} but with this patch that set can be greatly expanded. I suppose if we're really lucky we might even open up new vectorisation opportunities. For example: void vecfoo (float *a, float *b) { for (int i = 0; i < N; i++) a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625); } will now be vectorisable if we have a hardware vector sqrt instruction. Though I'm not sure how often this would appear in real code. I would like advice on some implementation details: - The new function representable_as_half_series_p is used to check if the fractional part of an exponent can be represented as a multiple of a power of 0.5. It does so by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a smarter way of doing this, considering that REAL_VALUE_TYPE holds the bit representation of the floating point number? - Are there any correctness cases that I may have missed? This patch gates the optimisation on -funsafe-math-optimizations, but maybe there are some edge cases that I missed? - What should be the default value of the max-pow-sqrt-depth param? In this patch it's 5 which on second thought strikes me as a bit aggressive. To catch exponents that are multiples of 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I suppose it depends on how likely more fine-grained powers are to appear in real code, how expensive the C library implementation of pow is, and how expensive are the sqrt instructions in hardware. Bootstrapped and tested on x86_64, aarch64, arm (pending) SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that might be of interest to look at? Thanks, Kyrill 2015-05-01 Kyrylo Tkachov <kyrylo.tkachov@arm.com> * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param. * tree-ssa-math-opts.c: Include params.h (pow_synth_sqrt_info): New struct. (representable_as_half_series_p): New function. (get_fn_chain): Likewise. (print_nested_fn): Likewise. (dump_fractional_sqrt_sequence): Likewise. (dump_integer_part): Likewise. (expand_pow_as_sqrts): Likewise. (gimple_expand_builtin_pow): Use above to attempt to expand pow as series of square roots. Removed now unused variables. 2015-05-01 Kyrylo Tkachov <kyrylo.tkachov@arm.com> * gcc.dg/pow-sqrt.x: New file. * gcc.dg/pow-sqrt-1.c: New test. * gcc.dg/pow-sqrt-2.c: Likewise. * gcc.dg/pow-sqrt-3.c: Likewise.

**Follow-Ups**:

**References**:**[PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible***From:*Kyrill Tkachov

**Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible***From:*Richard Biener

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |