This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH/RFC], PR 42694/middle-end, make pow faster for some constant exponents


On Tue, Jan 26, 2010 at 8:14 PM, Michael Meissner
<meissner@linux.vnet.ibm.com> wrote:
> On Tue, Jan 26, 2010 at 11:23:01AM +0100, Richard Guenther wrote:
>> On Tue, Jan 26, 2010 at 1:57 AM, Michael Meissner
>> <meissner@linux.vnet.ibm.com> wrote:
>> > I would like to get some comments on this patch that I plan to submit to GCC
>> > 4.6 when it is opened up.
>> >
>> > In looking at the various spec 2006 runs, I noticed one bencmark (bwaves) was
>> > spending a lot of time in the pow function. ?I looked at the source, and saw
>> > that all of the pow's came from the use of x**0.75 in the Fortran source (x**2
>> > having already been optimized out by the compiler).
>> >
>> > This patch changes the compiler so that under -ffast-math:
>>
>> When not optimizing for size only.
>
> Yes. ?Earlier versions of the patch had -Os support, and I can add that back
> in.
>
>> > ? ? pow (x, 0.25) ? ? ?replaced by sqrt (sqrt (x))
>>
>> should be always profitable if an optab for sqrt exists
>
> Yep. ?Note current behavior is to always convert sqrt(sqrt(x)) into
> pow(x,.25) for -ffast-math.

Yes, but just because the expander lacks the back-conversion
(so I think a minimal patch should make sure to at least do
the back-transforms for the cases we turn into pow).

>> > ? ? pow (x, 0.125) ? ? replaced by sqrt (sqrt (sqrt (x)))
>>
>> I'm not sure.
>
> On various x86's and power systems that I tested, 3 sqrts are faster than a
> pow. ?4 sqrts is still a win, though the question is at what point do you worry
> about the accuracy.

Indeed.  It's definitely more accuracy lost than a rounding per
call - the error increases quadratically.  Sill might be common
enough to be worth optimizing.

>> > ? ? pow (x, 0.75) ? ? ?replaced by sqrt (x) * sqrt (sqrt (x))
>>
>> always profitable
>
> Yep.
>
>> > ? ? pow (x, 1./6.) ? ? replaced by cbrt (sqrt (x))
>>
>> I'm not sure.
>
> On the systems I checked, cbrt/sqrt is faster than pow (x, 1/6.), but cbrt/cbrt
> is slower.

I see.  As it is library calls I guess we need to be extra careful (also
because of accuracy issues - cbrt is probably less  used and tested
than pow - it reminds me of the issues with darwin and expanding
sincos as a call to cexp).

>> Did all these happen in bwaves?
>
> bwaves only does x**0.75 and x**2. ?So if we were just wanting to optimize for
> bwaves, we only need 0.75. ?I do think sqrt/sqrt should be done also. ?I don't
> know how frequently people do x**0.125, etc. ?I was just trying to be general.

Did you check what cases XLC expands specially?

>> > And have the compiler stop replacing sqrt (sqrt (x)), sqrt (cbrt (x)), and
>> > cbrt (sqrt (x)) with equivalent pow calls.
>>
>> These are for canonicalization, so please retain them.
>
> The patches do replace sqrt(sqrt(x)) with pow at the tree level
> (i.e. fold_builtin_sqrt), but the rtl expansion (expand_builtin_pow) creates
> the sqrt/sqrt for pow(x,0.25). ?Note, earlier versions of my patches did do the
> conversion at the tree level, and I found the vectorizer was able to vectorize
> sqrt(sqrt(x)) where it wasn't able to vectorize pow(x, 0.25). ?However, I think
> it should be easy to teach the vectorizer to do this transformation.

Yes, it already handles pow(x,0.5) specially - it should be easy to add
the other cases (though I can see that expading things early also makes
sense, like we do bswap and sincos detection before loop optimizers).

So I would say a lot of the expansions we do during expand should be
done before loop optimizers as they also have impact on code size
metrics.  No need to do that in your patch though (but worth to think
about when trying to teach the vectorizer of the cases).

>> For exposing CSE opportunities this kind of transformations should
>> probably be done earlier than expansion while still in gimple
>> (look at tree-ssa-math-opts.c where we for example do some
>> transformations only if there are CSE opportunities with other
>> expressions).
>>
>> > I added switches to control this, but I could just as easily go with --param or
>> > target hooks.
>>
>> I'd say avoid all of them by applying common sense, if at all
>> use target hooks (I suppose a generic builtin_cost that returns
>> values that can be compared with rtx_cost would be useful).
>
> Yep. ?I've gone back and forth, between target hooks, --param values, and
> switches. ?Different machines have different costs (pow is more expensive on
> power hardware than x86) and whether we want to add knobs is one of the reasons
> I posted this.
>
> I think that the following two must be done:
> ? ? ? ?pow (x, 0.25) ? -> sqrt (sqrt (x))
> ? ? ? ?pow (x, 0.75) ? -> sqrt (x) * sqrt (sqrt (x))
>
> I can go either way whether:
> ? ? ? ?pow (x, 1./6.) ?-> cbrt (sqrt (x))

I think avoiding the cases below is good if there are no CSE opportunities
of the intermediate values.  The precision loss may be significant already.
CSE opportunity detection might fit what we do in the cse_sincos pass.

So - the patch is ok with the switches removed and only the above
transformations
unconditionally retained.

Thanks,
Richard.

> ? ? ? ?pow (x, 1./8.) ?-> sqrt (sqrt (sqrt (x)))
> ? ? ? ?pow (x, 1./16.) -> sqrt (sqrt (sqrt (sqrt (x)))
>
> are done or not. ?It would be useful to know if programmers use these values or
> not. ?I was just trying to be general in the code, and allow for people to tune
> it one way or another.
>
>> > I looked at the differences between what the compiler is doing now, and with
>> > the optimization, and I feel they are not out of bounds (in the range of what
>> > the current -ffast-math optimizations does already), but I would welcome
>> > comments of how many sqrts/cbrts should be generated before the error gets out
>> > of hand.
>>
>> We already do unbound expansion of integer powers, so I guess
>> we don't care (with -ffast-math).
>
> Yep, definately -ffast-math only.
>
>> > In general, on x86 and power systems, sqrt is an order of magntude faster than
>> > pow, and cbrt is about 1.5 times faster.
>>
>> Does power have a cbrt instruction or is it the libcall that is faster?
>
> No it does not have a cbrt instruction, but the special purpose cbrt library
> call is faster than the general purpose pow call. ?On power6 with cpu tuned
> libraries, a pow takes about 1.6 times as long as a cbrt. ?On my core2 laptop,
> it is also 1.6 times. ?On other machines, it is a little slower.
>
> --
> Michael Meissner, IBM
> 4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA
> meissner@linux.vnet.ibm.com
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]