This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: GCC and fast eveluaton of polynomials
- From: "Timothy J. Wood" <tjw at omnigroup dot com>
- To: Andreas Svrcek-Seiler <svrci at tbi dot univie dot ac dot at>
- Cc: <gcc at gcc dot gnu dot org>
- Date: Mon, 19 Nov 2001 10:23:08 -0800
- Subject: Re: GCC and fast eveluaton of polynomials
This formulation of the series approximation to log() completely
serializes the execution of your code (each step cannot be computed
until the previous has finished due to dependence on 'v').
Instead, you can compute two 'halves' of the sum in parallel.
First, rewrite your expression to be the sum of one polynomial with
only even powers and one with only odd powers. Then, factor a power out
of the odd power side (since its lowest power is 1), leaving you with
two independent expressions in dealing even powers.
Finally, start your two partial sums, one with the zero power one with
the coefficient of what was the 1st power before factoring above
happened and then accumulate in the rest of the terms in parallel.
An example might help here, say we have:
y = a7*x^7 ... + a1*x + a0
Rewrite this as:
y = yEven + yOdd
yEven = a6*x^6 + a4*x^4 ... + a0
yOdd = a7*x^7 + a5*x^5 ... + a1*x
Factor yOdd, leaving:
y = yEven + x * yOdd;
yEven = a6*x^6 + a4*x^4 ... + a0
yOdd = a7*x^6 + a5*x^4 ... + a1
Then, compute yEven and yOdd using Horner's scheme in x^2.
One you have to two halves, the final sum is just another
multiply-accumulate operation.
The advantage to this approach is that you free up the compiler to
overlap the two independent instruction streams (and of course we are
hoping the compiler is smart enough to do this on CPUs where multiple
instructions can be in flight). The only point of synchronization is
the very ending bit for the final sum.
This approach could be extended to a x^{3,4,...} scheme if you had a
lot of terms and you had a CPU that would have FPU stalls with just the
x^2 version above. At that point though, you should probably be using
some sort of vector unit :)
Bringing this back to something relevant to the this list, it would be
nice for the compiler to be able to do this in the -ffast-math case, but
I'm not 100% familiar with all the precision loss issues so please don't
flame me (duck!)
-tim
On Monday, November 19, 2001, at 04:48 AM, Andreas Svrcek-Seiler wrote:
> Dear developers,
>
> I,ve got an inner loop
> inside which I evaluate a polynomial
> If I do it naively according to horner's scheme:
> (replaciing a log() call, that's even more time-consuming)
> u2 = u*u;
> v = (((((B5*u2+B4)*u2+B3)*u2+B2)*u2+B1)*u2+B0)*u;
>
> the coompiled executable is significantly slower than
> when the source reads
> v = B5*u2 + B4;
> v = v*u2 + B3;
> v = v*u2 + B2;
> v = v*u2 + B1;
> v = v*u2 + B0;
> v *= u;
>
> this beavior is independent of how much optimizatons I turn on.
> by the way,the newly released intel compiler produces even slower
> code ;-)
>
> ... maybe there's room for easy improvement ?
>
> best regards
> andreas
> --
>
> )))))
> (((((
> ( O O )
> -------oOOO--(_)--OOOo-----------------------------------------------------
> o Wolfgang Andreas Svrcek-Seiler
> o (godzilla)
> svrci@tbi.univie.ac.at
> .oooO Tel.:01-4277-52733
> ( ) Oooo.
> -------\ (----(
> )--------------------------------------------------------
> \_) ) /
> (_/
>