This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: GCC and fast eveluaton of polynomials

   This formulation of the series approximation to log() completely 
serializes the execution of your code (each step cannot be computed 
until the previous has finished due to dependence on 'v').

   Instead, you can compute two 'halves' of the sum in parallel.

   First, rewrite your expression to be the sum of one polynomial with 
only even powers and one with only odd powers.  Then, factor a power out 
of the odd power side (since its lowest power is 1), leaving you with 
two independent expressions in dealing even powers.

   Finally, start your two partial sums, one with the zero power one with 
the coefficient of what was the 1st power before factoring above 
happened and then accumulate in the rest of the terms in parallel.

   An example might help here, say we have:

   y = a7*x^7 ... + a1*x + a0

   Rewrite this as:

   y = yEven + yOdd
   yEven = a6*x^6 + a4*x^4 ... + a0
   yOdd = a7*x^7 + a5*x^5 ... + a1*x

   Factor yOdd, leaving:

   y = yEven + x * yOdd;
   yEven = a6*x^6 + a4*x^4 ... + a0
   yOdd = a7*x^6 + a5*x^4 ... + a1

   Then, compute yEven and yOdd using Horner's scheme in x^2.

   One you have to two halves, the final sum is just another 
multiply-accumulate operation.

   The advantage to this approach is that you free up the compiler to 
overlap the two independent instruction streams (and of course we are 
hoping the compiler is smart enough to do this on CPUs where multiple 
instructions can be in flight).  The only point of synchronization is 
the very ending bit for the final sum.

    This approach could be extended to a x^{3,4,...} scheme if you had a 
lot of terms and you had a CPU that would have FPU stalls with just the 
x^2 version above.  At that point though, you should probably be using 
some sort of vector unit :)

   Bringing this back to something relevant to the this list, it would be 
nice for the compiler to be able to do this in the -ffast-math case, but 
I'm not 100% familiar with all the precision loss issues so please don't 
flame me (duck!)


On Monday, November 19, 2001, at 04:48  AM, Andreas Svrcek-Seiler wrote:

> Dear developers,
> I,ve got an inner loop
> inside which I evaluate a polynomial
> If I do it naively according to horner's scheme:
> (replaciing a log() call, that's even more time-consuming)
> 	      u2 = u*u;
> 	      v = (((((B5*u2+B4)*u2+B3)*u2+B2)*u2+B1)*u2+B0)*u;
> the coompiled executable is significantly slower than
> when the source reads
> 	      v = B5*u2 + B4;
> 	      v = v*u2 + B3;
> 	      v = v*u2 + B2;
> 	      v = v*u2 + B1;
> 	      v = v*u2 + B0;
> 	      v *= u;
> this beavior is independent of how much optimizatons I turn on.
> by the way,the newly released intel compiler produces even slower 
> code ;-)
> ... maybe there's room for easy improvement ?
> best regards
> andreas
> --
>             )))))
>             (((((
>            ( O O )
> -------oOOO--(_)--OOOo-----------------------------------------------------
>               o        Wolfgang Andreas Svrcek-Seiler
>               o        (godzilla)
>       .oooO            Tel.:01-4277-52733
>       (   )   Oooo.
> -------\ (----(   
> )--------------------------------------------------------
>         \_)    ) /
>               (_/

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]