This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: weird optimization in sin+cos, x86 backend


On Tue, Feb 14, 2012 at 7:54 PM, Christoph Lauter
<christoph.lauter@lip6.fr> wrote:
> Hello,
>
> first of all, let me apologize for my late answer to this very exciting
> email thread.
>
> As pointed out several times, the current libm suffers from several
> disadvantages:
>
> * The current libm code is a mix of codes coming from different sources,
> with tables generated by different people. The knowledge on how to maintain
> these codes (in the sense of extending them) has mainly been lost.
>
> * The current libm is the result of a difficult compromise between speed,
> size, accuracy, architecture dependent optimization. It is principally
> targeted to general-purpose use, with some optimizations. These
> optimizations, as the one for speed in sin_cos, may fit the expectations of
> a certain group of users and not the ones of others.
> There seems to be some consensus that in an ideal world, all libm functions
> should exist in different flavors (latency or throughput optimized,
> vectorized or not, with different accuracy levels up to correctly rounded,
> etc.)
>
> * Any libm should actually be regularly performance-optimized for the
> different hardware architectures. This is tedious and as far is I know, it
> has not often been done on GNU libm.
>
> So a re-design of the current libm seems to be desirable.
>
> crlibm has been proposed as a good starting point for such a new and
> optimized libm. As one of the people having contributed to crlibm, I'd say
> it actually isn't and we can do much better:
>
> * crlibm has proven that the correct rounding of (univariate) libm functions
> is possible and that the performance slow-down can absolutely be neglected.
> The correct rounding of libm functions yields perfect portability with
> bitwise comparable results.
> However, correct rounding is not a panacea for all libm problems. There is a
> huge number of libm use cases where people actually do want to sacrifice a
> certain amount of accuracy (and portability) for speed.
> A libm rewrite should, IMHO, address their needs, too.
>
> * Besides all issues due to intellectual property transfer, crlibm has never
> been designed as one-to-one replacement for the standard libm:
> ? - the library must be initialized, an operation which might change the
> state of the FPU unit on some architectures (x86)
> ? - all functions come in 4 flavors, according to the rounding modes: e.g.
> exp_rn(), exp_rd(), exp_ru(), exp_rz(). Hard work would be required to make
> this fit with a more IEEE 754-2008/C99 view of what rounding modes are (i.e.
> modes in the environment).
> ? - the quality of the code in crlibm has itself suffered from crlibm having
> been written by a bunch of people (with different coding conventions) and
> with different technologies for intermediate format handling.
> In one word, crlibm has been a great place to do experiments and to learn
> how to do correct rounding in a performance-sensible fashion. It will be
> hard to clean it up for a start for a new libm, though.
>
> The challenge of starting over with libm is hence quite a huge one. We might
> also see it as an opportunity though:
>
> * A well-structured design of the new library might bring better code
> maintainability and first of all, better code documentation. IMHO, the libm
> sources should come with all well-documented scripts that have been used to
> generate all tables and approximation polynomials.
>
> * Recent advances in automatic code optimization for libms [1] and
> polynomial approximation [2] seem to indicate that a libm could be written
> in a fully parameterized way. That way, it would be possible
> ?- to support different accuracy flavors, traded for speed,
> ?- to support or neglect flag and errno handling as appropriate,
> ?- to auto-tune the code for the different, ever-changing architectures,
> ?- propose latency and throughput optimized flavors,
> ?- to go for correct rounding where it is possible and needed.
>
> * The new library would come with a freshly-developed test bench (also
> generated by scripts distributed along with the code) and might even be
> proven formally, at least partly.
>
> Hence, by instantiation of the parameterization, it might actually be
> possible to fill in all cases in the hypercube accuracy * latency/throughput
> optimization * flag handling * etc. and to enable the user to choose the
> flavor suiting their needs by setting of a couple of flags. A vanilla libm
> for -O0 would be a corollary of that project, of course.

This is a very nice idea for the set of "correct" routines we want to provide.
Providing OS/CPU specific glues around the generic code to deal with
errno setting, IEEE special values handling (inf/nan) and FPU state
modification would probably be best not part of the auto-generated part.

I'd also expect "optimized" routines would be written in assembly (I'm not
sure how good the code generators can target vectorized machines).

Or do you expect that if you tune the "accuracy value" down, say, accept
10 eps instead of 4 eps, there will be a noticable speedup?

> My colleagues Jean-Michel Muller, Florent de Dinechin at École Normale
> Supérieure de Lyon/ INRIA/ CNRS and myself at Université Pierre et Marie
> Curie, Paris Sorbonne, we have been thinking of starting such a project for
> quite a while. We've got everything what we need to get started, even a
> possible Ph.D. student for doing the work is in the pipeline. She could get
> started by September (if we get the financing done).
>
> As a matter of course, we'd be more than happy to get your input (and even
> guidance w.r.t. copyright management and coding conventions) on and during
> that project.
> If (parts of) GNU or gcc could officially endorse it, we'd even be happier.
> This might also help to get the financing.

License wise the code generators can use GPL but the generated code
should be (L)GPL plus an exception like the one we use for libgcc to
allow static linking and re-optimizing into a proprietary binary using LTO.

Richard.

> Christoph Lauter
>
> also on behalf of Jean-Michel Muller and Florent de Dinenchin
>
>
> References:
>
> [1] Automatic Generation of Fast and Certified Code for Polynomial
> Evaluation. Christophe Mouilleron and Guillaume Revy.
> In E. Antelo, D. Hough, and P. Ienne, editors, 20th IEEE Symposium on
> Computer Arithmetic (ARITH'20), pages 233-242, Tuebingen, Germany, 25-27
> July 2011.
>
> [2] Optimizing polynomials for floating-point implementation, Florent de
> Dinechin and Christoph Lauter, in Proceedings of the 8th Conference on Real
> Numbers and Computers, pages 7-16, Santiago de Compostela, Spain, July 2008.
>
>
> Joseph S. Myers wrote on 10/02/2012 14:50:
>
>> On Fri, 10 Feb 2012, Richard Guenther wrote:
>>
>>> I don't buy the argument that inlining math routines (apart from those
>>> we already handle) would improve performance. ?What will improve
>>> performance is to have separate entry points to the routines
>>> to skip errno handling, NaN/Inf checking or rounding mode selection
>>> when certain compilation flags are set. ?That as well as a more
>>> sane calling convention for, for example sincos, or in general
>>> on x86_64 (have at least _some_ callee-saved XMM registers).
>>
>> glibc has some extra entry points for -ffinite-math-only in 2.15.
>>
>>> The issue with libm in glibc here is that Drepper absolutely does
>>> not want new ABIs in libm - he believes that for example vectorized
>>> routines do not belong there (nor the SSE calling-convention variants
>>> for i686 I tried to push once).
>>
>> And this fits in with the general principle that glibc's libm is for
>> general-purpose use (balancing speed, size, accuracy etc.) and it makes
>> sense to have the extra variants in a separate library (that won't
>> necessarily be loaded into every program linked with libstdc++ or with a
>> computation of a square root that's not at all performance critical, for
>> example); being conservative about extra interfaces in a basic system
>> library does make sense.
>>
>> One key difference with the -ffinite-math-only entry points is that they
>> generally are just exported names for functions that already existed - the
>> code was already structured to do checks for errors / exceptional values
>> and then call the main function for the no-error case as needed.
>>
>
>
> Vincent Lefevre wrote on 13/02/2012 15:59:
>
>> On 2012-02-09 15:49:37 +0000, Andrew Haley wrote:
>>> I'd start with INRIA's crlibm.
>>
>> I point I'd like to correct. GNU MPFR has mainly (> ?95%) been
>> developed by researchers and engineers paid by INRIA. But this
>> is not the case of CRlibm. I don't know its copyright status
>> (apparently, mainly ENS Lyon, and the rights have not been
>> transferred to the FSF).
>>
>> Also, from what I've heard, CRlibm is more or less regarded as
>> dead, because there are new tools that do a better job, and new
>> functions could be generated in a few hours. I suppose a member
>> (not me since I don't work on these tools) or ex-member of our
>> team will contact some of you about this.
>
>
> --
> Christoph Lauter
> Maître de conférences - Associate Professor
> Équipe PEQUAN - LIP6 - UPMC Paris 6
> 4, place Jussieu, 75252 Paris Cedex 05, 26-00/301
> Tel.: +33144278029 / +33182521777
> http://www.christoph-lauter.org/
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]