This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: gcc will become the best optimizing x86 compiler


Michael Meissner wrote:
On Fri, Jul 25, 2008 at 09:08:42AM +0200, Agner Fog wrote:
Gnu libc could borrow a lot of optimized functions from Opensolaris and Mac and other open source projects. They look better than Gnu libc, but there is still room for improvement. For example, Opensolaris does not use XMM registers for strlen, although this is simpler than using general purpose registers (see my code www.agner.org/optimize/asmlib.zip)

Note, glibc can only take code that is appropriately licensed and donated to
the FSF. In addition it must meet the coding standards for glibc.
The Mac/Xnu and Opensolaris projects have fairly liberal public licenses. If there are legal differences, maybe the copyright owner is open to negotiation. My own code has GPL license. The fact that I am offering my code to you also means, of course, that I am willing to grant the necessary license.

Also note, that it depends on the basic chip level what is fastest for the
operation (for example, using XMM registers are not faster for current AMD
platforms).
Indeed. That's why I am talking about CPU dispatching (i.e. different branches for different CPUs). The CPU dispatching can be done with just a single jump instruction:
At the function entry there is an indirect jump through a pointer to the appropriate version. The code pointer initially points to a CPU dispatcher. The CPU dispatcher detects which CPU it is running on, and replaces the code pointer with a pointer to the appropriate version, then jumps to the pointer. The next time the function is called, it follows the pointer directly to the right version.


My memcpy runs faster with XMM registers than with 64-bit x64 registers on AMD K8.
My strlen runs slower with XMM registers than with 64-bit x64 registers on AMD K8.


I expect the XMM versions to run much faster on AMD K10, because it has full 128-bit execution units and data paths, where K8 has only 64-bits. I have not had the chance to test this on AMD K10 yet.

I believe it is best to optimize for the newest processors, because the processor that is brand new today will become mainstream in a few years.
Memcpy/memset optimizations were added to glibc 2.8, though when your favorite
distribution will provide it is a different question:
http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html
I have libc version 2.7. Can't find version 2.8.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]