This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: gcc will become the best optimizing x86 compiler
On Wednesday 30 July 2008 19:14, Agner Fog wrote:
> I agree that the OpenSolaris memcpy is bigger than necessary. However,
> it is necessary to have 16 branches for covering all possible alignments
> modulo 16. This is because, unfortunately, there is no XMM shift
> instruction with a variable count, only with a constant count, so we
> need one branch for each value of the shift count. Since only one of the
> branches is used, it doesn't take much space in the code cache. The
> speed is improved by a factor 4-5 by this 16-branch algorithm, so it is
> certainly worth the extra complexity.
I tend to doubt that odd-byte aligned large memcpys are anywhere
near typical. malloc and mmap both return well-aligned buffers
(say, 8 byte aligned). Static and on-stack objects are also
at least word-aligned 99% of the time.
memcpy can just use "relatively simple" code for copies in which
either src or dst is not word aligned. This cuts possibilities down
from 16 to 4 (or even 2?).
--
vda