This is the mail archive of the
mailing list for the GCC project.
Re: powerpc & unaligned block moves with fp registers
- To: <dewar at gnat dot com>, <degger at fhm dot edu>, <kenner at vlsi1 dot ultra dot nyu dot edu>
- Subject: Re: powerpc & unaligned block moves with fp registers
- From: "Tim Prince" <tprince at computer dot org>
- Date: Sat, 10 Nov 2001 07:28:34 -0800
- Cc: <gcc at gcc dot gnu dot org>
- References: <20011110144448.6B166F28C7@nile.gnat.com>
----- Original Message -----
To: <email@example.com>; <firstname.lastname@example.org>
Sent: Saturday, November 10, 2001 6:44 AM
Subject: Re: powerpc & unaligned block moves with fp registers
> <<Slow in the case of misaligned accesses depends on the system; if the
> hardware handles it by splitting up the accesses then it's likely to be
> in the range of a few dotzend up to a few hundred cycles. If the acesses
> emerge into the OS because the hardware cannot handle it then
> the overhead is more likely to be in the range from a few hundred up to
> several thousand cycles. It's really hard to give accurate numbers here
> since it depends very much on the CPU and in the latter case also on the
> This is too pessimistic. For example, on Power, the penalty for a
> access is far less than this.
> Yes, it very much depends on the architecture, but your generalization is
> not accurate (and far too pessimistic) for many cases. I don't have the
> figures for latest chips in the Pentium and Athlon series, but I would
> be very surprised if the penalty is as much as a few dozen cycles (on
> earlier chips it was about one clock).
I don't have the figures either, but typical memory-intensive benchmarks
using 64-bit data on P4 take 60% longer with the standard alignment
specified in coff-i386.c, as compared to when the
DEFAULT_SECTION_ALIGNMENT_POWER is increased. The penalty occurs only on
the memory access which straddles cache boundaries (cache line split), so it
is huge when it occurs. On P-III and early Athlons, the penalty is about
30% on the same tests, with smaller cache lines; on Athlon 1800+, about 50%.
This looks like more than a few dozen cycles, unless you take an average, in
which case it is only a "handful," to be even less precise. On Itanium, a
mis-aligned access, if processed by a trap handler, takes 1000's of cycles.