On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
Jan Hubicka <hubicka@ucw.cz> writes:
Note that I think Core has similar characteristics - at least for string
operations
it fares well with unalignes accesses.
Nehalem and later has very fast unaligned vector loads. There's still
some
penalty when they cross cache lines however.
iirc the rule of thumb is to do unaligned for 128 bit vectors,
but avoid it for 256bit vectors because the cache line cross
penalty is larger on Sandy Bridge and more likely with the larger
vectors.
Yes, I think the rule was that using the unaligned instruction variants
carries
no penalty when the actual access is aligned but that aligned accesses are
still faster than unaligned accesses. Thus peeling for alignment _is_ a
win.
I also seem to remember that the story for unaligned stores vs. unaligned
loads
is usually different.