[Bug target/68928] AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling

peter at cordes dot ca gcc-bugzilla@gcc.gnu.org
Wed Dec 16 21:46:00 GMT 2015


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
Richard wrote: 
> [...] avoid peeling for alignment on x86_64 and just use unaligned ops

Yeah, that's what clang does, and may be optimal.  Certainly it's easy, and
gives optimal performance when buffers *are* in fact aligned, even when the
programmer has neglected to inform the compiler of any guarantee.

However, with vector sizes getting closer to the cache-line size, unaligned
accesses will cross cache lines more of the time.  (e.g. an AVX loop over an
unaligned buffer will have a cacheline split on every other iteration).  Iff we
can *cheaply* avoid this, it may be worth it.

IIRC, all modern x86 / x86-64 CPUs have no penalty for unaligned loads, as long
as they don't actually cross a cache-line boundary.  (True for Intel since
Nehalem).  Store-forwarding doesn't work well if the stores don't line up with
the loads, though.


More information about the Gcc-bugs mailing list