This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/68928] AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 16 Dec 2015 21:47:40 +0000
- Subject: [Bug target/68928] AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling
- Auto-submitted: auto-generated
- References: <bug-68928-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928
--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
I posted this as a question on stackoverflow, and got some useful comments (and
had some ideas while writing up a mask-gen answer).
http://stackoverflow.com/questions/34306933/vectorizing-with-unaligned-buffers-using-vmaskmovps-generating-a-mask-from-a-m
Stephen Canon points out that VMASKMOVPS isn't actually useful: you can instead
use unaligned loads/stores for the peeled first/last iteration, and do
overlapping work. You just have to make sure you load any data you need before
clobbering it. I posted an answer using that idea, but I'm not sure if it's
the sort of thing a compiler could decide to use.
For reduction loops where we need to accumulate each element exactly once, a
mask is still useful, but we can use it for ANDPS / ANDNPS instead of VMASKMOV.
I improved the mask-generation to a single AVX2 VPMOVSXBD load (with 5 or 7
single-uop integer instructions to generate the index from the start/end
address). VPCMPGT isn't needed: instead just use an index to take the right
window of bytes from memory. This emulates a variable-count VPSLLDQ on a
buffer of all-ones.
This is something gcc could maybe use, but probably some experimental testing
to compare with just using unaligned is warranted before spending any time
implementing automatic generation of something complicated like this.