[Bug target/97194] optimize vector element set/extract at variable position
amonakov at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Sep 28 08:55:51 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
FWIW, Peter Cordes provides an overview of available approaches for extraction
depending on vector length and ISA extensions (up to AVX2, not including
AVX-512) in this StackOverflow answer:
https://stackoverflow.com/a/51414330/4755075
TL;DR: generally through store+load; possible alternatives:
128b:
SSSE3: pshufb (1-byte elements)
SSSE3: imul+add+pshufb (any element size)
AVX: vpermilp[sd] (4 or 8-byte elements)
256b:
AVX2: vpermps (4-byte elements)
In all cases a (v)movd is needed to move the index to a vector register, and
potentially another (v)movd if the result is needed in a general register.
The basic store+load tactic may look worse latency-wise, but can be better
throughput-wise (especially with multiple extractions from the same vector, as
then the store needs to be done just once, as Peter mentioned).
Why in RTL it is important to do this without referencing the stack?
More information about the Gcc-bugs
mailing list