[Bug target/97194] optimize vector element set/extract at variable position

amonakov at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Mon Sep 28 08:55:51 GMT 2020


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
FWIW, Peter Cordes provides an overview of available approaches for extraction
depending on vector length and ISA extensions (up to AVX2, not including
AVX-512) in this StackOverflow answer:
https://stackoverflow.com/a/51414330/4755075

TL;DR: generally through store+load; possible alternatives:
 128b:
  SSSE3: pshufb          (1-byte elements)
  SSSE3: imul+add+pshufb (any element size)
  AVX: vpermilp[sd] (4 or 8-byte elements)
 256b:
  AVX2: vpermps (4-byte elements)

In all cases a (v)movd is needed to move the index to a vector register, and
potentially another (v)movd if the result is needed in a general register.

The basic store+load tactic may look worse latency-wise, but can be better
throughput-wise (especially with multiple extractions from the same vector, as
then the store needs to be done just once, as Peter mentioned).

Why in RTL it is important to do this without referencing the stack?


More information about the Gcc-bugs mailing list