[Bug target/97194] optimize vector element set/extract at variable position

Mon Sep 28 08:55:51 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
FWIW, Peter Cordes provides an overview of available approaches for extraction
depending on vector length and ISA extensions (up to AVX2, not including
AVX-512) in this StackOverflow answer:
https://stackoverflow.com/a/51414330/4755075

TL;DR: generally through store+load; possible alternatives:
 128b:
  SSSE3: pshufb          (1-byte elements)
  SSSE3: imul+add+pshufb (any element size)
  AVX: vpermilp[sd] (4 or 8-byte elements)
 256b:
  AVX2: vpermps (4-byte elements)

In all cases a (v)movd is needed to move the index to a vector register, and
potentially another (v)movd if the result is needed in a general register.

The basic store+load tactic may look worse latency-wise, but can be better
throughput-wise (especially with multiple extractions from the same vector, as
then the store needs to be done just once, as Peter mentioned).

Why in RTL it is important to do this without referencing the stack?