[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

Sat Sep 11 07:54:39 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #9 from Peter Cordes <peter at cordes dot ca> ---
Thanks for implementing my idea :)

(In reply to Hongtao.liu from comment #6)
> For elements located above 128bits, it seems always better(?) to use
> valign{d,q}

TL:DR:
 I think we should still use vextracti* / vextractf* when that can get the job
done in a single instruction, especially when the VEX-encoded vextracti/f128
can save a byte of code size for v[4].

Extracts are simpler shuffles that might have better throughput on some future
CPUs, especially the upcoming Zen4, so even without code-size savings we should
use them when possible.  Tiger Lake has a 256-bit shuffle unit on port 1 that
supports some common shuffles (like vpshufb); a future Intel might add
256->128-bit extracts to that.

It might also save a tiny bit of power, allowing on-average higher turbo
clocks.

---

On current CPUs with AVX-512, valignd is about equal to a single vextract, and
better than multiple instruction.  It doesn't really have downsides on current
Intel, since I think Intel has continued to not have int/FP bypass delays for
shuffles.

We don't know yet what AMD's Zen4 implementation of AVX-512 will look like.  If
it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other than
insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle like
valignd probably costs more than 2 uops.  (vpermq is more than 2 uops on
Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.  (And
especially an extract of the high 256 might be very cheap and low latency, like
vextracti128 on Zen1, so we might prefer vextracti64x4 for v[8].)

So this change is good, but using a vextracti64x2 or vextracti64x4 could be a
useful peephole optimization when byte_offset % 16 == 0.  Or of course
vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
with an EVEX-encoded instruction).

vextractf-whatever allows an FP shuffle on FP data in case some future CPU
cares about that for shuffles.

An extract is a simpler shuffle that might have better throughput on some
future CPU even with full-width execution units.  Some future Intel CPU might
add support for vextract uops to the extra shuffle unit on port 1.  (Which is
available when no 512-bit uops are in flight.)  Currently (Ice Lake / Tiger
Lake) it can only run some common shuffles like vpshufb ymm, but not including
any vextract or valign.  Of course port 1 vector ALUs are shut down when
512-bit uops are in flight, but could be relevant for __m256 vectors on these
hypothetical future CPUs.

When we can get the job done with a single vextract-something, we should use
that instead of valignd.  Otherwise use valignd.

We already check the index for low-128 special cases to use vunpckhqdq vs.
vpshufd (or vpsrldq) or similar FP shuffles.

-----

On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
zero), an extract that only writes a 128-bit register will keep them clean
(even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is only
needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function like

float foo(float *p) {
  some vector stuff that can use high zmm regs;
  return scalar that happens to be from the middle of a vector;
}

could vextract into XMM0, but would need vzeroupper if it used valignd into
ZMM0.

(Also related
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
re reading a ZMM at all and turbo clock).

---

Having known zeros outside the low 128 bits (from writing an xmm instead of
rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
elements that might be subnormal could happen to be an advantage, maybe saving
an FP assist for denormal.  We're unlikely to be able to take advantage of it
to save instructions/uops (like OR instead of blend).  But it's not worse to
use a single extract instruction instead of a single valignd.