This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 05 May 2017 23:16:36 +0000
Subject: [Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
Auto-submitted: auto-generated
References: <bug-80636-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
> The same possibly applies to all "zero-extending" moves?

Yes, if a  vmovdqa %xmm0,%xmm1  will work, it's the best choice on AMD CPUs,
and doesn't hurt on Intel CPUs.  So in any case where you need to copy a
register, and the upper lane(s) are known to be zero.

If you're copying just to zero the upper lane, you don't have a choice (if you
don't know that the source reg's upper lane is zeroed).

In general, when all else is equal, use narrower vectors.  (e.g. in a
horizontal sum, the first step should be vextractf128 to reduce down to 128b
vectors.)

---

Quoting the Bulldozer section of Agner Fog's microarch.pdf (section 18.10
Bulldozer AVX):

> 128-bit register-to-register moves have zero latency, while 256-bit register-to-register
> moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different
> domain (see below) on Bulldozer and Piledriver.

---

On Ryzen: the low 128-bit lane is renamed with zero latency, but the upper lane
needs an execution unit.

Despite this, vectorizing with 256b *is* worth it on Ryzen, because the core is
so wide and decodes double-uop instructions efficiently.  Also, AVX 3-operand
instructions make moves rarer.

---

On Jaguar: 128b moves (with implicit zeroing of the upper lane) are 1 uop, 256b
moves are 2 uops.  128b moves from zeroed registers are eliminated (no
execution port, but still have to decode/issue/retire).

David Kanter's writeup (http://www.realworldtech.com/jaguar/4/) explains that
the PRF has an "is-zero" bit which can be set efficiently.  This is how 128b
moves are able to zero the upper lane of the destination in the rename stage,
without using an extra uop.  (And to avoid needing an execution port for
xor-zeroing uops).

References:
- [Bug target/80636] New: AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]