This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80636] New: AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 05 May 2017 00:02:56 +0000
Subject: [Bug target/80636] New: AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
Auto-submitted: auto-generated

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

            Bug ID: 80636
           Summary: AVX / AVX512 register-zeroing should always use AVX
                    128b, not ymm or zmm
           Product: gcc
           Version: 8.0
               URL: http://stackoverflow.com/questions/43713273/is-vxorps-
                    zeroing-on-amd-jaguar-bulldozer-zen-faster-with-xmm-re
                    gisters-than-ymm
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Currently, gcc compiles _mm256_setzero_ps() to vxorps %ymm0, %ymm0, %ymm0, or
zmm for _mm512_setzero_ps.  And similar for pd and integer vectors, using a
vector size that matches how it's going to use the register.

vxorps %xmm0, %xmm0, %xmm0 has the same effect, because AVX instructions zero
the destination register out to VLMAX.

AMD Ryzen decodes the xmm version to 1 micro-op, but the ymm version to 2
micro-ops.  It doesn't detect the zeroing idiom special-case until after the
decoder has split it.  (Earlier AMD CPUs (Bulldozer/Jaguar) may be similar.)

---

For zeroing a ZMM register, it also saves a byte or two to use a VEX prefix
instead of EVEX, if the target register is zmm0-15.  (zmm16-31 of course always
need EVEX).

---

There is no benefit, but also no downside, to using xmm-zeroing on Intel CPUs
that don't split 256b or 512b vector ops.  This change could be made across the
board, without adding any tuning options to control it.

References: 
http://stackoverflow.com/a/43751783/224132 Agner Fog's answer to my SO question
about this.
https://bugs.llvm.org/show_bug.cgi?id=32862  the same issue for clang.

Follow-Ups:
- [Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
  - From: rguenth at gcc dot gnu.org
- [Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
  - From: peter at cordes dot ca
- [Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]