Bug 80636 - AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
Summary: AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL: http://stackoverflow.com/questions/43...
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-05-05 00:02 UTC by Peter Cordes
Modified: 2021-06-04 01:18 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-05-05 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-05-05 00:02:56 UTC
Currently, gcc compiles _mm256_setzero_ps() to vxorps %ymm0, %ymm0, %ymm0, or zmm for _mm512_setzero_ps.  And similar for pd and integer vectors, using a vector size that matches how it's going to use the register.

vxorps %xmm0, %xmm0, %xmm0 has the same effect, because AVX instructions zero the destination register out to VLMAX.

AMD Ryzen decodes the xmm version to 1 micro-op, but the ymm version to 2 micro-ops.  It doesn't detect the zeroing idiom special-case until after the decoder has split it.  (Earlier AMD CPUs (Bulldozer/Jaguar) may be similar.)

---

For zeroing a ZMM register, it also saves a byte or two to use a VEX prefix instead of EVEX, if the target register is zmm0-15.  (zmm16-31 of course always need EVEX).

---

There is no benefit, but also no downside, to using xmm-zeroing on Intel CPUs that don't split 256b or 512b vector ops.  This change could be made across the board, without adding any tuning options to control it.

References: 
http://stackoverflow.com/a/43751783/224132 Agner Fog's answer to my SO question about this.
https://bugs.llvm.org/show_bug.cgi?id=32862  the same issue for clang.
Comment 1 Richard Biener 2017-05-05 07:30:31 UTC
Confirmed.  The same possibly applies to all "zero-extending" moves?
Comment 2 Peter Cordes 2017-05-05 23:16:36 UTC
> The same possibly applies to all "zero-extending" moves?

Yes, if a  vmovdqa %xmm0,%xmm1  will work, it's the best choice on AMD CPUs, and doesn't hurt on Intel CPUs.  So in any case where you need to copy a register, and the upper lane(s) are known to be zero.

If you're copying just to zero the upper lane, you don't have a choice (if you don't know that the source reg's upper lane is zeroed).

In general, when all else is equal, use narrower vectors.  (e.g. in a horizontal sum, the first step should be vextractf128 to reduce down to 128b vectors.)

---

Quoting the Bulldozer section of Agner Fog's microarch.pdf (section 18.10 Bulldozer AVX):

> 128-bit register-to-register moves have zero latency, while 256-bit register-to-register
> moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different
> domain (see below) on Bulldozer and Piledriver.

---

On Ryzen: the low 128-bit lane is renamed with zero latency, but the upper lane needs an execution unit.

Despite this, vectorizing with 256b *is* worth it on Ryzen, because the core is so wide and decodes double-uop instructions efficiently.  Also, AVX 3-operand instructions make moves rarer.

---

On Jaguar: 128b moves (with implicit zeroing of the upper lane) are 1 uop, 256b moves are 2 uops.  128b moves from zeroed registers are eliminated (no execution port, but still have to decode/issue/retire).

David Kanter's writeup (http://www.realworldtech.com/jaguar/4/) explains that the PRF has an "is-zero" bit which can be set efficiently.  This is how 128b moves are able to zero the upper lane of the destination in the rename stage, without using an extra uop.  (And to avoid needing an execution port for xor-zeroing uops).
Comment 3 Peter Cordes 2017-05-20 09:29:35 UTC
The point about moves also applies to integer code, since a 64-bit mov requires an extra byte for the REX prefix (unless a REX prefix was already required for r8-r15).

I just noticed a case where gcc uses a 64-bit mov to copy a just-zeroed integer register, when setting up for a 16-byte atomic load (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load for a single member, and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 for a 7.1.0 regression.  And https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for the store-forwarding stalls from this code with -m32)

// https://godbolt.org/g/xnyI0l
// return the first 8-byte member of a 16-byte atomic object.
#include <atomic>
#include <stdint.h>
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
    node *ptr;    // non-atomic pointer-to-atomic
    uintptr_t count;
};

node *load_nounion(std::atomic<counted_ptr> *p) {
  return p->load(std::memory_order_acquire).ptr;
}

gcc6.3 -std=gnu++11 -O3 -mcx16 compiles this to

        pushq   %rbx
        xorl    %ecx, %ecx
        xorl    %eax, %eax
        xorl    %edx, %edx
        movq    %rcx, %rbx    ### BAD: should be movl %ecx,%ebx.  Or another xor
        lock cmpxchg16b (%rdi)
        popq    %rbx
        ret

MOVQ is obviously sub-optimal, unless done for padding to avoid NOPs later.

It's debatable whether %rbx should be zeroed with xorl %ebx,%ebx or movl %ecx,%ebx.

* AMD: copying a zeroed register is always at least as good, sometimes better.
* Intel: xor-zeroing is always best, but on IvB and later copying a zeroed reg is as good most of the time.  (But not in cases where mov %r10d, %ebx would cost a REX and xor %ebx,%ebx wouldn't.)

Unfortunately, -march/-mtune doesn't affect the code-gen either way.  OTOH, there's not much to gain here, and the current strategy of mostly using xor is not horrible for any CPUs.  Just avoiding useless REX prefixes to save code size would be good enough.

But if anyone does care about optimally zeroing multiple registers:

-mtune=bdver1/2/3 should maybe use one xorl and three movl (since integer MOV can run on ports AGU01 as well as EX01, but integer xor-zeroing still takes an execution unit, AFAIK, and can only run on EX01.)  Copying a zeroed register is definitely good for vectors, since vector movdqa is handled at rename with no execution port or latency.

-mtune=znver1 (AMD Ryzen) needs an execution port for integer xor-zeroing (and maybe vector), but integer and vector mov run with no execution port or latency (in the rename stage).  XOR-zeroing one register and copying it (with 32-bit integer or 128-bit vector mov) is clearly optimal.  In http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen3_InstLatX64.txt, mov r32,r32 throughput is 0.2, but integer xor-zeroing throughput is only 0.25.  IDK why vector movdqa throughput isn't 0.2, but the latency data tells us it's handled at rename, which Agner Fog's data confirms.


-mtune=nehalem and earlier Intel P6-family don't care much: both mov and xor-zeroing use an execution port.  But mov has non-zero latency, so the mov-zeroed registers are ready at the earliest 2 cycles after the xor and mov uops issue.  Also, mov may not preserve the upper-bytes-zeroes property that avoids partial register stalls if you write AL and then read EAX.  Definitely don't MOV a register that was zeroed a long time ago: that will contribute to register-read stalls.  (http://stackoverflow.com/a/41410223/224132).  mov-zeroing is only ok within about 5 cycles of the xor-zeroing.

-mtune=sandybridge should definitely use four XOR-zeroing instructions, because MOV needs an execution unit (and has 1c latency), but xor-zeroing doesn't.   XOR-zeroing also avoids consuming space in the physical register file: http://stackoverflow.com/a/33668295/224132.

-mtune=ivybridge and later Intel shouldn't care most of the time, but xor-zeroing is sometimes better (and never worse):  They can handle integer and SSE MOV instructions in the rename stage with no execution port, the same way they and SnB handle xor-zeroing.  However, mov-zeroing reads more registers, which can be a bottleneck (especially if they're cold?) on HSW/SKL. http://www.agner.org/optimize/blog/read.php?i=415#852.  Apparently mov-elimination isn't perfect, and it sometimes does use an execution port.  IDK when it fails.  Also, a kernel save/restore might leave the zeroed source register no longer in the special zeroed state (pointing to the physical zero-register, so it and its copies don't take up a register-file entry).  So mov-zeroing is likely to be worse in the same cases as Nehalem and earlier: when the source was zeroed a while ago. 


IDK about Silvermont/KNL or Jaguar, except that 64-bit xorq same,same isn't a dependency-breaker on Silvermont/KNL.  Fortunately, gcc always uses 32-bit xor for integer registers.


-mtune=generic might take a balanced approach and zero two or three with XOR (starting with ones that don't need REX prefixes), and use MOVL to copy for the remaining one or two.  Since MOV may help throughput on AMD (by reducing execution-port pressure), and the only significant downside for Intel is on Sandybridge (except for partial-register stuff), it's probably fine to mix in some MOV.
Comment 4 Peter Cordes 2021-06-04 01:18:23 UTC
This seems to be fixed for ZMM vectors in GCC8.  https://gcc.godbolt.org/z/7351be1v4

Seems to have never been a problem for __m256, at least not for 
__m256 zero256(){ return _mm256_setzero_ps(); }
IDK what I was looking at when I originally reported; maybe just clang which *did* used to prefer YMM-zeroing.

Some later comments suggested movdqa vs. pxor zeroing choices (and mov vs. xor for integer), but the bug title is just AVX / AVX-512 xor-zeroing, and that seems to be fixed.  So I think this should be closed.