Summary: | Unnecessary EVEX encoding | ||
---|---|---|---|
Product: | gcc | Reporter: | H.J. Lu <hjl.tools> |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | marxin, pcordes, ubizjak |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 9.0 | ||
Target Milestone: | --- | ||
Host: | Target: | i386,x86-64 | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: |
Description
H.J. Lu
2019-02-13 22:28:55 UTC
Still present in pre10.0.0 trunk 20191022. We pessimize vmovdqu/a in AVX2 intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on such machines) It seems only VMOVDQU/A load/store/register-copy instructions are affected; we get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and non-zeroing XOR. (And most other instructions have the same mnemonic for VEX and EVEX, like vpaddd. This includes FP moves like VMOVUPS/PD) (https://godbolt.org/z/TEvWiU for example) The good options are: * use VEX whenever possible instead of AVX512VL to save code-size. (2 or 3 byte prefix instead of 4-byte EVEX) * Avoid the need for vzeroupper by using only x/y/zmm16..31. (Still has a max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code that doesn't spend a lot of time in vectorized loops.) This might be appropriate for very simple functions / blocks that only have a few SIMD instructions before the next vzeroupper would be needed. (e.g. copying or zeroing some memory); could be competitive on code-size as well as saving the 4-uop instruction. VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding for everything involving the vector (and rules out using AVX2 and earlier instructions, which may be a problem for KNL without AVX512VL unless we narrow to 128-bit in an XMM reg) ---- (citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly: https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc - it's even safe to do vpxor xmm0,xmm0,xmm0 vpcmpeqb k0, zmm0, [rdi] without vzeroupper. Although that will reduce max turbo *temporarily* because it's a 512-bit uop. Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM stores. Dup Dup. *** This bug has been marked as a duplicate of bug 89229 *** The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>: https://gcc.gnu.org/g:5358e8f5800daa0012fc9d06705d64bbb21fa07b commit r10-7054-g5358e8f5800daa0012fc9d06705d64bbb21fa07b Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Mar 5 16:45:05 2020 -0800 i386: Properly encode vector registers in vector move On x86, when AVX and AVX512 are enabled, vector move instructions can be encoded with either 2-byte/3-byte VEX (AVX) or 4-byte EVEX (AVX512): 0: c5 f9 6f d1 vmovdqa %xmm1,%xmm2 4: 62 f1 fd 08 6f d1 vmovdqa64 %xmm1,%xmm2 We prefer VEX encoding over EVEX since VEX is shorter. Also AVX512F only supports 512-bit vector moves. AVX512F + AVX512VL supports 128-bit and 256-bit vector moves. xmm16-xmm31 and ymm16-ymm31 are disallowed in 128-bit and 256-bit modes when AVX512VL is disabled. Mode attributes on x86 vector move patterns indicate target preferences of vector move encoding. For scalar register to register move, we can use 512-bit vector move instructions to move 32-bit/64-bit scalar if AVX512VL isn't available. With AVX512F and AVX512VL, we should use VEX encoding for 128-bit/256-bit vector moves if upper 16 vector registers aren't used. This patch adds a function, ix86_output_ssemov, to generate vector moves: 1. If zmm registers are used, use EVEX encoding. 2. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE or VEX encoding will be generated. 3. If xmm16-xmm31/ymm16-ymm31 registers are used: a. With AVX512VL, AVX512VL vector moves will be generated. b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register move will be done with zmm register move. There is no need to set mode attribute to XImode explicitly since ix86_output_ssemov can properly encode xmm16-xmm31/ymm16-ymm31 registers with and without AVX512VL. Tested on AVX2 and AVX512 with and without --with-arch=native. gcc/ PR target/89229 PR target/89346 * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. * config/i386/i386.c (ix86_get_ssemov): New function. (ix86_output_ssemov): Likewise. * config/i386/sse.md (VMOVE:mov<mode>_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL check. (*movxi_internal_avx512f): Call ix86_output_ssemov for TYPE_SSEMOV. (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand and TARGET_AVX512VL check. (*movti_internal): Likewise. (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. gcc/testsuite/ PR target/89229 PR target/89346 * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. * gcc.target/i386/pr89229-2a.c: New test. * gcc.target/i386/pr89229-2b.c: Likewise. * gcc.target/i386/pr89229-2c.c: Likewise. * gcc.target/i386/pr89229-3a.c: Likewise. * gcc.target/i386/pr89229-3b.c: Likewise. * gcc.target/i386/pr89229-3c.c: Likewise. * gcc.target/i386/pr89346.c: Likewise. commit r10-7078-g6733ecaf3fe77871d86bfb36bcda5497ae2aaba7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sun Mar 8 05:01:03 2020 -0700 gcc.target/i386/pr89229-3c.c: Include "pr89229-3a.c" PR target/89229 PR target/89346 * gcc.target/i386/pr89229-3c.c: Include "pr89229-3a.c", instead of "pr89229-5a.c". |