Bug 89346

Summary: Unnecessary EVEX encoding
Product: gcc Reporter: H.J. Lu <hjl.tools>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: normal CC: marxin, pcordes, ubizjak
Priority: P3 Keywords: missed-optimization
Version: 9.0   
Target Milestone: ---   
Host: Target: i386,x86-64
Build: Known to work:
Known to fail: Last reconfirmed:

Description H.J. Lu 2019-02-13 22:28:55 UTC
[hjl@gnu-skx-1 gcc]$ cat x.c
#include <immintrin.h>

long long *p;
volatile __m256i yy;

void
foo (void)
{
   _mm256_store_epi64 (p, yy);
}
[hjl@gnu-skx-1 gcc]$ gcc -S -O2 x.c -march=skylake-avx512
[hjl@gnu-skx-1 gcc]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB5168:
	.cfi_startproc
	vmovdqa64	yy(%rip), %ymm0   <<< No need for EVEX.
	movq	p(%rip), %rax
	vmovdqa64	%ymm0, (%rax)     <<< No need for EVEX.
	vzeroupper
	ret
	.cfi_endproc
.LFE5168:
	.size	foo, .-foo
	.comm	yy,32,32
	.comm	p,8,8
	.ident	"GCC: (GNU) 8.2.1 20190209 (Red Hat 8.2.1-8)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 gcc]$
Comment 1 Peter Cordes 2019-10-30 14:28:25 UTC
Still present in pre10.0.0 trunk 20191022.  We pessimize vmovdqu/a in AVX2 intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on such machines)

It seems only VMOVDQU/A load/store/register-copy instructions are affected; we get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and non-zeroing XOR.  (And most other instructions have the same mnemonic for VEX and EVEX, like vpaddd.  This includes FP moves like VMOVUPS/PD)

(https://godbolt.org/z/TEvWiU for example)

The good options are: 

* use VEX whenever possible instead of AVX512VL to save code-size.  (2 or 3 byte prefix instead of 4-byte EVEX)

* Avoid the need for vzeroupper by using only x/y/zmm16..31.  (Still has a max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code that doesn't spend a lot of time in vectorized loops.)

 This might be appropriate for very simple functions / blocks that only have a few SIMD instructions before the next vzeroupper would be needed.  (e.g. copying or zeroing some memory); could be competitive on code-size as well as saving the 4-uop instruction.

 VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding for everything involving the vector (and rules out using AVX2 and earlier instructions, which may be a problem for KNL without AVX512VL unless we narrow to 128-bit in an XMM reg)

----

(citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly: https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc - it's even safe to do

    vpxor     xmm0,xmm0,xmm0
    vpcmpeqb  k0, zmm0, [rdi]

without vzeroupper.  Although that will reduce max turbo *temporarily* because it's a 512-bit uop.

Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM stores.
Comment 2 H.J. Lu 2020-01-27 19:01:14 UTC
Dup
Comment 3 H.J. Lu 2020-01-27 19:01:58 UTC
Dup.

*** This bug has been marked as a duplicate of bug 89229 ***
Comment 4 GCC Commits 2020-03-06 00:53:05 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:5358e8f5800daa0012fc9d06705d64bbb21fa07b

commit r10-7054-g5358e8f5800daa0012fc9d06705d64bbb21fa07b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 5 16:45:05 2020 -0800

    i386: Properly encode vector registers in vector move
    
    On x86, when AVX and AVX512 are enabled, vector move instructions can
    be encoded with either 2-byte/3-byte VEX (AVX) or 4-byte EVEX (AVX512):
    
       0:	c5 f9 6f d1          	vmovdqa %xmm1,%xmm2
       4:	62 f1 fd 08 6f d1    	vmovdqa64 %xmm1,%xmm2
    
    We prefer VEX encoding over EVEX since VEX is shorter.  Also AVX512F
    only supports 512-bit vector moves.  AVX512F + AVX512VL supports 128-bit
    and 256-bit vector moves.  xmm16-xmm31 and ymm16-ymm31 are disallowed in
    128-bit and 256-bit modes when AVX512VL is disabled.  Mode attributes on
    x86 vector move patterns indicate target preferences of vector move
    encoding.  For scalar register to register move, we can use 512-bit
    vector move instructions to move 32-bit/64-bit scalar if AVX512VL isn't
    available.  With AVX512F and AVX512VL, we should use VEX encoding for
    128-bit/256-bit vector moves if upper 16 vector registers aren't used.
    This patch adds a function, ix86_output_ssemov, to generate vector moves:
    
    1. If zmm registers are used, use EVEX encoding.
    2. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE or VEX encoding
    will be generated.
    3. If xmm16-xmm31/ymm16-ymm31 registers are used:
       a. With AVX512VL, AVX512VL vector moves will be generated.
       b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
          move will be done with zmm register move.
    
    There is no need to set mode attribute to XImode explicitly since
    ix86_output_ssemov can properly encode xmm16-xmm31/ymm16-ymm31 registers
    with and without AVX512VL.
    
    Tested on AVX2 and AVX512 with and without --with-arch=native.
    
    gcc/
    
    	PR target/89229
    	PR target/89346
    	* config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
    	* config/i386/i386.c (ix86_get_ssemov): New function.
    	(ix86_output_ssemov): Likewise.
    	* config/i386/sse.md (VMOVE:mov<mode>_internal): Call
    	ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
    	check.
    	(*movxi_internal_avx512f): Call ix86_output_ssemov for TYPE_SSEMOV.
    	(*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV.
    	Remove ext_sse_reg_operand and TARGET_AVX512VL check.
    	(*movti_internal): Likewise.
    	(*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV.
    
    gcc/testsuite/
    
    	PR target/89229
    	PR target/89346
    	* gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
    	* gcc.target/i386/pr89229-2a.c: New test.
    	* gcc.target/i386/pr89229-2b.c: Likewise.
    	* gcc.target/i386/pr89229-2c.c: Likewise.
    	* gcc.target/i386/pr89229-3a.c: Likewise.
    	* gcc.target/i386/pr89229-3b.c: Likewise.
    	* gcc.target/i386/pr89229-3c.c: Likewise.
    	* gcc.target/i386/pr89346.c: Likewise.
Comment 5 Martin Liška 2020-03-11 12:07:40 UTC
commit r10-7078-g6733ecaf3fe77871d86bfb36bcda5497ae2aaba7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 8 05:01:03 2020 -0700

    gcc.target/i386/pr89229-3c.c: Include "pr89229-3a.c"
    
            PR target/89229
            PR target/89346
            * gcc.target/i386/pr89229-3c.c: Include "pr89229-3a.c", instead
            of "pr89229-5a.c".