This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 04 Oct 2017 01:45:57 +0000
- Subject: [Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.
- Auto-submitted: auto-generated
- References: <bug-82370-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
Doesn't change the performance implications, but I just realized I have the
offset-load backwards. Instead of
vpsrlw $8, (%rsi), %xmm1
vpand 15(%rsi), %xmm2, %xmm0
this algorithm should use
vpand 1(%rsi), %xmm2, %xmm0 # ideally with rsi 32B-aligned
vpsrlw $8, 16(%rsi), %xmm1
Or (with k1 = 0x5555555555555555)
vmovdqu8 1(%rsi), %zmm0{k1}{z} # ALU + load micro-fused
vmovdqu8 65(%rsi), %zmm1{k1}{z} # and probably causes CL-split
penalties
Like I said, we should probably avoid vmovdqu8 for loads or stores unless we
actually use masking. vmovdqu32 or 64 is always at least as good. If some
future CPU has masked vmovdqu8 without needing an ALU uop, it could be good
(but probably only if it also avoids cache-line split penalties).
https://godbolt.org/g/a1U7hf
See also https://github.com/InstLatx64/InstLatx64 for a spreadsheet of
Skylake-AVX512 uop->port assignments (but it doesn't include masked loads /
stores), and doesn't match IACA for vmovdqu8 zmm stores (which says even
without masking, the ZMM version uses an ALU uop).