This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
Doesn't change the performance implications, but I just realized I have the
offset-load backwards.  Instead of
        vpsrlw  $8, (%rsi), %xmm1
        vpand   15(%rsi), %xmm2, %xmm0

this algorithm should use
        vpand   1(%rsi), %xmm2, %xmm0     # ideally with rsi 32B-aligned
        vpsrlw  $8, 16(%rsi), %xmm1

Or (with k1 = 0x5555555555555555)
        vmovdqu8    1(%rsi),  %zmm0{k1}{z}   # ALU + load micro-fused
        vmovdqu8    65(%rsi), %zmm1{k1}{z}   # and probably causes CL-split
penalties

Like I said, we should probably avoid vmovdqu8 for loads or stores unless we
actually use masking.  vmovdqu32 or 64 is always at least as good.  If some
future CPU has masked vmovdqu8 without needing an ALU uop, it could be good
(but probably only if it also avoids cache-line split penalties).

https://godbolt.org/g/a1U7hf

See also https://github.com/InstLatx64/InstLatx64 for a spreadsheet of
Skylake-AVX512 uop->port assignments (but it doesn't include masked loads /
stores), and doesn't match IACA for vmovdqu8 zmm stores (which says even
without masking, the ZMM version uses an ALU uop).

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]