[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Mon Oct 28 00:10:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246
Bug ID: 92246
Summary: Byte or short array reverse loop auto-vectorized with
3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
typedef short swapt;
void strrev_explicit(swapt *head, long len)
{
swapt *tail = head + len - 1;
for( ; head < tail; ++head, --tail) {
swapt h = *head, t = *tail;
*head = t;
*tail = h;
}
}
g++ -O3 -march=skylake-avx512
(Compiler-Explorer-Build) 10.0.0 20191022 (experimental)
https://godbolt.org/z/LS34w9
...
.L4:
vmovdqu16 (%rdx), %ymm1
vmovdqu16 (%rax), %ymm0
vmovdqa64 %ymm1, %ymm3 # useless copy
vpermt2w %ymm1, %ymm2, %ymm3
vmovdqu16 %ymm3, (%rax)
vpermt2w %ymm0, %ymm2, %ymm0
addq $32, %rax
vmovdqu16 %ymm0, (%rcx)
subq $32, %rdx
subq $32, %rcx # two tail pointers, PR 92244 is unrelated to
this
cmpq %rsi, %rax
jne .L4
vpermt2w ymm is 3 uops on SKX and CannonLake: 2p5 + p015
(https://www.uops.info/table.html)
Obviously better would be vpermw (%rax), %ymm2, %ymm0.
vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus
a load if we use a memory source. SKX still bottlenecks on 2p5 for vpermw,
losing only the p015 uop, but in general fewer uops is better.
But on CannonLake it runs on p01 + p5 (plus p23 with a memory source).
uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake
is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on
CNL and ICL.
We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL
g++ -O3 -march=icelake-client -mprefer-vector-width=512
More information about the Gcc-bugs
mailing list