typedef short swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } g++ -O3 -march=skylake-avx512 (Compiler-Explorer-Build) 10.0.0 20191022 (experimental) https://godbolt.org/z/LS34w9 ... .L4: vmovdqu16 (%rdx), %ymm1 vmovdqu16 (%rax), %ymm0 vmovdqa64 %ymm1, %ymm3 # useless copy vpermt2w %ymm1, %ymm2, %ymm3 vmovdqu16 %ymm3, (%rax) vpermt2w %ymm0, %ymm2, %ymm0 addq $32, %rax vmovdqu16 %ymm0, (%rcx) subq $32, %rdx subq $32, %rcx # two tail pointers, PR 92244 is unrelated to this cmpq %rsi, %rax jne .L4 vpermt2w ymm is 3 uops on SKX and CannonLake: 2p5 + p015 (https://www.uops.info/table.html) Obviously better would be vpermw (%rax), %ymm2, %ymm0. vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus a load if we use a memory source. SKX still bottlenecks on 2p5 for vpermw, losing only the p015 uop, but in general fewer uops is better. But on CannonLake it runs on p01 + p5 (plus p23 with a memory source). uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on CNL and ICL. We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL g++ -O3 -march=icelake-client -mprefer-vector-width=512
And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This problem only applies to char and short. Possibly because AVX2 includes vpermd ymm. ---- Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real testing on real hardware by https://uops.info/. Their automated test methods are generally reliable. That seems to be true for Ice Lake, too, so when AVX512VBMI is available we should be using vpermb any time we might have used vpermw with a compile-time-constant control vector. (verpmw requires AVX512BW, e.g. SKX and Cascade Lake. vpermb requires AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.) Instlat provides some confirmation: https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a chain of 2 uops, 1c and 3c being the standard latencies that exist in recent Intel CPUs). InstLat doesn't document which input the dep chain goes through, so it's not 100% confirmation of only 1 uop. But it's likely that ICL has 1 uop vpermb given that CNL definitely does. uops.info lists latencies separately from each input to the result, sometimes letting us figure out that e.g. one of the inputs isn't needed until the 2nd uop. Seems to be the case for CannonLake vpermw: latency from one of the inputs is only 3 cycles, the other is 4. https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html