Bug 92246 - Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
Summary: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead ...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2019-10-28 00:10 UTC by Peter Cordes
Modified: 2020-01-29 14:47 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-01-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2019-10-28 00:10:29 UTC
typedef short swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
      swapt h = *head, t = *tail;
      *head = t;
      *tail = h;
  }
}

g++ -O3 -march=skylake-avx512
  (Compiler-Explorer-Build) 10.0.0 20191022 (experimental)

https://godbolt.org/z/LS34w9

        ...
.L4:
        vmovdqu16       (%rdx), %ymm1
        vmovdqu16       (%rax), %ymm0
        vmovdqa64       %ymm1, %ymm3        # useless copy
        vpermt2w        %ymm1, %ymm2, %ymm3
        vmovdqu16       %ymm3, (%rax)
        vpermt2w        %ymm0, %ymm2, %ymm0
        addq    $32, %rax
        vmovdqu16       %ymm0, (%rcx)
        subq    $32, %rdx
        subq    $32, %rcx       # two tail pointers, PR 92244 is unrelated to this
        cmpq    %rsi, %rax
        jne     .L4

vpermt2w ymm is 3 uops on SKX and CannonLake:  2p5 + p015 (https://www.uops.info/table.html)

Obviously better would be  vpermw (%rax), %ymm2, %ymm0.

vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus a load if we use a memory source.  SKX still bottlenecks on 2p5 for vpermw, losing only the p015 uop, but in general fewer uops is better.

But on CannonLake it runs on p01 + p5 (plus p23 with a memory source).

uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on CNL and ICL.

We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL
g++ -O3 -march=icelake-client -mprefer-vector-width=512
Comment 1 Peter Cordes 2019-10-28 02:02:49 UTC
And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long.  This problem only applies to char and short.  Possibly because AVX2 includes vpermd ymm.

----

Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real testing on real hardware by https://uops.info/.  Their automated test methods are generally reliable.

That seems to be true for Ice Lake, too, so when AVX512VBMI is available we should be using vpermb any time we might have used vpermw with a compile-time-constant control vector.


(verpmw requires AVX512BW, e.g. SKX and Cascade Lake.  vpermb requires AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.)

Instlat provides some confirmation: https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt  shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a chain of 2 uops, 1c and 3c being the standard latencies that exist in recent Intel CPUs).  InstLat doesn't document which input the dep chain goes through, so it's not 100% confirmation of only 1 uop.  But it's likely that ICL has 1 uop vpermb given that CNL definitely does.

uops.info lists latencies separately from each input to the result, sometimes letting us figure out that e.g. one of the inputs isn't needed until the 2nd uop.  Seems to be the case for CannonLake vpermw: latency from one of the inputs is only 3 cycles, the other is 4.  https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html