This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 30 Apr 2017 06:37:22 +0000
- Subject: [Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570
Bug ID: 80570
Summary: auto-vectorizing int->double conversion should use
half-width memory operands to avoid shuffles, instead
of load+extract
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
When auto-vectorizing int->double conversion, gcc loads a full-width vector
into a register and then unpacks the upper half to feed (v)cvtdq2pd. e.g. with
AVX, we get a 256b load and then vextracti128.
It's even worse with an unaligned src pointer with
-mavx256-split-unaligned-load, where it does vinsertf128 -> vextractf128,
without ever doing anything with the full 256b vector!
On Intel SnB-family CPUs, this will bottleneck the loop on port5 throughput,
because VCVTDQ2PD reg -> reg needs a port5 uop as well as a port1 uop. (And
vextracti128 can only run on the shuffle unit on port5).
VCVTDQ2PD with a memory source operand doesn't need the shuffle port at all on
Intel Haswell and later, just the FP-add unit and a load, so it's a much better
choice. (Throughput of one per clock on Sandybridge and Haswell, 2 per clock
on Skylake). It's still 2 fused-domain uops, though, so I guess it can't
micro-fuse the load according to Agner Fog's testing. (Or 3 on SnB).
I'm pretty sure using twice as many half-width memory operands is not worse on
other AVX CPUs either (AMD BD-family or Zen, or KNL), vs. max-width loads and
extracting the high half.
void cvti32f64_loop(double *dp, int *ip) {
// ICC avoids the mistake when it doesn't emit a prologue to align the pointers
#ifdef __GNUC__
dp = __builtin_assume_aligned(dp, 64);
ip = __builtin_assume_aligned(ip, 64);
#endif
for (int i=0; i<10000 ; i++) {
double tmp = ip[i];
dp[i] = tmp;
}
}
https://godbolt.org/g/329C3P
gcc.godbolt.org's "gcc7" snapshot: g++ (GCC-Explorer-Build) 8.0.0 20170429
(experimental)
gcc -O3 -march=sandybridge
cvti32f64_loop:
xorl %eax, %eax
.L2:
vmovdqa (%rsi,%rax), %ymm0
vcvtdq2pd %xmm0, %ymm1
vextractf128 $0x1, %ymm0, %xmm0
vmovapd %ymm1, (%rdi,%rax,2)
vcvtdq2pd %xmm0, %ymm0
vmovapd %ymm0, 32(%rdi,%rax,2)
addq $32, %rax
cmpq $40000, %rax
jne .L2
vzeroupper
ret
gcc does the same thing for -march=haswell, but uses vextracti128. This is
obviously really silly.
For comparison, clang 4.0 -O3 -march=sandybridge -fno-unroll-loops emits:
xorl %eax, %eax
.LBB0_1:
vcvtdq2pd (%rsi,%rax,4), %ymm0
vmovaps %ymm0, (%rdi,%rax,8)
addq $4, %rax
cmpq $10000, %rax # imm = 0x2710
jne .LBB0_1
vzeroupper
retq
This should come close to one 256b store per clock (on Haswell), even with
unrolling disabled.
----
With -march=nehalem, gcc gets away with it for this simple not-unrolled loop
(without hurting throughput I think), but only because this strategy
effectively unrolls the loop (doing two stores per add + cmp/jne), and Nehalem
can run shuffles on two execution ports (so the pshufd can run on port1, while
the cvtdq2pd can run on ports 1+5). So it's 10 fused-domain uops per 2 stores
instead of 5 per 1 store. Depending on how the loop buffer handles
non-multiple-of-4 uop counts, this might be a wash. (Of course, with any other
work in the loop, or with unrolling, the memory-operand strategy is much
better).
CVTDQ2PD's memory operand is only 64 bits, so even the non-AVX version doesn't
fault if misaligned.
------
It's even more horrible without aligned pointers, when the sandybridge version
(which splits unaligned 256b loads/stores) uses vinsertf128 to emulate a 256b
load, and then does vextractf128 right away:
inner_loop: # gcc8 -march=sandybridge without __builtin_assume_aligned
vmovdqu (%r8,%rax), %xmm0
vinsertf128 $0x1, 16(%r8,%rax), %ymm0, %ymm0
vcvtdq2pd %xmm0, %ymm1
vextractf128 $0x1, %ymm0, %xmm0
vmovapd %ymm1, (%rcx,%rax,2)
vcvtdq2pd %xmm0, %ymm0
vmovapd %ymm0, 32(%rcx,%rax,2)
This is obviously really really bad, and should probably be checked for and
avoided in case there are things other than int->double autovec that could lead
to doing this.
---
With -march=skylake-avx512, gcc does the AVX512 version of the same thing: zmm
load and then extra the upper 256b