This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/80571] New: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 30 Apr 2017 08:29:54 +0000
- Subject: [Bug target/80571] New: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571
Bug ID: 80571
Summary: AVX allows multiple vcvtsi2ss/sd (integer ->
float/double) to reuse a single dep-breaking vxorps,
even hoisting it out of loops
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
See also more discussion on a clang bug about what's optimal for scalar int->fp
conversion (https://bugs.llvm.org/show_bug.cgi?id=22024#c11).
The important point is that using vcvtsi2sd/ss with a src vector reg different
from the destination reg allows reusing the same "safe" source to avoid false
dependencies, using of one vxorps for multiple conversions. (Even hoisted out
of loops).
// see https://godbolt.org/g/c9k4SH for gcc8, clang4, icc17, and MSVC CL
void cvti64f64_loop(double *dp, long long *ip) {
for (int i=0; i<1000 ; i++) {
dp[i] = ip[i];
// int64 can't vectorize without AVX512
}
}
compiles with gcc6.3.1 and gcc8 20170429 -O3 -march=haswell:
cvti64f64_loop:
xorl %eax, %eax
.L6:
vxorpd %xmm0, %xmm0, %xmm0
vcvtsi2sdq (%rsi,%rax), %xmm0, %xmm0
vmovsd %xmm0, (%rdi,%rax)
addq $8, %rax
cmpq $8000, %rax
jne .L6
ret
But it could compile to
xorl %eax, %eax
vxorpd %xmm1, %xmm1, %xmm1 #### Hoisted out of the loop
.L6:
vcvtsi2sdq (%rsi,%rax), %xmm1, %xmm0
vmovsd %xmm0, (%rdi,%rax)
addq $8, %rax
cmpq $8000, %rax
jne .L6
ret
----
This trick requires AVX, of course.
If the loop needs a vector constant, you can use that instead of a
specially-zeroed register. (The upper elements don't have to be 0.0 for ...ss
instructions, and the x86-64 SysV ABI allows passing scalar float/double args
with garbage (not zeros) in the high bytes. This already happens in some code,
so library implementers already have to avoid assuming that high elements are
always clear, even with hand-written asm).
clang already hoists vxorps out of loops it, but only by looking for "dead"
registers, not by reusing a reg holding a constant. (e.g. enabling those
multiplies in the godbolt link to use up all the xmm regs gets clang to put a
vxorps in the loop instead of merging into a reg holding one of them.)
Using a constant has the downside that loading the constant might cache-miss,
delaying a bunch of loop-setup work from happening (and loop-setup is where
int->float is probably most common). So maybe it would be best to vxorps-zero
a register and use it for int->FP conversion if several other instructions
depend on that before any depend on the constant. The zeroed register can
later have a constant loaded into it or whatever, and use that as a safe source
register inside the loop.
Or if there's a constant that's about to be used with the result of the int->fp
conversion, it's maybe not bad to load that constant and then use that xmm reg
as the merge-target for a scalar conversion. If OOO execution can do the load
early (and it doesn't miss in cache), then there's no extra latency. Or if it
does miss, then there's only an extra cycle or two of latency for that
dependency chain vs. spending an instruction on vxorpd-zeroing a target to
convert into, separate from loading the constant.
----
This trick works for non-loops, of course:
void cvti32f32(float *A, float *B, int x, int y) {
*B = y;
*A = x;
}
currently compiles to (-O3 -march=haswell)
cvti32f32:
vxorps %xmm0, %xmm0, %xmm0
vcvtsi2ss %ecx, %xmm0, %xmm0
vmovss %xmm0, (%rsi)
vxorps %xmm0, %xmm0, %xmm0
vcvtsi2ss %edx, %xmm0, %xmm0
vmovss %xmm0, (%rdi)
But could compile (with no loss of safety against false deps) to
cvti32f32:
vxorps %xmm1, %xmm1, %xmm1 # reused for both
vcvtsi2ss %ecx, %xmm1, %xmm0
vmovss %xmm0, (%rsi)
vcvtsi2ss %edx, %xmm1, %xmm0 # this convert can go into xmm1 if
we want to have both in regs at once
vmovss %xmm0, (%rdi)
ret
Or, with the same amount of safety and fewer fused-domain uops on Intel
SnB-family CPUs, and not requiring AVX (but of course works fine with AVX):
cvti32f32:
movd %ecx, %xmm0 # breaks deps on xmm0
cvtdq2ps %xmm0, %xmm0 # 1 uop for port1
movss %xmm0, (%rsi)
movd %edx, %xmm0
cvtdq2ps %xmm0, %xmm0
movss %xmm0, (%rdi)
ret
But this is only good for 32-bit integer -> float, not double. (Because
cvtdq2pd also takes a port5 uop, unlike conversion to ps). The code-size is
also slightly larger, taking 2 instructions per conversion instead of 1 for AVX
when a safe register to merge into is already available.
See full analysis of this (uop counts and stuff for Intel SnB-family) at the
bottom of this comment: https://bugs.llvm.org/show_bug.cgi?id=22024#c11