[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Mon Jan 28 21:29:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
(In reply to H.J. Lu from comment #1)
> But
>
> vxorps %xmm0, %xmm0, %xmm0
> vcvtsd2ss %xmm1, %xmm0, %xmm0
>
> are faster than both.
On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written
asm loop. (I was using NASM to make a static executable that runs a 100M
iteration loop so I could measure with perf). Can you show some asm where this
performs better?
vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination
being an xor-zeroed register. (Either zeroed outside the loop, or inside, or
once per 4 converts with an unrolled loop.)
I can't construct a case where vcvtsd2ss %xmm1, %xmm1, %xmm0 is worse in any
way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst
= middle source. I wasn't mixing it with other instructions other than VXORPS,
but I don't think anything is going to get rid of its 2nd uop, and choosing
both inputs = the same source removes any benefit from dep-breaking the output.
If adding a VXORPS helped, its probably due to some other side-effect.
Could the effect you saw have been due to code-gen changes for memory sources,
maybe vxorps + vcvtsd2ss (mem), %xmm0, %xmm0 vs. vmovsd + vcvtsd2ss %xmm1,
%xmm1, %xmm0? (Those should be about equal, but memory-source SS2SD is
cheaper, no port5 uop.)
----
BTW, the false-dependency effect is much more obvious with SS2SD, where the
latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS.
Even without dependency-breaking, repeated
vcvtsd2ss %xmm1, %xmm0, %xmm0
can run at 1 per clock (same as with dep breaking), because the port-5 uop that
merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd. So latency
from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle.
With dep-breaking, they both still bottleneck on the port5 uop if you're doing
nothing else.
More information about the Gcc-bugs
mailing list