[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

Mon Jan 28 21:29:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
(In reply to H.J. Lu from comment #1)
> But
> 
> 	vxorps	%xmm0, %xmm0, %xmm0
> 	vcvtsd2ss	%xmm1, %xmm0, %xmm0
> 
> are faster than both.

On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written
asm loop.  (I was using NASM to make a static executable that runs a 100M
iteration loop so I could measure with perf).  Can you show some asm where this
performs better?

vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination
being an xor-zeroed register.  (Either zeroed outside the loop, or inside, or
once per 4 converts with an unrolled loop.)

I can't construct a case where  vcvtsd2ss %xmm1, %xmm1, %xmm0  is worse in any
way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst
= middle source.  I wasn't mixing it with other instructions other than VXORPS,
but I don't think anything is going to get rid of its 2nd uop, and choosing
both inputs = the same source removes any benefit from dep-breaking the output.

If adding a VXORPS helped, its probably due to some other side-effect.

Could the effect you saw have been due to code-gen changes for memory sources,
maybe  vxorps + vcvtsd2ss (mem), %xmm0, %xmm0   vs.  vmovsd + vcvtsd2ss %xmm1,
%xmm1, %xmm0?  (Those should be about equal, but memory-source SS2SD is
cheaper, no port5 uop.)

----

BTW, the false-dependency effect is much more obvious with SS2SD, where the
latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS.

Even without dependency-breaking, repeated

 vcvtsd2ss      %xmm1, %xmm0, %xmm0

can run at 1 per clock (same as with dep breaking), because the port-5 uop that
merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd.  So latency
from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle.

With dep-breaking, they both still bottleneck on the port5 uop if you're doing
nothing else.