[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

Sat May 20 04:59:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
On most CPUs, psrldq / movd is optimal for xmm[1] -> int without SSE4.  On
SnB-family, movd runs on port0, and psrldq can run on port5, so they can
execute in parallel.  (And the second movd can run the next cycle).

I'd suggest using movd/psrldq/movd for -mtune=generic.  (Or pshuflw to
copy+shuffle if it's useful to not destroy the value in the xmm reg while
extracting to integer.  pshuflw is faster than pshufd on old CPUs, and the same
on current CPUs).

But for some CPUs, this is better:

    movd    %xmm0, %eax
    psrlq   $32, %xmm0
    movd    %xmm0, %edx

A 64-bit shift by 32 is much better than PSRLDQ on some CPUs, especially
SlowShuffle CPUs (where xmm pshufd is slower than 64-bit granularity shuffles).

* P4: 2c latency instead of 4, and twice the throughput
* Pentium M: 2 uops instead of 4.
* Core2 merom/conroe:  1 uop instead of 2
* K8/K10: same as PSRLDQ