This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Sat, 20 May 2017 01:41:25 +0000
Subject: [Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
Auto-submitted: auto-generated
References: <bug-80833-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
I don't think it's worth anyone's time to implement this in 2017, but using MMX
regs for 64-bit store/load would be faster on really old CPUs that split 128b
vectors insns into two halves, like K8 and Pentium M.  Especially with
-mno-sse2 (e.g. Pentium3 compat) where movlps has a false dependency on the old
value of the xmm reg, but movq mm0 doesn't.  (No SSE2 means we can't MOVQ or
MOVSD to an XMM reg).

MMX is also a saving in code-size: one fewer prefix byte vs. SSE2 integer
instructions.  It's also another set of 8 registers for 32-bit mode.

But Skylake has lower throughput for the MMX versions of some instructions than
for the XMM version.  And SSE4 instructions like PEXTRD don't have MMX
versions, unlike SSSE3 and earlier (e.g. pshufb mm0, mm1 is available, and on
Conroe it's faster than the xmm version).

References:
- [Bug target/80833] New: 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]