[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

Sat Mar 26 23:03:58 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

--- Comment #6 from Peter Cordes <peter at cordes dot ca> ---
Looks good to me, thanks for taking care of this quickly, hopefully we can get
this backported to the GCC11 series to limit the damage for people using these
newish intrinsics.  I'd love to recommend them for general use, except for this
GCC problem where some distros have already shipped GCC versions that compile
without error but in a 100% broken way.

Portable ways to do narrow alignment/aliasing-safe SIMD loads were sorely
lacking; there aren't good effective workarounds for this, especially for
16-bit loads.  (I still don't know how to portably / safely write code that
will compile to a memory-source PMOVZXBQ across all compilers; Intel's
intrinsics API is rather lacking in some areas and relies on compilers folding
loads into memory source operands.)

> So, isn't that a bug in the intrinsic guide instead?

Yes, __m128i _mm_loadu_si16 only really makes sense with SSE2 for PINSRW.  Even
movzx into an integer reg and then MOVD xmm, eax requires SSE2.  With only SSE1
you'd have to movzx / dword store to stack / MOVSS reload.

SSE1 makes *some* sense for _mm_loadu_si32 since it can be implemented with a
single MOVSS if MOVD isn't available.

But we already have SSE1 __m128 _mm_load_ss(const float *) for that.

Except GCC's implementation of _mm_load_ss isn't alignment and strict-aliasing
safe; it derefs the actual float *__P as _mm_set_ss (*__P).  Which I think is a
bug, although I'm not clear what semantics Intel intended for that intrinsic. 
Clang implements it as alignment/aliasing safe with a packed may_alias struct
containing a float.  MSVC always behaves like -fno-strict-aliasing, and I
*think* ICC does, too.

Perhaps best to follow the crowd and make all narrow load/store intrinsics
alignment and aliasing safe, unless that causes code-gen regressions; users can
_mm_set_ss( *ptr ) themselves if they want that to tell the compiler that's its
a normal C float object.

Was going to report this, but PR84508 is still open and already covers the
relevant ss and sd intrinsics.  That points out that Intel specifically
documents it as not requiring alignment, not mentioning aliasing.

----

Speaking of bouncing through a GP-integer reg, GCC unfortunately does that; it
seems to incorrectly think PINSRW xmm, mem, 0 requires -msse4.1, unlike with a
GP register source.  Reported as PR105066 along with related missed
optimizations about folding into a memory source operand for pmovzx/sx.

But that's unrelated to correctness; this bug can be closed unless we're
keeping it open until it's fixed in the GCC11 current stable series.