Bug 40537 - wrong instr. dependency with some SSE intrinsics
Summary: wrong instr. dependency with some SSE intrinsics
Status: RESOLVED DUPLICATE of bug 21920
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Depends on:
Reported: 2009-06-24 08:05 UTC by Gael Guennebaud
Modified: 2009-06-25 09:03 UTC (History)
33 users (show)

See Also:
Host: x86_64-pc-linux
Target: x86_64-pc-linux
Build: x86_64-pc-linux
Known to work:
Known to fail:
Last reconfirmed: 2009-06-24 11:57:04

a complete example showing the problem (444 bytes, text/plain)
2009-06-24 08:07 UTC, Gael Guennebaud

Note You need to log in before you can comment on or make changes to this bug.
Description Gael Guennebaud 2009-06-24 08:05:23 UTC
In order to perform faster loads from an unaligned memory location to a SSE register, a common trick is to replace the default unaligned load instructions (e.g., MOVUPS for floats) by one MOVSD followed by one MOVHPS. Using intrinsics, this can be implemented like this:

inline __m128 ploadu(const float* from) {
  __m128 r;
  r = _mm_castpd_ps(_mm_load_sd((double*)(from)));
  r = _mm_loadh_pi(r, (const __m64*)(from+2));
  return r;

Unfortunately, when optimizations are enabled (-O2), I found that GCC can incorrectly reorder the instructions leading to invalid code. For instance, in that example:

float data[4] = {1, 2, 3, 4};
__attribute__ ((aligned(16))) float aligned_data[4];
_mm_store_ps(aligned_data, ploadu(data));
std::cout << aligned_data[0] << " " << aligned_data[1] << " "
          << aligned_data[2] << " " << aligned_data[3] << "\n";

GCC generates the following ASM:

 movsd 32(%rsp), %xmm0
 movl  $0x40400000, 40(%rsp)
 movl  $0x40800000, 44(%rsp)
 movl  $0x3f800000, 32(%rsp)
 movhps  40(%rsp), %xmm0
 movl  $0x40000000, 36(%rsp)
 movaps  %xmm0, 16(%rsp)

where the MOVSD instruction is executed before the values of the array "data" have been set.

If we use the standard _mm_loadu_ps intrinsics, then the generated ASM is obviously correct:

 movl  $0x3f800000, 32(%rsp)
 movl  $0x40000000, 36(%rsp)
 movl  $0x40400000, 40(%rsp)
 movl  $0x40800000, 44(%rsp)
 movups  32(%rsp), %xmm0
 movaps  %xmm0, 16(%rsp)

Please, see the attachment for a complete example.
Comment 1 Gael Guennebaud 2009-06-24 08:07:39 UTC
Created attachment 18055 [details]
a complete example showing the problem


  g++ -O3 instr_dependency.cpp -o instr_dependency && ./instr_dependency

  g++ -O3 instr_dependency.cpp -o instr_dependency -DFAST && ./instr_dependency
Comment 2 Richard Biener 2009-06-24 09:48:18 UTC
I can't verify it (you do not provide complete source that can be compiled),
but I think this is a duplicate of PR40141 which is fixed in GCC 4.4.
Comment 3 Gael Guennebaud 2009-06-24 10:53:43 UTC
There is a compilable example attached to comment #1.

Furthermore, I can reproduce the problem with gcc 4.1.3, 4.2.4, 4.3.2, and 4.4.0, so I don't think it is a duplicate of PR40141.

FYI, in the meantime I workaround the issue using inline assembly:

inline __m128 ploadu(const float* from)
  __m128 res; 
  asm("movsd  %[from0], %[r]"
     : [r] "=x" (res) : [from0] "m" (*from), [dummy] "m" (*(from+1)) );
  asm("movhps %[from2], %[r]"
     : [r] "+x" (res) : [from2] "m" (*(from+2)), [dummy] "m" (*(from+3)) );
  return res;

but that's not as portable as intrinsics.
Comment 4 Gael Guennebaud 2009-06-24 11:12:44 UTC
some additional info:

- compiling with -fno-strict-aliasing fix the issue, so perhaps this is not a real bug but a feature ?

- on the other hand using the may_alias type attribute for casting does not help:

inline __m128 ploadu(const float*   from) { 
  typedef double __attribute__((may_alias)) doubleA;
  typedef __m128 __attribute__((may_alias)) __m128A;

  __m128 r;
  r = (__m128A)(_mm_load_sd( (const doubleA*)(from) ));
  r = _mm_loadh_pi(r, (const __m64*)(from+2));
  return r;

I guess this is because the "const doubleA*" is then casted to a "const double*" when calling the _mm_load_sd() function.
Comment 5 Richard Biener 2009-06-24 11:23:03 UTC
There are aliasing issues with your code / the intrinsics implementation:

  float data[4] = {1, 2, 3, 4};
  r = _mm_castpd_ps(_mm_load_sd((double*)(from)));

ends up loading from float data via a pointer to double.  That is invalid
unless the intrinsic specifies it should work ok in which case it better
had implemented counter-measures here:

  data[0] ={v} 1.0e+0;
  data[1] ={v} 2.0e+0;
  data[2] ={v} 3.0e+0;
  data[3] ={v} 4.0e+0;
  from.92_14 = (const double *) &data[0];
  D.25303_15 = *from.92_14;
  D.25304_16 = {D.25303_15, 0.0};
  r_17 = VIEW_CONVERT_EXPR<__m128>(D.25304_16);

I would suggest using

  typedef double __attribute__((may_alias)) aliased_double;
  r = _mm_castpd_ps(_mm_set_sd (*(aliased_double *)from));

Comment 6 Richard Biener 2009-06-24 11:25:27 UTC
_mm_load_sd( (const doubleA*)(from) )

does not work because the prototype of _mm_load_sd does not have a type
with the may-alias attribute, so it gets stripped again.
Comment 7 Uroš Bizjak 2009-06-24 11:57:04 UTC
Adding x86 intrinsic expert...
Comment 8 H.J. Lu 2009-06-24 13:09:11 UTC
(In reply to comment #6)
> _mm_load_sd( (const doubleA*)(from) )
> does not work because the prototype of _mm_load_sd does not have a type
> with the may-alias attribute, so it gets stripped again.

I think movsd/movhpd intrinsics with union of double/float or
movlps/movhps intrinsics should be used.
Comment 9 Uroš Bizjak 2009-06-25 09:03:30 UTC

*** This bug has been marked as a duplicate of 21920 ***