Compile the following testcase with "g++ -msse2 -O2" #include <xmmintrin.h> typedef double double_a __attribute__((__may_alias__)); struct V { __m128d data; }; int main() { V a; __m128d b; b = _mm_set_pd(1., 0.); a.data = _mm_set_pd(1., 0.); a.data = _mm_add_pd(a.data, _mm_and_pd(_mm_cmpeq_pd(a.data, _mm_set1_pd(0.)), _mm_set1_pd(2.))); reinterpret_cast<double_a *>(&a.data)[1] += 1.; b = _mm_add_pd(b, _mm_and_pd(_mm_cmpeq_pd(b, _mm_set1_pd(0.)), _mm_set1_pd(1.))); b = _mm_add_pd(b, _mm_and_pd(_mm_cmpeq_pd(b, _mm_set1_pd(1.)), _mm_set1_pd(1.))); if (_mm_movemask_pd(_mm_cmpeq_pd(a.data, b)) != 0x3) { abort(); } return 0; } GCC 4.6.[01] calculate the correct values for a.data[0] and a.data[1] but fail to combine the results correctly. I.e. the resulting shufpd $0x1 is wrong. GCC 4.5.x uses unpacklpd, which gives the correct result, but emits unnecessary stores to the stack.
Confirmed, proposed patch to sse2_loadhpd: Index: sse.md =================================================================== --- sse.md (revision 174102) +++ sse.md (working copy) @@ -4284,7 +4284,7 @@ vmovhpd\t{%2, %1, %0|%0, %1, %2} unpcklpd\t{%2, %0|%0, %2} vunpcklpd\t{%2, %1, %0|%0, %1, %2} - shufpd\t{$1, %1, %0|%0, %1, 1} + shufpd\t{$0, %1, %0|%0, %1, 0} # # #"
I applied the patch to the latest 4.6 snapshot. I confirm that it fixes the bug. Also, there are no regressions in my testsuite. Just for confirmation, the patched sse.md looks like this for me now (starting from line 4952): (define_insn "sse2_loadhpd" [(set (match_operand:V2DF 0 "nonimmediate_operand" "=x,x,x,o,o,o") (vec_concat:V2DF (vec_select:DF (match_operand:V2DF 1 "nonimmediate_operand" " 0,0,x,0,0,0") (parallel [(const_int 0)])) (match_operand:DF 2 "nonimmediate_operand" " m,x,0,x,*f,r")))] "TARGET_SSE2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "@ movhpd\t{%2, %0|%0, %2} unpcklpd\t{%2, %0|%0, %2} shufpd\t{$0, %1, %0|%0, %1, 0} # Question, why not use unpcklpd instead of shufpd $0? On older CPUs unpcklpd should be slightly faster than shufpd.
(In reply to comment #2) > I applied the patch to the latest 4.6 snapshot. I confirm that it fixes the > bug. Also, there are no regressions in my testsuite. > > Just for confirmation, the patched sse.md looks like this for me now (starting > from line 4952): > (define_insn "sse2_loadhpd" > [(set (match_operand:V2DF 0 "nonimmediate_operand" "=x,x,x,o,o,o") > (vec_concat:V2DF > (vec_select:DF > (match_operand:V2DF 1 "nonimmediate_operand" " 0,0,x,0,0,0") > (parallel [(const_int 0)])) > (match_operand:DF 2 "nonimmediate_operand" " m,x,0,x,*f,r")))] > "TARGET_SSE2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))" > "@ > movhpd\t{%2, %0|%0, %2} > unpcklpd\t{%2, %0|%0, %2} > shufpd\t{$0, %1, %0|%0, %1, 0} > # > > Question, why not use unpcklpd instead of shufpd $0? On older CPUs unpcklpd > should be slightly faster than shufpd. OTOH, it looks that this alternative is wrong entirely. Unmodified operand can only be passed in lower half (operand 1 in the pattern above). GCC will then generate unpcklpd, as suggested.
(In reply to comment #3) > > I applied the patch to the latest 4.6 snapshot. I confirm that it fixes the > > bug. Also, there are no regressions in my testsuite. > OTOH, it looks that this alternative is wrong entirely. Unmodified operand can > only be passed in lower half (operand 1 in the pattern above). GCC will then > generate unpcklpd, as suggested. Forgot to say, that unpcklpd will be generated with removed referred alternative.
Author: uros Date: Tue May 24 15:31:12 2011 New Revision: 174122 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=174122 Log: PR target/49133 * config/i386/sse.md (sse2_loadhpd): Remove shufpd alternative. testsuite/ChangeLog: PR target/49133 * g++.dg/other/pr49133.C: New test. Added: trunk/gcc/testsuite/g++.dg/other/pr49133.C Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
Author: uros Date: Tue May 24 18:41:31 2011 New Revision: 174131 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=174131 Log: PR target/49133 * config/i386/sse.md (sse2_loadhpd): Remove shufpd alternative. testsuite/ChangeLog: PR target/49133 * g++.dg/other/pr49133.C: New test. Added: branches/gcc-4_6-branch/gcc/testsuite/g++.dg/other/pr49133.C Modified: branches/gcc-4_6-branch/gcc/ChangeLog branches/gcc-4_6-branch/gcc/config/i386/sse.md branches/gcc-4_6-branch/gcc/testsuite/ChangeLog
Fixed.
Author: uros Date: Wed May 25 13:26:42 2011 New Revision: 174195 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=174195 Log: PR target/49133 * config/i386/sse.md (sse2_loadhpd): Remove shufpd alternative. testsuite/ChangeLog: PR target/49133 * g++.dg/other/pr49133.C: New test. Added: branches/gcc-4_5-branch/gcc/testsuite/g++.dg/other/pr49133.C Modified: branches/gcc-4_5-branch/gcc/ChangeLog branches/gcc-4_5-branch/gcc/config/i386/sse.md branches/gcc-4_5-branch/gcc/testsuite/ChangeLog
Author: uros Date: Wed May 25 16:39:22 2011 New Revision: 174215 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=174215 Log: PR target/49133 * config/i386/sse.md (sse2_loadhpd): Remove shufpd alternative. testsuite/ChangeLog: PR target/49133 * g++.dg/other/pr49133.C: New test. Added: branches/gcc-4_4-branch/gcc/testsuite/g++.dg/other/pr49133.C Modified: branches/gcc-4_4-branch/gcc/ChangeLog branches/gcc-4_4-branch/gcc/config/i386/sse.md branches/gcc-4_4-branch/gcc/testsuite/ChangeLog