Hello! The gcc.target/i386/pr36222-1.c compiles for x86_64-linux-gnu with "-O2 -mno-sse3 -mtune=corei7" to: _mm_set_epi32: movd %ecx, %xmm1 movd %edx, %xmm4 movd %esi, %xmm0 movd %edi, %xmm3 punpckldq %xmm4, %xmm1 movdqa %xmm1, %xmm2 punpckldq %xmm3, %xmm0 punpcklqdq %xmm0, %xmm2 movdqa %xmm2, %xmm0 ret However, 4.8 branch compiles to: _mm_set_epi32: movd %esi, %xmm1 movd %edi, %xmm2 movd %ecx, %xmm0 movd %edx, %xmm3 punpckldq %xmm2, %xmm1 punpckldq %xmm3, %xmm0 punpcklqdq %xmm1, %xmm0 ret
Confirmed as RA regression.
One movdqa started appearing with r204212, the second movdqa started appearing with r204752. Vlad, can you please have a look?
(In reply to Jakub Jelinek from comment #2) > One movdqa started appearing with r204212, the second movdqa started > appearing with r204752. Vlad, can you please have a look? It seems the changes triggered a bug in register move cost calculations. I have a patch to fix it but I need more time to check affect of it on the performance. So the fix will be ready at the end of week if everything is ok.
Author: vmakarov Date: Wed Jan 15 17:32:47 2014 New Revision: 206636 URL: http://gcc.gnu.org/viewcvs?rev=206636&root=gcc&view=rev Log: 2014-01-15 Vladimir Makarov <vmakarov@redhat.com> PR rtl-optimization/59511 * ira.c (ira_init_register_move_cost): Use memory costs for some cases of register move cost calculations. * lra-constraints.c (lra_constraints): Use REG_FREQ_FROM_BB instead of BB frequency. * lra-coalesce.c (move_freq_compare_func, lra_coalesce): Ditto. * lra-assigns.c (find_hard_regno_for): Ditto. Modified: trunk/gcc/ChangeLog trunk/gcc/ira.c trunk/gcc/lra-assigns.c trunk/gcc/lra-coalesce.c trunk/gcc/lra-constraints.c
Fixed, thanks.
Created attachment 38629 [details] extra-movdqa-with-gcc5-not-4.9.cpp
I'm seeing the same symptom, affecting gcc4.9 through 5.3. Not present in 6.1. IDK if the cause is the same. (code from an improvement to the horizontal_add functions in Agner Fog's vector class library) #include <immintrin.h> int hsum16_gccmovdqa (__m128i const a) { __m128i lo = _mm_cvtepi16_epi32(a); // sign-extended a0, a1, a2, a3 __m128i hi = _mm_unpackhi_epi64(a,a); // gcc4.9 through 5.3 wastes a movdqa on this hi = _mm_cvtepi16_epi32(hi); __m128i sum1 = _mm_add_epi32(lo,hi); // add sign-extended upper / lower halves //return horizontal_add(sum1); // manually inlined. // Shortening the code below can avoid the movdqa __m128i shuf = _mm_shuffle_epi32(sum1, 0xEE); __m128i sum2 = _mm_add_epi32(shuf,sum1); // 2 sums shuf = _mm_shufflelo_epi16(sum2, 0xEE); __m128i sum4 = _mm_add_epi32(shuf,sum2); return _mm_cvtsi128_si32(sum4); // 32 bit sum } gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1): movdqa %xmm0, %xmm1 pmovsxwd %xmm0, %xmm2 punpckhqdq %xmm0, %xmm1 pmovsxwd %xmm1, %xmm0 paddd %xmm2, %xmm0 ... gcc6.1 output: pmovsxwd %xmm0, %xmm1 punpckhqdq %xmm0, %xmm0 pmovsxwd %xmm0, %xmm0 paddd %xmm0, %xmm1 ... In a more complicated case, when inlining this code or not, there's actually a difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases. See my attachment, copied from https://godbolt.org/g/e8iQsj