#include <stdint.h> void test(int32_t* input, int32_t* out, unsigned x1, unsigned x2) { unsigned i, j; unsigned end = x1; for(i = j = 0; i < 1000; i++) { int32_t sum = 0; end += x2; for( ; j < end; j++) sum += input[j]; out[i] = sum; } } options used: -S -O2 -ftree-vectorize -msse2 GCC 5.2 generates the following code: ... movdqa %xmm0, %xmm1 movl 8(%esp), %ebx psrldq $8, %xmm1 paddd %xmm1, %xmm0 movdqa %xmm0, %xmm3 pshufd $255, %xmm0, %xmm2 addl %ebx, %eax cmpl %ebx, %esi pshufd $85, %xmm0, %xmm1 punpckhdq %xmm0, %xmm3 movd %xmm2, %ecx punpckldq %xmm3, %xmm1 movd %ecx, %xmm2 punpcklqdq %xmm2, %xmm1 paddd %xmm1, %xmm0 movd %xmm0, %ecx ... while GCC 4.9.2 generates this: ... movdqa %xmm0, %xmm1 movl 8(%esp), %ebx psrldq $8, %xmm1 paddd %xmm1, %xmm0 movdqa %xmm0, %xmm1 addl %ebx, %eax cmpl %ebx, %esi psrldq $4, %xmm1 paddd %xmm1, %xmm0 movd %xmm0, %ecx ... GCC 4.9.2: 1 psrldq instruction GCC 5.2.0: 2 pshufd, 2 movd, 2 punpckldq, 1 punpcklqdq instructions. Also, GCC 5.2.0 can generate the same code as GCC 4.9.2, but it requires -mssse3 option for this. It's strange that -mssse3 is necessary to generate more efficient SSE2 code.
(In reply to lvqcl.mail from comment #0) "gcc version 6.0.0 20151121 (experimental)" from dongsheng-daily (mingw-w64) generates the same code as 4.9.2. So this regression was fixed in 6.x branch.
Hum, on x86_64 I don't see either GCC 4.9 or GCC 5.2 vectorize the function at all because they fail to analyze the evolution of the dataref for input[j] as the initial j of the inner loop is not propagated as zero. With i?86 I can confirm your observation but I don't see it fixed on trunk. Note that this boils down to vector shift detection of permutes where (IIRC) some patterns were not properly guarded on SSE3 support previously and a wrong-code bug was fixed conservatively on the GCC 5 branch while missing support was only implemented on trunk. The failure to vectorize on x86_64 isn't a regression.
On i?86 this regressed with r217509, aka part of VEC_RSHIFT_EXPR removal. Guess we'll need to have a look at the i?86 vec perm handling.
Ah, no, the problem is not on the backend side, but during veclower2 pass. Before that pass we after the replacement of v>> 64 or v>>32 shifts we have: vect_sum_15.12_58 = VEC_PERM_EXPR <vect_sum_15.10_57, { 0, 0, 0, 0 }, { 2, 3, 4, 5 }>; vect_sum_15.12_59 = vect_sum_15.12_58 + vect_sum_15.10_57; vect_sum_15.12_60 = VEC_PERM_EXPR <vect_sum_15.12_59, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }>; vect_sum_15.12_61 = vect_sum_15.12_60 + vect_sum_15.12_59; but veclower2 for some reason decides to lower the latter VEC_PERM_EXPR into: _32 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 32>; _17 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 64>; _23 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 96>; vect_sum_15.12_60 = {_32, _17, _23, 0}; The first VEC_PERM_EXPR is kept and generates efficient code. If I manually disable in the debugger the lowering, the code regression is gone.
Created attachment 36811 [details] gcc6-pr68483.patch Untested fix.
(In reply to Richard Biener from comment #2) > With i?86 I can confirm your observation but I don't see it fixed on trunk. Sorry, the GCC 6.x compiler that I downloaded was built with --with-arch=core2 option, so it implicitely enables ssse3. That's why I incorrectly thought that the regression was fixed.
Author: jakub Date: Tue Nov 24 10:45:52 2015 New Revision: 230797 URL: https://gcc.gnu.org/viewcvs?rev=230797&root=gcc&view=rev Log: PR target/68483 * tree-vect-generic.c (lower_vec_perm): If VEC_PERM_EXPR is valid vec_shr pattern, don't lower it even if can_vec_perm_p returns false. * optabs.c (shift_amt_for_vec_perm_mask): Return NULL_RTX whenever first is nelt or above. Don't mask expected with 2 * nelt - 1. * gcc.target/i386/pr68483-1.c: New test. * gcc.target/i386/pr68483-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr68483-1.c trunk/gcc/testsuite/gcc.target/i386/pr68483-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/optabs.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-generic.c
Author: jakub Date: Tue Nov 24 11:10:45 2015 New Revision: 230799 URL: https://gcc.gnu.org/viewcvs?rev=230799&root=gcc&view=rev Log: PR target/68483 * tree-vect-generic.c (lower_vec_perm): If VEC_PERM_EXPR is valid vec_shr pattern, don't lower it even if can_vec_perm_p returns false. * optabs.c (shift_amt_for_vec_perm_mask): Return NULL_RTX whenever first is nelt or above. Don't mask expected with 2 * nelt - 1. * gcc.target/i386/pr68483-1.c: New test. * gcc.target/i386/pr68483-2.c: New test. Added: branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr68483-1.c branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr68483-2.c Modified: branches/gcc-5-branch/gcc/ChangeLog branches/gcc-5-branch/gcc/optabs.c branches/gcc-5-branch/gcc/testsuite/ChangeLog branches/gcc-5-branch/gcc/tree-vect-generic.c
Fixed for 5.3+.