Testcase: void test(int *__restrict__ a, int *__restrict__ b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[5] = 0; a[6] = 0; a[7] = 0; a[8] = 0; } produces (at -O3) on AArch64: test: ldp w4, w3, [x1] ldp w2, w1, [x1, 8] stp w4, w3, [x0] stp w2, w1, [x0, 8] stp wzr, wzr, [x0, 20] stp wzr, wzr, [x0, 28] ret or on x86_64/-mavx: test: .LFB0: movl (%rsi), %eax movl $0, 20(%rdi) movl $0, 24(%rdi) movl $0, 28(%rdi) movl $0, 32(%rdi) movl %eax, (%rdi) movl 4(%rsi), %eax movl %eax, 4(%rdi) movl 8(%rsi), %eax movl %eax, 8(%rdi) movl 12(%rsi), %eax movl %eax, 12(%rdi) ret (there is no -fdump-tree-vect) In contrast, testcase void test(int *__restrict__ a, int *__restrict__ b) { for (int i = 0; i < 4; i++) a[i] = b[i]; for (int i = 0; i < 4; i++) a[i+4] = 0; } the memcpy is recognized by ldist, and the 'memset' by slp1 (neither of which triggers on the first case), producing (superior) AArch64: test: movi v0.4s, 0 ldp x2, x3, [x1] stp x2, x3, [x0] str q0, [x0, 16] ret or x86_64: test: .LFB0: movq (%rsi), %rax movq 8(%rsi), %rdx vpxor %xmm0, %xmm0, %xmm0 movq %rax, (%rdi) movq %rdx, 8(%rdi) vmovups %xmm0, 16(%rdi) ret
It's because you don't init a[4]. Let me fix that.
Author: rguenth Date: Mon May 4 14:24:49 2015 New Revision: 222765 URL: https://gcc.gnu.org/viewcvs?rev=222765&root=gcc&view=rev Log: 2015-05-04 Richard Biener <rguenther@suse.de> PR tree-optimization/65965 * tree-vect-data-refs.c (vect_analyze_data_ref_accesses): Split store groups at gaps. * gcc.dg/vect/bb-slp-33.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/bb-slp-33.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c
Fixed for GCC 6.
(In reply to Richard Biener from comment #3) > Fixed for GCC 6. Indeed. I note that the same testcase does _not_ SLP/vectorize if I use consecutive indices: void test (int*__restrict a, int*__restrict b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = 0; a[5] = 0; a[6] = 0; a[7] = 0; } loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int *)a _4(D) + 28B] = 0; loop26a.c:6:13: note: original stmt *a_4(D) = _3; loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences === loop26a.c:6:13: note: === vect_slp_analyze_operations === loop26a.c:6:13: note: not vectorized: bad operation in basic block. Worth another bug?
On Tue, 22 Sep 2015, alalaw01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65965 > > --- Comment #4 from alalaw01 at gcc dot gnu.org --- > (In reply to Richard Biener from comment #3) > > Fixed for GCC 6. > > Indeed. I note that the same testcase does _not_ SLP/vectorize if I use > consecutive indices: > > void > test (int*__restrict a, int*__restrict b) > { > a[0] = b[0]; > a[1] = b[1]; > a[2] = b[2]; > a[3] = b[3]; > a[4] = 0; > a[5] = 0; > a[6] = 0; > a[7] = 0; > } > > loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int > *)a > _4(D) + 28B] = 0; > loop26a.c:6:13: note: original stmt *a_4(D) = _3; > loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences === > loop26a.c:6:13: note: === vect_slp_analyze_operations === > loop26a.c:6:13: note: not vectorized: bad operation in basic block. > > Worth another bug? The above looks like if SLP is trying a vector size of v8si. It _should_ work for v4si. For v8si we indeed can't vectorize this as we don't support "partial" loads. We could vectorize with masked loads and IIRC on x86_64 the masked elements can be initialized to 0 or -1, so we can OR in the constant pieces. Not sure if that's worth another bug, please double-check your vector size first.