GCC -O3 can't vectorize the following simple case. $ cat test_loop_2.c int test_loop_2(char *p1, char *p2) { int s = 0; for(int i=0; i<4; i++, p1+=4, p2+=4) { s += (p1[0]-p2[0]) + (p1[1]-p2[1]) + (p1[2]-p2[2]) + (p1[3]-p2[3]); } return s; } The vector size is 4*1=4 bytes, and it doesn't directly fit into 8-byte or 16-byte vector, but we still can extend the element to be 32-bit, and use the vector operations on 4*4=16 bytes vector.
Created attachment 44396 [details] vectorization failure Attached is -O3 result for aarch64, in which no vectorization code generated at all.
Confirmed
I'll take this one as part of GCC10.
(In reply to Tamar Christina from comment #3) > I'll take this one as part of GCC10. Reconfirmed at Cauldron, where it was also mentioned that this bug is related to bug 65930 and bug 88492
Actually I have a patch for this (PR 113458 also) which I will be submitting for GCC 15.
With my patch for V4QI, we still don't get the best code: vect_perm_even_271 = VEC_PERM_EXPR <vect__1.7_264, vect__1.8_266, { 0, 2, 4, 6 }>; vect_perm_even_273 = VEC_PERM_EXPR <vect__1.9_268, vect__1.10_270, { 0, 2, 4, 6 }>; vect_perm_even_275 = VEC_PERM_EXPR <vect_perm_even_271, vect_perm_even_273, { 0, 2, 4, 6 }>; _275={_264[0], _264[2], _268[0], _268[2]} or VEC_PERM<_264, _268, {0, 2, 4, 6}> but for some reason we don't reduce it to that perm And there is still a lot of extra PERMS than there should be.
The whole PERM<0,2,1,3> shows up a few times in many other places too.
(In reply to Andrew Pinski from comment #6) > With my patch for V4QI, we still don't get the best code: > vect_perm_even_271 = VEC_PERM_EXPR <vect__1.7_264, vect__1.8_266, { 0, 2, > 4, 6 }>; > vect_perm_even_273 = VEC_PERM_EXPR <vect__1.9_268, vect__1.10_270, { 0, 2, > 4, 6 }>; > vect_perm_even_275 = VEC_PERM_EXPR <vect_perm_even_271, > vect_perm_even_273, { 0, 2, 4, 6 }>; > > _275={_264[0], _264[2], _268[0], _268[2]} or > VEC_PERM<_264, _268, {0, 2, 4, 6}> > > but for some reason we don't reduce it to that perm > > And there is still a lot of extra PERMS than there should be. Because this loop is not something that can be fixed by using V4QI (we tried before). This loop requires improvements to SCEV and SLP. It's loading 16 sequential bytes as there's no gap between the p1 and p2 values across iterations.. so this loop should vectorized with V16QI and widening additions. So I don't think this is related to the other example. So I'll take it back as it requires actual vectorizer work and part of things we're trying to address in GCC 15.