[Bug tree-optimization/104010] [12 regression] short loop no longer vectorized with Neon after r12-3362

Wed Apr 13 10:42:49 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104010

Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #6 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
The reason this wasn't reproducible is because there is a typo in the testcase
- the loop iteration count should be 2 not 4.  Clues are in the function name
and the assembly code generated, which both show 2 iterations of the loop.

Changing the test to:

void test_vcmpeq_s32x2 (int32_t * __restrict__ dest, int32_t *a, int32_t *b)
{
  int i;
  for (i=0; i<2; i++) {
    dest[i] = a[i] == b[i];
  }
}

Does indeed show a regression between gcc-11 and trunk.  With gcc-11 the
costing shows:

vect.c:5:13: note: Cost model analysis: 
0x2f0a780 _28 1 times scalar_store costs 1 in body
0x2f0a780 _41 1 times scalar_store costs 1 in body
0x2f0a780 (int) _26 1 times scalar_stmt costs 1 in body
0x2f0a780 (int) _39 1 times scalar_stmt costs 1 in body
0x2f0a780 _23 == _25 1 times scalar_stmt costs 1 in body
0x2f0a780 _36 == _38 1 times scalar_stmt costs 1 in body
0x2f0a780 *a_13(D) 1 times scalar_load costs 1 in body
0x2f0a780 MEM[(int *)a_13(D) + 4B] 1 times scalar_load costs 1 in body
0x2f0a780 *b_14(D) 1 times scalar_load costs 1 in body
0x2f0a780 MEM[(int *)b_14(D) + 4B] 1 times scalar_load costs 1 in body
0x2f0a780 *a_13(D) 1 times unaligned_load (misalign -1) costs 1 in body
0x2f0a780 *b_14(D) 1 times unaligned_load (misalign -1) costs 1 in body
0x2f0a780 _23 == _25 1 times vector_stmt costs 1 in body
0x2f0a780 _26 ? 1 : 0 1 times vector_stmt costs 1 in body
0x2f0a780 <unknown> 1 times vector_load costs 1 in prologue
0x2f0a780 <unknown> 1 times vector_load costs 1 in prologue
0x2f0a780 _28 1 times unaligned_store (misalign -1) costs 1 in body
vect.c:5:13: note: Cost model analysis for part in loop 0:
  Vector cost: 7
  Scalar cost: 10

While trunk shows:

vect.c:5:13: note: Cost model analysis: 
_28 1 times scalar_store costs 1 in body
_41 1 times scalar_store costs 1 in body
(int) _26 1 times scalar_stmt costs 1 in body
(int) _39 1 times scalar_stmt costs 1 in body
*a_13(D) 1 times unaligned_load (misalign -1) costs 1 in body
*b_14(D) 1 times unaligned_load (misalign -1) costs 1 in body
_23 == _25 1 times vector_stmt costs 1 in body
_26 ? 1 : 0 1 times vector_stmt costs 1 in body
node 0x3bc5078 1 times vector_load costs 1 in prologue
node 0x3bc5100 1 times vector_load costs 1 in prologue
_28 1 times unaligned_store (misalign -1) costs 1 in body
vect.c:5:13: note: Cost model analysis for part in loop 0:
  Vector cost: 7
  Scalar cost: 4
vect.c:5:13: missed: not vectorized: vectorization is not profitable.

Now the question is why has the scalar cost has been so dramatically reduced?