Bug 86530 - Vectorization failure for a simple loop
Summary: Vectorization failure for a simple loop
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Tamar Christina
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2018-07-16 07:44 UTC by Jiangning Liu
Modified: 2024-02-27 07:52 UTC (History)
5 users (show)

See Also:
Host:
Target: arm aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-16 00:00:00


Attachments
vectorization failure (475 bytes, text/plain)
2018-07-16 07:46 UTC, Jiangning Liu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jiangning Liu 2018-07-16 07:44:02 UTC
GCC -O3 can't vectorize the following simple case. 

$ cat test_loop_2.c
int test_loop_2(char *p1, char *p2)
{
    int s = 0;
    for(int i=0; i<4; i++, p1+=4, p2+=4)
    {
        s += (p1[0]-p2[0]) + (p1[1]-p2[1]) + (p1[2]-p2[2]) + (p1[3]-p2[3]);
    }

    return s;
}

The vector size is 4*1=4 bytes, and it doesn't directly fit into 8-byte or 16-byte vector, but we still can extend the element to be 32-bit, and use the vector operations on 4*4=16 bytes vector.
Comment 1 Jiangning Liu 2018-07-16 07:46:43 UTC
Created attachment 44396 [details]
vectorization failure

Attached is -O3 result for aarch64, in which no vectorization code generated at all.
Comment 2 ktkachov 2018-07-16 08:00:27 UTC
Confirmed
Comment 3 Tamar Christina 2019-04-09 18:12:09 UTC
I'll take this one as part of GCC10.
Comment 4 Eric Gallager 2019-09-19 02:56:39 UTC
(In reply to Tamar Christina from comment #3)
> I'll take this one as part of GCC10.

Reconfirmed at Cauldron, where it was also mentioned that this bug is related to bug 65930 and bug 88492
Comment 5 Andrew Pinski 2024-02-27 07:25:40 UTC
Actually I have a patch for this (PR 113458 also) which I will be submitting for GCC 15.
Comment 6 Andrew Pinski 2024-02-27 07:46:26 UTC
With my patch for V4QI, we still don't get the best code:
  vect_perm_even_271 = VEC_PERM_EXPR <vect__1.7_264, vect__1.8_266, { 0, 2, 4, 6 }>;
  vect_perm_even_273 = VEC_PERM_EXPR <vect__1.9_268, vect__1.10_270, { 0, 2, 4, 6 }>;
  vect_perm_even_275 = VEC_PERM_EXPR <vect_perm_even_271, vect_perm_even_273, { 0, 2, 4, 6 }>;

_275={_264[0], _264[2], _268[0], _268[2]} or
VEC_PERM<_264, _268, {0, 2, 4, 6}>

but for some reason we don't reduce it to that perm

And there is still a lot of extra PERMS than there should be.
Comment 7 Andrew Pinski 2024-02-27 07:49:47 UTC
The whole PERM<0,2,1,3> shows up a few times in many other places too.
Comment 8 Tamar Christina 2024-02-27 07:51:53 UTC
(In reply to Andrew Pinski from comment #6)
> With my patch for V4QI, we still don't get the best code:
>   vect_perm_even_271 = VEC_PERM_EXPR <vect__1.7_264, vect__1.8_266, { 0, 2,
> 4, 6 }>;
>   vect_perm_even_273 = VEC_PERM_EXPR <vect__1.9_268, vect__1.10_270, { 0, 2,
> 4, 6 }>;
>   vect_perm_even_275 = VEC_PERM_EXPR <vect_perm_even_271,
> vect_perm_even_273, { 0, 2, 4, 6 }>;
> 
> _275={_264[0], _264[2], _268[0], _268[2]} or
> VEC_PERM<_264, _268, {0, 2, 4, 6}>
> 
> but for some reason we don't reduce it to that perm
> 
> And there is still a lot of extra PERMS than there should be.

Because this loop is not something that can be fixed by using V4QI (we tried before).

This loop requires improvements to SCEV and SLP. It's loading 16 sequential bytes as there's no gap between the p1 and p2 values across iterations..

so this loop should vectorized with V16QI and widening additions. So I don't think this is related to the other example.

So I'll take it back as it requires actual vectorizer work and part of things we're trying to address in GCC 15.