86530 – Vectorization failure for a simple loop

Bug 86530 - Vectorization failure for a simple loop

Summary: Vectorization failure for a simple loop

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Tamar Christina

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	spec vectorizer
	Show dependency tree / graph

Reported:	2018-07-16 07:44 UTC by Jiangning Liu
Modified:	2024-02-27 07:52 UTC (History)
CC List:	5 users (show)

See Also:	65930 88492 113458
Host:
Target:	arm aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-07-16 00:00:00

Attachments
vectorization failure (475 bytes, text/plain) 2018-07-16 07:46 UTC, Jiangning Liu	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jiangning Liu 2018-07-16 07:44:02 UTC

GCC -O3 can't vectorize the following simple case. 

$ cat test_loop_2.c
int test_loop_2(char *p1, char *p2)
{
    int s = 0;
    for(int i=0; i<4; i++, p1+=4, p2+=4)
    {
        s += (p1[0]-p2[0]) + (p1[1]-p2[1]) + (p1[2]-p2[2]) + (p1[3]-p2[3]);
    }

    return s;
}

The vector size is 4*1=4 bytes, and it doesn't directly fit into 8-byte or 16-byte vector, but we still can extend the element to be 32-bit, and use the vector operations on 4*4=16 bytes vector.

Comment 1 Jiangning Liu 2018-07-16 07:46:43 UTC

Created attachment 44396 [details]
vectorization failure

Attached is -O3 result for aarch64, in which no vectorization code generated at all.

Comment 2 ktkachov 2018-07-16 08:00:27 UTC

Confirmed

Comment 3 Tamar Christina 2019-04-09 18:12:09 UTC

I'll take this one as part of GCC10.

Comment 4 Eric Gallager 2019-09-19 02:56:39 UTC

(In reply to Tamar Christina from comment #3)
> I'll take this one as part of GCC10.

Reconfirmed at Cauldron, where it was also mentioned that this bug is related to bug 65930 and bug 88492

Comment 5 Andrew Pinski 2024-02-27 07:25:40 UTC

Actually I have a patch for this (PR 113458 also) which I will be submitting for GCC 15.

Comment 6 Andrew Pinski 2024-02-27 07:46:26 UTC

With my patch for V4QI, we still don't get the best code:
  vect_perm_even_271 = VEC_PERM_EXPR <vect__1.7_264, vect__1.8_266, { 0, 2, 4, 6 }>;
  vect_perm_even_273 = VEC_PERM_EXPR <vect__1.9_268, vect__1.10_270, { 0, 2, 4, 6 }>;
  vect_perm_even_275 = VEC_PERM_EXPR <vect_perm_even_271, vect_perm_even_273, { 0, 2, 4, 6 }>;

_275={_264[0], _264[2], _268[0], _268[2]} or
VEC_PERM<_264, _268, {0, 2, 4, 6}>

but for some reason we don't reduce it to that perm

And there is still a lot of extra PERMS than there should be.

Comment 7 Andrew Pinski 2024-02-27 07:49:47 UTC

The whole PERM<0,2,1,3> shows up a few times in many other places too.

Comment 8 Tamar Christina 2024-02-27 07:51:53 UTC

(In reply to Andrew Pinski from comment #6)
> With my patch for V4QI, we still don't get the best code:
>   vect_perm_even_271 = VEC_PERM_EXPR <vect__1.7_264, vect__1.8_266, { 0, 2,
> 4, 6 }>;
>   vect_perm_even_273 = VEC_PERM_EXPR <vect__1.9_268, vect__1.10_270, { 0, 2,
> 4, 6 }>;
>   vect_perm_even_275 = VEC_PERM_EXPR <vect_perm_even_271,
> vect_perm_even_273, { 0, 2, 4, 6 }>;
> 
> _275={_264[0], _264[2], _268[0], _268[2]} or
> VEC_PERM<_264, _268, {0, 2, 4, 6}>
> 
> but for some reason we don't reduce it to that perm
> 
> And there is still a lot of extra PERMS than there should be.

Because this loop is not something that can be fixed by using V4QI (we tried before).

This loop requires improvements to SCEV and SLP. It's loading 16 sequential bytes as there's no gap between the p1 and p2 values across iterations..

so this loop should vectorized with V16QI and widening additions. So I don't think this is related to the other example.

So I'll take it back as it requires actual vectorizer work and part of things we're trying to address in GCC 15.