Bug 65965 - Straight-line memcpy/memset not vectorized when equivalent loop is
Summary: Straight-line memcpy/memset not vectorized when equivalent loop is
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 5.0
: P3 normal
Target Milestone: 6.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-05-01 15:31 UTC by Alan Lawrence
Modified: 2015-09-22 14:05 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-05-04 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Lawrence 2015-05-01 15:31:12 UTC
Testcase:
void
test(int *__restrict__ a, int *__restrict__ b)
{
  a[0] = b[0];
  a[1] = b[1];
  a[2] = b[2];
  a[3] = b[3];
  a[5] = 0;
  a[6] = 0;
  a[7] = 0;
  a[8] = 0;
}
produces (at -O3) on AArch64:
test:
        ldp     w4, w3, [x1]
        ldp     w2, w1, [x1, 8]
        stp     w4, w3, [x0]
        stp     w2, w1, [x0, 8]
        stp     wzr, wzr, [x0, 20]
        stp     wzr, wzr, [x0, 28]
        ret
or on x86_64/-mavx:
test:
.LFB0:
        movl    (%rsi), %eax
        movl    $0, 20(%rdi)
        movl    $0, 24(%rdi)
        movl    $0, 28(%rdi)
        movl    $0, 32(%rdi)
        movl    %eax, (%rdi)
        movl    4(%rsi), %eax
        movl    %eax, 4(%rdi)
        movl    8(%rsi), %eax
        movl    %eax, 8(%rdi)
        movl    12(%rsi), %eax
        movl    %eax, 12(%rdi)
        ret
(there is no -fdump-tree-vect)

In contrast, testcase
void
test(int *__restrict__ a, int *__restrict__ b)
{
  for (int i = 0; i < 4; i++) a[i] = b[i];
  for (int i = 0; i < 4; i++) a[i+4] = 0;
}
the memcpy is recognized by ldist, and the 'memset' by slp1 (neither of which triggers on the first case), producing (superior) AArch64:
test:
        movi    v0.4s, 0
        ldp     x2, x3, [x1]
        stp     x2, x3, [x0]
        str     q0, [x0, 16]
        ret
or x86_64:
test:
.LFB0:
        movq    (%rsi), %rax
        movq    8(%rsi), %rdx
        vpxor   %xmm0, %xmm0, %xmm0
        movq    %rax, (%rdi)
        movq    %rdx, 8(%rdi)
        vmovups %xmm0, 16(%rdi)
        ret
Comment 1 Richard Biener 2015-05-04 10:41:57 UTC
It's because you don't init a[4].  Let me fix that.
Comment 2 Richard Biener 2015-05-04 14:25:21 UTC
Author: rguenth
Date: Mon May  4 14:24:49 2015
New Revision: 222765

URL: https://gcc.gnu.org/viewcvs?rev=222765&root=gcc&view=rev
Log:
2015-05-04  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/65965
	* tree-vect-data-refs.c (vect_analyze_data_ref_accesses): Split
	store groups at gaps.

	* gcc.dg/vect/bb-slp-33.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/bb-slp-33.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c
Comment 3 Richard Biener 2015-05-04 14:25:42 UTC
Fixed for GCC 6.
Comment 4 Alan Lawrence 2015-09-22 11:33:32 UTC
(In reply to Richard Biener from comment #3)
> Fixed for GCC 6.

Indeed. I note that the same testcase does _not_ SLP/vectorize if I use consecutive indices:

void
test (int*__restrict a, int*__restrict b)
{
    a[0] = b[0];
    a[1] = b[1];
    a[2] = b[2];
    a[3] = b[3];
    a[4] = 0;
    a[5] = 0;
    a[6] = 0;
    a[7] = 0;
}

loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int *)a
_4(D) + 28B] = 0;
loop26a.c:6:13: note: original stmt *a_4(D) = _3;
loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences ===
loop26a.c:6:13: note: === vect_slp_analyze_operations ===
loop26a.c:6:13: note: not vectorized: bad operation in basic block.

Worth another bug?
Comment 5 rguenther@suse.de 2015-09-22 14:05:14 UTC
On Tue, 22 Sep 2015, alalaw01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65965
> 
> --- Comment #4 from alalaw01 at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #3)
> > Fixed for GCC 6.
> 
> Indeed. I note that the same testcase does _not_ SLP/vectorize if I use
> consecutive indices:
> 
> void
> test (int*__restrict a, int*__restrict b)
> {
>     a[0] = b[0];
>     a[1] = b[1];
>     a[2] = b[2];
>     a[3] = b[3];
>     a[4] = 0;
>     a[5] = 0;
>     a[6] = 0;
>     a[7] = 0;
> }
> 
> loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int
> *)a
> _4(D) + 28B] = 0;
> loop26a.c:6:13: note: original stmt *a_4(D) = _3;
> loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences ===
> loop26a.c:6:13: note: === vect_slp_analyze_operations ===
> loop26a.c:6:13: note: not vectorized: bad operation in basic block.
> 
> Worth another bug?

The above looks like if SLP is trying a vector size of v8si.  It
_should_ work for v4si.  For v8si we indeed can't vectorize this
as we don't support "partial" loads.  We could vectorize with
masked loads and IIRC on x86_64 the masked elements can be 
initialized to 0 or -1, so we can OR in the constant pieces.

Not sure if that's worth another bug, please double-check your
vector size first.