This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Fix PR56118
- From: Richard Biener <rguenther at suse dot de>
- To: Alan Lawrence <alan dot lawrence at arm dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Mon, 16 Nov 2015 09:39:14 +0100 (CET)
- Subject: Re: [PATCH] Fix PR56118
- Authentication-results: sourceware.org; auth=none
- References: <alpine dot LSU dot 2 dot 11 dot 1511101033150 dot 10078 at zhemvz dot fhfr dot qr> <56462679 dot 5040908 at arm dot com>
On Fri, 13 Nov 2015, Alan Lawrence wrote:
> On 10/11/15 09:34, Richard Biener wrote:
> >
> > The following fixes PR56118 by adjusting the cost model handling of
> > basic-block vectorization to favor the vectorized version in case
> > estimated cost is the same as the estimated cost of the scalar
> > version. This makes sense because we over-estimate the vectorized
> > cost in several places.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
> >
> > Richard.
> >
> > 2015-11-10 Richard Biener <rguenther@suse.de>
> >
> > PR tree-optimization/56118
> > * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal
> > cost favor vectorized version.
> >
> > * gcc.target/i386/pr56118.c: New testcase.
>
> On AArch64 and ARM targets, this causes PASS->FAIL of
>
> gcc.dg/vect/bb-slp-32.c scan-tree-dump slp2 "vectorization is not profitable"
> gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2
> "vectorization is not profitable"
>
> ....that sounds like a good thing ;)
It depends ;) You may want to look at the generated code with/without
vectorization and decide for yourselves. The testcase is
int foo (int *p)
{
int x[4];
int tem0, tem1, tem2, tem3;
tem0 = p[0] + 1;
x[0] = tem0;
tem1 = p[1] + 2;
x[1] = tem1;
tem2 = p[2] + 3;
x[2] = tem2;
tem3 = p[3] + 4;
x[3] = tem3;
bar (x);
return tem0 + tem1 + tem2 + tem3;
}
which was added to cover the situation where we vectorize the store
to x[] but have to keep the scalar computations for tem[0-3] for
the final reduction. The scalar cost for this kernel is 3*4
while the vector cost is unaligned load + aligned load + vector op
+ aligned vector store. We compensate for the out-of-kernel uses
by making the scalar cost just the stores (4). Now if all of the
vector cost pieces are 1 then we have 4 vs. 4 here. On x86_64
the unaligned load cost is 2 and thus vectorization is deemed
non-profitable.
In reality this depends on the actual constants used in the plus
as that tells you whether the constant is free for the scalar
plus or not (for vectors it almost always comes from the constant
pool).
On x86_64 the assembler difference is
movl (%rdi), %eax
movdqu (%rdi), %xmm0
leal 1(%rax), %r13d
movl 4(%rdi), %eax
paddd .LC0(%rip), %xmm0
movaps %xmm0, (%rsp)
leal 2(%rax), %r12d
movl 8(%rdi), %eax
addl %r13d, %r12d
leal 3(%rax), %ebp
movl 12(%rdi), %eax
movq %rsp, %rdi
addl %r12d, %ebp
leal 4(%rax), %ebx
call bar
leal 0(%rbp,%rbx), %eax
vs.
movl (%rdi), %eax
movl 12(%rdi), %ebx
leal 1(%rax), %r13d
movl 4(%rdi), %eax
addl $4, %ebx
movl %ebx, 12(%rsp)
movl %r13d, (%rsp)
leal 2(%rax), %r12d
movl 8(%rdi), %eax
movq %rsp, %rdi
movl %r12d, 4(%rsp)
leal 3(%rax), %ebp
movl %ebp, 8(%rsp)
call bar
leal 0(%r13,%r12), %eax
addl %ebp, %eax
addl %ebx, %eax
clearly the testcase could need adjustment to be not so on the
edge of the individual targets. I'm considering changing it.
The testcase also shows the lack of reduction vectorization in BBs
(I have partial finished patches for this but got distracted...).
>, so I imagine the xfail directive may
> just need updating. The test also looks to be failing on powerpc64 (according
> to https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg01327.html).
I'll try making the testcase more complicated instead.
Richard.
> Regards, Alan
>
>
--
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)