[Bug tree-optimization/59544] New: Vectorizing store with negative stop
bmei at broadcom dot com
gcc-bugzilla@gcc.gnu.org
Wed Dec 18 12:18:00 GMT 2013
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544
Bug ID: 59544
Summary: Vectorizing store with negative stop
Product: gcc
Version: 4.9.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: bmei at broadcom dot com
Created attachment 31467
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31467&action=edit
The patch against r206016
I was looking at some loops that can be vectorized by LLVM, but not GCC. One
type of loop is with store of negative step.
void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__
z)
{
int i;
for (i=127; i>=0; i--) {
x[i] = y[127-i] + z[127-i];
}
}
I don't know why GCC only implements negative step for load, but not store. I
implemented a patch (attached), very similar to code in vectorizable_load.
~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
Without patch:
test1:
.LFB0:
addq $254, %rdi
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movzwl (%rsi,%rax), %ecx
subq $2, %rdi
addw (%rdx,%rax), %cx
addq $2, %rax
movw %cx, 2(%rdi)
cmpq $256, %rax
jne .L2
rep; ret
With patch:
test1:
.LFB0:
vmovdqa .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
vmovdqu (%rsi,%rax), %xmm0
movq %rax, %rcx
negq %rcx
vpaddw (%rdx,%rax), %xmm0, %xmm0
vpshufb %xmm1, %xmm0, %xmm0
addq $16, %rax
cmpq $256, %rax
vmovups %xmm0, 240(%rdi,%rcx)
jne .L2
rep; ret
Performance is definitely improved here. It is bootstrapped for
x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
For reference, LLVM seems to use different instructions and slightly worse
code. I am not so familiar with x86 assemble code. The patch is originally for
our private port.
test1: # @test1
.cfi_startproc
# BB#0: # %entry
addq $240, %rdi
xorl %eax, %eax
.align 16, 0x90
.LBB0_1: # %vector.body
# =>This Inner Loop Header: Depth=1
movdqu (%rsi,%rax,2), %xmm0
movdqu (%rdx,%rax,2), %xmm1
paddw %xmm0, %xmm1
shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0]
pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7]
pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4]
movdqu %xmm0, (%rdi)
addq $8, %rax
addq $-16, %rdi
cmpq $128, %rax
jne .LBB0_1
# BB#2: # %for.end
ret
More information about the Gcc-bugs
mailing list