Vectorization for store with negative step
Richard Biener
richard.guenther@gmail.com
Wed Dec 18 11:47:00 GMT 2013
On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Thanks, Richard. I will file a bug report and prepare a complete patch. For perm_mask_for_reverse function, should I move it before vectorizable_store or add a declaration.
Move it.
Richard.
>
> Bingfeng
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 18 December 2013 11:26
> To: Bingfeng Mei
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: Vectorization for store with negative step
>
> On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi,
>> I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step.
>>
>> void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z)
>> {
>> int i;
>> for (i=127; i>=0; i--) {
>> x[i] = y[127-i] + z[127-i];
>> }
>> }
>>
>> I don't know why GCC only implements negative step for load, but not store. I implemented a patch, very similar to code in vectorizable_load.
>>
>> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>>
>> Without patch:
>> test1:
>> .LFB0:
>> addq $254, %rdi
>> xorl %eax, %eax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> movzwl (%rsi,%rax), %ecx
>> subq $2, %rdi
>> addw (%rdx,%rax), %cx
>> addq $2, %rax
>> movw %cx, 2(%rdi)
>> cmpq $256, %rax
>> jne .L2
>> rep; ret
>>
>> With patch:
>> test1:
>> .LFB0:
>> vmovdqa .LC0(%rip), %xmm1
>> xorl %eax, %eax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vmovdqu (%rsi,%rax), %xmm0
>> movq %rax, %rcx
>> negq %rcx
>> vpaddw (%rdx,%rax), %xmm0, %xmm0
>> vpshufb %xmm1, %xmm0, %xmm0
>> addq $16, %rax
>> cmpq $256, %rax
>> vmovups %xmm0, 240(%rdi,%rcx)
>> jne .L2
>> rep; ret
>>
>> Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>>
>> For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port.
>> test1: # @test1
>> .cfi_startproc
>> # BB#0: # %entry
>> addq $240, %rdi
>> xorl %eax, %eax
>> .align 16, 0x90
>> .LBB0_1: # %vector.body
>> # =>This Inner Loop Header: Depth=1
>> movdqu (%rsi,%rax,2), %xmm0
>> movdqu (%rdx,%rax,2), %xmm1
>> paddw %xmm0, %xmm1
>> shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0]
>> pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7]
>> pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4]
>> movdqu %xmm0, (%rdi)
>> addq $8, %rax
>> addq $-16, %rdi
>> cmpq $128, %rax
>> jne .LBB0_1
>> # BB#2: # %for.end
>> ret
>>
>> Any comment?
>
> Looks good to me. One of the various TODOs in vectorizable_store I presume.
>
> Needs a testcase and at this stage a bugreport that is fixed by it.
>
> Thanks,
> Richard.
>
>> Bingfeng Mei
>> Broadcom UK
>>
>>
More information about the Gcc-patches
mailing list