Vectorization for store with negative step

Wed Dec 18 11:47:00 GMT 2013

On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Thanks, Richard. I will file a bug report and prepare a complete patch. For perm_mask_for_reverse function, should I move it before vectorizable_store or add a declaration.

Move it.

Richard.

>
> Bingfeng
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 18 December 2013 11:26
> To: Bingfeng Mei
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: Vectorization for store with negative step
>
> On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi,
>> I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step.
>>
>> void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z)
>> {
>>     int i;
>>     for (i=127; i>=0; i--) {
>>         x[i] = y[127-i] + z[127-i];
>>     }
>> }
>>
>> I don't know why GCC only implements negative step for load, but not store. I implemented a patch, very similar to code in vectorizable_load.
>>
>> ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx
>>
>> Without patch:
>> test1:
>> .LFB0:
>>         addq    $254, %rdi
>>         xorl    %eax, %eax
>>         .p2align 4,,10
>>         .p2align 3
>> .L2:
>>         movzwl  (%rsi,%rax), %ecx
>>         subq    $2, %rdi
>>         addw    (%rdx,%rax), %cx
>>         addq    $2, %rax
>>         movw    %cx, 2(%rdi)
>>         cmpq    $256, %rax
>>         jne     .L2
>>         rep; ret
>>
>> With patch:
>> test1:
>> .LFB0:
>>         vmovdqa .LC0(%rip), %xmm1
>>         xorl    %eax, %eax
>>         .p2align 4,,10
>>         .p2align 3
>> .L2:
>>         vmovdqu (%rsi,%rax), %xmm0
>>         movq    %rax, %rcx
>>         negq    %rcx
>>         vpaddw  (%rdx,%rax), %xmm0, %xmm0
>>         vpshufb %xmm1, %xmm0, %xmm0
>>         addq    $16, %rax
>>         cmpq    $256, %rax
>>         vmovups %xmm0, 240(%rdi,%rcx)
>>         jne     .L2
>>         rep; ret
>>
>> Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine.
>>
>> For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port.
>> test1:                                  # @test1
>>         .cfi_startproc
>> # BB#0:                                 # %entry
>>         addq    $240, %rdi
>>         xorl    %eax, %eax
>>         .align  16, 0x90
>> .LBB0_1:                                # %vector.body
>>                                         # =>This Inner Loop Header: Depth=1
>>         movdqu  (%rsi,%rax,2), %xmm0
>>         movdqu  (%rdx,%rax,2), %xmm1
>>         paddw   %xmm0, %xmm1
>>         shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
>>         pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
>>         pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
>>         movdqu  %xmm0, (%rdi)
>>         addq    $8, %rax
>>         addq    $-16, %rdi
>>         cmpq    $128, %rax
>>         jne     .LBB0_1
>> # BB#2:                                 # %for.end
>>         ret
>>
>> Any comment?
>
> Looks good to me.  One of the various TODOs in vectorizable_store I presume.
>
> Needs a testcase and at this stage a bugreport that is fixed by it.
>
> Thanks,
> Richard.
>
>> Bingfeng Mei
>> Broadcom UK
>>
>>