[Bug tree-optimization/59544] New: Vectorizing store with negative stop

Wed Dec 18 12:18:00 GMT 2013

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544

            Bug ID: 59544
           Summary: Vectorizing store with negative stop
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bmei at broadcom dot com

Created attachment 31467
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31467&action=edit
The patch against r206016

I was looking at some loops that can be vectorized by LLVM, but not GCC. One
type of loop is with store of negative step. 

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__
z)
{
    int i;
    for (i=127; i>=0; i--) {
    x[i] = y[127-i] + z[127-i];
    }
}

I don't know why GCC only implements negative step for load, but not store. I
implemented a patch (attached), very similar to code in vectorizable_load. 

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
    addq    $254, %rdi
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L2:
    movzwl    (%rsi,%rax), %ecx
    subq    $2, %rdi
    addw    (%rdx,%rax), %cx
    addq    $2, %rax
    movw    %cx, 2(%rdi)
    cmpq    $256, %rax
    jne    .L2
    rep; ret

With patch:
test1:
.LFB0:
    vmovdqa    .LC0(%rip), %xmm1
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L2:
    vmovdqu    (%rsi,%rax), %xmm0
    movq    %rax, %rcx
    negq    %rcx
    vpaddw    (%rdx,%rax), %xmm0, %xmm0
    vpshufb    %xmm1, %xmm0, %xmm0
    addq    $16, %rax
    cmpq    $256, %rax
    vmovups    %xmm0, 240(%rdi,%rcx)
    jne    .L2
    rep; ret

Performance is definitely improved here. It is bootstrapped for
x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse
code. I am not so familiar with x86 assemble code. The patch is originally for
our private port. 
test1:                                  # @test1
        .cfi_startproc
# BB#0:                                 # %entry
        addq    $240, %rdi
        xorl    %eax, %eax
        .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movdqu  (%rsi,%rax,2), %xmm0
        movdqu  (%rdx,%rax,2), %xmm1
        paddw   %xmm0, %xmm1
        shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
        pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
        pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
        movdqu  %xmm0, (%rdi)
        addq    $8, %rax
        addq    $-16, %rdi
        cmpq    $128, %rax
        jne     .LBB0_1
# BB#2:                                 # %for.end
        ret