Vectorization for store with negative step

I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. 

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z)
    int i;
    for (i=127; i>=0; i--) {
	x[i] = y[127-i] + z[127-i];

I don't know why GCC only implements negative step for load, but not store. I implemented a patch, very similar to code in vectorizable_load. 

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
	addq	$254, %rdi
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
	movzwl	(%rsi,%rax), %ecx
	subq	$2, %rdi
	addw	(%rdx,%rax), %cx
	addq	$2, %rax
	movw	%cx, 2(%rdi)
	cmpq	$256, %rax
	jne	.L2
	rep; ret

With patch:
	vmovdqa	.LC0(%rip), %xmm1
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
	vmovdqu	(%rsi,%rax), %xmm0
	movq	%rax, %rcx
	negq	%rcx
	vpaddw	(%rdx,%rax), %xmm0, %xmm0
	vpshufb	%xmm1, %xmm0, %xmm0
	addq	$16, %rax
	cmpq	$256, %rax
	vmovups	%xmm0, 240(%rdi,%rcx)
	jne	.L2
	rep; ret

Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. 
test1:                                  # @test1
# BB#0:                                 # %entry
        addq    $240, %rdi
        xorl    %eax, %eax
        .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movdqu  (%rsi,%rax,2), %xmm0
        movdqu  (%rdx,%rax,2), %xmm1
        paddw   %xmm0, %xmm1
        shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
        pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
        pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
        movdqu  %xmm0, (%rdi)
        addq    $8, %rax
        addq    $-16, %rdi
        cmpq    $128, %rax
        jne     .LBB0_1
# BB#2:                                 # %for.end

Any comment? 

Bingfeng Mei
Broadcom UK

