Bug 59544 - Vectorizing store with negative step
Summary: Vectorizing store with negative step
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2013-12-18 12:18 UTC by Bingfeng Mei
Modified: 2013-12-30 11:07 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
The patch against r206016 (1.74 KB, patch)
2013-12-18 12:18 UTC, Bingfeng Mei
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Bingfeng Mei 2013-12-18 12:18:03 UTC
Created attachment 31467 [details]
The patch against r206016

I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. 

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z)
{
    int i;
    for (i=127; i>=0; i--) {
	x[i] = y[127-i] + z[127-i];
    }
}

I don't know why GCC only implements negative step for load, but not store. I implemented a patch (attached), very similar to code in vectorizable_load. 

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
	addq	$254, %rdi
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L2:
	movzwl	(%rsi,%rax), %ecx
	subq	$2, %rdi
	addw	(%rdx,%rax), %cx
	addq	$2, %rax
	movw	%cx, 2(%rdi)
	cmpq	$256, %rax
	jne	.L2
	rep; ret

With patch:
test1:
.LFB0:
	vmovdqa	.LC0(%rip), %xmm1
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L2:
	vmovdqu	(%rsi,%rax), %xmm0
	movq	%rax, %rcx
	negq	%rcx
	vpaddw	(%rdx,%rax), %xmm0, %xmm0
	vpshufb	%xmm1, %xmm0, %xmm0
	addq	$16, %rax
	cmpq	$256, %rax
	vmovups	%xmm0, 240(%rdi,%rcx)
	jne	.L2
	rep; ret

Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. 
test1:                                  # @test1
        .cfi_startproc
# BB#0:                                 # %entry
        addq    $240, %rdi
        xorl    %eax, %eax
        .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movdqu  (%rsi,%rax,2), %xmm0
        movdqu  (%rdx,%rax,2), %xmm1
        paddw   %xmm0, %xmm1
        shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
        pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
        pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
        movdqu  %xmm0, (%rdi)
        addq    $8, %rax
        addq    $-16, %rdi
        cmpq    $128, %rax
        jne     .LBB0_1
# BB#2:                                 # %for.end
        ret
Comment 1 meibf 2013-12-20 13:46:03 UTC
Author: meibf
Date: Fri Dec 20 13:46:01 2013
New Revision: 206148

URL: http://gcc.gnu.org/viewcvs?rev=206148&root=gcc&view=rev
Log:
2013-12-20  Bingfeng Mei  <bmei@broadcom.com>

	PR tree-optimization/59544
	* tree-vect-stmts.c (perm_mask_for_reverse): Move before
	vectorizable_store. 
	(vectorizable_store): Handle negative step.

	* gcc.target/i386/pr59544.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr59544.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-stmts.c
Comment 2 Bingfeng Mei 2013-12-30 11:07:23 UTC
Patch checked in at r206148. It triggers pr59569 that is fixed by a separate patch  (r206179).