59544 – Vectorizing store with negative step

Bug 59544 - Vectorizing store with negative step

Summary: Vectorizing store with negative step

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.9.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2013-12-18 12:18 UTC by Bingfeng Mei
Modified:	2013-12-30 11:07 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
The patch against r206016 (1.74 KB, patch) 2013-12-18 12:18 UTC, Bingfeng Mei	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bingfeng Mei 2013-12-18 12:18:03 UTC

Created attachment 31467 [details]
The patch against r206016

I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. 

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z)
{
    int i;
    for (i=127; i>=0; i--) {
	x[i] = y[127-i] + z[127-i];
    }
}

I don't know why GCC only implements negative step for load, but not store. I implemented a patch (attached), very similar to code in vectorizable_load. 

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
	addq	$254, %rdi
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L2:
	movzwl	(%rsi,%rax), %ecx
	subq	$2, %rdi
	addw	(%rdx,%rax), %cx
	addq	$2, %rax
	movw	%cx, 2(%rdi)
	cmpq	$256, %rax
	jne	.L2
	rep; ret

With patch:
test1:
.LFB0:
	vmovdqa	.LC0(%rip), %xmm1
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L2:
	vmovdqu	(%rsi,%rax), %xmm0
	movq	%rax, %rcx
	negq	%rcx
	vpaddw	(%rdx,%rax), %xmm0, %xmm0
	vpshufb	%xmm1, %xmm0, %xmm0
	addq	$16, %rax
	cmpq	$256, %rax
	vmovups	%xmm0, 240(%rdi,%rcx)
	jne	.L2
	rep; ret

Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. 
test1:                                  # @test1
        .cfi_startproc
# BB#0:                                 # %entry
        addq    $240, %rdi
        xorl    %eax, %eax
        .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movdqu  (%rsi,%rax,2), %xmm0
        movdqu  (%rdx,%rax,2), %xmm1
        paddw   %xmm0, %xmm1
        shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
        pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
        pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
        movdqu  %xmm0, (%rdi)
        addq    $8, %rax
        addq    $-16, %rdi
        cmpq    $128, %rax
        jne     .LBB0_1
# BB#2:                                 # %for.end
        ret

Comment 1 meibf 2013-12-20 13:46:03 UTC

Author: meibf
Date: Fri Dec 20 13:46:01 2013
New Revision: 206148

URL: http://gcc.gnu.org/viewcvs?rev=206148&root=gcc&view=rev
Log:
2013-12-20  Bingfeng Mei  <bmei@broadcom.com>

	PR tree-optimization/59544
	* tree-vect-stmts.c (perm_mask_for_reverse): Move before
	vectorizable_store. 
	(vectorizable_store): Handle negative step.

	* gcc.target/i386/pr59544.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr59544.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-stmts.c

Comment 2 Bingfeng Mei 2013-12-30 11:07:23 UTC

Patch checked in at r206148. It triggers pr59569 that is fixed by a separate patch  (r206179).