Bug 68483 - [5/6 Regression] gcc 5.2: suboptimal code compared to 4.9
Summary: [5/6 Regression] gcc 5.2: suboptimal code compared to 4.9
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 5.2.0
: P3 normal
Target Milestone: 5.3
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-11-22 12:59 UTC by lvqcl.mail
Modified: 2016-08-10 05:34 UTC (History)
1 user (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-11-23 00:00:00


Attachments
gcc6-pr68483.patch (1.37 KB, patch)
2015-11-23 13:41 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description lvqcl.mail 2015-11-22 12:59:21 UTC
#include <stdint.h>

void test(int32_t* input, int32_t* out, unsigned x1, unsigned x2)
{
	unsigned i, j;
	unsigned end = x1;

	for(i = j = 0; i < 1000; i++) {
		int32_t sum = 0;
		end += x2;
		for( ; j < end; j++)
			sum += input[j];
		out[i] = sum;
	}
}

options used: -S -O2 -ftree-vectorize -msse2
GCC 5.2 generates the following code:
...
	movdqa	%xmm0, %xmm1
	movl	8(%esp), %ebx
	psrldq	$8, %xmm1
	paddd	%xmm1, %xmm0
	movdqa	%xmm0, %xmm3
	pshufd	$255, %xmm0, %xmm2
	addl	%ebx, %eax
	cmpl	%ebx, %esi
	pshufd	$85, %xmm0, %xmm1
	punpckhdq	%xmm0, %xmm3
	movd	%xmm2, %ecx
	punpckldq	%xmm3, %xmm1
	movd	%ecx, %xmm2
	punpcklqdq	%xmm2, %xmm1
	paddd	%xmm1, %xmm0
	movd	%xmm0, %ecx
...

while GCC 4.9.2 generates this:
...
	movdqa	%xmm0, %xmm1
	movl	8(%esp), %ebx
	psrldq	$8, %xmm1
	paddd	%xmm1, %xmm0
	movdqa	%xmm0, %xmm1
	addl	%ebx, %eax
	cmpl	%ebx, %esi
	psrldq	$4, %xmm1
	paddd	%xmm1, %xmm0
	movd	%xmm0, %ecx
...

GCC 4.9.2: 1 psrldq instruction
GCC 5.2.0: 2 pshufd, 2 movd, 2 punpckldq, 1 punpcklqdq instructions.

Also, GCC 5.2.0 can generate the same code as GCC 4.9.2, but it requires -mssse3 option for this. It's strange that -mssse3 is necessary to generate more efficient SSE2 code.
Comment 1 lvqcl.mail 2015-11-22 20:48:43 UTC
(In reply to lvqcl.mail from comment #0)
"gcc version 6.0.0 20151121 (experimental)" from dongsheng-daily (mingw-w64)
generates the same code as 4.9.2. So this regression was fixed in 6.x branch.
Comment 2 Richard Biener 2015-11-23 09:02:01 UTC
Hum, on x86_64 I don't see either GCC 4.9 or GCC 5.2 vectorize the function at all because they fail to analyze the evolution of the dataref for input[j] as the initial j of the inner loop is not propagated as zero.

With i?86 I can confirm your observation but I don't see it fixed on trunk.

Note that this boils down to vector shift detection of permutes where (IIRC)
some patterns were not properly guarded on SSE3 support previously and a
wrong-code bug was fixed conservatively on the GCC 5 branch while missing
support was only implemented on trunk.

The failure to vectorize on x86_64 isn't a regression.
Comment 3 Jakub Jelinek 2015-11-23 10:14:28 UTC
On i?86 this regressed with r217509, aka part of VEC_RSHIFT_EXPR removal.
Guess we'll need to have a look at the i?86 vec perm handling.
Comment 4 Jakub Jelinek 2015-11-23 10:34:07 UTC
Ah, no, the problem is not on the backend side, but during veclower2 pass.
Before that pass we after the replacement of v>> 64 or v>>32 shifts we have:
  vect_sum_15.12_58 = VEC_PERM_EXPR <vect_sum_15.10_57, { 0, 0, 0, 0 }, { 2, 3, 4, 5 }>;
  vect_sum_15.12_59 = vect_sum_15.12_58 + vect_sum_15.10_57;
  vect_sum_15.12_60 = VEC_PERM_EXPR <vect_sum_15.12_59, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }>;
  vect_sum_15.12_61 = vect_sum_15.12_60 + vect_sum_15.12_59;
but veclower2 for some reason decides to lower the latter VEC_PERM_EXPR into:
  _32 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 32>;
  _17 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 64>;
  _23 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 96>;
  vect_sum_15.12_60 = {_32, _17, _23, 0};
The first VEC_PERM_EXPR is kept and generates efficient code.  If I manually disable in the debugger the lowering, the code regression is gone.
Comment 5 Jakub Jelinek 2015-11-23 13:41:25 UTC
Created attachment 36811 [details]
gcc6-pr68483.patch

Untested fix.
Comment 6 lvqcl.mail 2015-11-23 19:07:18 UTC
(In reply to Richard Biener from comment #2)
> With i?86 I can confirm your observation but I don't see it fixed on trunk.

Sorry, the GCC 6.x compiler that I downloaded was built with --with-arch=core2 option, so it implicitely enables ssse3. That's why I incorrectly thought that the regression was fixed.
Comment 7 Jakub Jelinek 2015-11-24 10:46:24 UTC
Author: jakub
Date: Tue Nov 24 10:45:52 2015
New Revision: 230797

URL: https://gcc.gnu.org/viewcvs?rev=230797&root=gcc&view=rev
Log:
	PR target/68483
	* tree-vect-generic.c (lower_vec_perm): If VEC_PERM_EXPR
	is valid vec_shr pattern, don't lower it even if can_vec_perm_p
	returns false.
	* optabs.c (shift_amt_for_vec_perm_mask): Return NULL_RTX
	whenever first is nelt or above.  Don't mask expected with
	2 * nelt - 1.

	* gcc.target/i386/pr68483-1.c: New test.
	* gcc.target/i386/pr68483-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr68483-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr68483-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/optabs.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-generic.c
Comment 8 Jakub Jelinek 2015-11-24 11:11:17 UTC
Author: jakub
Date: Tue Nov 24 11:10:45 2015
New Revision: 230799

URL: https://gcc.gnu.org/viewcvs?rev=230799&root=gcc&view=rev
Log:
	PR target/68483
	* tree-vect-generic.c (lower_vec_perm): If VEC_PERM_EXPR
	is valid vec_shr pattern, don't lower it even if can_vec_perm_p
	returns false.
	* optabs.c (shift_amt_for_vec_perm_mask): Return NULL_RTX
	whenever first is nelt or above.  Don't mask expected with
	2 * nelt - 1.

	* gcc.target/i386/pr68483-1.c: New test.
	* gcc.target/i386/pr68483-2.c: New test.

Added:
    branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr68483-1.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr68483-2.c
Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/optabs.c
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
    branches/gcc-5-branch/gcc/tree-vect-generic.c
Comment 9 Jakub Jelinek 2015-11-24 11:16:24 UTC
Fixed for 5.3+.