68483 – [5/6 Regression] gcc 5.2: suboptimal code compared to 4.9

Bug 68483 - [5/6 Regression] gcc 5.2: suboptimal code compared to 4.9

Summary: [5/6 Regression] gcc 5.2: suboptimal code compared to 4.9

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	5.2.0

Importance:	P3 normal
Target Milestone:	5.3
Assignee:	Jakub Jelinek

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2015-11-22 12:59 UTC by lvqcl.mail
Modified:	2016-08-10 05:34 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-11-23 00:00:00

Attachments
gcc6-pr68483.patch (1.37 KB, patch) 2015-11-23 13:41 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description lvqcl.mail 2015-11-22 12:59:21 UTC

#include <stdint.h>

void test(int32_t* input, int32_t* out, unsigned x1, unsigned x2)
{
	unsigned i, j;
	unsigned end = x1;

	for(i = j = 0; i < 1000; i++) {
		int32_t sum = 0;
		end += x2;
		for( ; j < end; j++)
			sum += input[j];
		out[i] = sum;
	}
}

options used: -S -O2 -ftree-vectorize -msse2
GCC 5.2 generates the following code:
...
	movdqa	%xmm0, %xmm1
	movl	8(%esp), %ebx
	psrldq	$8, %xmm1
	paddd	%xmm1, %xmm0
	movdqa	%xmm0, %xmm3
	pshufd	$255, %xmm0, %xmm2
	addl	%ebx, %eax
	cmpl	%ebx, %esi
	pshufd	$85, %xmm0, %xmm1
	punpckhdq	%xmm0, %xmm3
	movd	%xmm2, %ecx
	punpckldq	%xmm3, %xmm1
	movd	%ecx, %xmm2
	punpcklqdq	%xmm2, %xmm1
	paddd	%xmm1, %xmm0
	movd	%xmm0, %ecx
...

while GCC 4.9.2 generates this:
...
	movdqa	%xmm0, %xmm1
	movl	8(%esp), %ebx
	psrldq	$8, %xmm1
	paddd	%xmm1, %xmm0
	movdqa	%xmm0, %xmm1
	addl	%ebx, %eax
	cmpl	%ebx, %esi
	psrldq	$4, %xmm1
	paddd	%xmm1, %xmm0
	movd	%xmm0, %ecx
...

GCC 4.9.2: 1 psrldq instruction
GCC 5.2.0: 2 pshufd, 2 movd, 2 punpckldq, 1 punpcklqdq instructions.

Also, GCC 5.2.0 can generate the same code as GCC 4.9.2, but it requires -mssse3 option for this. It's strange that -mssse3 is necessary to generate more efficient SSE2 code.

Comment 1 lvqcl.mail 2015-11-22 20:48:43 UTC

(In reply to lvqcl.mail from comment #0)
"gcc version 6.0.0 20151121 (experimental)" from dongsheng-daily (mingw-w64)
generates the same code as 4.9.2. So this regression was fixed in 6.x branch.

Comment 2 Richard Biener 2015-11-23 09:02:01 UTC

Hum, on x86_64 I don't see either GCC 4.9 or GCC 5.2 vectorize the function at all because they fail to analyze the evolution of the dataref for input[j] as the initial j of the inner loop is not propagated as zero.

With i?86 I can confirm your observation but I don't see it fixed on trunk.

Note that this boils down to vector shift detection of permutes where (IIRC)
some patterns were not properly guarded on SSE3 support previously and a
wrong-code bug was fixed conservatively on the GCC 5 branch while missing
support was only implemented on trunk.

The failure to vectorize on x86_64 isn't a regression.

Comment 3 Jakub Jelinek 2015-11-23 10:14:28 UTC

On i?86 this regressed with r217509, aka part of VEC_RSHIFT_EXPR removal.
Guess we'll need to have a look at the i?86 vec perm handling.

Comment 4 Jakub Jelinek 2015-11-23 10:34:07 UTC

Ah, no, the problem is not on the backend side, but during veclower2 pass.
Before that pass we after the replacement of v>> 64 or v>>32 shifts we have:
  vect_sum_15.12_58 = VEC_PERM_EXPR <vect_sum_15.10_57, { 0, 0, 0, 0 }, { 2, 3, 4, 5 }>;
  vect_sum_15.12_59 = vect_sum_15.12_58 + vect_sum_15.10_57;
  vect_sum_15.12_60 = VEC_PERM_EXPR <vect_sum_15.12_59, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }>;
  vect_sum_15.12_61 = vect_sum_15.12_60 + vect_sum_15.12_59;
but veclower2 for some reason decides to lower the latter VEC_PERM_EXPR into:
  _32 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 32>;
  _17 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 64>;
  _23 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 96>;
  vect_sum_15.12_60 = {_32, _17, _23, 0};
The first VEC_PERM_EXPR is kept and generates efficient code.  If I manually disable in the debugger the lowering, the code regression is gone.

Comment 5 Jakub Jelinek 2015-11-23 13:41:25 UTC

Created attachment 36811 [details]
gcc6-pr68483.patch

Untested fix.

Comment 6 lvqcl.mail 2015-11-23 19:07:18 UTC

(In reply to Richard Biener from comment #2)
> With i?86 I can confirm your observation but I don't see it fixed on trunk.

Sorry, the GCC 6.x compiler that I downloaded was built with --with-arch=core2 option, so it implicitely enables ssse3. That's why I incorrectly thought that the regression was fixed.

Comment 7 Jakub Jelinek 2015-11-24 10:46:24 UTC

Author: jakub
Date: Tue Nov 24 10:45:52 2015
New Revision: 230797

URL: https://gcc.gnu.org/viewcvs?rev=230797&root=gcc&view=rev
Log:
	PR target/68483
	* tree-vect-generic.c (lower_vec_perm): If VEC_PERM_EXPR
	is valid vec_shr pattern, don't lower it even if can_vec_perm_p
	returns false.
	* optabs.c (shift_amt_for_vec_perm_mask): Return NULL_RTX
	whenever first is nelt or above.  Don't mask expected with
	2 * nelt - 1.

	* gcc.target/i386/pr68483-1.c: New test.
	* gcc.target/i386/pr68483-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr68483-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr68483-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/optabs.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-generic.c

Comment 8 Jakub Jelinek 2015-11-24 11:11:17 UTC

Author: jakub
Date: Tue Nov 24 11:10:45 2015
New Revision: 230799

URL: https://gcc.gnu.org/viewcvs?rev=230799&root=gcc&view=rev
Log:
	PR target/68483
	* tree-vect-generic.c (lower_vec_perm): If VEC_PERM_EXPR
	is valid vec_shr pattern, don't lower it even if can_vec_perm_p
	returns false.
	* optabs.c (shift_amt_for_vec_perm_mask): Return NULL_RTX
	whenever first is nelt or above.  Don't mask expected with
	2 * nelt - 1.

	* gcc.target/i386/pr68483-1.c: New test.
	* gcc.target/i386/pr68483-2.c: New test.

Added:
    branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr68483-1.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr68483-2.c
Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/optabs.c
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
    branches/gcc-5-branch/gcc/tree-vect-generic.c

Comment 9 Jakub Jelinek 2015-11-24 11:16:24 UTC

Fixed for 5.3+.