Bug 71488 - [6 Regression] Wrong code for vector comparisons with ivybridge and westmere targets
Summary: [6 Regression] Wrong code for vector comparisons with ivybridge and westmere ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P2 normal
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on: 71655
Blocks: vectorizer yarpgen
  Show dependency treegraph
 
Reported: 2016-06-10 08:37 UTC by Anton Mitrokhin
Modified: 2021-11-01 23:07 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail: 6.1.0
Last reconfirmed: 2016-06-13 00:00:00


Attachments
Reproducer (508 bytes, text/x-csrc)
2016-06-10 08:37 UTC, Anton Mitrokhin
Details
Enable masks comparison using patterns (2.61 KB, patch)
2016-06-15 11:51 UTC, Ilya Enkovich
Details | Diff
Enable masks comparison with no new patterns (1.38 KB, patch)
2016-06-15 13:42 UTC, Ilya Enkovich
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Anton Mitrokhin 2016-06-10 08:37:48 UTC
Created attachment 38673 [details]
Reproducer

It looks like the result on -O3 is incorrect

Reproduce:
> g++ -std=c++11 -static-libgcc -static-libstdc++ -O0 -march=westmere -o out crash.cpp
 
Output: -5386832788153059297

> g++ -std=c++11 -static-libgcc -static-libstdc++ -O3 -march=westmere -o out crash.cpp
Output: -5386832788153059298

> g++ -std=c++11 -static-libgcc -static-libstdc++ -O0 -march=ivybridge -o out crash.cpp
Output: -5386832788153059297

> g++ -std=c++11 -static-libgcc -static-libstdc++ -O3 -march=ivybridge -o out crash.cpp
Output: -5386832788153059298


> gcc -v:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/export/users/amitrokh/gcc_trunk/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /export/users/gnutester/stability/svn/trunk/configure --with-arch=corei7 --with-cpu=corei7 --enable-clocale=gnu --with-system-zlib --enable-shared --with-demangler-in-ld --enable-cloog-backend=isl --with-fpmath=sse --with-pkgversion=Revision=237240/svn-rev:237241/ --prefix=/export/users/gnutester/stability/work/trunk/64/install --enable-languages=c,c++,fortran,java,lto
Thread model: posix
gcc version 7.0.0 20160608 (experimental) (Revision=237240/svn-rev:237241/)
Comment 1 Uroš Bizjak 2016-06-13 19:04:14 UTC
Following minimized case will show the problem:

--cut here--
int var_4 = 1;
long long var_9 = 0;

int main() {

  std::valarray<std::valarray<long long>> v10;

  v10.resize(1);
  v10[0].resize(4);

  for (int i = 0; i < 4; i++)
    v10[0][i] = ((var_9 == 0) > unsigned (var_4 == 0)) + (var_9 == 0);

  std::cout << v10[0][0] << "\n";
}
--cut here--

This test should be compiled with "-std=c++11 -O3 -march=westmere" to obtain wrong result:

$ ./a.out
1

The correct result can be obtained by adding -fno-tree-vectorize to compile flags:

./a.out
2

Looking at the asm dump, the problematic loop is:

.L22:
	movddup	var_9(%rip), %xmm0
	pxor	%xmm1, %xmm1
(1)	pcmpeqq	%xmm1, %xmm0
	salq	$63, %rax
	movdqa	.LC0(%rip), %xmm2
	sarq	$63, %rax
	movq	%rax, %xmm1
(2)	movdqa	%xmm0, %xmm3
	punpcklqdq	%xmm1, %xmm1
	pand	%xmm2, %xmm0
	shufps	$136, %xmm0, %xmm0
(3)	pcmpgtq	%xmm1, %xmm3
	movdqa	%xmm3, %xmm1
	pand	%xmm2, %xmm1
	shufps	$136, %xmm1, %xmm1
	paddd	%xmm1, %xmm0
	pmovsxdq	%xmm0, %xmm1
	psrldq	$8, %xmm0
	pmovsxdq	%xmm0, %xmm0
	movups	%xmm1, (%rdx)
	movups	%xmm0, 16(%rdx)

At insn (1), vector (0xf...f,0xf...f) is generated as a result of comparison of vector (var_9,var_9) with vector (0,0). However, this result goes through insn (2) directly to insn (3) as its input argument. This is certainly wrong, the result of the comparison should be masked with (0x0...1,0x0...1).

The problem already exists at RTL expand time. The corresponding insn sequence is:

;; mask__3.59_48 = vect_cst__51 == { 0, 0 };

(insn 117 116 118 (set (reg:V2DI 179)
        (vec_duplicate:V2DI (reg:DI 108 [ var_9.0_50 ]))) crash.cpp:29 4210 {*vec_dupv2di}
     (nil))

(insn 118 117 119 (set (reg:V2DI 180)
        (const_vector:V2DI [
                (const_int 0 [0])
                (const_int 0 [0])
            ])) crash.cpp:29 -1
     (nil))

(insn 119 118 120 (set (reg:V2DI 181)
        (eq:V2DI (reg:V2DI 179)
            (reg:V2DI 180))) crash.cpp:29 -1
     (nil))

(insn 120 119 0 (set (reg:V2DI 106 [ mask__3.59 ])
        (reg:V2DI 181)) crash.cpp:29 -1
     (nil))

;; vect_patt_111.61_79 = VEC_COND_EXPR <mask__3.59_48 > vect_cst__63, { 1, 1 }, { 0, 0 }>;

(insn 121 120 122 (set (reg:V2DI 182)
        (vec_duplicate:V2DI (reg:DI 117 [ _64 ]))) 4210 {*vec_dupv2di}
     (nil))

(insn 122 121 123 (set (reg:V2DI 183)
        (mem/u/c:V2DI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [5  S16 A128])) -1
     (expr_list:REG_EQUAL (const_vector:V2DI [
                (const_int 1 [0x1])
                (const_int 1 [0x1])
            ])
        (nil)))

(insn 123 122 124 (set (reg:V2DI 184)
        (gt:V2DI (reg:V2DI 106 [ mask__3.59 ])
            (reg:V2DI 182))) -1
     (nil))

(insn 124 123 0 (set (reg:V2DI 119 [ vect_patt_111.61 ])
        (and:V2DI (reg:V2DI 184)
            (reg:V2DI 183))) -1
     (nil))

Please note how the result of comparison from (insn 119) enters directly a foolow up comparison (insn 123). It looks to me that (insn 120) needs to be AND insn, as is the case with comparison (insn 123) and its corresponding (insn 124).

Confirmed as a middle-end problem.
Comment 2 Marc Glisse 2016-06-13 21:42:02 UTC
Independently of the wrong code issue, we are generating pretty bad code on Uros' testcase. It is full of operator delete(0) and operator new(0). The first one we could drop, but the second one is forced by the C++ standard to allocate at least one byte (or throw). It probably comes from the copy constructor of valarray. And even when I help with the usual:
  __attribute__((returns_nonnull)) __typeof__(malloc) malloc;
  inline void* operator new(std::size_t n){return malloc(n);}
  inline void operator delete(void*p)noexcept{free(p);}
the .optimized dump still has things like
  MEM[(struct valarray *)_61]._M_data = _45;
  _46 = MEM[(struct valarray *)_61]._M_data;
because of how late other optimizations happened. Quite a common occurrence with C++ code :-(
Comment 3 Uroš Bizjak 2016-06-14 09:01:36 UTC
(In reply to Uroš Bizjak from comment #1)

> The problem already exists at RTL expand time. The corresponding insn
> sequence is:
> 
> ;; mask__3.59_48 = vect_cst__51 == { 0, 0 };

Richi, shouldn't this be a VEC_COND_EXPR ?
Comment 4 rguenther@suse.de 2016-06-14 09:29:23 UTC
On Tue, 14 Jun 2016, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71488
> 
> Uroš Bizjak <ubizjak at gmail dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |ienkovich at gcc dot gnu.org,
>                    |                            |rguenther at suse dot de
> 
> --- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Uroš Bizjak from comment #1)
> 
> > The problem already exists at RTL expand time. The corresponding insn
> > sequence is:
> > 
> > ;; mask__3.59_48 = vect_cst__51 == { 0, 0 };
> 
> Richi, shouldn't this be a VEC_COND_EXPR ?

I think we're supposed to handle this fine now.  Ilja should know this
better though.
Comment 5 Ilya Enkovich 2016-06-14 14:16:33 UTC
What seems suspicious to me is how we vectorize boolean comparison.  Before vectorization we have (_3, _5, _6 are bool):

  _3 = var_9.0_2 == 0;
  _6 = _3 > _5;

vectorized code:

  mask__3.59_62 = vect_cst__47 == vect_cst__66;
  mask__6.60_49 = mask__3.59_48 > vect_cst__63;

Boolean comparison is transformed into signed integer vector (according to mask mode) comparison.  So in scalar code 'true' (1) > 'false (0) but in vector code 'true' (-1) < 'false' (0).
Comment 6 Ilya Enkovich 2016-06-14 14:30:01 UTC
I think we should disable vectorization of boolean comparison (except '==' and '!=') and handle them applying patterns.  E.g. a > b ==> a & !b, a >= b ==> a | !b etc.  Bitwise operations are better because work for both vector and scalar masks.
Comment 7 rguenther@suse.de 2016-06-15 08:51:52 UTC
On Tue, 14 Jun 2016, ienkovich at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71488
> 
> --- Comment #6 from Ilya Enkovich <ienkovich at gcc dot gnu.org> ---
> I think we should disable vectorization of boolean comparison (except '==' and
> '!=') and handle them applying patterns.  E.g. a > b ==> a & !b, a >= b ==> a |
> !b etc.  Bitwise operations are better because work for both vector and scalar
> masks.

Sounds reasonable though less patterns are IMHO better (patterns should
be only used when combining N > 1 scalar stmts to M >= 1 new scalar stmts,
single scalar stmts can be handled in complex ways in the vectorizable_*
functions).
Comment 8 Ilya Enkovich 2016-06-15 11:51:31 UTC
Created attachment 38703 [details]
Enable masks comparison using patterns

(In reply to rguenther@suse.de from comment #7)
>
> Sounds reasonable though less patterns are IMHO better (patterns should
> be only used when combining N > 1 scalar stmts to M >= 1 new scalar stmts,
> single scalar stmts can be handled in complex ways in the vectorizable_*
> functions).

This can be handled in vectorizable_comparison but it would make it much more complex and in fact it would mean a copy of a part of vectorizable_operation into vectorizable_comparison.

I attach a pattern variant.  I can also prepare vectorizable_comparison enhancement variant to compare and choose.
Comment 9 Ilya Enkovich 2016-06-15 13:42:36 UTC
Created attachment 38704 [details]
Enable masks comparison with no new patterns

Here is a version without additional patterns (mask conversion pattern is slightly extended).  It appeared to be smaller than I thought.
Comment 10 rguenther@suse.de 2016-06-15 13:49:21 UTC
On Wed, 15 Jun 2016, ienkovich at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71488
> 
> --- Comment #9 from Ilya Enkovich <ienkovich at gcc dot gnu.org> ---
> Created attachment 38704 [details]
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38704&action=edit
> Enable masks comparison with no new patterns
> 
> Here is a version without additional patterns (mask conversion pattern is
> slightly extended).  It appeared to be smaller than I thought.

I like this more.  Please also adjust the cost accordingly.
I suppose by passing
 (1 + bitop1 != NOP_EXPR + bitop2 != NOP_EXPR) * ncopies
to vect_model_simple_cost.

Thanks,
Richard.
Comment 11 Ilya Enkovich 2016-06-22 14:06:27 UTC
Author: ienkovich
Date: Wed Jun 22 14:05:55 2016
New Revision: 237706

URL: https://gcc.gnu.org/viewcvs?rev=237706&root=gcc&view=rev
Log:
gcc/

	PR middle-end/71488
	* tree-vect-patterns.c (vect_recog_mask_conversion_pattern): Support
	comparison of boolean vectors.
	* tree-vect-stmts.c (vectorizable_comparison): Vectorize comparison
	of boolean vectors using bitwise operations.

gcc/testsuite/

	PR middle-end/71488
	* g++.dg/pr71488.C: New test.
	* gcc.dg/vect/vect-bool-cmp.c: New test.

Added:
    trunk/gcc/testsuite/g++.dg/pr71488.C
    trunk/gcc/testsuite/gcc.dg/vect/vect-bool-cmp.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-patterns.c
    trunk/gcc/tree-vect-stmts.c
Comment 12 uros 2016-06-23 16:59:26 UTC
Author: uros
Date: Thu Jun 23 16:58:54 2016
New Revision: 237738

URL: https://gcc.gnu.org/viewcvs?rev=237738&root=gcc&view=rev
Log:
	PR tree-optimization/71488
	* gcc.target/i386/i386.exp (check_effective_target_sse4): Move to ...
	* lib/target-supports.exp: ... here.
	(check_sse4_hw_available): New procedure.
	(check_effective_target_sse4_runtime): Ditto.
	* g++.dg/pr71488.C (dg-additional-options): Use -msse4 instead of
	-march=westmere for sse4_runtime targets.
	* gcc.dg/vect/vect-bool-cmp.c: Include "tree-vect.h".
	(dg-additional-options): Use for sse4_runtime targets.
	(main): Call check_vect ().
	(dg-final): Perform scan only for sse4_runtime targets.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/g++.dg/pr71488.C
    trunk/gcc/testsuite/gcc.dg/vect/vect-bool-cmp.c
    trunk/gcc/testsuite/gcc.target/i386/i386.exp
    trunk/gcc/testsuite/lib/target-supports.exp
Comment 13 Uroš Bizjak 2016-06-23 18:27:07 UTC
Vectorizer PRs should be filled having tree-optimization component field.
Comment 14 Jakub Jelinek 2016-06-27 16:54:47 UTC
Caused PR71655.
Comment 15 Richard Biener 2016-08-22 08:12:35 UTC
GCC 6.2 is being released, adjusting target milestone.
Comment 16 Richard Biener 2016-08-22 08:13:38 UTC
GCC 6.2 is being released, adjusting target milestone.
Comment 17 Jakub Jelinek 2016-12-21 10:58:08 UTC
GCC 6.3 is being released, adjusting target milestone.
Comment 18 Richard Biener 2017-07-04 08:47:53 UTC
GCC 6.4 is being released, adjusting target milestone.
Comment 19 Jakub Jelinek 2018-10-26 10:54:05 UTC
GCC 6 branch is being closed, fixed in 7.x.