Created attachment 38673 [details] Reproducer It looks like the result on -O3 is incorrect Reproduce: > g++ -std=c++11 -static-libgcc -static-libstdc++ -O0 -march=westmere -o out crash.cpp Output: -5386832788153059297 > g++ -std=c++11 -static-libgcc -static-libstdc++ -O3 -march=westmere -o out crash.cpp Output: -5386832788153059298 > g++ -std=c++11 -static-libgcc -static-libstdc++ -O0 -march=ivybridge -o out crash.cpp Output: -5386832788153059297 > g++ -std=c++11 -static-libgcc -static-libstdc++ -O3 -march=ivybridge -o out crash.cpp Output: -5386832788153059298 > gcc -v: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/export/users/amitrokh/gcc_trunk/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /export/users/gnutester/stability/svn/trunk/configure --with-arch=corei7 --with-cpu=corei7 --enable-clocale=gnu --with-system-zlib --enable-shared --with-demangler-in-ld --enable-cloog-backend=isl --with-fpmath=sse --with-pkgversion=Revision=237240/svn-rev:237241/ --prefix=/export/users/gnutester/stability/work/trunk/64/install --enable-languages=c,c++,fortran,java,lto Thread model: posix gcc version 7.0.0 20160608 (experimental) (Revision=237240/svn-rev:237241/)
Following minimized case will show the problem: --cut here-- int var_4 = 1; long long var_9 = 0; int main() { std::valarray<std::valarray<long long>> v10; v10.resize(1); v10[0].resize(4); for (int i = 0; i < 4; i++) v10[0][i] = ((var_9 == 0) > unsigned (var_4 == 0)) + (var_9 == 0); std::cout << v10[0][0] << "\n"; } --cut here-- This test should be compiled with "-std=c++11 -O3 -march=westmere" to obtain wrong result: $ ./a.out 1 The correct result can be obtained by adding -fno-tree-vectorize to compile flags: ./a.out 2 Looking at the asm dump, the problematic loop is: .L22: movddup var_9(%rip), %xmm0 pxor %xmm1, %xmm1 (1) pcmpeqq %xmm1, %xmm0 salq $63, %rax movdqa .LC0(%rip), %xmm2 sarq $63, %rax movq %rax, %xmm1 (2) movdqa %xmm0, %xmm3 punpcklqdq %xmm1, %xmm1 pand %xmm2, %xmm0 shufps $136, %xmm0, %xmm0 (3) pcmpgtq %xmm1, %xmm3 movdqa %xmm3, %xmm1 pand %xmm2, %xmm1 shufps $136, %xmm1, %xmm1 paddd %xmm1, %xmm0 pmovsxdq %xmm0, %xmm1 psrldq $8, %xmm0 pmovsxdq %xmm0, %xmm0 movups %xmm1, (%rdx) movups %xmm0, 16(%rdx) At insn (1), vector (0xf...f,0xf...f) is generated as a result of comparison of vector (var_9,var_9) with vector (0,0). However, this result goes through insn (2) directly to insn (3) as its input argument. This is certainly wrong, the result of the comparison should be masked with (0x0...1,0x0...1). The problem already exists at RTL expand time. The corresponding insn sequence is: ;; mask__3.59_48 = vect_cst__51 == { 0, 0 }; (insn 117 116 118 (set (reg:V2DI 179) (vec_duplicate:V2DI (reg:DI 108 [ var_9.0_50 ]))) crash.cpp:29 4210 {*vec_dupv2di} (nil)) (insn 118 117 119 (set (reg:V2DI 180) (const_vector:V2DI [ (const_int 0 [0]) (const_int 0 [0]) ])) crash.cpp:29 -1 (nil)) (insn 119 118 120 (set (reg:V2DI 181) (eq:V2DI (reg:V2DI 179) (reg:V2DI 180))) crash.cpp:29 -1 (nil)) (insn 120 119 0 (set (reg:V2DI 106 [ mask__3.59 ]) (reg:V2DI 181)) crash.cpp:29 -1 (nil)) ;; vect_patt_111.61_79 = VEC_COND_EXPR <mask__3.59_48 > vect_cst__63, { 1, 1 }, { 0, 0 }>; (insn 121 120 122 (set (reg:V2DI 182) (vec_duplicate:V2DI (reg:DI 117 [ _64 ]))) 4210 {*vec_dupv2di} (nil)) (insn 122 121 123 (set (reg:V2DI 183) (mem/u/c:V2DI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [5 S16 A128])) -1 (expr_list:REG_EQUAL (const_vector:V2DI [ (const_int 1 [0x1]) (const_int 1 [0x1]) ]) (nil))) (insn 123 122 124 (set (reg:V2DI 184) (gt:V2DI (reg:V2DI 106 [ mask__3.59 ]) (reg:V2DI 182))) -1 (nil)) (insn 124 123 0 (set (reg:V2DI 119 [ vect_patt_111.61 ]) (and:V2DI (reg:V2DI 184) (reg:V2DI 183))) -1 (nil)) Please note how the result of comparison from (insn 119) enters directly a foolow up comparison (insn 123). It looks to me that (insn 120) needs to be AND insn, as is the case with comparison (insn 123) and its corresponding (insn 124). Confirmed as a middle-end problem.
Independently of the wrong code issue, we are generating pretty bad code on Uros' testcase. It is full of operator delete(0) and operator new(0). The first one we could drop, but the second one is forced by the C++ standard to allocate at least one byte (or throw). It probably comes from the copy constructor of valarray. And even when I help with the usual: __attribute__((returns_nonnull)) __typeof__(malloc) malloc; inline void* operator new(std::size_t n){return malloc(n);} inline void operator delete(void*p)noexcept{free(p);} the .optimized dump still has things like MEM[(struct valarray *)_61]._M_data = _45; _46 = MEM[(struct valarray *)_61]._M_data; because of how late other optimizations happened. Quite a common occurrence with C++ code :-(
(In reply to Uroš Bizjak from comment #1) > The problem already exists at RTL expand time. The corresponding insn > sequence is: > > ;; mask__3.59_48 = vect_cst__51 == { 0, 0 }; Richi, shouldn't this be a VEC_COND_EXPR ?
On Tue, 14 Jun 2016, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71488 > > Uroš Bizjak <ubizjak at gmail dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |ienkovich at gcc dot gnu.org, > | |rguenther at suse dot de > > --- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> --- > (In reply to Uroš Bizjak from comment #1) > > > The problem already exists at RTL expand time. The corresponding insn > > sequence is: > > > > ;; mask__3.59_48 = vect_cst__51 == { 0, 0 }; > > Richi, shouldn't this be a VEC_COND_EXPR ? I think we're supposed to handle this fine now. Ilja should know this better though.
What seems suspicious to me is how we vectorize boolean comparison. Before vectorization we have (_3, _5, _6 are bool): _3 = var_9.0_2 == 0; _6 = _3 > _5; vectorized code: mask__3.59_62 = vect_cst__47 == vect_cst__66; mask__6.60_49 = mask__3.59_48 > vect_cst__63; Boolean comparison is transformed into signed integer vector (according to mask mode) comparison. So in scalar code 'true' (1) > 'false (0) but in vector code 'true' (-1) < 'false' (0).
I think we should disable vectorization of boolean comparison (except '==' and '!=') and handle them applying patterns. E.g. a > b ==> a & !b, a >= b ==> a | !b etc. Bitwise operations are better because work for both vector and scalar masks.
On Tue, 14 Jun 2016, ienkovich at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71488 > > --- Comment #6 from Ilya Enkovich <ienkovich at gcc dot gnu.org> --- > I think we should disable vectorization of boolean comparison (except '==' and > '!=') and handle them applying patterns. E.g. a > b ==> a & !b, a >= b ==> a | > !b etc. Bitwise operations are better because work for both vector and scalar > masks. Sounds reasonable though less patterns are IMHO better (patterns should be only used when combining N > 1 scalar stmts to M >= 1 new scalar stmts, single scalar stmts can be handled in complex ways in the vectorizable_* functions).
Created attachment 38703 [details] Enable masks comparison using patterns (In reply to rguenther@suse.de from comment #7) > > Sounds reasonable though less patterns are IMHO better (patterns should > be only used when combining N > 1 scalar stmts to M >= 1 new scalar stmts, > single scalar stmts can be handled in complex ways in the vectorizable_* > functions). This can be handled in vectorizable_comparison but it would make it much more complex and in fact it would mean a copy of a part of vectorizable_operation into vectorizable_comparison. I attach a pattern variant. I can also prepare vectorizable_comparison enhancement variant to compare and choose.
Created attachment 38704 [details] Enable masks comparison with no new patterns Here is a version without additional patterns (mask conversion pattern is slightly extended). It appeared to be smaller than I thought.
On Wed, 15 Jun 2016, ienkovich at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71488 > > --- Comment #9 from Ilya Enkovich <ienkovich at gcc dot gnu.org> --- > Created attachment 38704 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38704&action=edit > Enable masks comparison with no new patterns > > Here is a version without additional patterns (mask conversion pattern is > slightly extended). It appeared to be smaller than I thought. I like this more. Please also adjust the cost accordingly. I suppose by passing (1 + bitop1 != NOP_EXPR + bitop2 != NOP_EXPR) * ncopies to vect_model_simple_cost. Thanks, Richard.
Author: ienkovich Date: Wed Jun 22 14:05:55 2016 New Revision: 237706 URL: https://gcc.gnu.org/viewcvs?rev=237706&root=gcc&view=rev Log: gcc/ PR middle-end/71488 * tree-vect-patterns.c (vect_recog_mask_conversion_pattern): Support comparison of boolean vectors. * tree-vect-stmts.c (vectorizable_comparison): Vectorize comparison of boolean vectors using bitwise operations. gcc/testsuite/ PR middle-end/71488 * g++.dg/pr71488.C: New test. * gcc.dg/vect/vect-bool-cmp.c: New test. Added: trunk/gcc/testsuite/g++.dg/pr71488.C trunk/gcc/testsuite/gcc.dg/vect/vect-bool-cmp.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-patterns.c trunk/gcc/tree-vect-stmts.c
Author: uros Date: Thu Jun 23 16:58:54 2016 New Revision: 237738 URL: https://gcc.gnu.org/viewcvs?rev=237738&root=gcc&view=rev Log: PR tree-optimization/71488 * gcc.target/i386/i386.exp (check_effective_target_sse4): Move to ... * lib/target-supports.exp: ... here. (check_sse4_hw_available): New procedure. (check_effective_target_sse4_runtime): Ditto. * g++.dg/pr71488.C (dg-additional-options): Use -msse4 instead of -march=westmere for sse4_runtime targets. * gcc.dg/vect/vect-bool-cmp.c: Include "tree-vect.h". (dg-additional-options): Use for sse4_runtime targets. (main): Call check_vect (). (dg-final): Perform scan only for sse4_runtime targets. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/pr71488.C trunk/gcc/testsuite/gcc.dg/vect/vect-bool-cmp.c trunk/gcc/testsuite/gcc.target/i386/i386.exp trunk/gcc/testsuite/lib/target-supports.exp
Vectorizer PRs should be filled having tree-optimization component field.
Caused PR71655.
GCC 6.2 is being released, adjusting target milestone.
GCC 6.3 is being released, adjusting target milestone.
GCC 6.4 is being released, adjusting target milestone.
GCC 6 branch is being closed, fixed in 7.x.