Created attachment 37985 [details] Reproducer. Test case produces incorrect result with -O3 -march=skylake-avx512. It gives different result for every launch. Everything works fine with -O2 option. Output: > g++ repr.cpp -o out -O3 -march=skylake-avx512; sde -skx -- ./out 12033857864019976279 > g++ repr.cpp -o out -O2 -march=skylake-avx512; sde -skx -- ./out 5785906989299578598 GCC version: > g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/export/users/vlivinsk/gcc-trunk/bin/libexec/gcc/x86_64-pc-linux-gnu/6.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /export/users/vlivinsk/gcc-trunk/gcc/configure --with-arch=corei7 --with-cpu=corei7 --enable-clocale=gnu --with-system-zlib --enable-shared --with-demangler-in-ld --enable-cloog-backend=isl --with-fpmath=sse --enable-checking=release --enable-languages=c,c++,lto --with-gmp=/export/users/vlivinsk/gcc-trunk/gmp-6.1.0/bin --with-mpfr=/export/users/vlivinsk/gcc-trunk/mpfr-3.1.3/bin --with-mpc=/export/users/vlivinsk/gcc-trunk/mpc-1.0.3/bin --prefix=/export/users/vlivinsk/gcc-trunk/bin Thread model: posix gcc version 6.0.0 20160315 (experimental) (Revision=234226) Test: #include <iostream> void hash(unsigned long long int &seed, unsigned long long int const &v) { seed ^= v + 0x9e3779b9 + (seed<<6) + (seed>>2); } unsigned int a [100]; signed char b [100]; signed char c [100]; void init () { for (int i = 0; i < 100; ++i) { a [i] = 1000L; b [i] = 10; c [i] = 5; } } void foo () { for (int i = 0; i < 100; ++i) b [i] = (!b [i] ^ (a [i] >= b [i])) + c [i]; } unsigned long long int checksum () { unsigned long long int seed = 0ULL; for (int i = 0; i < 100; ++i) hash(seed, b [i]); return seed; } int main () { init (); foo (); std::cout << checksum () << std::endl; return 0; }

Confirm. The problem is that vect_patt_50.39_93 = VEC_COND_EXPR <mask_patt_51.38_90, { 1, .. , 1 }, { 0, .. , 0 }>; vect__17.44_98 = vect_patt_50.39_93 + vect__16.43_97; is replaced with: _1 = VIEW_CONVERT_EXPR<vector(64) unsigned char>(mask_patt_51.38_90); vect__17.44_98 = vect__16.43_97 - _1; which works for vector masks but doesn't work for scalar masks.

Here is a responsible match.pd pattern: /* A + (B vcmp C ? 1 : 0) -> A - (B vcmp C), since vector comparisons return all-1 or all-0 results. */ /* ??? We could instead convert all instances of the vec_cond to negate, but that isn't necessarily a win on its own. */ (simplify (plus:c @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) (if (VECTOR_TYPE_P (type) && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) && (TYPE_MODE (TREE_TYPE (type)) == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0))))) (minus @3 (view_convert @0)))) I propose this patch: diff --git a/gcc/match.pd b/gcc/match.pd index 112deb3..7245ff4 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1759,6 +1759,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (simplify (plus:c @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) (if (VECTOR_TYPE_P (type) + && VECTOR_MODE_P (TYPE_MODE (TREE_TYPE (@0))) && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) && (TYPE_MODE (TREE_TYPE (type)) == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0))))) @@ -1768,6 +1769,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (simplify (minus @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) (if (VECTOR_TYPE_P (type) + && VECTOR_MODE_P (TYPE_MODE (TREE_TYPE (@0))) && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) && (TYPE_MODE (TREE_TYPE (type)) == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0)))))

On Wed, 16 Mar 2016, ienkovich at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70251 > > --- Comment #2 from Ilya Enkovich <ienkovich at gcc dot gnu.org> --- > Here is a responsible match.pd pattern: > > /* A + (B vcmp C ? 1 : 0) -> A - (B vcmp C), since vector comparisons > return all-1 or all-0 results. */ > /* ??? We could instead convert all instances of the vec_cond to negate, > but that isn't necessarily a win on its own. */ > (simplify > (plus:c @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) > (if (VECTOR_TYPE_P (type) > && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) > && (TYPE_MODE (TREE_TYPE (type)) > == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0))))) > (minus @3 (view_convert @0)))) > > I propose this patch: > > diff --git a/gcc/match.pd b/gcc/match.pd > index 112deb3..7245ff4 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -1759,6 +1759,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (simplify > (plus:c @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) > (if (VECTOR_TYPE_P (type) > + && VECTOR_MODE_P (TYPE_MODE (TREE_TYPE (@0))) > && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) > && (TYPE_MODE (TREE_TYPE (type)) > == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0))))) > @@ -1768,6 +1769,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (simplify > (minus @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) > (if (VECTOR_TYPE_P (type) > + && VECTOR_MODE_P (TYPE_MODE (TREE_TYPE (@0))) > && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) > && (TYPE_MODE (TREE_TYPE (type)) > == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0))))) Looks good to me.

(In reply to Ilya Enkovich from comment #2) > Here is a responsible match.pd pattern: > > /* A + (B vcmp C ? 1 : 0) -> A - (B vcmp C), since vector comparisons > return all-1 or all-0 results. */ Since we don't simplify (x<y)?-1:0 to (x<y) anymore for vectors (on any platform), it seems weird to keep this optimization at all. Maybe it should be instead: A + (B vcmp C ? 1 : 0) -> A - (B vcmp C ? -1 : 0) ?

On Wed, 16 Mar 2016, glisse at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70251 > > --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- > (In reply to Ilya Enkovich from comment #2) > > Here is a responsible match.pd pattern: > > > > /* A + (B vcmp C ? 1 : 0) -> A - (B vcmp C), since vector comparisons > > return all-1 or all-0 results. */ > > Since we don't simplify (x<y)?-1:0 to (x<y) anymore for vectors (on any > platform), it seems weird to keep this optimization at all. Maybe it should be > instead: > > A + (B vcmp C ? 1 : 0) -> A - (B vcmp C ? -1 : 0) > > ? I think that would be an odd transform. But yes, The pattern is probably dead now and can as well be removed...

(In reply to rguenther@suse.de from comment #5) > I think that would be an odd transform. But yes, The pattern is probably > dead now and can as well be removed... This test proves pattern is not dead and still can be applied for all non-AVX-512 targets. I don't think all targets make such optimizations so why not to keep it?

On Thu, 17 Mar 2016, ienkovich at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70251 > > --- Comment #6 from Ilya Enkovich <ienkovich at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #5) > > I think that would be an odd transform. But yes, The pattern is probably > > dead now and can as well be removed... > > This test proves pattern is not dead and still can be applied for all > non-AVX-512 targets. I don't think all targets make such optimizations so why > not to keep it? Looks like it was added for gcc.target/aarch64/vect-add-sub-cond.c so we indeed should keep it.

Fixed by r234283.

I don't find the current situation very consistent. We seem to consider that in gimple, the output of a vector comparison can only be used directly in a vec_cond_expr (possibly also {and,ior,xor,bit_not}_expr?), except in the very specific case where we apply this transformation a+(b?1:0) -> a-VCE(b), where we only have a view_convert_expr. Unless we start transforming x?-1:0 into view_convert_expr(x) under the same condition (VECTOR_MODE_P(...)), it doesn't make sense to me. I am also not convinced that VECTOR_MODE_P is the right test. If you are using a long vector (say of 2048 bits), the mode will never be a vector mode, so at best the transformation is delayed until after vector lowering. By the way, nothing seems to combine the following into a single statement: c_3 = VEC_COND_EXPR <x_1(D) < y_2(D), { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>; _4 = VEC_COND_EXPR <c_3 != { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>; (it does get combined in forwprop4 after the vector operation is lowered to scalar operations (yes, we are missing the code to use sub-vectors here)) From a code generation point of view: typedef int vec __attribute__((vector_size(32))); vec f(vec x,vec y,vec z){ vec zero={}; vec one=zero+1; vec c=x<y; return z+(c?one:zero); } -mavx2 generates vpcmpgtd %ymm0, %ymm1, %ymm1 vpsubd %ymm1, %ymm2, %ymm0 while -march=skylake-avx512 gives the inferior vpcmpgtd %ymm0, %ymm1, %ymm1 vpandd .LC0(%rip), %ymm1, %ymm0 vpaddd %ymm2, %ymm0, %ymm0 (In reply to rguenther@suse.de from comment #5) > On Wed, 16 Mar 2016, glisse at gcc dot gnu.org wrote: > > A + (B vcmp C ? 1 : 0) -> A - (B vcmp C ? -1 : 0) > I think that would be an odd transform. As far as I understand, since the bool vector changes, this is currently the proper gimple syntax for this transformation, with the added bonus that it doesn't need to be disabled for avx512. (I got "incorrect sharing of tree nodes" on the first argument of vec_cond_expr when I wanted to try it out, the match-simplify machinery should probably do the unshare_expr automatically, especially since gimplify now wants to keep the comparison inside the vec_cond_expr)

On Sat, 19 Mar 2016, glisse at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70251 > > --- Comment #9 from Marc Glisse <glisse at gcc dot gnu.org> --- > I don't find the current situation very consistent. We seem to consider that in > gimple, the output of a vector comparison can only be used directly in a > vec_cond_expr (possibly also {and,ior,xor,bit_not}_expr?), except in the very > specific case where we apply this transformation a+(b?1:0) -> a-VCE(b), where > we only have a view_convert_expr. True. OTOH VIEW_CONVERT_EXPR is quite a special beast. > Unless we start transforming x?-1:0 into > view_convert_expr(x) under the same condition (VECTOR_MODE_P(...)), it doesn't > make sense to me. I think that would be a valid transform. > I am also not convinced that VECTOR_MODE_P is the right test. > If you are using a long vector (say of 2048 bits), the mode will never be a > vector mode, so at best the transformation is delayed until after vector > lowering. True, I was also thinking that a TYPE_SIZE check would be better to verify the validity of the V_C_E. > By the way, nothing seems to combine the following into a single > statement: > c_3 = VEC_COND_EXPR <x_1(D) < y_2(D), { -1, -1, -1, -1, -1, -1, -1, -1, -1, > -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, > -1, -1, -1 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>; > _4 = VEC_COND_EXPR <c_3 != { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }, { 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0 }>; > (it does get combined in forwprop4 after the vector operation is lowered to > scalar operations (yes, we are missing the code to use sub-vectors here)) This means we are missing a simplifier for (for cnd (cond vec_cond) (for cmp (ne eq) (cmp (cnd @0 CONSTANT_CLASS_P@1 CONSTANT_CLASS_P@2) ... possibly also for the scalar COND_EXPR case. > From a code generation point of view: > > typedef int vec __attribute__((vector_size(32))); > vec f(vec x,vec y,vec z){ > vec zero={}; > vec one=zero+1; > vec c=x<y; > return z+(c?one:zero); > } > > -mavx2 generates > > vpcmpgtd %ymm0, %ymm1, %ymm1 > vpsubd %ymm1, %ymm2, %ymm0 > > while -march=skylake-avx512 gives the inferior > > vpcmpgtd %ymm0, %ymm1, %ymm1 > vpandd .LC0(%rip), %ymm1, %ymm0 > vpaddd %ymm2, %ymm0, %ymm0 > > > > (In reply to rguenther@suse.de from comment #5) > > On Wed, 16 Mar 2016, glisse at gcc dot gnu.org wrote: > > > A + (B vcmp C ? 1 : 0) -> A - (B vcmp C ? -1 : 0) > > I think that would be an odd transform. > > As far as I understand, since the bool vector changes, this is currently the > proper gimple syntax for this transformation, with the added bonus that it > doesn't need to be disabled for avx512. Sure, but nowhere else we transform a + into a - operation(?) This is what I mean with "odd transform". We'd get this motivated only by (B vcmp C ? -1 : 0) being "canonical" for target expansion, right? > (I got "incorrect sharing of tree nodes" on the first argument of vec_cond_expr > when I wanted to try it out, the match-simplify machinery should probably do > the unshare_expr automatically, especially since gimplify now wants to keep the > comparison inside the vec_cond_expr) Ah, yes, that's sth we need to fix. Care to share the pattern/testcase that triggers this? The place to fix seems a bit non-obvious. Richard.

(In reply to rguenther@suse.de from comment #10) > > Unless we start transforming x?-1:0 into > > view_convert_expr(x) under the same condition (VECTOR_MODE_P(...)), it doesn't > > make sense to me. > > I think that would be a valid transform. Ah, ok. I thought it was a deliberate choice (maybe to keep things consistent between AVX512 and the rest). Vector lowering might also gain a few interesting cases. > > (In reply to rguenther@suse.de from comment #5) > > > On Wed, 16 Mar 2016, glisse at gcc dot gnu.org wrote: > > > > A + (B vcmp C ? 1 : 0) -> A - (B vcmp C ? -1 : 0) > > > I think that would be an odd transform. > > > > As far as I understand, since the bool vector changes, this is currently the > > proper gimple syntax for this transformation, with the added bonus that it > > doesn't need to be disabled for avx512. > > Sure, but nowhere else we transform a + into a - operation(?) This > is what I mean with "odd transform". It is indeed (and we are already performing it). > We'd get this motivated only > by (B vcmp C ? -1 : 0) being "canonical" for target expansion, right? Yes (canonical and cheaper). > > (I got "incorrect sharing of tree nodes" on the first argument of vec_cond_expr > > when I wanted to try it out, the match-simplify machinery should probably do > > the unshare_expr automatically, especially since gimplify now wants to keep the > > comparison inside the vec_cond_expr) > > Ah, yes, that's sth we need to fix. Care to share the pattern/testcase > that triggers this? The place to fix seems a bit non-obvious. typedef int vec __attribute__((vector_size(64))); vec f(vec x,vec y,vec z){ vec zero={}; vec one=zero+1; vec c=x<y; return z+(c?one:zero); } g++ a.c -c -O -w -Wno-psabi a.c: In function 'vec f(vec, vec, vec)': a.c:7:1: error: incorrect sharing of tree nodes } ^ x_3(D) < y_4(D) _1 = VEC_COND_EXPR <x_3(D) < y_4(D), { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>; a.c:7:1: internal compiler error: verify_gimple failed 0x10af447 verify_gimple_in_cfg(function*, bool) /home/glisse/trunk/gcc/tree-cfg.c:5125 0xf494d0 execute_function_todo /home/glisse/trunk/gcc/passes.c:1958 0xf485a3 do_per_function /home/glisse/trunk/gcc/passes.c:1645 0xf496ac execute_todo /home/glisse/trunk/gcc/passes.c:2010 --- gcc/match.pd (revision 234381) +++ gcc/match.pd (working copy) @@ -1759,11 +1759,10 @@ (simplify (plus:c @3 (view_convert? (vec_cond @0 integer_each_onep@1 integer_zerop@2))) (if (VECTOR_TYPE_P (type) - && VECTOR_MODE_P (TYPE_MODE (TREE_TYPE (@0))) && TYPE_VECTOR_SUBPARTS (type) == TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)) && (TYPE_MODE (TREE_TYPE (type)) == TYPE_MODE (TREE_TYPE (TREE_TYPE (@0))))) - (minus @3 (view_convert @0)))) + (minus @3 (view_convert (vec_cond @0 (negate @1) @2))))) /* ... likewise A - (B vcmp C ? 1 : 0) -> A + (B vcmp C). */ (simplify

Thanks - I'm testing a fix for that and will adjust the match.pd patterns accordingly (without adding a cmp ? -1 : 0 -> VIEW_CONVERT_EXPR pattern). I verified the original aarch64 testcase still works after the change.

Author: rguenth Date: Tue Mar 22 14:38:42 2016 New Revision: 234405 URL: https://gcc.gnu.org/viewcvs?rev=234405&root=gcc&view=rev Log: 2016-03-22 Richard Biener <rguenther@suse.de> PR middle-end/70251 * genmatch.c (gen_transform): Adjust last parameter to a three-state int... (capture::gen_transform): ... to change behavior when substituting a condition into cond or not-cond expr context. (dt_simplify::gen_1): Adjust. * gimple-match-head.c: Include gimplify.h for unshare_expr. * match.pd (A + (B vcmp C ? 1 : 0) -> A - (B vcmp C)): Revert last change and instead change to A + (B vcmp C ? 1 : 0) -> A - (B vcmp C ? -1 : 0). (A - (B vcmp C ? 1 : 0) -> A + (B vcmp C)): Likewise. * g++.dg/torture/pr70251.C: New testcase. Added: trunk/gcc/testsuite/g++.dg/torture/pr70251.C Modified: trunk/gcc/ChangeLog trunk/gcc/genmatch.c trunk/gcc/gimple-match-head.c trunk/gcc/match.pd trunk/gcc/testsuite/ChangeLog

Author: rguenth Date: Wed Mar 23 13:40:50 2016 New Revision: 234427 URL: https://gcc.gnu.org/viewcvs?rev=234427&root=gcc&view=rev Log: 2016-03-23 Richard Biener <rguenther@suse.de> PR middle-end/70251 * match.pd (A + (B vcmp C ? 1 : 0) -> A - (B vcmp C)): Adjust mode compatibility check. (A - (B vcmp C ? 1 : 0) -> A + (B vcmp C)): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd