Regression starting in GCC 9 Currently, GCC constant propagates the AVX2 _mm_sllv family with constant amounts to only shift by the first element instead of all elements individually. #include <immintrin.h> #include <stdio.h> // force -O0 __attribute__((__optimize__("-O0"))) void unoptimized() { __m128i vals = _mm_set1_epi32(0xffffffff); __m128i shifts = _mm_setr_epi32(16, 31, -34, 3); __m128i shifted = _mm_sllv_epi32(vals, shifts); printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3)); } // force -O3 __attribute__((__optimize__("-O3"))) void optimized() { __m128i vals = _mm_set1_epi32(0xffffffff); __m128i shifts = _mm_setr_epi32(16, 31, -34, 3); __m128i shifted = _mm_sllv_epi32(vals, shifts); printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3)); } int main() { printf("Without optimizations (correct result):\t"); unoptimized(); printf("With optimizations (incorrect result):\t"); optimized(); } I would expect this code to emit the following: Without optimizations (correct result): ffff0000 80000000 00000000 fffffff8 With optimizations (incorrect result): ffff0000 80000000 00000000 fffffff8 Clang and GCC < 9 exhibit the first output, but 9.1 and later However, I get this output on GCC 9 and later: Without optimizations (correct result): ffff0000 80000000 00000000 fffffff8 With optimizations (incorrect result): ffff0000 ffff0000 ffff0000 ffff0000 Godbolt link: https://gcc.godbolt.org/z/oC3Psp
From .ccp1: Visiting statement: _17 = VIEW_CONVERT_EXPR<vector(4) int>(shifts_13); which is likely CONSTANT Match-and-simplified VIEW_CONVERT_EXPR<vector(4) int>(shifts_13) to { 16, 31, -34, 3 } Lattice value changed to CONSTANT { 16, 31, -34, 3 }. Adding SSA edges to worklist. marking stmt to be not simulated again Visiting statement: _18 = VIEW_CONVERT_EXPR<vector(4) int>(vals_11); which is likely CONSTANT Match-and-simplified VIEW_CONVERT_EXPR<vector(4) int>(vals_11) to { -1, -1, -1, -1 } Lattice value changed to CONSTANT { -1, -1, -1, -1 }. Adding SSA edges to worklist. marking stmt to be not simulated again Visiting statement: _19 = __builtin_ia32_psllv4si (_18, _17); which is likely CONSTANT Lattice value changed to CONSTANT { -65536, -65536, -65536, -65536 }. Adding SSA edges to worklist. marking stmt to be not simulated again
The bug is most likely inside ix86_fold_builtin.
I think I found the culprit commit. Haven't set up a GCC build tree yet, though. https://github.com/gcc-mirror/gcc/commit/a51c4926712307787d133ba50af8c61393a9229b
(In reply to Devin Hussey from comment #3) > I think I found the culprit commit. > > Haven't set up a GCC build tree yet, though. > > https://github.com/gcc-mirror/gcc/commit/ > a51c4926712307787d133ba50af8c61393a9229b g:6a03477e85e1b097ed6c0b86c76436de575aef04 for the new git.
Finally got GCC to build after it was throwing a fit. I can confirm that the regression is in that commit. g:28a8a768ebef5e31f950013f1b48b14c008b4b3b works correctly, g:6a03477e85e1b097ed6c0b86c76436de575aef04 does not.
I'll have a look.
Created attachment 47717 [details] gcc10-pr93418.patch Untested fix.
Seems to work. ~ $ ~/gcc-test/bin/x86_64-pc-cygwin-gcc.exe -mavx2 -O3 _mm_sllv_bug.c ~ $ ./a.exe Without optimizations (correct result): ffff0000 80000000 00000000 fffffff8 With optimizations (incorrect result): ffff0000 80000000 00000000 fffffff8 ~ $ And checking the assembly, the shifts are constant propagated. The provided test file also passes.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:bff948aa337807260344c83ac9079d6386410094 commit r10-6273-gbff948aa337807260344c83ac9079d6386410094 Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Jan 28 08:46:23 2020 +0100 i386: Fix ix86_fold_builtin shift folding [PR93418] The following testcase is miscompiled, because the variable shift left operand, { -1, -1, -1, -1 } is represented as a VECTOR_CST with VECTOR_CST_NPATTERNS 1 and VECTOR_CST_NELTS_PER_PATTERN 1, so when we call builder.new_unary_operation, builder.encoded_nelts () will be just 1 and thus we encode the resulting vector as if all the elements were the same. For non-masked is_vshift, we could perhaps call builder.new_binary_operation (TREE_TYPE (args[0]), args[0], args[1], false), but then there are masked shifts, for non-is_vshift we could perhaps call it too but with args[2] instead of args[1], but there is no builder.new_ternary_operation. All this stuff is primarily for aarch64 anyway, on x86 we don't have any variable length vectors, and it is not a big deal to compute all elements and just let builder.finalize () find the most efficient VECTOR_CST representation of the vector. So, instead of doing too much, this just keeps using new_unary_operation only if only one VECTOR_CST is involved (i.e. non-masked shift by constant) and for the rest just compute all elts. 2020-01-28 Jakub Jelinek <jakub@redhat.com> PR target/93418 * config/i386/i386.c (ix86_fold_builtin) <do_shift>: If mask is not -1 or is_vshift is true, use new_vector with number of elts npatterns rather than new_unary_operation. * gcc.target/i386/avx2-pr93418.c: New test.
Fixed on the trunk so far.
The releases/gcc-9 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:764e831291a2e510978ca7be0bffb55589a5a0b6 commit r9-8214-g764e831291a2e510978ca7be0bffb55589a5a0b6 Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Jan 28 08:46:23 2020 +0100 i386: Fix ix86_fold_builtin shift folding [PR93418] The following testcase is miscompiled, because the variable shift left operand, { -1, -1, -1, -1 } is represented as a VECTOR_CST with VECTOR_CST_NPATTERNS 1 and VECTOR_CST_NELTS_PER_PATTERN 1, so when we call builder.new_unary_operation, builder.encoded_nelts () will be just 1 and thus we encode the resulting vector as if all the elements were the same. For non-masked is_vshift, we could perhaps call builder.new_binary_operation (TREE_TYPE (args[0]), args[0], args[1], false), but then there are masked shifts, for non-is_vshift we could perhaps call it too but with args[2] instead of args[1], but there is no builder.new_ternary_operation. All this stuff is primarily for aarch64 anyway, on x86 we don't have any variable length vectors, and it is not a big deal to compute all elements and just let builder.finalize () find the most efficient VECTOR_CST representation of the vector. So, instead of doing too much, this just keeps using new_unary_operation only if only one VECTOR_CST is involved (i.e. non-masked shift by constant) and for the rest just compute all elts. 2020-01-28 Jakub Jelinek <jakub@redhat.com> PR target/93418 * config/i386/i386.c (ix86_fold_builtin) <do_shift>: If mask is not -1 or is_vshift is true, use new_vector with number of elts npatterns rather than new_unary_operation. * gcc.target/i386/avx2-pr93418.c: New test.
Fixed for 9.3+ too.