In the following test case, all three functions should compile to just `ret`: #include <x86intrin.h> __m128i f(__m128i x) { x = _mm_sll_epi64(x, __m128i()); x = _mm_sll_epi32(x, __m128i()); x = _mm_sll_epi16(x, __m128i()); x = _mm_srl_epi64(x, __m128i()); x = _mm_srl_epi32(x, __m128i()); x = _mm_srl_epi16(x, __m128i()); x = _mm_sra_epi64(x, __m128i()); x = _mm_sra_epi32(x, __m128i()); x = _mm_sra_epi16(x, __m128i()); x = _mm_slli_epi64(x, 0); x = _mm_slli_epi32(x, 0); x = _mm_slli_epi16(x, 0); x = _mm_srli_epi64(x, 0); x = _mm_srli_epi32(x, 0); x = _mm_srli_epi16(x, 0); x = _mm_srai_epi64(x, 0); x = _mm_srai_epi32(x, 0); x = _mm_srai_epi16(x, 0); return x; } __m256i f(__m256i x) { x = _mm256_sll_epi64(x, __m128i()); x = _mm256_sll_epi32(x, __m128i()); x = _mm256_sll_epi16(x, __m128i()); x = _mm256_srl_epi64(x, __m128i()); x = _mm256_srl_epi32(x, __m128i()); x = _mm256_srl_epi16(x, __m128i()); x = _mm256_sra_epi64(x, __m128i()); x = _mm256_sra_epi32(x, __m128i()); x = _mm256_sra_epi16(x, __m128i()); x = _mm256_slli_epi64(x, 0); x = _mm256_slli_epi32(x, 0); x = _mm256_slli_epi16(x, 0); x = _mm256_srli_epi64(x, 0); x = _mm256_srli_epi32(x, 0); x = _mm256_srli_epi16(x, 0); x = _mm256_srai_epi64(x, 0); x = _mm256_srai_epi32(x, 0); x = _mm256_srai_epi16(x, 0); return x; } __m512i f(__m512i x) { x = _mm512_sll_epi64(x, __m128i()); x = _mm512_sll_epi32(x, __m128i()); x = _mm512_sll_epi16(x, __m128i()); x = _mm512_srl_epi64(x, __m128i()); x = _mm512_srl_epi32(x, __m128i()); x = _mm512_srl_epi16(x, __m128i()); x = _mm512_sra_epi64(x, __m128i()); x = _mm512_sra_epi32(x, __m128i()); x = _mm512_sra_epi16(x, __m128i()); x = _mm512_slli_epi64(x, 0); x = _mm512_slli_epi32(x, 0); x = _mm512_slli_epi16(x, 0); x = _mm512_srli_epi64(x, 0); x = _mm512_srli_epi32(x, 0); x = _mm512_srli_epi16(x, 0); x = _mm512_srai_epi64(x, 0); x = _mm512_srai_epi32(x, 0); x = _mm512_srai_epi16(x, 0); return x; }
Created attachment 43898 [details] idea for a partial solution Constant propagation works using the built in shift operators. At least for the shifts taking a "const int" shift argument, this is a possible fix. However, the patched intrinsics now accept non-immediate shift arguments, which modifies the interface (in an unsafe way).
The builtins have different behavior from the C shifts on vectors, where shift counts < 0 or >= element bitsize are just undefined. What the i386 backend could do (and should do) is extend ix86_fold_builtin and/or ix86_gimple_fold_builtin to optimize these builtins. For the shifts in particular, if the shift count (treated as unsigned) is equal to the element size or larger, return a zero vector, and if it is 0, return the first operand, or if first argument is VECTOR_CST and second argument is INTEGER_CST, compute it into VECTOR_CST. Similarly for VECTOR_CST second operand, handle the VECTOR_CST, VECTOR_CST operands, or when all elts of the latter one are equal, or all are >= element size. Many other MD builtins should be folded too.
Can we just return if zero from the intrinsic code in the header file and generate proper builtin if not zero? It won't work on O0, but it is an optimization after all.
If you mean adding if (__builtin_constant_p (__B) && __B == 0) return __A; and similar to all the various intrinsics, then that is not the right thing to do, it will make the headers much larger and even the IL much larger for the common case where people aren't using 0 constant. Doing this in in the fold target hooks is easier and more effective. Note, Matthias has filed a bunch of other PRs for similar issues recently, perhaps we want some tracker PR for those. The general rule is that foldings which result in VECTOR_CST or other constants should probably go into ix86_fold_builtin, and other foldings should go into ix86_gimple_fold_builtin.
Created attachment 44082 [details] gcc9-pr85323.patch Untested WIP patch. Still need to add testsuite coverage for the case when both first and second arguments are constant.
Author: jakub Date: Thu May 17 09:54:36 2018 New Revision: 260311 URL: https://gcc.gnu.org/viewcvs?rev=260311&root=gcc&view=rev Log: PR target/85323 * config/i386/i386.c: Include tree-vector-builder.h. (ix86_vector_shift_count): New function. (ix86_fold_builtin): Fold shift builtins by scalar count. (ix86_gimple_fold_builtin): Likewise. * gcc.target/i386/pr85323-1.c: New test. * gcc.target/i386/pr85323-2.c: New test. * gcc.target/i386/pr85323-3.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr85323-1.c trunk/gcc/testsuite/gcc.target/i386/pr85323-2.c trunk/gcc/testsuite/gcc.target/i386/pr85323-3.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/testsuite/ChangeLog
Author: jakub Date: Thu May 17 10:01:33 2018 New Revision: 260312 URL: https://gcc.gnu.org/viewcvs?rev=260312&root=gcc&view=rev Log: PR target/85323 * config/i386/i386.c (ix86_fold_builtin): Fold shift builtins by vector. (ix86_gimple_fold_builtin): Likewise. * gcc.target/i386/pr85323-4.c: New test. * gcc.target/i386/pr85323-5.c: New test. * gcc.target/i386/pr85323-6.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr85323-4.c trunk/gcc/testsuite/gcc.target/i386/pr85323-5.c trunk/gcc/testsuite/gcc.target/i386/pr85323-6.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/testsuite/ChangeLog
Author: jakub Date: Thu May 17 10:07:12 2018 New Revision: 260313 URL: https://gcc.gnu.org/viewcvs?rev=260313&root=gcc&view=rev Log: PR target/85323 * config/i386/i386.c (ix86_fold_builtin): Handle masked shifts even if the mask is not all ones. * gcc.target/i386/pr85323-7.c: New test. * gcc.target/i386/pr85323-8.c: New test. * gcc.target/i386/pr85323-9.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr85323-7.c trunk/gcc/testsuite/gcc.target/i386/pr85323-8.c trunk/gcc/testsuite/gcc.target/i386/pr85323-9.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/testsuite/ChangeLog
Should be fixed for GCC 9+.