Bug 85323 - SSE/AVX/AVX512 shift by 0 not optimized away
Summary: SSE/AVX/AVX512 shift by 0 not optimized away
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0.1
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-04-10 13:04 UTC by Matthias Kretz (Vir)
Modified: 2020-01-24 21:49 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-04-10 00:00:00


Attachments
idea for a partial solution (1.78 KB, patch)
2018-04-10 13:08 UTC, Matthias Kretz (Vir)
Details | Diff
gcc9-pr85323.patch (3.38 KB, patch)
2018-05-07 17:51 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz (Vir) 2018-04-10 13:04:59 UTC
In the following test case, all three functions should compile to just `ret`:

#include <x86intrin.h>

__m128i f(__m128i x) {
    x = _mm_sll_epi64(x, __m128i());
    x = _mm_sll_epi32(x, __m128i());
    x = _mm_sll_epi16(x, __m128i());
    x = _mm_srl_epi64(x, __m128i());
    x = _mm_srl_epi32(x, __m128i());
    x = _mm_srl_epi16(x, __m128i());
    x = _mm_sra_epi64(x, __m128i());
    x = _mm_sra_epi32(x, __m128i());
    x = _mm_sra_epi16(x, __m128i());
    x = _mm_slli_epi64(x, 0);
    x = _mm_slli_epi32(x, 0);
    x = _mm_slli_epi16(x, 0);
    x = _mm_srli_epi64(x, 0);
    x = _mm_srli_epi32(x, 0);
    x = _mm_srli_epi16(x, 0);
    x = _mm_srai_epi64(x, 0);
    x = _mm_srai_epi32(x, 0);
    x = _mm_srai_epi16(x, 0);
    return x;
}

__m256i f(__m256i x) {
    x = _mm256_sll_epi64(x, __m128i());
    x = _mm256_sll_epi32(x, __m128i());
    x = _mm256_sll_epi16(x, __m128i());
    x = _mm256_srl_epi64(x, __m128i());
    x = _mm256_srl_epi32(x, __m128i());
    x = _mm256_srl_epi16(x, __m128i());
    x = _mm256_sra_epi64(x, __m128i());
    x = _mm256_sra_epi32(x, __m128i());
    x = _mm256_sra_epi16(x, __m128i());
    x = _mm256_slli_epi64(x, 0);
    x = _mm256_slli_epi32(x, 0);
    x = _mm256_slli_epi16(x, 0);
    x = _mm256_srli_epi64(x, 0);
    x = _mm256_srli_epi32(x, 0);
    x = _mm256_srli_epi16(x, 0);
    x = _mm256_srai_epi64(x, 0);
    x = _mm256_srai_epi32(x, 0);
    x = _mm256_srai_epi16(x, 0);
    return x;
}

__m512i f(__m512i x) {
    x = _mm512_sll_epi64(x, __m128i());
    x = _mm512_sll_epi32(x, __m128i());
    x = _mm512_sll_epi16(x, __m128i());
    x = _mm512_srl_epi64(x, __m128i());
    x = _mm512_srl_epi32(x, __m128i());
    x = _mm512_srl_epi16(x, __m128i());
    x = _mm512_sra_epi64(x, __m128i());
    x = _mm512_sra_epi32(x, __m128i());
    x = _mm512_sra_epi16(x, __m128i());
    x = _mm512_slli_epi64(x, 0);
    x = _mm512_slli_epi32(x, 0);
    x = _mm512_slli_epi16(x, 0);
    x = _mm512_srli_epi64(x, 0);
    x = _mm512_srli_epi32(x, 0);
    x = _mm512_srli_epi16(x, 0);
    x = _mm512_srai_epi64(x, 0);
    x = _mm512_srai_epi32(x, 0);
    x = _mm512_srai_epi16(x, 0);
    return x;
}
Comment 1 Matthias Kretz (Vir) 2018-04-10 13:08:47 UTC
Created attachment 43898 [details]
idea for a partial solution

Constant propagation works using the built in shift operators. At least for the shifts taking a "const int" shift argument, this is a possible fix. However, the patched intrinsics now accept non-immediate shift arguments, which modifies the interface (in an unsafe way).
Comment 2 Jakub Jelinek 2018-04-10 14:27:25 UTC
The builtins have different behavior from the C shifts on vectors, where shift counts < 0 or >= element bitsize are just undefined.

What the i386 backend could do (and should do) is extend ix86_fold_builtin and/or ix86_gimple_fold_builtin to optimize these builtins.

For the shifts in particular, if the shift count (treated as unsigned) is equal to the element size or larger, return a zero vector, and if it is 0, return the first operand, or if first argument is VECTOR_CST and second argument is INTEGER_CST, compute it into VECTOR_CST.  Similarly for VECTOR_CST second operand, handle the VECTOR_CST, VECTOR_CST operands, or when all elts of the latter one are equal, or all are >= element size.  Many other MD builtins should be folded too.
Comment 3 Julia Koval 2018-04-11 11:26:35 UTC
Can we just return if zero from the intrinsic code in the header file and generate proper builtin if not zero? It won't work on O0, but it is an optimization after all.
Comment 4 Jakub Jelinek 2018-04-11 11:48:05 UTC
If you mean adding if (__builtin_constant_p (__B) && __B == 0) return __A; and similar to all the various intrinsics, then that is not the right thing to do, it will make the headers much larger and even the IL much larger for the common case where people aren't using 0 constant.
Doing this in in the fold target hooks is easier and more effective.

Note, Matthias has filed a bunch of other PRs for similar issues recently, perhaps we want some tracker PR for those.
The general rule is that foldings which result in VECTOR_CST or other constants should probably go into ix86_fold_builtin, and other foldings should go into ix86_gimple_fold_builtin.
Comment 5 Jakub Jelinek 2018-05-07 17:51:07 UTC
Created attachment 44082 [details]
gcc9-pr85323.patch

Untested WIP patch.  Still need to add testsuite coverage for the case when both first and second arguments are constant.
Comment 6 Jakub Jelinek 2018-05-17 09:55:08 UTC
Author: jakub
Date: Thu May 17 09:54:36 2018
New Revision: 260311

URL: https://gcc.gnu.org/viewcvs?rev=260311&root=gcc&view=rev
Log:
	PR target/85323
	* config/i386/i386.c: Include tree-vector-builder.h.
	(ix86_vector_shift_count): New function.
	(ix86_fold_builtin): Fold shift builtins by scalar count.
	(ix86_gimple_fold_builtin): Likewise.

	* gcc.target/i386/pr85323-1.c: New test.
	* gcc.target/i386/pr85323-2.c: New test.
	* gcc.target/i386/pr85323-3.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85323-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-3.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
Comment 7 Jakub Jelinek 2018-05-17 10:02:05 UTC
Author: jakub
Date: Thu May 17 10:01:33 2018
New Revision: 260312

URL: https://gcc.gnu.org/viewcvs?rev=260312&root=gcc&view=rev
Log:
	PR target/85323
	* config/i386/i386.c (ix86_fold_builtin): Fold shift builtins by
	vector.
	(ix86_gimple_fold_builtin): Likewise.

	* gcc.target/i386/pr85323-4.c: New test.
	* gcc.target/i386/pr85323-5.c: New test.
	* gcc.target/i386/pr85323-6.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85323-4.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-5.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-6.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
Comment 8 Jakub Jelinek 2018-05-17 10:07:44 UTC
Author: jakub
Date: Thu May 17 10:07:12 2018
New Revision: 260313

URL: https://gcc.gnu.org/viewcvs?rev=260313&root=gcc&view=rev
Log:
	PR target/85323
	* config/i386/i386.c (ix86_fold_builtin): Handle masked shifts
	even if the mask is not all ones.

	* gcc.target/i386/pr85323-7.c: New test.
	* gcc.target/i386/pr85323-8.c: New test.
	* gcc.target/i386/pr85323-9.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85323-7.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-8.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-9.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
Comment 9 Jakub Jelinek 2019-11-14 12:36:36 UTC
Should be fixed for GCC 9+.