85323 – SSE/AVX/AVX512 shift by 0 not optimized away

Bug 85323 - SSE/AVX/AVX512 shift by 0 not optimized away

Summary: SSE/AVX/AVX512 shift by 0 not optimized away

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.0.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	Jakub Jelinek

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2018-04-10 13:04 UTC by Matthias Kretz (Vir)
Modified:	2020-01-24 21:49 UTC (History)
CC List:	4 users (show)

See Also:	93418
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-04-10 00:00:00

Attachments
idea for a partial solution (1.78 KB, patch) 2018-04-10 13:08 UTC, Matthias Kretz (Vir)	Details \| Diff
gcc9-pr85323.patch (3.38 KB, patch) 2018-05-07 17:51 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Kretz (Vir) 2018-04-10 13:04:59 UTC

In the following test case, all three functions should compile to just `ret`:

#include <x86intrin.h>

__m128i f(__m128i x) {
    x = _mm_sll_epi64(x, __m128i());
    x = _mm_sll_epi32(x, __m128i());
    x = _mm_sll_epi16(x, __m128i());
    x = _mm_srl_epi64(x, __m128i());
    x = _mm_srl_epi32(x, __m128i());
    x = _mm_srl_epi16(x, __m128i());
    x = _mm_sra_epi64(x, __m128i());
    x = _mm_sra_epi32(x, __m128i());
    x = _mm_sra_epi16(x, __m128i());
    x = _mm_slli_epi64(x, 0);
    x = _mm_slli_epi32(x, 0);
    x = _mm_slli_epi16(x, 0);
    x = _mm_srli_epi64(x, 0);
    x = _mm_srli_epi32(x, 0);
    x = _mm_srli_epi16(x, 0);
    x = _mm_srai_epi64(x, 0);
    x = _mm_srai_epi32(x, 0);
    x = _mm_srai_epi16(x, 0);
    return x;
}

__m256i f(__m256i x) {
    x = _mm256_sll_epi64(x, __m128i());
    x = _mm256_sll_epi32(x, __m128i());
    x = _mm256_sll_epi16(x, __m128i());
    x = _mm256_srl_epi64(x, __m128i());
    x = _mm256_srl_epi32(x, __m128i());
    x = _mm256_srl_epi16(x, __m128i());
    x = _mm256_sra_epi64(x, __m128i());
    x = _mm256_sra_epi32(x, __m128i());
    x = _mm256_sra_epi16(x, __m128i());
    x = _mm256_slli_epi64(x, 0);
    x = _mm256_slli_epi32(x, 0);
    x = _mm256_slli_epi16(x, 0);
    x = _mm256_srli_epi64(x, 0);
    x = _mm256_srli_epi32(x, 0);
    x = _mm256_srli_epi16(x, 0);
    x = _mm256_srai_epi64(x, 0);
    x = _mm256_srai_epi32(x, 0);
    x = _mm256_srai_epi16(x, 0);
    return x;
}

__m512i f(__m512i x) {
    x = _mm512_sll_epi64(x, __m128i());
    x = _mm512_sll_epi32(x, __m128i());
    x = _mm512_sll_epi16(x, __m128i());
    x = _mm512_srl_epi64(x, __m128i());
    x = _mm512_srl_epi32(x, __m128i());
    x = _mm512_srl_epi16(x, __m128i());
    x = _mm512_sra_epi64(x, __m128i());
    x = _mm512_sra_epi32(x, __m128i());
    x = _mm512_sra_epi16(x, __m128i());
    x = _mm512_slli_epi64(x, 0);
    x = _mm512_slli_epi32(x, 0);
    x = _mm512_slli_epi16(x, 0);
    x = _mm512_srli_epi64(x, 0);
    x = _mm512_srli_epi32(x, 0);
    x = _mm512_srli_epi16(x, 0);
    x = _mm512_srai_epi64(x, 0);
    x = _mm512_srai_epi32(x, 0);
    x = _mm512_srai_epi16(x, 0);
    return x;
}

Comment 1 Matthias Kretz (Vir) 2018-04-10 13:08:47 UTC

Created attachment 43898 [details]
idea for a partial solution

Constant propagation works using the built in shift operators. At least for the shifts taking a "const int" shift argument, this is a possible fix. However, the patched intrinsics now accept non-immediate shift arguments, which modifies the interface (in an unsafe way).

Comment 2 Jakub Jelinek 2018-04-10 14:27:25 UTC

The builtins have different behavior from the C shifts on vectors, where shift counts < 0 or >= element bitsize are just undefined.

What the i386 backend could do (and should do) is extend ix86_fold_builtin and/or ix86_gimple_fold_builtin to optimize these builtins.

For the shifts in particular, if the shift count (treated as unsigned) is equal to the element size or larger, return a zero vector, and if it is 0, return the first operand, or if first argument is VECTOR_CST and second argument is INTEGER_CST, compute it into VECTOR_CST.  Similarly for VECTOR_CST second operand, handle the VECTOR_CST, VECTOR_CST operands, or when all elts of the latter one are equal, or all are >= element size.  Many other MD builtins should be folded too.

Comment 3 Julia Koval 2018-04-11 11:26:35 UTC

Can we just return if zero from the intrinsic code in the header file and generate proper builtin if not zero? It won't work on O0, but it is an optimization after all.

Comment 4 Jakub Jelinek 2018-04-11 11:48:05 UTC

If you mean adding if (__builtin_constant_p (__B) && __B == 0) return __A; and similar to all the various intrinsics, then that is not the right thing to do, it will make the headers much larger and even the IL much larger for the common case where people aren't using 0 constant.
Doing this in in the fold target hooks is easier and more effective.

Note, Matthias has filed a bunch of other PRs for similar issues recently, perhaps we want some tracker PR for those.
The general rule is that foldings which result in VECTOR_CST or other constants should probably go into ix86_fold_builtin, and other foldings should go into ix86_gimple_fold_builtin.

Comment 5 Jakub Jelinek 2018-05-07 17:51:07 UTC

Created attachment 44082 [details]
gcc9-pr85323.patch

Untested WIP patch.  Still need to add testsuite coverage for the case when both first and second arguments are constant.

Comment 6 Jakub Jelinek 2018-05-17 09:55:08 UTC

Author: jakub
Date: Thu May 17 09:54:36 2018
New Revision: 260311

URL: https://gcc.gnu.org/viewcvs?rev=260311&root=gcc&view=rev
Log:
	PR target/85323
	* config/i386/i386.c: Include tree-vector-builder.h.
	(ix86_vector_shift_count): New function.
	(ix86_fold_builtin): Fold shift builtins by scalar count.
	(ix86_gimple_fold_builtin): Likewise.

	* gcc.target/i386/pr85323-1.c: New test.
	* gcc.target/i386/pr85323-2.c: New test.
	* gcc.target/i386/pr85323-3.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85323-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-3.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog

Comment 7 Jakub Jelinek 2018-05-17 10:02:05 UTC

Author: jakub
Date: Thu May 17 10:01:33 2018
New Revision: 260312

URL: https://gcc.gnu.org/viewcvs?rev=260312&root=gcc&view=rev
Log:
	PR target/85323
	* config/i386/i386.c (ix86_fold_builtin): Fold shift builtins by
	vector.
	(ix86_gimple_fold_builtin): Likewise.

	* gcc.target/i386/pr85323-4.c: New test.
	* gcc.target/i386/pr85323-5.c: New test.
	* gcc.target/i386/pr85323-6.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85323-4.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-5.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-6.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog

Comment 8 Jakub Jelinek 2018-05-17 10:07:44 UTC

Author: jakub
Date: Thu May 17 10:07:12 2018
New Revision: 260313

URL: https://gcc.gnu.org/viewcvs?rev=260313&root=gcc&view=rev
Log:
	PR target/85323
	* config/i386/i386.c (ix86_fold_builtin): Handle masked shifts
	even if the mask is not all ones.

	* gcc.target/i386/pr85323-7.c: New test.
	* gcc.target/i386/pr85323-8.c: New test.
	* gcc.target/i386/pr85323-9.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr85323-7.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-8.c
    trunk/gcc/testsuite/gcc.target/i386/pr85323-9.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog

Comment 9 Jakub Jelinek 2019-11-14 12:36:36 UTC

Should be fixed for GCC 9+.