Bug 93594 - Missed optimization with _mm256_set/setr_m128i intrinsics
Summary: Missed optimization with _mm256_set/setr_m128i intrinsics
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.2.1
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
Depends on:
Blocks: 93613
  Show dependency treegraph
Reported: 2020-02-05 13:04 UTC by andysem
Modified: 2020-02-07 08:30 UTC (History)
0 users

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2020-02-05 00:00:00

gcc10-pr93594.patch (1.16 KB, patch)
2020-02-05 17:11 UTC, Jakub Jelinek
Details | Diff
gcc10-pr93594-2.patch (1.18 KB, patch)
2020-02-06 12:51 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description andysem 2020-02-05 13:04:45 UTC
When _mm256_set_m128i/_mm256_setr_m128i intrinsics are used to zero the upper half of the resulting register, gcc generates unnecessary vinserti128 instruction, where a single vmovdqa would be enough. The compiler is able to recognize "_mm256_insertf128_si256(_mm256_setzero_si256(), low, 0)" pattern but not "_mm256_insertf128_si256(_mm256_castsi128_si256(low), _mm_setzero_si128(), 1)".

You can see code generated for the different pieces of code here: https://gcc.godbolt.org/z/ZMwtPq

Note that clang is able to recognize all versions and generates optimal code in all cases.

For convenience, here is the test code:

#include <immintrin.h>

__m256i cvt_setr(__m128i low)
    return _mm256_setr_m128i(low, _mm_setzero_si128());

__m256i cvt_set(__m128i low)
    return _mm256_set_m128i(_mm_setzero_si128(), low);

__m256i cvt_insert(__m128i low)
    return _mm256_insertf128_si256(_mm256_setzero_si256(), low, 0);

__m256i cvt_insert_v2(__m128i low)
    return _mm256_insertf128_si256(_mm256_castsi128_si256(low), _mm_setzero_si128(), 1);

$ g++ -O3 -mavx2
Comment 1 Jakub Jelinek 2020-02-05 17:11:45 UTC
Created attachment 47786 [details]

Untested fix.
Comment 2 andysem 2020-02-05 19:14:30 UTC
Another test case:

__m256i cvt_permute(__m128i low)
    return _mm256_permute2x128_si256(_mm256_castsi128_si256(low), _mm256_castsi128_si256(low), 0x80);

Comment 3 andysem 2020-02-05 19:49:41 UTC
...and probably other permute variants involving zeroed input registers, e.g.:

__m256i cvt_permute_zero_v1(__m128i low)
    return _mm256_permute2x128_si256(_mm256_setzero_si256(), _mm256_castsi128_si256(low), 0x02);

__m256i cvt_permute_zero_v2(__m128i low)
    return _mm256_permute2x128_si256(_mm256_castsi128_si256(low), _mm256_setzero_si256(), 0x20);

Comment 4 Marc Glisse 2020-02-06 08:54:43 UTC
The versions involving _mm256_cast* may be related to PR50829 and others (UNSPEC hiding the semantics of the operation).
Comment 5 CVS Commits 2020-02-06 10:10:20 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:


commit r10-6472-g3f740c67dbb90177aa71d3c60ef9b0fd2f44dbd9
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Feb 6 11:08:59 2020 +0100

    i386: Improve avx* vector concatenation [PR93594]
    The following testcase shows that for _mm256_set*_m128i and similar
    intrinsics, we sometimes generate bad code.  All 4 routines are expressing
    the same thing, a 128-bit vector zero padded to 256-bit vector, but only the
    3rd one actually emits the desired vmovdqa      %xmm0, %xmm0 insn, the
    others vpxor    %xmm1, %xmm1, %xmm1; vinserti128        $0x1, %xmm1, %ymm0, %ymm0
    The problem is that the cast builtins use UNSPEC_CAST which is after reload
    simplified using a splitter, but during combine it prevents optimizations.
    We do have avx_vec_concat* patterns that generate efficient code, both for
    this low part + zero concatenation special case and for other cases too, so
    the following define_insn_and_split just recognizes avx_vec_concat made of a
    low half of a cast and some other reg.
    2020-02-06  Jakub Jelinek  <jakub@redhat.com>
    	PR target/93594
    	* config/i386/predicates.md (avx_identity_operand): New predicate.
    	* config/i386/sse.md (*avx_vec_concat<mode>_1): New
    	* gcc.target/i386/avx2-pr93594.c: New test.
Comment 6 Jakub Jelinek 2020-02-06 12:51:44 UTC
Created attachment 47789 [details]

Actually, thinking about it some more, rather than having a special pattern to handle UNSPEC_CAST, if we change the cast patterns so that they are a vec_concat of the operand and UNSPEC_CAST that then represents just the uninitialized higher part, simplify-rtx.c is able to deal with it on its own.
Comment 7 Jakub Jelinek 2020-02-06 13:38:58 UTC
The _mm256_permute2x128_si256 issues are similar, but really unrelated and IMHO should be tracked in a separate PR.  The problem there is that the pattern we use doesn't really describe what the instruction does, uses an UNSPEC_VPERMTI, which obviously can't be simplified by the generic code.  The reason is mainly that the instruction isn't just a two source permutation, but essentially 3 source permutation, with the third source of 0.
Comment 8 Jakub Jelinek 2020-02-06 14:42:58 UTC
_mm256_permute2x128_si256 issue moved to separate PR93613.
Comment 9 Marc Glisse 2020-02-06 14:58:57 UTC
(In reply to Jakub Jelinek from comment #6)
> if we change the cast patterns so that they are a
> vec_concat of the operand and UNSPEC_CAST that then represents just the
> uninitialized higher part, simplify-rtx.c is able to deal with it on its own.

Yes, I like that better as well. The name UNSPEC_CAST looks a bit strange though, it could be renamed UNSPEC_UNDEF for instance? If someone tries to extract the high part of such a vector, I expect simplification yields just the unspec, which doesn't have a matching pattern, so the simplification is cancelled? Any connection with _mm_undefined_si128?
(random dump of thoughts, ignore most of it)
Comment 10 CVS Commits 2020-02-07 08:30:54 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:


commit r10-6499-gf82617f229b336d856c18313339b14657e05c129
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Feb 7 09:28:39 2020 +0100

    i386: Better patch to improve avx* vector concatenation [PR93594]
    After thinking some more on this, we can do better; rather than having to
    add a new prereload splitter pattern to catch all other cases where it might
    be beneficial to fold first part of an UNSPEC_CAST back to the unspec
    operand, this patch reverts the *.md changes I've made yesterday and instead
    tweaks the patterns, so that simplify-rtx.c can optimize those on its own.
    Instead of the whole SET_SRC being an UNSPEC through which simplify-rtx.c
    obviously can't optimize anything, this represents those patterns through a
    VEC_CONCAT (or two nested ones for the 128-bit -> 512-bit casts) with the
    operand as the low part of it and UNSPEC representing just the high part of
    it (the undefined, to be ignored, bits).  While richi suggested using
    already in GIMPLE for those using a SSA_NAME default definition (i.e.
    clearly uninitialized use), I'd say that uninit pass would warn about those,
    but more importantly, in RTL it would probably force zero initialization of
    that or use or an uninitialized pseudo, all of which is hard to match in an
    pattern, so I think an UNSPEC is better for that.
    2020-02-07  Jakub Jelinek  <jakub@redhat.com>
    	PR target/93594
    	* config/i386/predicates.md (avx_identity_operand): Remove.
    	* config/i386/sse.md (*avx_vec_concat<mode>_1): Remove.
    	avx512f_<castmode><avxsizesuffix>_256<castmode>): Change patterns to
    	a VEC_CONCAT of the operand and UNSPEC_CAST.
    	(avx512f_<castmode><avxsizesuffix>_<castmode>): Change pattern to
    	a VEC_CONCAT of VEC_CONCAT of the operand and UNSPEC_CAST with