Consider the following testcase: #include <immintrin.h> static inline __m256i __attribute__((always_inline)) my_add(__m256i a0, __m256i b0) { __m128i a1 = _mm256_extractf128_si256(a0, 1); __m128i b1 = _mm256_extractf128_si256(b0, 1); __m256i r = _mm256_castsi128_si256(_mm_add_epi32(_mm256_castsi256_si128(a0), _mm256_castsi256_si128(b0))); r = _mm256_insertf128_si256(r, _mm_add_epi32(a1, b1), 1); return r; } extern int DATA[]; void use_insert_extract() { __m256i x = _mm256_loadu_si256((__m256i*)&DATA[0]); __m256i y = _mm256_loadu_si256((__m256i*)&DATA[1]); x = my_add(x, y); x = my_add(x, y); _mm256_storeu_si256((__m256i*)&DATA[0], x); } int main() { return DATA[1]; } Compiled with "g++ -mavx -O3 -Wall -S" one gets the following output: vmovdqu DATA(%rip), %ymm1 pushq %rbp vmovdqu DATA+4(%rip), %ymm0 vextractf128 $0x1, %ymm1, %xmm3 vmovdqa %xmm1, %xmm2 movq %rsp, %rbp vmovdqa %xmm0, %xmm1 vextractf128 $0x1, %ymm0, %xmm0 vpaddd %xmm1, %xmm2, %xmm2 vpaddd %xmm0, %xmm3, %xmm3 vinsertf128 $0x1, %xmm3, %ymm2, %ymm2 vextractf128 $0x1, %ymm2, %xmm3 vpaddd %xmm1, %xmm2, %xmm1 vpaddd %xmm0, %xmm3, %xmm0 vinsertf128 $0x1, %xmm0, %ymm1, %ymm0 vmovdqu %ymm0, DATA(%rip) ICC 11.1 compiles the same source ("-xavx -O3 -Wall -S") to: vmovdqu DATA(%rip), %ymm1 vmovdqu 4+DATA(%rip), %ymm0 vextractf128 $1, %ymm1, %xmm2 vextractf128 $1, %ymm0, %xmm6 vpaddd %xmm0, %xmm1, %xmm3 vpaddd %xmm6, %xmm2, %xmm5 vpaddd %xmm0, %xmm3, %xmm4 vpaddd %xmm6, %xmm5, %xmm7 vinsertf128 $1, %xmm7, %ymm4, %ymm8 vmovdqu %ymm8, DATA(%rip) Note especially the extract after insert which happens because of the double application of my_add. This kind of optimization (which ICC is able to apply here) is important because AVX introduces 256 bit vector registers but arithmetic/logic/comparison operations on integers remain the 128 bit SSE variants. Thus if you want to handle integers in YMM registers you will find a lot of vinsertf128 and vextractf128 operations.
This is probably missing combiner patterns in sse.md.
The problem is UNSPEC_CAST. There is no good way to model it.
Well for one, you could have a splitter if the case which_alternative == 0 so that an reg rename can do its magic. Also what does UNSPEC_CAST really do? From the looks of it is just a move which you could use a splitter on. At least for after reload.
You can cast 256bit to 128bit to get the lower 128bit. You can also cast 128bit to 256bit with upper 128bit undefined. If I use union, it will always generate 2 moves via memory.
(In reply to comment #4) > You can cast 256bit to 128bit to get the lower 128bit. This way can be represented using vec_select. And then later on using a split (after reload) turned into a move. > You can also cast 128bit to 256bit with upper 128bit undefined. Still use an UNSPEC but use define_insn_and_split which does a splitting (after reload) to turn it into a move. Since it is a move after all (the registers are overlapping). This should improve code generation. Also penalize the non matching 0 operand case in both insn.
(In reply to comment #4) > You can also cast 128bit to 256bit with upper 128bit undefined. If you cast from xmm to ymm after a 128bit instruction coded with VEX prefix then the upper 128bit are actually guaranteed to be zero. If the SSE instruction does not use the VEX prefix then the upper 128 bits are not modified. Thus there is never really an undefined state. That might be useful information for other optimizations? > If I use union, it will always generate 2 moves via memory. Yes, I noticed that unions are not a good choice for performance critical code. It results in way more memory moves than necessary. BTW ICC also generates memory moves when implementing the testcase with unions. PS: Thanks a lot for looking into this!
Created attachment 20934 [details] A patch to split cast Here is a patch to split cast. But it doesn't remove redundant vinsertf128/vextractf128. I am not sure which pass can optimize setting/extracting higher elements of a vector.
Can we use subreg instead of vec_select?
(In reply to comment #8) > Can we use subreg instead of vec_select? Kinda, you need to do triple subregs, first to an integer mode and then to a smaller integer mode and then to the other vector mode. subreg on vector types are only valid for the same size. I tried doing this for another target and it did not work really and I ended up using vec_select instead and penalizing the non matching constraint case.
Here is a small testcase: [hjl@gnu-6 44551]$ cat c.s .file "c.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB798: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 movq %rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 vextractf128 $0x1, %ymm0, %xmm0 leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE798: .size foo, .-foo .ident "GCC: (GNU) 4.6.0 20100625 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-6 44551]$ The optimize code is vmovaps %xmm1, %xmm0 ret
Testcase is [hjl@gnu-6 44551]$ cat c.c #include <immintrin.h> __m128i foo (__m256i x, __m128i y) { __m256i r = _mm256_insertf128_si256(x, y, 1); __m128i a = _mm256_extractf128_si256(r, 1); return a; } [hjl@gnu-6 44551]$ make c.s /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -mavx -O2 -S c.c
Hmm, maybe this patch: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00373.html would help with the testcase in comment #11 ? I'll have to try and resurrect it.
Created attachment 32915 [details] simplify vec_select(vec_concat) A simpler/safer version of the patch linked in comment #12 (untested). It optimizes the example in comment #11, but fails to optimize the original testcase because simplify-rtx operations are only done on single-use operands, and I don't know where in the RTL optimizers we can apply transformations without this constraint.
Author: glisse Date: Sat Jul 26 09:00:31 2014 New Revision: 213076 URL: https://gcc.gnu.org/viewcvs?rev=213076&root=gcc&view=rev Log: 2014-07-26 Marc Glisse <marc.glisse@inria.fr> PR target/44551 gcc/ * simplify-rtx.c (simplify_binary_operation_1) <VEC_SELECT>: Optimize inverse of a VEC_CONCAT. gcc/testsuite/ * gcc.target/i386/pr44551-1.c: New file. Added: trunk/gcc/testsuite/gcc.target/i386/pr44551-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/simplify-rtx.c trunk/gcc/testsuite/ChangeLog
This regression crashes the current trunk. $ gcc-trunk -v Using built-in specs. COLLECT_GCC=gcc-trunk COLLECT_LTO_WRAPPER=/home/absozero/trunk/root-gcc/libexec/gcc/x86_64-pc-linux-gnu/6.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc/configure --prefix=/home/absozero/trunk/root-gcc --enable-languages=c,c++ --disable-werror --enable-multilib Thread model: posix gcc version 6.0.0 20160211 (experimental) [trunk revision 233345] (GCC) $ gcc-trunk pr44551-1.c pr44551-1.c: In function ‘foo’: pr44551-1.c:7:1: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6 foo (__m256i x, __m128i y) ^~~ pr44551-1.c:7:1: warning: AVX vector argument without AVX enabled changes the ABI [-Wpsabi] In file included from /home/absozero/trunk/root-gcc/lib/gcc/x86_64-pc-linux-gnu/6.0.0/include/immintrin.h:41:0, from pr44551-1.c:4: pr44551-1.c:9:15: error: ‘__builtin_ia32_vinsertf128_si256’ needs isa option -m32 __m256i r = _mm256_insertf128_si256(x, y, 1); ^ pr44551-1.c:9:15: internal compiler error: in emit_move_insn, at expr.c:3546 0x851c6f emit_move_insn(rtx_def*, rtx_def*) ../../gcc/gcc/expr.c:3545 0x85859d store_expr_with_bounds(tree_node*, rtx_def*, int, bool, bool, tree_node*) ../../gcc/gcc/expr.c:5583 0x859b88 expand_assignment(tree_node*, tree_node*, bool) ../../gcc/gcc/expr.c:5175 0x74dd8a expand_call_stmt ../../gcc/gcc/cfgexpand.c:2646 0x74dd8a expand_gimple_stmt_1 ../../gcc/gcc/cfgexpand.c:3536 0x74dd8a expand_gimple_stmt ../../gcc/gcc/cfgexpand.c:3702 0x74fbe8 expand_gimple_basic_block ../../gcc/gcc/cfgexpand.c:5708 0x755bf6 execute ../../gcc/gcc/cfgexpand.c:6323 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions.
I don't see the ICE, thus closing.
(In reply to Martin Liška from comment #16) > I don't see the ICE, thus closing. Martin, the missed optimization doesn't seem fixed (is it?). The ICE was a side remark in one comment but not the topic of this PR.
FWIW, the issue is resolved on trunk. GCC8.2 still has the missed optimization: https://godbolt.org/z/hbgIIi
(In reply to Matthias Kretz from comment #18) > FWIW, the issue is resolved on trunk. GCC8.2 still has the missed > optimization: https://godbolt.org/z/hbgIIi If I use exactly the testcase from the original description, the results don't look as nice. Is that not an issue?
The original issue I meant to report is fixed. There are many more missed optimizations in the original example, though. I.e. https://godbolt.org/z/7P1o3O should compile to: use_insert_extract(): vmovdqu DATA+4(%rip), %xmm2 vmovdqu DATA+20(%rip), %xmm4 vpaddd DATA(%rip), %xmm2, %xmm0 vpaddd DATA+16(%rip), %xmm4, %xmm1 vpaddd %xmm2, %xmm0, %xmm0 vpaddd %xmm4, %xmm1, %xmm1 vmovups %xmm0, DATA(%rip) vmovups %xmm1, DATA+16(%rip) ret
(In reply to Matthias Kretz from comment #20) > The original issue I meant to report is fixed. There are many more missed > optimizations in the original example, though. ok, your choice if you prefer to close it (especially if the other missed optimizations are already tracked in other bugs) or leave it open.
Let's close it. It's much better to track different issues in different bugs.