Bug 44551 - [missed optimization] AVX vextractf128 after vinsertf128
Summary: [missed optimization] AVX vextractf128 after vinsertf128
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.5.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-15 22:31 UTC by Matthias Kretz (Vir)
Modified: 2018-11-20 08:11 UTC (History)
4 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2010-06-16 09:02:24


Attachments
A patch to split cast (1.71 KB, patch)
2010-06-17 22:01 UTC, H.J. Lu
Details | Diff
simplify vec_select(vec_concat) (694 bytes, patch)
2014-06-10 17:02 UTC, Marc Glisse
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz (Vir) 2010-06-15 22:31:28 UTC
Consider the following testcase:

#include <immintrin.h>

static inline __m256i __attribute__((always_inline))
my_add(__m256i a0, __m256i b0)
{
    __m128i a1 = _mm256_extractf128_si256(a0, 1);
    __m128i b1 = _mm256_extractf128_si256(b0, 1);
    __m256i r  = _mm256_castsi128_si256(_mm_add_epi32(_mm256_castsi256_si128(a0), _mm256_castsi256_si128(b0)));
    r = _mm256_insertf128_si256(r, _mm_add_epi32(a1, b1), 1);
    return r;
}

extern int DATA[];

void use_insert_extract()
{
    __m256i x = _mm256_loadu_si256((__m256i*)&DATA[0]);
    __m256i y = _mm256_loadu_si256((__m256i*)&DATA[1]);
    x = my_add(x, y);
    x = my_add(x, y);
    _mm256_storeu_si256((__m256i*)&DATA[0], x);
}

int main()
{
    return DATA[1];
}

Compiled with "g++ -mavx -O3 -Wall -S" one gets the following output:
        vmovdqu DATA(%rip), %ymm1
        pushq   %rbp
        vmovdqu DATA+4(%rip), %ymm0
        vextractf128    $0x1, %ymm1, %xmm3
        vmovdqa %xmm1, %xmm2
        movq    %rsp, %rbp
        vmovdqa %xmm0, %xmm1
        vextractf128    $0x1, %ymm0, %xmm0
        vpaddd  %xmm1, %xmm2, %xmm2
        vpaddd  %xmm0, %xmm3, %xmm3
        vinsertf128     $0x1, %xmm3, %ymm2, %ymm2
        vextractf128    $0x1, %ymm2, %xmm3
        vpaddd  %xmm1, %xmm2, %xmm1
        vpaddd  %xmm0, %xmm3, %xmm0
        vinsertf128     $0x1, %xmm0, %ymm1, %ymm0
        vmovdqu %ymm0, DATA(%rip)

ICC 11.1 compiles the same source ("-xavx -O3 -Wall -S") to:
        vmovdqu   DATA(%rip), %ymm1
        vmovdqu   4+DATA(%rip), %ymm0
        vextractf128 $1, %ymm1, %xmm2
        vextractf128 $1, %ymm0, %xmm6
        vpaddd    %xmm0, %xmm1, %xmm3
        vpaddd    %xmm6, %xmm2, %xmm5
        vpaddd    %xmm0, %xmm3, %xmm4
        vpaddd    %xmm6, %xmm5, %xmm7
        vinsertf128 $1, %xmm7, %ymm4, %ymm8
        vmovdqu   %ymm8, DATA(%rip)

Note especially the extract after insert which happens because of the double application of my_add. This kind of optimization (which ICC is able to apply here) is important because AVX introduces 256 bit vector registers but arithmetic/logic/comparison operations on integers remain the 128 bit SSE variants. Thus if you want to handle integers in YMM registers you will find a lot of vinsertf128 and vextractf128 operations.
Comment 1 Richard Biener 2010-06-16 09:02:24 UTC
This is probably missing combiner patterns in sse.md.
Comment 2 H.J. Lu 2010-06-16 19:50:10 UTC
The problem is UNSPEC_CAST.  There is no good way to model it.
Comment 3 Andrew Pinski 2010-06-16 20:00:52 UTC
Well for one, you could have a splitter if the case which_alternative == 0 so that an reg rename can do its magic.

Also what does UNSPEC_CAST really do?  From the looks of it is just a move which you could use a splitter on.  At least for after reload.
Comment 4 H.J. Lu 2010-06-16 20:42:15 UTC
You can cast 256bit to 128bit to get the lower 128bit. You can also
cast 128bit to 256bit with upper 128bit undefined. If I use union,
it will always generate 2 moves via memory.
Comment 5 Andrew Pinski 2010-06-16 20:46:05 UTC
(In reply to comment #4)
> You can cast 256bit to 128bit to get the lower 128bit.

This way can be represented using vec_select.  And then later on using a split (after reload) turned into a move.

> You can also cast 128bit to 256bit with upper 128bit undefined. 
Still use an UNSPEC but use define_insn_and_split which does a splitting (after reload) to turn it into a move.  Since it is a move after all (the registers are overlapping).

This should improve code generation.  Also penalize the non matching 0 operand case in both insn.
Comment 6 Matthias Kretz (Vir) 2010-06-16 21:21:30 UTC
(In reply to comment #4)
> You can also cast 128bit to 256bit with upper 128bit undefined.
If you cast from xmm to ymm after a 128bit instruction coded with VEX prefix then the upper 128bit are actually guaranteed to be zero. If the SSE instruction does not use the VEX prefix then the upper 128 bits are not modified. Thus there is never really an undefined state. That might be useful information for other optimizations?

> If I use union, it will always generate 2 moves via memory.
Yes, I noticed that unions are not a good choice for performance critical code. It results in way more memory moves than necessary. BTW ICC also generates memory moves when implementing the testcase with unions.

PS: Thanks a lot for looking into this!
Comment 7 H.J. Lu 2010-06-17 22:01:17 UTC
Created attachment 20934 [details]
A patch to split cast

Here is a patch to split cast. But it doesn't remove
redundant vinsertf128/vextractf128. I am not sure which
pass can optimize setting/extracting higher elements of
a vector.
Comment 8 H.J. Lu 2010-06-18 00:46:24 UTC
Can we use subreg instead of vec_select?
Comment 9 Andrew Pinski 2010-06-18 00:49:56 UTC
(In reply to comment #8)
> Can we use subreg instead of vec_select?

Kinda, you need to do triple subregs, first to an integer mode and then to a smaller integer mode and then to the other vector mode.  subreg on vector types are only valid for the same size.  I tried doing this for another target and it did not work really and I ended up using vec_select instead and penalizing the non matching constraint case.  
Comment 10 H.J. Lu 2010-06-28 19:17:10 UTC
Here is a small testcase:

[hjl@gnu-6 44551]$ cat c.s
	.file	"c.c"
	.text
	.p2align 4,,15
.globl foo
	.type	foo, @function
foo:
.LFB798:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	vinsertf128	$0x1, %xmm1, %ymm0, %ymm0
	movq	%rsp, %rbp
	.cfi_offset 6, -16
	.cfi_def_cfa_register 6
	vextractf128	$0x1, %ymm0, %xmm0
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE798:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 4.6.0 20100625 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-6 44551]$ 

The optimize code is

        vmovaps   %xmm1, %xmm0
        ret         
Comment 11 H.J. Lu 2010-06-28 19:17:52 UTC
Testcase is

[hjl@gnu-6 44551]$ cat c.c
#include <immintrin.h>

__m128i
foo (__m256i x, __m128i y)
{
  __m256i r = _mm256_insertf128_si256(x, y, 1);
  __m128i a = _mm256_extractf128_si256(r, 1);
  return a;
}
[hjl@gnu-6 44551]$ make c.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -mavx -O2 -S c.c
Comment 12 Marc Glisse 2012-12-01 22:38:15 UTC
Hmm, maybe this patch:

http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00373.html

would help with the testcase in comment #11 ? I'll have to try and resurrect it.
Comment 13 Marc Glisse 2014-06-10 17:02:04 UTC
Created attachment 32915 [details]
simplify vec_select(vec_concat)

A simpler/safer version of the patch linked in comment #12 (untested). It optimizes the example in comment #11, but fails to optimize the original testcase because simplify-rtx operations are only done on single-use operands, and I don't know where in the RTL optimizers we can apply transformations without this constraint.
Comment 14 Marc Glisse 2014-07-26 09:01:05 UTC
Author: glisse
Date: Sat Jul 26 09:00:31 2014
New Revision: 213076

URL: https://gcc.gnu.org/viewcvs?rev=213076&root=gcc&view=rev
Log:
2014-07-26  Marc Glisse  <marc.glisse@inria.fr>

	PR target/44551
gcc/
	* simplify-rtx.c (simplify_binary_operation_1) <VEC_SELECT>:
	Optimize inverse of a VEC_CONCAT.
gcc/testsuite/
	* gcc.target/i386/pr44551-1.c: New file.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr44551-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/simplify-rtx.c
    trunk/gcc/testsuite/ChangeLog
Comment 15 Qirun Zhang 2016-02-11 23:42:01 UTC
This regression crashes the current trunk.


$ gcc-trunk -v
Using built-in specs.
COLLECT_GCC=gcc-trunk
COLLECT_LTO_WRAPPER=/home/absozero/trunk/root-gcc/libexec/gcc/x86_64-pc-linux-gnu/6.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc/configure --prefix=/home/absozero/trunk/root-gcc --enable-languages=c,c++ --disable-werror --enable-multilib
Thread model: posix
gcc version 6.0.0 20160211 (experimental) [trunk revision 233345] (GCC) 


$ gcc-trunk pr44551-1.c 
pr44551-1.c: In function ‘foo’:
pr44551-1.c:7:1: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 foo (__m256i x, __m128i y)
 ^~~
pr44551-1.c:7:1: warning: AVX vector argument without AVX enabled changes the ABI [-Wpsabi]
In file included from /home/absozero/trunk/root-gcc/lib/gcc/x86_64-pc-linux-gnu/6.0.0/include/immintrin.h:41:0,
                 from pr44551-1.c:4:
pr44551-1.c:9:15: error: ‘__builtin_ia32_vinsertf128_si256’ needs isa option -m32
   __m256i r = _mm256_insertf128_si256(x, y, 1);
               ^
pr44551-1.c:9:15: internal compiler error: in emit_move_insn, at expr.c:3546
0x851c6f emit_move_insn(rtx_def*, rtx_def*)
	../../gcc/gcc/expr.c:3545
0x85859d store_expr_with_bounds(tree_node*, rtx_def*, int, bool, bool, tree_node*)
	../../gcc/gcc/expr.c:5583
0x859b88 expand_assignment(tree_node*, tree_node*, bool)
	../../gcc/gcc/expr.c:5175
0x74dd8a expand_call_stmt
	../../gcc/gcc/cfgexpand.c:2646
0x74dd8a expand_gimple_stmt_1
	../../gcc/gcc/cfgexpand.c:3536
0x74dd8a expand_gimple_stmt
	../../gcc/gcc/cfgexpand.c:3702
0x74fbe8 expand_gimple_basic_block
	../../gcc/gcc/cfgexpand.c:5708
0x755bf6 execute
	../../gcc/gcc/cfgexpand.c:6323
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
Comment 16 Martin Liška 2018-11-19 11:55:54 UTC
I don't see the ICE, thus closing.
Comment 17 Marc Glisse 2018-11-19 12:16:48 UTC
(In reply to Martin Liška from comment #16)
> I don't see the ICE, thus closing.

Martin, the missed optimization doesn't seem fixed (is it?). The ICE was a side remark in one comment but not the topic of this PR.
Comment 18 Matthias Kretz (Vir) 2018-11-19 15:00:34 UTC
FWIW, the issue is resolved on trunk. GCC8.2 still has the missed optimization: https://godbolt.org/z/hbgIIi
Comment 19 Marc Glisse 2018-11-19 15:12:02 UTC
(In reply to Matthias Kretz from comment #18)
> FWIW, the issue is resolved on trunk. GCC8.2 still has the missed
> optimization: https://godbolt.org/z/hbgIIi

If I use exactly the testcase from the original description, the results don't look as nice. Is that not an issue?
Comment 20 Matthias Kretz (Vir) 2018-11-19 16:33:14 UTC
The original issue I meant to report is fixed. There are many more missed optimizations in the original example, though.

I.e. https://godbolt.org/z/7P1o3O should compile to:
use_insert_extract():
  vmovdqu DATA+4(%rip), %xmm2
  vmovdqu DATA+20(%rip), %xmm4
  vpaddd DATA(%rip), %xmm2, %xmm0
  vpaddd DATA+16(%rip), %xmm4, %xmm1
  vpaddd %xmm2, %xmm0, %xmm0
  vpaddd %xmm4, %xmm1, %xmm1
  vmovups %xmm0, DATA(%rip)
  vmovups %xmm1, DATA+16(%rip)
  ret
Comment 21 Marc Glisse 2018-11-19 20:11:30 UTC
(In reply to Matthias Kretz from comment #20)
> The original issue I meant to report is fixed. There are many more missed
> optimizations in the original example, though.

ok, your choice if you prefer to close it (especially if the other missed optimizations are already tracked in other bugs) or leave it open.
Comment 22 Richard Biener 2018-11-20 08:11:20 UTC
Let's close it.  It's much better to track different issues in different bugs.