typedef double v2df __attribute__((vector_size(16))); v2df move_sd(v2df a, v2df b) { v2df result = a; result[1] = b[1]; return result; } With `-O3 -msse4.1`, LLVM gives : move_sd(double __vector(2), double __vector(2)): # @move_sd(double __vector(2), double __vector(2)) blendps xmm0, xmm1, 12 # xmm0 = xmm0[0,1],xmm1[2,3] ret GCC gives : move_sd(double __vector(2), double __vector(2)): unpckhpd xmm1, xmm1 unpcklpd xmm0, xmm1 ret
Similar as PR94864. I'll note that x86 might fare better if on GIMPLE instead of _1 = BIT_FIELD_REF <b_3(D), 64, 64>; result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>; return result_4; we had a VEC_PERM but IIRC for two-element vectors this regressed some cases. Note for this case the IL looks like above from the start so pattern-matching a insert of an element from another vector to a permute might be a possibility as well.
Missing match.pd patterns also include a no-op comb of insertion of an extracted element at the same position (simplify (bit_insert @0 (BIT_FIELD_REF @0 @size @pos) @pos) (if (size matches) @0) in addition to the requested (simplify (bit_insert @0 (BIT_FIELD_REF @1 @rsize @rpos) @ipos) (if (@0 and @1 are vectors compatible for a vec_perm) (vec_perm @0 @1 { shuffle-mask }))
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:27de9aa152141e7f3ee66372647d0f2cd94c4b90 commit r14-3381-g27de9aa152141e7f3ee66372647d0f2cd94c4b90 Author: Richard Biener <rguenther@suse.de> Date: Wed Jul 12 15:01:47 2023 +0200 tree-optimization/94864 - vector insert of vector extract simplification The PRs ask for optimizing of _1 = BIT_FIELD_REF <b_3(D), 64, 64>; result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>; to a vector permutation. The following implements this as match.pd pattern, improving code generation on x86_64. On the RTL level we face the issue that backend patterns inconsistently use vec_merge and vec_select of vec_concat to represent permutes. I think using a (supported) permute is almost always better than an extract plus insert, maybe excluding the case we extract element zero and that's aliased to a register that can be used directly for insertion (not sure how to query that). The patch FAILs one case in gcc.target/i386/avx512fp16-vmovsh-1a.c where we now expand from __A_28 = VEC_PERM_EXPR <x2.8_9, x1.9_10, { 0, 9, 10, 11, 12, 13, 14, 15 }>; instead of _28 = BIT_FIELD_REF <x2.8_9, 16, 0>; __A_29 = BIT_INSERT_EXPR <x1.9_10, _28, 0>; producing a vpblendw instruction instead of the expected vmovsh. That's either a missed vec_perm_const expansion optimization or even better, an improvement - Zen4 for example has 4 ports to execute vpblendw but only 3 for executing vmovsh and both instructions have the same size. The patch XFAILs the sub-testcase. PR tree-optimization/94864 PR tree-optimization/94865 PR tree-optimization/93080 * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern for vector insertion from vector extraction. * gcc.target/i386/pr94864.c: New testcase. * gcc.target/i386/pr94865.c: Likewise. * gcc.target/i386/avx512fp16-vmovsh-1a.c: XFAIL. * gcc.dg/tree-ssa/forwprop-40.c: Likewise. * gcc.dg/tree-ssa/forwprop-41.c: Likewise.
Fixed.