cat test.c typedef int v16si __attribute__((vector_size (64))); typedef long long v8di __attribute__((vector_size (64))); void bar_s32_s64 (v8di * dst, v16si src) { long long tem[8]; tem[0] = src[0]; tem[1] = src[1]; tem[2] = src[2]; tem[3] = src[3]; tem[4] = src[4]; tem[5] = src[5]; tem[6] = src[6]; tem[7] = src[7]; dst[0] = *(v8di *) tem; } gcc -O3 -march=skylake-avx512 will fail to vectorize the case after my r12-2549 because i've increased vec_construct cost for SKX/CLX. Here's dump for slp2 <bb 2> [local count: 1073741824]: _1 = BIT_FIELD_REF <src_18(D), 32, 0>; _2 = (long long int) _1; _3 = BIT_FIELD_REF <src_18(D), 32, 32>; _4 = (long long int) _3; _5 = BIT_FIELD_REF <src_18(D), 32, 64>; _6 = (long long int) _5; _7 = BIT_FIELD_REF <src_18(D), 32, 96>; _8 = (long long int) _7; _9 = BIT_FIELD_REF <src_18(D), 32, 128>; _10 = (long long int) _9; _11 = BIT_FIELD_REF <src_18(D), 32, 160>; _12 = (long long int) _11; _13 = BIT_FIELD_REF <src_18(D), 32, 192>; _14 = (long long int) _13; _15 = BIT_FIELD_REF <src_18(D), 32, 224>; _31 = {_1, _3, _5, _7, _9, _11, _13, _15}; vect__2.4_32 = (vector(8) long long int) _31; _16 = (long long int) _15; MEM <vector(8) long long int> [(long long int *)&tem] = vect__2.4_32; _17 = MEM[(v8di *)&tem]; *dst_28(D) = _17; tem ={v} {CLOBBER}; return; But actually, there's no need for vec_contruct from each element, it will be optimized to <bb 2> [local count: 1073741824]: _2 = BIT_FIELD_REF <src_18(D), 256, 0>; vect__2.4_32 = (vector(8) long long int) _2; *dst_28(D) = vect__2.4_32; return; So at the time slp2 can realize the optimization and categorize vec_contruct cost more accurately, we can avoid this regression.
The basic-block vectorizer is currently limited as to what "existing" vectors it recognizes. In this testcase we're accessing only the lowpart of 'src', something we cannot yet model in vectorizable_slp_permutation. The specific case isn't hard to fix, we'd get <bb 2> [local count: 1073741824]: _31 = VIEW_CONVERT_EXPR<vector(8) int>(src_18(D)); vect__2.4_33 = [vec_unpack_lo_expr] _31; vect__2.4_34 = [vec_unpack_hi_expr] _31; MEM <vector(4) long long int> [(long long int *)&tem] = vect__2.4_33; MEM <vector(4) long long int> [(long long int *)&tem + 32B] = vect__2.4_34; _17 = MEM[(v8di *)&tem]; *dst_28(D) = _17; tem ={v} {CLOBBER}; return; so we then fail to elide the temporary, producing bar_s32_s64: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 vpmovsxdq %xmm0, %ymm1 vextracti128 $0x1, %ymm0, %xmm0 movq %rsp, %rbp .cfi_def_cfa_register 6 andq $-64, %rsp subq $8, %rsp vpmovsxdq %xmm0, %ymm0 vmovdqa %ymm1, -56(%rsp) vmovdqa %ymm0, -24(%rsp) vmovdqa64 -56(%rsp), %zmm2 vmovdqa64 %zmm2, (%rdi) leave .cfi_def_cfa 7, 8 ret it looks like there's no V8SI->V8DI conversion optab or we choose V4DI for some other reason as prefered vector mode.
> it looks like there's no V8SI->V8DI conversion optab or we choose V4DI > for some other reason as prefered vector mode. We have, just need to add -mprefer-vector-width=512, the we'll get bar_s32_s64: vpmovsxdq zmm0, ymm0 vmovdqa64 ZMMWORD PTR [rdi], zmm0 ret
Some pending enhancements would allow us to use VEC_PERM_EXPR with different input modes from output mode and thus make implementation of this easier. vectorizable_slp_permutation doesn't yet support that though. For the special case of a contiguous permutation we can also vectorize it as BIT_FIELD_REF of course.
Guess we need to extend backend hook to handle different input and output modes.
(In reply to Hongtao.liu from comment #4) > Guess we need to extend backend hook to handle different input and output > modes. Yes, alternatively as said, some special cases could be directly handled. For example v16si -> v8si could be handled by VEC_PERM <lowpart, highpart, {..}> without any extra magic (but IIRC we don't have a way to query target support for specific BIT_FIELD_REFs which we'd use for getting at the lowpart or highpart and if not available those would fall back to memory). And contiguous permutes could be directly emitted as BIT_FIELD_REFs (in some cases). I have a half-way patch that does the preparatory work but leaves vectorizable_slp_permutation unchanged so we immediately fail there due to FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) { if (!vect_maybe_update_slp_op_vectype (child, vectype) || !types_compatible_p (SLP_TREE_VECTYPE (child), vectype)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "Unsupported lane permutation\n"); return false; the comment above that says /* ??? We currently only support all same vector input and output types while the SLP IL should really do a concat + select and thus accept arbitrary mismatches. */ so it was designed to handle more, it wasn't just necessary to implement it ...
Created attachment 53031 [details] patch This works now - the support for enhanced vec_perm_const is still not complete on trunk (it claims all is OK ...) so it will ICE for testcases that would require this. But lowpart extracts and concats (untested) should work. I'll extend coverage once the dependences are on trunk. For the testcase at hand we now generate bar_s32_s64: .LFB0: .cfi_startproc vpmovsxdq %ymm0, %zmm0 vmovdqa64 %zmm0, (%rdi) ret with AVX512.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:08afab6f8642f58f702010ec196dce3b00955627 commit r13-926-g08afab6f8642f58f702010ec196dce3b00955627 Author: Richard Biener <rguenther@suse.de> Date: Tue May 31 09:37:05 2022 +0200 tree-optimization/101668 - relax SLP of existing vectors This relaxes the conditions on SLPing extracts from existing vectors leveraging the relaxed VEC_PERM conditions on the input vs output vector type compatibility. It also handles lowpart extracts and concats without VEC_PERMs now. 2022-05-25 Richard Biener <rguenther@suse.de> PR tree-optimization/101668 * tree-vect-slp.cc (vect_build_slp_tree_1): Allow BIT_FIELD_REFs for vector types with compatible lane types. (vect_build_slp_tree_2): Deal with this. (vect_add_slp_permutation): Adjust. Emit lowpart/concat special cases without VEC_PERM. (vectorizable_slp_permutation): Select the operand vector type and relax requirements. Handle identity permutes with mismatching operand types. * optabs-query.cc (can_vec_perm_const_p): Only allow variable permutes for op_mode == mode. * gcc.target/i386/pr101668.c: New testcase. * gcc.dg/vect/bb-slp-pr101668.c: Likewise.
Fixed (lowpart of existing vector). The vectorizer now should also handle other things like extract even but the target needs to support this in its vec_perm_const handling which now allows different input/output modes.