clang implements __builtin_convertvector to simplify conversions between different vector builtins. In contrast to bitcasts, supported through C casts, this builtin converts element-wise according to the standard type conversion rules. Documentation for the builtin: https://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-convertvector Related to PR85048.
Confirmed. __builtin_convertvector is used to express generic vector type-conversion operations. The input vector and the output vector type must have the same number of elements. Syntax: __builtin_convertvector(src_vec, dst_vec_type)
Dup of PR61731.
*** Bug 61731 has been marked as a duplicate of this bug. ***
Created attachment 45319 [details] gcc9-pr85052.patch Untested implementation. Some further work is needed to improve code generation for the narrowing or widening conversions.
Thank you Jakub! Here's a tested x86 library implementation for all conversions and different ISA extension support for reference: https://github.com/mattkretz/gcc/blob/mkretz/simd/libstdc%2B%2B-v3/include/experimental/bits/simd_x86_conversions.h (I have not looked at the patch yet to see whether I understand enough of the implementation to optimize conversions myself.)
The patch seems to be working. typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cvt(u32x2 in) { return __builtin_convertvector(in, u64x2); } It doesn't generate the best code, but it isn't bad. x86_64, SSE4.1: cvt: movq %xmm0, %rax movd %eax, %xmm0 shrq $32, %rax pinsrq $1, %rax, %xmm0 ret x86_64, SSE2: cvt: movq %xmm0, %rax movd %eax, %xmm0 shrq $32, %rax movq %rax, %xmm1 punpcklqdq %xmm1, %xmm0 ret ARMv7a NEON: cvt: sub sp, sp, #16 mov r3, #0 str r3, [sp, #4] str r3, [sp, #12] add r3, sp, #8 vst1.32 {d0[0]}, [sp] vst1.32 {d0[1]}, [r3] vld1.64 {d0-d1}, [sp:64] add sp, sp, #16 bx lr I haven't built the others yet. The correct code would be this ([signed|unsigned]): cvt: vmovl.[s|u]32 q0, d0 bx lr I am testing other targets now. For the reference, this is what clang generates for other targets: aarch64: cvt: [s|u]shll v0.2d, v0.2s, #0 ret sse4.1/avx: cvt: [v]pmov[s|z]xdq xmm0, xmm0 ret sse2: signed_cvt: pxor xmm1, xmm1 pcmpgtd xmm1, xmm0 punpckldq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] ret unsigned_cvt: xorps xmm1, xmm1 unpcklps xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] ret
Wait, silly me, this isn't about optimizations, this is about patterns. It does the same thing it was doing for this code: typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cvt(u32x2 in) { return (u64x2) { (unsigned long long)in[0], (unsigned long long)in[1] }; }
Note, I've posted in the meantime a newer version of the patch that should handle the 2x narrowing or 2x widening cases better, see https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00129.html
(In reply to Devin Hussey from comment #7) > Wait, silly me, this isn't about optimizations, this is about patterns. Regarding optimizations, PR85048 is a first step (it lists all x86 single-instruction SIMD conversions). I also linked my library implementation in #5, which provides optimizations for all cases on x86.
Author: jakub Date: Mon Jan 7 08:49:08 2019 New Revision: 267632 URL: https://gcc.gnu.org/viewcvs?rev=267632&root=gcc&view=rev Log: PR c++/85052 * tree-vect-generic.c: Include insn-config.h and recog.h. (expand_vector_piecewise): Add defaulted ret_type argument, if non-NULL, use that in preference to type for the result type. (expand_vector_parallel): Formatting fix. (do_vec_conversion, do_vec_narrowing_conversion, expand_vector_conversion): New functions. (expand_vector_operations_1): Call expand_vector_conversion for VEC_CONVERT ifn calls. * internal-fn.def (VEC_CONVERT): New internal function. * internal-fn.c (expand_VEC_CONVERT): New function. * fold-const-call.c (fold_const_vec_convert): New function. (fold_const_call): Use it for CFN_VEC_CONVERT. * doc/extend.texi (__builtin_convertvector): Document. c-family/ * c-common.h (enum rid): Add RID_BUILTIN_CONVERTVECTOR. (c_build_vec_convert): Declare. * c-common.c (c_build_vec_convert): New function. c/ * c-parser.c (c_parser_postfix_expression): Parse __builtin_convertvector. cp/ * cp-tree.h (cp_build_vec_convert): Declare. * parser.c (cp_parser_postfix_expression): Parse __builtin_convertvector. * constexpr.c: Include fold-const-call.h. (cxx_eval_internal_function): Handle IFN_VEC_CONVERT. (potential_constant_expression_1): Likewise. * semantics.c (cp_build_vec_convert): New function. * pt.c (tsubst_copy_and_build): Handle CALL_EXPR to IFN_VEC_CONVERT. testsuite/ * c-c++-common/builtin-convertvector-1.c: New test. * c-c++-common/torture/builtin-convertvector-1.c: New test. * g++.dg/ext/builtin-convertvector-1.C: New test. * g++.dg/cpp0x/constexpr-builtin4.C: New test. Added: trunk/gcc/testsuite/c-c++-common/builtin-convertvector-1.c trunk/gcc/testsuite/c-c++-common/torture/builtin-convertvector-1.c trunk/gcc/testsuite/g++.dg/cpp0x/constexpr-builtin4.C trunk/gcc/testsuite/g++.dg/ext/builtin-convertvector-1.C Modified: trunk/gcc/ChangeLog trunk/gcc/c-family/ChangeLog trunk/gcc/c-family/c-common.c trunk/gcc/c-family/c-common.h trunk/gcc/c/ChangeLog trunk/gcc/c/c-parser.c trunk/gcc/cp/ChangeLog trunk/gcc/cp/constexpr.c trunk/gcc/cp/cp-tree.h trunk/gcc/cp/parser.c trunk/gcc/cp/pt.c trunk/gcc/cp/semantics.c trunk/gcc/doc/extend.texi trunk/gcc/fold-const-call.c trunk/gcc/internal-fn.c trunk/gcc/internal-fn.def trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-generic.c
Implemented on the trunk now. The 4x/8x narrowing/widening conversions will need further work to handle them efficiently, though for 8x conversions we are e.g. on x86 already outside of the realm of natively supported vectors (we don't really want MMX and for 1024 bit and wider generic vectors we don't always emit best code).
(In reply to Jakub Jelinek from comment #11) > [...] though for 8x conversions we > are e.g. on x86 already outside of the realm of natively supported vectors > (we don't really want MMX and for 1024 bit and wider generic vectors we > don't always emit best code). Creatively thinking, consider constants stored as (u)char arrays (for bandwith optimization), converted to double or (u)llong when used. I'd want to use a half-SSE load + subsequent conversion to AVX-512 vector (e.g. vpmovsxbq + vcvtqq2pd) or even full SSE load + one shift and two conversions to AVX-512. Similar motivation for the reverse direction. (Though a lot less likely to be used in practice, I believe. Hmm, maybe AI applications can prove that expectation wrong.) But we should track optimizations in their own issues.