Bug 85052 - Implement support for clang's __builtin_convertvector
Summary: Implement support for clang's __builtin_convertvector
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 8.0.1
: P3 enhancement
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords:
: 61731 (view as bug list)
Depends on:
Blocks: 88601 88670
  Show dependency treegraph
 
Reported: 2018-03-23 13:08 UTC by Matthias Kretz (Vir)
Modified: 2019-01-16 21:56 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-03-23 00:00:00


Attachments
gcc9-pr85052.patch (7.97 KB, patch)
2019-01-02 18:03 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz (Vir) 2018-03-23 13:08:40 UTC
clang implements __builtin_convertvector to simplify conversions between different vector builtins. In contrast to bitcasts, supported through C casts, this builtin converts element-wise according to the standard type conversion rules.

Documentation for the builtin: https://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-convertvector

Related to PR85048.
Comment 1 Richard Biener 2018-03-23 14:19:38 UTC
Confirmed.

__builtin_convertvector is used to express generic vector type-conversion operations. The input vector and the output vector type must have the same number of elements.

Syntax:
  __builtin_convertvector(src_vec, dst_vec_type)
Comment 2 Marc Glisse 2018-03-30 09:35:41 UTC
Dup of PR61731.
Comment 3 Jan Hubicka 2018-12-26 16:50:27 UTC
*** Bug 61731 has been marked as a duplicate of this bug. ***
Comment 4 Jakub Jelinek 2019-01-02 18:03:23 UTC
Created attachment 45319 [details]
gcc9-pr85052.patch

Untested implementation.  Some further work is needed to improve code generation for the narrowing or widening conversions.
Comment 5 Matthias Kretz (Vir) 2019-01-03 07:59:17 UTC
Thank you Jakub! Here's a tested x86 library implementation for all conversions and different ISA extension support for reference:

https://github.com/mattkretz/gcc/blob/mkretz/simd/libstdc%2B%2B-v3/include/experimental/bits/simd_x86_conversions.h

(I have not looked at the patch yet to see whether I understand enough of the implementation to optimize conversions myself.)
Comment 6 Devin Hussey 2019-01-05 18:28:28 UTC
The patch seems to be working.

typedef unsigned u32x2 __attribute__((vector_size(8)));
typedef unsigned long long u64x2 __attribute__((vector_size(16)));

u64x2 cvt(u32x2 in)
{
    return __builtin_convertvector(in, u64x2);
}

It doesn't generate the best code, but it isn't bad.

x86_64, SSE4.1:

cvt:
	movq	%xmm0, %rax
	movd	%eax, %xmm0
	shrq	$32, %rax
	pinsrq	$1, %rax, %xmm0
	ret

x86_64, SSE2:

cvt:
	movq	%xmm0, %rax
	movd	%eax, %xmm0
	shrq	$32, %rax
	movq	%rax, %xmm1
	punpcklqdq	%xmm1, %xmm0
	ret

ARMv7a NEON:

cvt:
	sub	sp, sp, #16
	mov	r3, #0
	str	r3, [sp, #4]
	str	r3, [sp, #12]
	add	r3, sp, #8
	vst1.32	{d0[0]}, [sp]
	vst1.32	{d0[1]}, [r3]
	vld1.64	{d0-d1}, [sp:64]
	add	sp, sp, #16
	bx	lr

I haven't built the others yet.

The correct code would be this ([signed|unsigned]):

cvt:
    vmovl.[s|u]32    q0, d0
    bx lr

I am testing other targets now. 

For the reference, this is what clang generates for other targets:

aarch64:

cvt:
        [s|u]shll   v0.2d, v0.2s, #0
        ret

sse4.1/avx:

cvt:
        [v]pmov[s|z]xdq        xmm0, xmm0
        ret

sse2:

signed_cvt:
        pxor    xmm1, xmm1
        pcmpgtd xmm1, xmm0
        punpckldq       xmm0, xmm1      # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
        ret

unsigned_cvt:
        xorps   xmm1, xmm1
        unpcklps        xmm0, xmm1      # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
        ret
Comment 7 Devin Hussey 2019-01-05 18:36:08 UTC
Wait, silly me, this isn't about optimizations, this is about patterns.

It does the same thing it was doing for this code:

typedef unsigned u32x2 __attribute__((vector_size(8)));
typedef unsigned long long u64x2 __attribute__((vector_size(16)));

u64x2 cvt(u32x2 in)
{
    return (u64x2) { (unsigned long long)in[0], (unsigned long long)in[1] };
}
Comment 8 Jakub Jelinek 2019-01-05 18:36:59 UTC
Note, I've posted in the meantime a newer version of the patch that should handle the 2x narrowing or 2x widening cases better, see https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00129.html
Comment 9 Matthias Kretz (Vir) 2019-01-05 23:03:45 UTC
(In reply to Devin Hussey from comment #7)
> Wait, silly me, this isn't about optimizations, this is about patterns.

Regarding optimizations, PR85048 is a first step (it lists all x86 single-instruction SIMD conversions). I also linked my library implementation in #5, which provides optimizations for all cases on x86.
Comment 10 Jakub Jelinek 2019-01-07 08:49:40 UTC
Author: jakub
Date: Mon Jan  7 08:49:08 2019
New Revision: 267632

URL: https://gcc.gnu.org/viewcvs?rev=267632&root=gcc&view=rev
Log:
	PR c++/85052
	* tree-vect-generic.c: Include insn-config.h and recog.h.
	(expand_vector_piecewise): Add defaulted ret_type argument,
	if non-NULL, use that in preference to type for the result type.
	(expand_vector_parallel): Formatting fix.
	(do_vec_conversion, do_vec_narrowing_conversion,
	expand_vector_conversion): New functions.
	(expand_vector_operations_1): Call expand_vector_conversion
	for VEC_CONVERT ifn calls.
	* internal-fn.def (VEC_CONVERT): New internal function.
	* internal-fn.c (expand_VEC_CONVERT): New function.
	* fold-const-call.c (fold_const_vec_convert): New function.
	(fold_const_call): Use it for CFN_VEC_CONVERT.
	* doc/extend.texi (__builtin_convertvector): Document.
c-family/
	* c-common.h (enum rid): Add RID_BUILTIN_CONVERTVECTOR.
	(c_build_vec_convert): Declare.
	* c-common.c (c_build_vec_convert): New function.
c/
	* c-parser.c (c_parser_postfix_expression): Parse
	__builtin_convertvector.
cp/
	* cp-tree.h (cp_build_vec_convert): Declare.
	* parser.c (cp_parser_postfix_expression): Parse
	__builtin_convertvector.
	* constexpr.c: Include fold-const-call.h.
	(cxx_eval_internal_function): Handle IFN_VEC_CONVERT.
	(potential_constant_expression_1): Likewise.
	* semantics.c (cp_build_vec_convert): New function.
	* pt.c (tsubst_copy_and_build): Handle CALL_EXPR to
	IFN_VEC_CONVERT.
testsuite/
	* c-c++-common/builtin-convertvector-1.c: New test.
	* c-c++-common/torture/builtin-convertvector-1.c: New test.
	* g++.dg/ext/builtin-convertvector-1.C: New test.
	* g++.dg/cpp0x/constexpr-builtin4.C: New test.

Added:
    trunk/gcc/testsuite/c-c++-common/builtin-convertvector-1.c
    trunk/gcc/testsuite/c-c++-common/torture/builtin-convertvector-1.c
    trunk/gcc/testsuite/g++.dg/cpp0x/constexpr-builtin4.C
    trunk/gcc/testsuite/g++.dg/ext/builtin-convertvector-1.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/c-family/ChangeLog
    trunk/gcc/c-family/c-common.c
    trunk/gcc/c-family/c-common.h
    trunk/gcc/c/ChangeLog
    trunk/gcc/c/c-parser.c
    trunk/gcc/cp/ChangeLog
    trunk/gcc/cp/constexpr.c
    trunk/gcc/cp/cp-tree.h
    trunk/gcc/cp/parser.c
    trunk/gcc/cp/pt.c
    trunk/gcc/cp/semantics.c
    trunk/gcc/doc/extend.texi
    trunk/gcc/fold-const-call.c
    trunk/gcc/internal-fn.c
    trunk/gcc/internal-fn.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-generic.c
Comment 11 Jakub Jelinek 2019-01-07 10:01:45 UTC
Implemented on the trunk now.  The 4x/8x narrowing/widening conversions will need further work to handle them efficiently, though for 8x conversions we are e.g. on x86 already outside of the realm of natively supported vectors (we don't really want MMX and for 1024 bit and wider generic vectors we don't always emit best code).
Comment 12 Matthias Kretz (Vir) 2019-01-07 12:05:56 UTC
(In reply to Jakub Jelinek from comment #11)
> [...] though for 8x conversions we
> are e.g. on x86 already outside of the realm of natively supported vectors
> (we don't really want MMX and for 1024 bit and wider generic vectors we
> don't always emit best code).

Creatively thinking, consider constants stored as (u)char arrays (for bandwith optimization), converted to double or (u)llong when used. I'd want to use a half-SSE load + subsequent conversion to AVX-512 vector (e.g. vpmovsxbq + vcvtqq2pd) or even full SSE load + one shift and two conversions to AVX-512.

Similar motivation for the reverse direction. (Though a lot less likely to be used in practice, I believe. Hmm, maybe AI applications can prove that expectation wrong.)

But we should track optimizations in their own issues.