Summary: | SSE intrinsics hard to use without redundant temporaries appearing | ||
---|---|---|---|
Product: | gcc | Reporter: | Tim Day <timday> |
Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | gcc-bugs, jakub, liuhongt, rguenth, timday, uros |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 4.1.2 | ||
Target Milestone: | --- | ||
Host: | Target: | x86_64-*-*, i?86-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2016-05-10 00:00:00 | |
Bug Depends on: | 28367 | ||
Bug Blocks: | |||
Attachments: |
Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp
More concise demonstration of the v4sf->float->v4sf issue. Uninclude version of the "More concise demonstration of the v4sf->float->v4sf issue" uninclude of the original testcase |
Description
Tim Day
2006-11-07 22:22:55 UTC
Created attachment 12566 [details]
Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp
This is the .ii file output from
gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp
Most of it is the result of the .cpp's sole direct include : #include <xmmintrin.h>, which was immediately before the "class vector4f" declaration.
Looks like this is mostly caused by: union { __v4sf vecf; __m128 rawf; float val[4]; } _rep; I will have a look more at this issue later tonight when I get home from work. I've just tried an alternative version (will upload later) replacing the union with a single __v4sf _rep, and implementing the [] operators using e.g (reinterpret_cast<const float*>(&_rep))[i]; However the code generated by the two transform implementations remains the same (20 and 32 instructions anyway; haven't checked the details yet). Maybe not surprising as it's just moving the problem around. The big difference between the two methods is perhaps primarily that the bad one involves a __v4sf->float->__vfs4 conversion, while the good one uses __v4sf throughout by using the mul_compN methods. I'll try and prepare a more concise test case based on the premise that bad handling of __v4sf <-> float is the real issue. Created attachment 12573 [details]
More concise demonstration of the v4sf->float->v4sf issue.
The attached code, (no classes or unions, just a few inline functions) obtained from
gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer v4sf.cpp
compiles transform_good to 18 instructions and transform_bad to 33. However it's not really surprising a round-trip through stack temporaries is required when pointer arithmetic is being used to extract a float from a __v4sf. I've no idea whether it's realistic to hope this could ever be optimised away. Alternatively, it would be very nice if the builtin vector types simply provided a [] operator, or if there were some intrinsics for extracting floats from a __v4sf.
(In the meantime, in the original vector4f class, remaining in the __v4sf domain by having the const operator[] return a suitably type-wrapped __v4sf "filled" with the specified component seems to be a promising direction).
This is mostly PR 28367. There are most likely other issues like some of the SSE intrinsics not being declared as pure/const. So what is missing here is avoiding 'v' for _26 = BIT_FIELD_REF <v_13(D), 32, 0>; BIT_FIELD_REF <v, 32, 0> = _26; v.1_24 = v; _25 = __builtin_ia32_shufps (v.1_24, v.1_24, 0); v ={v} {CLOBBER}; which can be done with a new BIT_FIELD_EXPR like so: v_24 = BIT_FIELD_EXPR <v_n(D), _26, 32, 0>; So I have it down to a x86 combine issue: ;; v_28 = BIT_FIELD_INSERT <v_27(D), _30, 0 (32 bits)>; (insn 7 6 8 (set (reg:SF 116) (vec_select:SF (reg/v:V4SF 115 [ v ]) (parallel [ (const_int 0 [0]) ]))) t.c:5 -1 (nil)) (insn 8 7 9 (set (reg:V4SF 117) (reg/v:V4SF 109 [ v ])) t.c:11 -1 (nil)) (insn 9 8 10 (set (reg:V4SF 117) (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116)) (reg:V4SF 117) (const_int 1 [0x1]))) t.c:11 -1 (nil)) (insn 10 9 0 (set (reg/v:V4SF 110 [ v ]) (reg:V4SF 117)) t.c:11 -1 (nil)) that's from what vec_set_optab produces ;; _29 = __builtin_ia32_shufps (v_28, v_28, 0); (insn 11 10 12 (set (reg:V4SF 119) (reg/v:V4SF 110 [ v ])) t.c:12 -1 (nil)) (insn 12 11 13 (set (reg:V4SF 120) (reg/v:V4SF 110 [ v ])) t.c:12 -1 (nil)) (insn 13 12 14 (set (reg:V4SF 118) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 119) (reg:V4SF 120)) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:12 -1 (nil)) (insn 14 13 0 (set (reg:V4SF 111 [ _29 ]) (reg:V4SF 118)) t.c:12 -1 (nil)) and that's the shuffle. And after combine we have (insn 7 4 53 2 (set (reg:SF 116) (vec_select:SF (reg/v:V4SF 115 [ v ]) (parallel [ (const_int 0 [0]) ]))) t.c:5 2423 {*vec_extractv4sf_0} (nil)) (insn 9 53 13 2 (set (reg:V4SF 117 [ v ]) (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116)) (const_vector:V4SF [ (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) ]) (const_int 1 [0x1]))) t.c:11 2420 {vec_setv4sf_0} (expr_list:REG_DEAD (reg:SF 116) (nil))) (insn 13 9 15 2 (set (reg:V4SF 118) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 117 [ v ]) (reg:V4SF 117 [ v ])) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:12 2405 {sse_shufps_v4sf} (expr_list:REG_DEAD (reg:V4SF 117 [ v ]) (nil))) which combine doesn't manage to get down to (insn 9 4 13 2 (set (reg:V4SF 104) (vec_select:V4SF (vec_concat:V8SF (reg/v:V4SF 103 [ v ]) (reg/v:V4SF 103 [ v ])) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:18 2405 {sse_shufps_v4sf} (nil)) The testcase was the following. #include <emmintrin.h> template <int N> inline float component(__v4sf v) { return (reinterpret_cast<const float*>(&v))[N]; } inline __v4sf fill(float f) { __v4sf v; *(reinterpret_cast<float*>(&v))=f; return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 0)); } template <int N> inline __v4sf component_fill(__v4sf v) { return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), ((((N) << 6) | ((N) << 4) | ((N) << 2) | (N))))); } __v4sf transform_bad(__v4sf m[4],__v4sf v) { return m[0]*fill(component<0>(v)) +m[1]*fill(component<1>(v)) +m[2]*fill(component<2>(v)) +m[3]*fill(component<3>(v)); } __v4sf transform_good(__v4sf m[4],__v4sf v) { return m[0]*component_fill<0>(v) +m[1]*component_fill<1>(v) +m[2]*component_fill<2>(v) +m[3]*component_fill<3>(v); } So the remaining piece may be that of the init-regs issue. We have vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>; which leaves the upper elements undefined, but init-regs forces them to zero. Another issue is that in _26 = BIT_FIELD_REF <v_13(D), 32, 32>; vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>; _25 = __builtin_ia32_shufps (vf_24, vf_24, 0); the shufps is not exposed to gimple optimizations and thus we can't simplify it in any way. Only the backend knows that it could be simplified to _25 = __builtin_ia32_shufps (vf_13(D), vf_13(D), 85); so the backend might want to "expand" __builtin_ia32_shufps to a VEC_PERM_EXPR in its target specific builtin folding hook (making sure the reverse works well enough obviously). Uros, see comment#8 - would that be acceptable? The other alternative is to try using __builtin_shuffle[2] in the intrinsic headers but that might be somewhat difficult. (In reply to Richard Biener from comment #9) > Uros, see comment#8 - would that be acceptable? The other alternative is to > try using __builtin_shuffle[2] in the intrinsic headers but that might be > somewhat difficult. I have added Jakub to CC, he is the expert in various permutation approaches for x86 target. Like Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 236441) +++ gcc/config/i386/i386.c (working copy) @@ -37745,6 +37745,23 @@ ix86_fold_builtin (tree fndecl, int n_ar gcc_assert (n_args == 1); return fold_builtin_cpu (fndecl, args); } + if (fn_code == IX86_BUILTIN_SHUFPS + && n_args == 3 + && TREE_CODE (args[2]) == INTEGER_CST) + { + tree mask[4]; + tree mtype = build_vector_type (integer_type_node, 4); + mask[0] = build_int_cst (integer_type_node, + TREE_INT_CST_LOW (args[2]) & 3); + mask[1] = build_int_cst (integer_type_node, + (TREE_INT_CST_LOW (args[2]) >> 2) & 3); + mask[2] = build_int_cst (integer_type_node, + ((TREE_INT_CST_LOW (args[2]) >> 4) & 3) + 4); + mask[3] = build_int_cst (integer_type_node, + ((TREE_INT_CST_LOW (args[2]) >> 6) & 3) + 4); + return fold_build3 (VEC_PERM_EXPR, TREE_TYPE (TREE_TYPE (fndecl)), + args[0], args[1], build_vector (mtype, mask)); + } } #ifdef SUBTARGET_FOLD_BUILTIN given the plethora of shuffling intrinsics this might be quite tedious work... (In reply to Richard Biener from comment #11) > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 236441) > +++ gcc/config/i386/i386.c (working copy) ... > given the plethora of shuffling intrinsics this might be quite tedious > work... The builtins aren't guaranteed to be usable directly, only the intrinsics are, so if we want to do the above, we should just kill those builtins instead and use __builtin_shuffle directly in the headers (plus of course each time verify that we get the corresponding or better insn sequence). On Thu, 19 May 2016, jakub at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
>
> --- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #11)
>
> > Index: gcc/config/i386/i386.c
> > ===================================================================
> > --- gcc/config/i386/i386.c (revision 236441)
> > +++ gcc/config/i386/i386.c (working copy)
> ...
> > given the plethora of shuffling intrinsics this might be quite tedious
> > work...
>
> The builtins aren't guaranteed to be usable directly, only the intrinsics are,
> so if we want to do the above, we should just kill those builtins instead and
> use __builtin_shuffle directly in the headers (plus of course each time verify
> that we get the corresponding or better insn sequence).
Yes, but that will result in sth like
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm_shuffle_ps (__m128 __A, __m128 __B, int const __mask)
{
return (__m128) __builtin_shuffle2 ((__v4sf)__A, ((__v4sf)__B,
(__v4si) { __mask & 3, (__mask >> 2) & 3,
((__mask >> 4) & 3) + 4, ((__mask >> 6) & 3) + 4) });
}
(not sure if we still need the !__OPTIMIZE__ path or what we should do for
that in general in the above context - once !__OPTIMIZE__ would no
longer constant-fold or so)
But if this would be the prefered way of addressing this that's clearly
better than "folding" the stuff back.
Author: rguenth Date: Fri May 20 09:17:16 2016 New Revision: 236501 URL: https://gcc.gnu.org/viewcvs?rev=236501&root=gcc&view=rev Log: 2016-05-20 Richard Guenther <rguenther@suse.de> PR tree-optimization/29756 * tree.def (BIT_INSERT_EXPR): New tcc_expression tree code. * expr.c (expand_expr_real_2): Handle BIT_INSERT_EXPR. * fold-const.c (operand_equal_p): Likewise. (fold_ternary_loc): Add constant folding of BIT_INSERT_EXPR. * gimplify.c (gimplify_expr): Handle BIT_INSERT_EXPR. * tree-inline.c (estimate_operator_cost): Likewise. * tree-pretty-print.c (dump_generic_node): Likewise. * tree-ssa-operands.c (get_expr_operands): Likewise. * cfgexpand.c (expand_debug_expr): Likewise. * gimple-pretty-print.c (dump_ternary_rhs): Likewise. * gimple.c (get_gimple_rhs_num_ops): Handle BIT_INSERT_EXPR. * tree-cfg.c (verify_gimple_assign_ternary): Verify BIT_INSERT_EXPR. * tree-ssa.c (non_rewritable_lvalue_p): We can rewrite vector inserts using BIT_FIELD_REF or MEM_REF on the lhs. (execute_update_addresses_taken): Do it. * gcc.dg/tree-ssa/vector-6.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/expr.c trunk/gcc/fold-const.c trunk/gcc/gimple-pretty-print.c trunk/gcc/gimple.c trunk/gcc/gimplify.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-inline.c trunk/gcc/tree-pretty-print.c trunk/gcc/tree-ssa-operands.c trunk/gcc/tree-ssa.c trunk/gcc/tree.def Author: jakub Date: Fri May 20 11:55:58 2016 New Revision: 236505 URL: https://gcc.gnu.org/viewcvs?rev=236505&root=gcc&view=rev Log: PR tree-optimization/29756 gcc.dg/tree-ssa/vector-6.c: Add -Wno-psabi -w to dg-options. Add -msse2 for x86 and -maltivec for powerpc. Use scan-tree-dump-times only on selected targets where V4SImode vectors are known to be supported. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c Ok, not mine anymore (for the intrinsics rewrite). Target bug now. Confirmed on the trunk, this is no longer a target issue: __builtin_ia32_shufps is lowered into a VEC_PERM_EXPR now: _1 = MEM[(const float &)v_7(D) + 12]; MEM[(float &)&D.5891] = _1; _24 = D.5891._rep.vecf; _25 = VEC_PERM_EXPR <_24, _24, { 0, 0, 0, 0 }>; Good: <bb 2> [local count: 1073741824]: _1 = *m_12(D); _14 = VEC_PERM_EXPR <v_13(D), v_13(D), { 0, 0, 0, 0 }>; _2 = _1 * _14; _3 = MEM[(__v4sf *)m_12(D) + 16B]; _15 = VEC_PERM_EXPR <v_13(D), v_13(D), { 1, 1, 1, 1 }>; _4 = _3 * _15; _5 = _2 + _4; _6 = MEM[(__v4sf *)m_12(D) + 32B]; _16 = VEC_PERM_EXPR <v_13(D), v_13(D), { 2, 2, 2, 2 }>; _7 = _6 * _16; _8 = _5 + _7; _9 = MEM[(__v4sf *)m_12(D) + 48B]; _17 = VEC_PERM_EXPR <v_13(D), v_13(D), { 3, 3, 3, 3 }>; _10 = _9 * _17; _18 = _8 + _10; return _18; Bad: <bb 2> [local count: 1073741824]: _1 = *m_12(D); _30 = BIT_FIELD_REF <v_13(D), 32, 0>; v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>; _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>; _2 = _1 * _29; _3 = MEM[(__v4sf *)m_12(D) + 16B]; _26 = BIT_FIELD_REF <v_13(D), 32, 32>; v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>; _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>; _4 = _3 * _25; _5 = _2 + _4; _6 = MEM[(__v4sf *)m_12(D) + 32B]; _14 = BIT_FIELD_REF <v_13(D), 32, 64>; v_16 = BIT_INSERT_EXPR <v_17(D), _14, 0>; _15 = VEC_PERM_EXPR <v_16, v_16, { 0, 0, 0, 0 }>; _7 = _6 * _15; _8 = _5 + _7; _9 = MEM[(__v4sf *)m_12(D) + 48B]; _18 = BIT_FIELD_REF <v_13(D), 32, 96>; v_20 = BIT_INSERT_EXPR <v_21(D), _18, 0>; _19 = VEC_PERM_EXPR <v_20, v_20, { 0, 0, 0, 0 }>; _10 = _9 * _19; _22 = _8 + _10; return _22; So what's missing is converting the extract element, insert at 0 & splat into splat element N. _30 = BIT_FIELD_REF <v_13(D), 32, 0>; v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>; _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>; Shows a missing no-op (insert into default-def at 0 from extract from same position can simply return the vector we extract from). _26 = BIT_FIELD_REF <v_13(D), 32, 32>; v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>; _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>; is a bit more complicated - the VEC_PERM_EXPR indices should be modified based on the fact we only pick the just inserted elements and those were extracted from another (compatible) vector. Created attachment 57941 [details]
Uninclude version of the "More concise demonstration of the v4sf->float->v4sf issue"
Uninclude version so it is easier to test with newer compilers.
Looks fixed on the trunk, will look into what fixed it in a few minutes. transform_bad no longer has BIT_INSERT_EXPR . r14-3381-g27de9aa152141e combined with r13-3212-gb88adba751da63 r13-3271-g786e4c024f9416 Fixed the "More concise demonstration of the v4sf->float->v4sf issue" testcase. Created attachment 57942 [details]
uninclude of the original testcase
The original testcase still has an issue though. We get: ``` _1 = MEM[(const float &)v_7(D) + 12]; MEM[(float &)&D.6716] = _1; _24 = D.6716._rep.vecf; _25 = VEC_PERM_EXPR <_24, _24, { 0, 0, 0, 0 }>; ``` |