I've been adapting some old codes' simple 4-float vector class to use SSE by use of the intrinsic functions. It seems to be quite hard to avoid the generated assembly code being rather diluted by apparently redundant spills of intermediate results to the stack. On inspecting the assembly produced from the file to be attached, compare the code generated for matrix44f::transform_good and matrix44f::transform_bad. The former is 20 instructions and apparently optimal. However, it was only arrived at by prodding the latter version of the function (which does exactly the same thing but expressed more naturally, but results in 32 instructions) until the stack temporaries went away. It would be nice if both versions of the function generated optimal code and there doesn't seem to be any particular reason they shouldn't. Both versions' assembly contain the same expected numbers of shuffle, multiply and add instructions, the excess seems to all involve extra stack temporaries. [I'm not sure what the "triplet" codes on this form are. I'm using a gcc in Debian Etch gcc --version shows "gcc (GCC) 4.1.2 20060901 (prerelease) (Debian 4.1.1-13)"; platform is a Pentium3. Sorry if the "inline-asm" component is a completely inappropriate thing to assign to.]
Created attachment 12566 [details] Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp This is the .ii file output from gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp Most of it is the result of the .cpp's sole direct include : #include <xmmintrin.h>, which was immediately before the "class vector4f" declaration.
Looks like this is mostly caused by: union { __v4sf vecf; __m128 rawf; float val[4]; } _rep; I will have a look more at this issue later tonight when I get home from work.
I've just tried an alternative version (will upload later) replacing the union with a single __v4sf _rep, and implementing the [] operators using e.g (reinterpret_cast<const float*>(&_rep))[i]; However the code generated by the two transform implementations remains the same (20 and 32 instructions anyway; haven't checked the details yet). Maybe not surprising as it's just moving the problem around. The big difference between the two methods is perhaps primarily that the bad one involves a __v4sf->float->__vfs4 conversion, while the good one uses __v4sf throughout by using the mul_compN methods. I'll try and prepare a more concise test case based on the premise that bad handling of __v4sf <-> float is the real issue.
Created attachment 12573 [details] More concise demonstration of the v4sf->float->v4sf issue. The attached code, (no classes or unions, just a few inline functions) obtained from gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer v4sf.cpp compiles transform_good to 18 instructions and transform_bad to 33. However it's not really surprising a round-trip through stack temporaries is required when pointer arithmetic is being used to extract a float from a __v4sf. I've no idea whether it's realistic to hope this could ever be optimised away. Alternatively, it would be very nice if the builtin vector types simply provided a [] operator, or if there were some intrinsics for extracting floats from a __v4sf. (In the meantime, in the original vector4f class, remaining in the __v4sf domain by having the const operator[] return a suitably type-wrapped __v4sf "filled" with the specified component seems to be a promising direction).
This is mostly PR 28367. There are most likely other issues like some of the SSE intrinsics not being declared as pure/const.
So what is missing here is avoiding 'v' for _26 = BIT_FIELD_REF <v_13(D), 32, 0>; BIT_FIELD_REF <v, 32, 0> = _26; v.1_24 = v; _25 = __builtin_ia32_shufps (v.1_24, v.1_24, 0); v ={v} {CLOBBER}; which can be done with a new BIT_FIELD_EXPR like so: v_24 = BIT_FIELD_EXPR <v_n(D), _26, 32, 0>;
So I have it down to a x86 combine issue: ;; v_28 = BIT_FIELD_INSERT <v_27(D), _30, 0 (32 bits)>; (insn 7 6 8 (set (reg:SF 116) (vec_select:SF (reg/v:V4SF 115 [ v ]) (parallel [ (const_int 0 [0]) ]))) t.c:5 -1 (nil)) (insn 8 7 9 (set (reg:V4SF 117) (reg/v:V4SF 109 [ v ])) t.c:11 -1 (nil)) (insn 9 8 10 (set (reg:V4SF 117) (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116)) (reg:V4SF 117) (const_int 1 [0x1]))) t.c:11 -1 (nil)) (insn 10 9 0 (set (reg/v:V4SF 110 [ v ]) (reg:V4SF 117)) t.c:11 -1 (nil)) that's from what vec_set_optab produces ;; _29 = __builtin_ia32_shufps (v_28, v_28, 0); (insn 11 10 12 (set (reg:V4SF 119) (reg/v:V4SF 110 [ v ])) t.c:12 -1 (nil)) (insn 12 11 13 (set (reg:V4SF 120) (reg/v:V4SF 110 [ v ])) t.c:12 -1 (nil)) (insn 13 12 14 (set (reg:V4SF 118) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 119) (reg:V4SF 120)) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:12 -1 (nil)) (insn 14 13 0 (set (reg:V4SF 111 [ _29 ]) (reg:V4SF 118)) t.c:12 -1 (nil)) and that's the shuffle. And after combine we have (insn 7 4 53 2 (set (reg:SF 116) (vec_select:SF (reg/v:V4SF 115 [ v ]) (parallel [ (const_int 0 [0]) ]))) t.c:5 2423 {*vec_extractv4sf_0} (nil)) (insn 9 53 13 2 (set (reg:V4SF 117 [ v ]) (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116)) (const_vector:V4SF [ (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) ]) (const_int 1 [0x1]))) t.c:11 2420 {vec_setv4sf_0} (expr_list:REG_DEAD (reg:SF 116) (nil))) (insn 13 9 15 2 (set (reg:V4SF 118) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 117 [ v ]) (reg:V4SF 117 [ v ])) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:12 2405 {sse_shufps_v4sf} (expr_list:REG_DEAD (reg:V4SF 117 [ v ]) (nil))) which combine doesn't manage to get down to (insn 9 4 13 2 (set (reg:V4SF 104) (vec_select:V4SF (vec_concat:V8SF (reg/v:V4SF 103 [ v ]) (reg/v:V4SF 103 [ v ])) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:18 2405 {sse_shufps_v4sf} (nil)) The testcase was the following. #include <emmintrin.h> template <int N> inline float component(__v4sf v) { return (reinterpret_cast<const float*>(&v))[N]; } inline __v4sf fill(float f) { __v4sf v; *(reinterpret_cast<float*>(&v))=f; return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 0)); } template <int N> inline __v4sf component_fill(__v4sf v) { return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), ((((N) << 6) | ((N) << 4) | ((N) << 2) | (N))))); } __v4sf transform_bad(__v4sf m[4],__v4sf v) { return m[0]*fill(component<0>(v)) +m[1]*fill(component<1>(v)) +m[2]*fill(component<2>(v)) +m[3]*fill(component<3>(v)); } __v4sf transform_good(__v4sf m[4],__v4sf v) { return m[0]*component_fill<0>(v) +m[1]*component_fill<1>(v) +m[2]*component_fill<2>(v) +m[3]*component_fill<3>(v); }
So the remaining piece may be that of the init-regs issue. We have vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>; which leaves the upper elements undefined, but init-regs forces them to zero. Another issue is that in _26 = BIT_FIELD_REF <v_13(D), 32, 32>; vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>; _25 = __builtin_ia32_shufps (vf_24, vf_24, 0); the shufps is not exposed to gimple optimizations and thus we can't simplify it in any way. Only the backend knows that it could be simplified to _25 = __builtin_ia32_shufps (vf_13(D), vf_13(D), 85); so the backend might want to "expand" __builtin_ia32_shufps to a VEC_PERM_EXPR in its target specific builtin folding hook (making sure the reverse works well enough obviously).
Uros, see comment#8 - would that be acceptable? The other alternative is to try using __builtin_shuffle[2] in the intrinsic headers but that might be somewhat difficult.
(In reply to Richard Biener from comment #9) > Uros, see comment#8 - would that be acceptable? The other alternative is to > try using __builtin_shuffle[2] in the intrinsic headers but that might be > somewhat difficult. I have added Jakub to CC, he is the expert in various permutation approaches for x86 target.
Like Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 236441) +++ gcc/config/i386/i386.c (working copy) @@ -37745,6 +37745,23 @@ ix86_fold_builtin (tree fndecl, int n_ar gcc_assert (n_args == 1); return fold_builtin_cpu (fndecl, args); } + if (fn_code == IX86_BUILTIN_SHUFPS + && n_args == 3 + && TREE_CODE (args[2]) == INTEGER_CST) + { + tree mask[4]; + tree mtype = build_vector_type (integer_type_node, 4); + mask[0] = build_int_cst (integer_type_node, + TREE_INT_CST_LOW (args[2]) & 3); + mask[1] = build_int_cst (integer_type_node, + (TREE_INT_CST_LOW (args[2]) >> 2) & 3); + mask[2] = build_int_cst (integer_type_node, + ((TREE_INT_CST_LOW (args[2]) >> 4) & 3) + 4); + mask[3] = build_int_cst (integer_type_node, + ((TREE_INT_CST_LOW (args[2]) >> 6) & 3) + 4); + return fold_build3 (VEC_PERM_EXPR, TREE_TYPE (TREE_TYPE (fndecl)), + args[0], args[1], build_vector (mtype, mask)); + } } #ifdef SUBTARGET_FOLD_BUILTIN given the plethora of shuffling intrinsics this might be quite tedious work...
(In reply to Richard Biener from comment #11) > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 236441) > +++ gcc/config/i386/i386.c (working copy) ... > given the plethora of shuffling intrinsics this might be quite tedious > work... The builtins aren't guaranteed to be usable directly, only the intrinsics are, so if we want to do the above, we should just kill those builtins instead and use __builtin_shuffle directly in the headers (plus of course each time verify that we get the corresponding or better insn sequence).
On Thu, 19 May 2016, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 > > --- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #11) > > > Index: gcc/config/i386/i386.c > > =================================================================== > > --- gcc/config/i386/i386.c (revision 236441) > > +++ gcc/config/i386/i386.c (working copy) > ... > > given the plethora of shuffling intrinsics this might be quite tedious > > work... > > The builtins aren't guaranteed to be usable directly, only the intrinsics are, > so if we want to do the above, we should just kill those builtins instead and > use __builtin_shuffle directly in the headers (plus of course each time verify > that we get the corresponding or better insn sequence). Yes, but that will result in sth like extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_shuffle_ps (__m128 __A, __m128 __B, int const __mask) { return (__m128) __builtin_shuffle2 ((__v4sf)__A, ((__v4sf)__B, (__v4si) { __mask & 3, (__mask >> 2) & 3, ((__mask >> 4) & 3) + 4, ((__mask >> 6) & 3) + 4) }); } (not sure if we still need the !__OPTIMIZE__ path or what we should do for that in general in the above context - once !__OPTIMIZE__ would no longer constant-fold or so) But if this would be the prefered way of addressing this that's clearly better than "folding" the stuff back.
Author: rguenth Date: Fri May 20 09:17:16 2016 New Revision: 236501 URL: https://gcc.gnu.org/viewcvs?rev=236501&root=gcc&view=rev Log: 2016-05-20 Richard Guenther <rguenther@suse.de> PR tree-optimization/29756 * tree.def (BIT_INSERT_EXPR): New tcc_expression tree code. * expr.c (expand_expr_real_2): Handle BIT_INSERT_EXPR. * fold-const.c (operand_equal_p): Likewise. (fold_ternary_loc): Add constant folding of BIT_INSERT_EXPR. * gimplify.c (gimplify_expr): Handle BIT_INSERT_EXPR. * tree-inline.c (estimate_operator_cost): Likewise. * tree-pretty-print.c (dump_generic_node): Likewise. * tree-ssa-operands.c (get_expr_operands): Likewise. * cfgexpand.c (expand_debug_expr): Likewise. * gimple-pretty-print.c (dump_ternary_rhs): Likewise. * gimple.c (get_gimple_rhs_num_ops): Handle BIT_INSERT_EXPR. * tree-cfg.c (verify_gimple_assign_ternary): Verify BIT_INSERT_EXPR. * tree-ssa.c (non_rewritable_lvalue_p): We can rewrite vector inserts using BIT_FIELD_REF or MEM_REF on the lhs. (execute_update_addresses_taken): Do it. * gcc.dg/tree-ssa/vector-6.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/expr.c trunk/gcc/fold-const.c trunk/gcc/gimple-pretty-print.c trunk/gcc/gimple.c trunk/gcc/gimplify.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-inline.c trunk/gcc/tree-pretty-print.c trunk/gcc/tree-ssa-operands.c trunk/gcc/tree-ssa.c trunk/gcc/tree.def
Author: jakub Date: Fri May 20 11:55:58 2016 New Revision: 236505 URL: https://gcc.gnu.org/viewcvs?rev=236505&root=gcc&view=rev Log: PR tree-optimization/29756 gcc.dg/tree-ssa/vector-6.c: Add -Wno-psabi -w to dg-options. Add -msse2 for x86 and -maltivec for powerpc. Use scan-tree-dump-times only on selected targets where V4SImode vectors are known to be supported. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c
Ok, not mine anymore (for the intrinsics rewrite). Target bug now.
Confirmed on the trunk, this is no longer a target issue: __builtin_ia32_shufps is lowered into a VEC_PERM_EXPR now: _1 = MEM[(const float &)v_7(D) + 12]; MEM[(float &)&D.5891] = _1; _24 = D.5891._rep.vecf; _25 = VEC_PERM_EXPR <_24, _24, { 0, 0, 0, 0 }>;
Good: <bb 2> [local count: 1073741824]: _1 = *m_12(D); _14 = VEC_PERM_EXPR <v_13(D), v_13(D), { 0, 0, 0, 0 }>; _2 = _1 * _14; _3 = MEM[(__v4sf *)m_12(D) + 16B]; _15 = VEC_PERM_EXPR <v_13(D), v_13(D), { 1, 1, 1, 1 }>; _4 = _3 * _15; _5 = _2 + _4; _6 = MEM[(__v4sf *)m_12(D) + 32B]; _16 = VEC_PERM_EXPR <v_13(D), v_13(D), { 2, 2, 2, 2 }>; _7 = _6 * _16; _8 = _5 + _7; _9 = MEM[(__v4sf *)m_12(D) + 48B]; _17 = VEC_PERM_EXPR <v_13(D), v_13(D), { 3, 3, 3, 3 }>; _10 = _9 * _17; _18 = _8 + _10; return _18; Bad: <bb 2> [local count: 1073741824]: _1 = *m_12(D); _30 = BIT_FIELD_REF <v_13(D), 32, 0>; v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>; _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>; _2 = _1 * _29; _3 = MEM[(__v4sf *)m_12(D) + 16B]; _26 = BIT_FIELD_REF <v_13(D), 32, 32>; v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>; _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>; _4 = _3 * _25; _5 = _2 + _4; _6 = MEM[(__v4sf *)m_12(D) + 32B]; _14 = BIT_FIELD_REF <v_13(D), 32, 64>; v_16 = BIT_INSERT_EXPR <v_17(D), _14, 0>; _15 = VEC_PERM_EXPR <v_16, v_16, { 0, 0, 0, 0 }>; _7 = _6 * _15; _8 = _5 + _7; _9 = MEM[(__v4sf *)m_12(D) + 48B]; _18 = BIT_FIELD_REF <v_13(D), 32, 96>; v_20 = BIT_INSERT_EXPR <v_21(D), _18, 0>; _19 = VEC_PERM_EXPR <v_20, v_20, { 0, 0, 0, 0 }>; _10 = _9 * _19; _22 = _8 + _10; return _22; So what's missing is converting the extract element, insert at 0 & splat into splat element N. _30 = BIT_FIELD_REF <v_13(D), 32, 0>; v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>; _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>; Shows a missing no-op (insert into default-def at 0 from extract from same position can simply return the vector we extract from). _26 = BIT_FIELD_REF <v_13(D), 32, 32>; v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>; _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>; is a bit more complicated - the VEC_PERM_EXPR indices should be modified based on the fact we only pick the just inserted elements and those were extracted from another (compatible) vector.
Created attachment 57941 [details] Uninclude version of the "More concise demonstration of the v4sf->float->v4sf issue" Uninclude version so it is easier to test with newer compilers.
Looks fixed on the trunk, will look into what fixed it in a few minutes. transform_bad no longer has BIT_INSERT_EXPR .
r14-3381-g27de9aa152141e combined with r13-3212-gb88adba751da63 r13-3271-g786e4c024f9416 Fixed the "More concise demonstration of the v4sf->float->v4sf issue" testcase.
Created attachment 57942 [details] uninclude of the original testcase
The original testcase still has an issue though. We get: ``` _1 = MEM[(const float &)v_7(D) + 12]; MEM[(float &)&D.6716] = _1; _24 = D.6716._rep.vecf; _25 = VEC_PERM_EXPR <_24, _24, { 0, 0, 0, 0 }>; ```