Bug 29756

Summary: SSE intrinsics hard to use without redundant temporaries appearing
Product: gcc Reporter: Tim Day <timday>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: gcc-bugs, jakub, liuhongt, rguenth, timday, uros
Priority: P3 Keywords: missed-optimization
Version: 4.1.2   
Target Milestone: ---   
Host: Target: x86_64-*-*, i?86-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2016-05-10 00:00:00
Bug Depends on: 28367    
Bug Blocks:    
Attachments: Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp
More concise demonstration of the v4sf->float->v4sf issue.
Uninclude version of the "More concise demonstration of the v4sf->float->v4sf issue"
uninclude of the original testcase

Description Tim Day 2006-11-07 22:22:55 UTC
I've been adapting some old codes' simple 4-float vector class to use SSE by use of the intrinsic functions.  It seems to be quite hard to avoid the generated assembly code being rather diluted by apparently redundant spills of intermediate results to the stack.

On inspecting the assembly produced from the file to be attached, compare the code generated for matrix44f::transform_good and matrix44f::transform_bad.
The former is 20 instructions and apparently optimal.  However, it was only arrived at by prodding the latter version of the function (which does exactly the same thing but expressed more naturally, but results in 32 instructions) until the stack temporaries went away.  It would be nice if both versions of the function generated optimal code and there doesn't seem to be any particular reason they shouldn't.

Both versions' assembly contain the same expected numbers of shuffle, multiply and add instructions, the excess seems to all involve extra stack temporaries.

[I'm not sure what the "triplet" codes on this form are.
I'm using a gcc in Debian Etch  gcc --version shows "gcc (GCC) 4.1.2 20060901 (prerelease) (Debian 4.1.1-13)"; platform is a Pentium3.  Sorry if the "inline-asm" component is a completely inappropriate thing to assign to.]
Comment 1 Tim Day 2006-11-07 22:26:59 UTC
Created attachment 12566 [details]
Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp

This is the .ii file output from
gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp
Most of it is the result of the .cpp's sole direct include : #include <xmmintrin.h>, which was immediately before the "class vector4f" declaration.
Comment 2 Andrew Pinski 2006-11-07 22:31:26 UTC
Looks like this is mostly caused by:
  union
  {
    __v4sf vecf;
    __m128 rawf;
    float val[4];
  } _rep;

I will have a look more at this issue later tonight when I get home from work.

Comment 3 Tim Day 2006-11-08 10:01:11 UTC
I've just tried an alternative version (will upload later) replacing the union with a single
  __v4sf _rep,
and implementing the [] operators using e.g
  (reinterpret_cast<const float*>(&_rep))[i];
However the code generated by the two transform implementations remains the same (20 and 32 instructions anyway; haven't checked the details yet).
Maybe not surprising as it's just moving the problem around.

The big difference between the two methods is perhaps primarily that the bad one involves a __v4sf->float->__vfs4 conversion, while the good one uses __v4sf throughout by using the mul_compN methods.  I'll try and prepare a more concise test case based on the premise that bad handling of __v4sf <-> float is the real issue.
Comment 4 Tim Day 2006-11-08 22:18:28 UTC
Created attachment 12573 [details]
More concise demonstration of the v4sf->float->v4sf issue.

The attached code, (no classes or unions, just a few inline functions) obtained from
  gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer v4sf.cpp
compiles transform_good to 18 instructions and transform_bad to 33.  However it's not really surprising a round-trip through stack temporaries is required when pointer arithmetic is being used to extract a float from a __v4sf.  I've no idea whether it's realistic to hope this could ever be optimised away.  Alternatively, it would be very nice if the builtin vector types simply provided a [] operator, or if there were some intrinsics for extracting floats from a __v4sf.

(In the meantime, in the original vector4f class, remaining in the __v4sf domain by having the const operator[] return a suitably type-wrapped __v4sf "filled" with the specified component seems to be a promising direction).
Comment 5 Andrew Pinski 2006-11-14 01:15:11 UTC
This is mostly PR 28367.  There are most likely other issues like some of the SSE intrinsics not being declared as pure/const.
Comment 6 Richard Biener 2016-05-10 08:34:19 UTC
So what is missing here is avoiding 'v' for

  _26 = BIT_FIELD_REF <v_13(D), 32, 0>;
  BIT_FIELD_REF <v, 32, 0> = _26;
  v.1_24 = v;
  _25 = __builtin_ia32_shufps (v.1_24, v.1_24, 0);
  v ={v} {CLOBBER};

which can be done with a new BIT_FIELD_EXPR like so:

  v_24 = BIT_FIELD_EXPR <v_n(D), _26, 32, 0>;
Comment 7 Richard Biener 2016-05-12 13:39:44 UTC
So I have it down to a x86 combine issue:

;; v_28 = BIT_FIELD_INSERT <v_27(D), _30, 0 (32 bits)>;

(insn 7 6 8 (set (reg:SF 116)
        (vec_select:SF (reg/v:V4SF 115 [ v ])
            (parallel [
                    (const_int 0 [0])
                ]))) t.c:5 -1
     (nil))

(insn 8 7 9 (set (reg:V4SF 117)
        (reg/v:V4SF 109 [ v ])) t.c:11 -1
     (nil))

(insn 9 8 10 (set (reg:V4SF 117)
        (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116))
            (reg:V4SF 117)
            (const_int 1 [0x1]))) t.c:11 -1
     (nil))

(insn 10 9 0 (set (reg/v:V4SF 110 [ v ])
        (reg:V4SF 117)) t.c:11 -1
     (nil))

that's from what vec_set_optab produces

;; _29 = __builtin_ia32_shufps (v_28, v_28, 0);

(insn 11 10 12 (set (reg:V4SF 119)
        (reg/v:V4SF 110 [ v ])) t.c:12 -1
     (nil))

(insn 12 11 13 (set (reg:V4SF 120)
        (reg/v:V4SF 110 [ v ])) t.c:12 -1
     (nil))

(insn 13 12 14 (set (reg:V4SF 118)
        (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 119)
                (reg:V4SF 120))
            (parallel [
                    (const_int 0 [0])
                    (const_int 0 [0])
                    (const_int 4 [0x4])
                    (const_int 4 [0x4])
                ]))) t.c:12 -1
     (nil))

(insn 14 13 0 (set (reg:V4SF 111 [ _29 ])
        (reg:V4SF 118)) t.c:12 -1
     (nil))

and that's the shuffle.  And after combine we have

(insn 7 4 53 2 (set (reg:SF 116)
        (vec_select:SF (reg/v:V4SF 115 [ v ])
            (parallel [
                    (const_int 0 [0])
                ]))) t.c:5 2423 {*vec_extractv4sf_0}
     (nil))
(insn 9 53 13 2 (set (reg:V4SF 117 [ v ])
        (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116))
            (const_vector:V4SF [
                    (const_double:SF 0.0 [0x0.0p+0])
                    (const_double:SF 0.0 [0x0.0p+0])
                    (const_double:SF 0.0 [0x0.0p+0])
                    (const_double:SF 0.0 [0x0.0p+0])
                ])
            (const_int 1 [0x1]))) t.c:11 2420 {vec_setv4sf_0}
     (expr_list:REG_DEAD (reg:SF 116)
        (nil)))
(insn 13 9 15 2 (set (reg:V4SF 118)
        (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 117 [ v ])
                (reg:V4SF 117 [ v ]))
            (parallel [
                    (const_int 0 [0])
                    (const_int 0 [0])
                    (const_int 4 [0x4])
                    (const_int 4 [0x4])
                ]))) t.c:12 2405 {sse_shufps_v4sf}
     (expr_list:REG_DEAD (reg:V4SF 117 [ v ])
        (nil)))

which combine doesn't manage to get down to

(insn 9 4 13 2 (set (reg:V4SF 104)
        (vec_select:V4SF (vec_concat:V8SF (reg/v:V4SF 103 [ v ])
                (reg/v:V4SF 103 [ v ]))
            (parallel [
                    (const_int 0 [0])
                    (const_int 0 [0])
                    (const_int 4 [0x4])
                    (const_int 4 [0x4])
                ]))) t.c:18 2405 {sse_shufps_v4sf}
     (nil))



The testcase was the following.

#include <emmintrin.h>

template <int N> inline float component(__v4sf v)
{
  return (reinterpret_cast<const float*>(&v))[N];
}

inline __v4sf fill(float f)
{
  __v4sf v;
  *(reinterpret_cast<float*>(&v))=f;
  return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 0));
}

template <int N> inline __v4sf component_fill(__v4sf v)
{
  return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 
                  ((((N) << 6) | ((N) << 4) | ((N) << 2) | (N)))));
}

__v4sf transform_bad(__v4sf m[4],__v4sf v)
{
  return m[0]*fill(component<0>(v))
      +m[1]*fill(component<1>(v))
      +m[2]*fill(component<2>(v))
      +m[3]*fill(component<3>(v));
}

__v4sf transform_good(__v4sf m[4],__v4sf v)
{
  return m[0]*component_fill<0>(v)
      +m[1]*component_fill<1>(v)
      +m[2]*component_fill<2>(v)
      +m[3]*component_fill<3>(v);
}
Comment 8 Richard Biener 2016-05-19 09:52:21 UTC
So the remaining piece may be that of the init-regs issue.  We have

  vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>;

which leaves the upper elements undefined, but init-regs forces them to zero.
Another issue is that in

  _26 = BIT_FIELD_REF <v_13(D), 32, 32>;
  vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>;
  _25 = __builtin_ia32_shufps (vf_24, vf_24, 0);

the shufps is not exposed to gimple optimizations and thus we can't simplify
it in any way.  Only the backend knows that it could be simplified to

  _25 = __builtin_ia32_shufps (vf_13(D), vf_13(D), 85);

so the backend might want to "expand" __builtin_ia32_shufps to a VEC_PERM_EXPR
in its target specific builtin folding hook (making sure the reverse works
well enough obviously).
Comment 9 Richard Biener 2016-05-19 09:55:59 UTC
Uros, see comment#8 - would that be acceptable?  The other alternative is to
try using __builtin_shuffle[2] in the intrinsic headers but that might be
somewhat difficult.
Comment 10 Uroš Bizjak 2016-05-19 09:59:09 UTC
(In reply to Richard Biener from comment #9)
> Uros, see comment#8 - would that be acceptable?  The other alternative is to
> try using __builtin_shuffle[2] in the intrinsic headers but that might be
> somewhat difficult.

I have added Jakub to CC, he is the expert in various permutation approaches for x86 target.
Comment 11 Richard Biener 2016-05-19 10:06:29 UTC
Like

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 236441)
+++ gcc/config/i386/i386.c      (working copy)
@@ -37745,6 +37745,23 @@ ix86_fold_builtin (tree fndecl, int n_ar
          gcc_assert (n_args == 1);
           return fold_builtin_cpu (fndecl, args);
        }
+      if (fn_code == IX86_BUILTIN_SHUFPS
+         && n_args == 3
+         && TREE_CODE (args[2]) == INTEGER_CST)
+       {
+         tree mask[4];
+         tree mtype = build_vector_type (integer_type_node, 4);
+         mask[0] = build_int_cst (integer_type_node,
+                                  TREE_INT_CST_LOW (args[2]) & 3);
+         mask[1] = build_int_cst (integer_type_node,
+                                  (TREE_INT_CST_LOW (args[2]) >> 2) & 3);
+         mask[2] = build_int_cst (integer_type_node,
+                                  ((TREE_INT_CST_LOW (args[2]) >> 4) & 3) + 4);
+         mask[3] = build_int_cst (integer_type_node,
+                                  ((TREE_INT_CST_LOW (args[2]) >> 6) & 3) + 4);
+         return fold_build3 (VEC_PERM_EXPR, TREE_TYPE (TREE_TYPE (fndecl)),
+                             args[0], args[1], build_vector (mtype, mask));
+       }
     }
 
 #ifdef SUBTARGET_FOLD_BUILTIN


given the plethora of shuffling intrinsics this might be quite tedious work...
Comment 12 Jakub Jelinek 2016-05-19 10:22:01 UTC
(In reply to Richard Biener from comment #11)

> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      (revision 236441)
> +++ gcc/config/i386/i386.c      (working copy)
...
> given the plethora of shuffling intrinsics this might be quite tedious
> work...

The builtins aren't guaranteed to be usable directly, only the intrinsics are, so if we want to do the above, we should just kill those builtins instead and use __builtin_shuffle directly in the headers (plus of course each time verify that we get the corresponding or better insn sequence).
Comment 13 rguenther@suse.de 2016-05-19 10:28:06 UTC
On Thu, 19 May 2016, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
> 
> --- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #11)
> 
> > Index: gcc/config/i386/i386.c
> > ===================================================================
> > --- gcc/config/i386/i386.c      (revision 236441)
> > +++ gcc/config/i386/i386.c      (working copy)
> ...
> > given the plethora of shuffling intrinsics this might be quite tedious
> > work...
> 
> The builtins aren't guaranteed to be usable directly, only the intrinsics are,
> so if we want to do the above, we should just kill those builtins instead and
> use __builtin_shuffle directly in the headers (plus of course each time verify
> that we get the corresponding or better insn sequence).

Yes, but that will result in sth like

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
_mm_shuffle_ps (__m128 __A, __m128 __B, int const __mask)
{
  return (__m128) __builtin_shuffle2 ((__v4sf)__A, ((__v4sf)__B,
		(__v4si) { __mask & 3, (__mask >> 2) & 3,
                           ((__mask >> 4) & 3) + 4, ((__mask >> 6) & 3) + 4) });
}

(not sure if we still need the !__OPTIMIZE__ path or what we should do for
that in general in the above context - once  !__OPTIMIZE__ would no
longer constant-fold or so)

But if this would be the prefered way of addressing this that's clearly
better than "folding" the stuff back.
Comment 14 Richard Biener 2016-05-20 09:17:48 UTC
Author: rguenth
Date: Fri May 20 09:17:16 2016
New Revision: 236501

URL: https://gcc.gnu.org/viewcvs?rev=236501&root=gcc&view=rev
Log:
2016-05-20  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/29756
	* tree.def (BIT_INSERT_EXPR): New tcc_expression tree code.
	* expr.c (expand_expr_real_2): Handle BIT_INSERT_EXPR.
	* fold-const.c (operand_equal_p): Likewise.
	(fold_ternary_loc): Add constant folding of BIT_INSERT_EXPR.
	* gimplify.c (gimplify_expr): Handle BIT_INSERT_EXPR.
	* tree-inline.c (estimate_operator_cost): Likewise.
	* tree-pretty-print.c (dump_generic_node): Likewise.
	* tree-ssa-operands.c (get_expr_operands): Likewise.
	* cfgexpand.c (expand_debug_expr): Likewise.
	* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
	* gimple.c (get_gimple_rhs_num_ops): Handle BIT_INSERT_EXPR.
	* tree-cfg.c (verify_gimple_assign_ternary): Verify BIT_INSERT_EXPR.

	* tree-ssa.c (non_rewritable_lvalue_p): We can rewrite
	vector inserts using BIT_FIELD_REF or MEM_REF on the lhs.
	(execute_update_addresses_taken): Do it.

	* gcc.dg/tree-ssa/vector-6.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
    trunk/gcc/expr.c
    trunk/gcc/fold-const.c
    trunk/gcc/gimple-pretty-print.c
    trunk/gcc/gimple.c
    trunk/gcc/gimplify.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-cfg.c
    trunk/gcc/tree-inline.c
    trunk/gcc/tree-pretty-print.c
    trunk/gcc/tree-ssa-operands.c
    trunk/gcc/tree-ssa.c
    trunk/gcc/tree.def
Comment 15 Jakub Jelinek 2016-05-20 11:56:30 UTC
Author: jakub
Date: Fri May 20 11:55:58 2016
New Revision: 236505

URL: https://gcc.gnu.org/viewcvs?rev=236505&root=gcc&view=rev
Log:
	PR tree-optimization/29756
	gcc.dg/tree-ssa/vector-6.c: Add -Wno-psabi -w to dg-options.
	Add -msse2 for x86 and -maltivec for powerpc.  Use scan-tree-dump-times
	only on selected targets where V4SImode vectors are known to be
	supported.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c
Comment 16 Richard Biener 2016-05-24 13:31:07 UTC
Ok, not mine anymore (for the intrinsics rewrite).  Target bug now.
Comment 17 Andrew Pinski 2021-12-06 23:36:27 UTC
Confirmed on the trunk, this is no longer a target issue:

__builtin_ia32_shufps is lowered into a VEC_PERM_EXPR now:
  _1 = MEM[(const float &)v_7(D) + 12];
  MEM[(float &)&D.5891] = _1;
  _24 = D.5891._rep.vecf;
  _25 = VEC_PERM_EXPR <_24, _24, { 0, 0, 0, 0 }>;
Comment 18 Richard Biener 2021-12-07 08:22:56 UTC
Good:

  <bb 2> [local count: 1073741824]:
  _1 = *m_12(D);
  _14 = VEC_PERM_EXPR <v_13(D), v_13(D), { 0, 0, 0, 0 }>;
  _2 = _1 * _14;
  _3 = MEM[(__v4sf *)m_12(D) + 16B];
  _15 = VEC_PERM_EXPR <v_13(D), v_13(D), { 1, 1, 1, 1 }>;
  _4 = _3 * _15;
  _5 = _2 + _4;
  _6 = MEM[(__v4sf *)m_12(D) + 32B];
  _16 = VEC_PERM_EXPR <v_13(D), v_13(D), { 2, 2, 2, 2 }>;
  _7 = _6 * _16;
  _8 = _5 + _7;
  _9 = MEM[(__v4sf *)m_12(D) + 48B];
  _17 = VEC_PERM_EXPR <v_13(D), v_13(D), { 3, 3, 3, 3 }>;
  _10 = _9 * _17;
  _18 = _8 + _10;
  return _18;

Bad:

  <bb 2> [local count: 1073741824]:
  _1 = *m_12(D);
  _30 = BIT_FIELD_REF <v_13(D), 32, 0>;
  v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>;
  _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>;
  _2 = _1 * _29;
  _3 = MEM[(__v4sf *)m_12(D) + 16B];
  _26 = BIT_FIELD_REF <v_13(D), 32, 32>;
  v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>;
  _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>;
  _4 = _3 * _25;
  _5 = _2 + _4;
  _6 = MEM[(__v4sf *)m_12(D) + 32B];
  _14 = BIT_FIELD_REF <v_13(D), 32, 64>;
  v_16 = BIT_INSERT_EXPR <v_17(D), _14, 0>;
  _15 = VEC_PERM_EXPR <v_16, v_16, { 0, 0, 0, 0 }>;
  _7 = _6 * _15;
  _8 = _5 + _7;
  _9 = MEM[(__v4sf *)m_12(D) + 48B];
  _18 = BIT_FIELD_REF <v_13(D), 32, 96>;
  v_20 = BIT_INSERT_EXPR <v_21(D), _18, 0>;
  _19 = VEC_PERM_EXPR <v_20, v_20, { 0, 0, 0, 0 }>;
  _10 = _9 * _19;
  _22 = _8 + _10;
  return _22;

So what's missing is converting the extract element, insert at 0 & splat
into splat element N.

  _30 = BIT_FIELD_REF <v_13(D), 32, 0>;
  v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>;
  _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>;

Shows a missing no-op (insert into default-def at 0 from extract from same
position can simply return the vector we extract from).

  _26 = BIT_FIELD_REF <v_13(D), 32, 32>;
  v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>;
  _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>;

is a bit more complicated - the VEC_PERM_EXPR indices should be modified
based on the fact we only pick the just inserted elements and those
were extracted from another (compatible) vector.
Comment 19 Andrew Pinski 2024-04-14 01:11:41 UTC
Created attachment 57941 [details]
Uninclude version of the "More concise demonstration of the v4sf->float->v4sf issue"

Uninclude version so it is easier to test with newer compilers.
Comment 20 Andrew Pinski 2024-04-14 01:16:08 UTC
Looks fixed on the trunk, will look into what fixed it in a few minutes.

transform_bad no longer has BIT_INSERT_EXPR .
Comment 21 Andrew Pinski 2024-04-14 01:29:05 UTC
r14-3381-g27de9aa152141e combined with
r13-3212-gb88adba751da63
r13-3271-g786e4c024f9416

Fixed the "More concise demonstration of the v4sf->float->v4sf issue" testcase.
Comment 22 Andrew Pinski 2024-04-14 01:31:18 UTC
Created attachment 57942 [details]
uninclude of the original testcase
Comment 23 Andrew Pinski 2024-04-14 01:33:06 UTC
The original testcase still has an issue though.
We get:
```
  _1 = MEM[(const float &)v_7(D) + 12];
  MEM[(float &)&D.6716] = _1;
  _24 = D.6716._rep.vecf;
  _25 = VEC_PERM_EXPR <_24, _24, { 0, 0, 0, 0 }>;
```