[Bug tree-optimization/29756] SSE intrinsics hard to use without redundant temporaries appearing

rguenth at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Tue Dec 7 08:22:56 GMT 2021


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
Good:

  <bb 2> [local count: 1073741824]:
  _1 = *m_12(D);
  _14 = VEC_PERM_EXPR <v_13(D), v_13(D), { 0, 0, 0, 0 }>;
  _2 = _1 * _14;
  _3 = MEM[(__v4sf *)m_12(D) + 16B];
  _15 = VEC_PERM_EXPR <v_13(D), v_13(D), { 1, 1, 1, 1 }>;
  _4 = _3 * _15;
  _5 = _2 + _4;
  _6 = MEM[(__v4sf *)m_12(D) + 32B];
  _16 = VEC_PERM_EXPR <v_13(D), v_13(D), { 2, 2, 2, 2 }>;
  _7 = _6 * _16;
  _8 = _5 + _7;
  _9 = MEM[(__v4sf *)m_12(D) + 48B];
  _17 = VEC_PERM_EXPR <v_13(D), v_13(D), { 3, 3, 3, 3 }>;
  _10 = _9 * _17;
  _18 = _8 + _10;
  return _18;

Bad:

  <bb 2> [local count: 1073741824]:
  _1 = *m_12(D);
  _30 = BIT_FIELD_REF <v_13(D), 32, 0>;
  v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>;
  _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>;
  _2 = _1 * _29;
  _3 = MEM[(__v4sf *)m_12(D) + 16B];
  _26 = BIT_FIELD_REF <v_13(D), 32, 32>;
  v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>;
  _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>;
  _4 = _3 * _25;
  _5 = _2 + _4;
  _6 = MEM[(__v4sf *)m_12(D) + 32B];
  _14 = BIT_FIELD_REF <v_13(D), 32, 64>;
  v_16 = BIT_INSERT_EXPR <v_17(D), _14, 0>;
  _15 = VEC_PERM_EXPR <v_16, v_16, { 0, 0, 0, 0 }>;
  _7 = _6 * _15;
  _8 = _5 + _7;
  _9 = MEM[(__v4sf *)m_12(D) + 48B];
  _18 = BIT_FIELD_REF <v_13(D), 32, 96>;
  v_20 = BIT_INSERT_EXPR <v_21(D), _18, 0>;
  _19 = VEC_PERM_EXPR <v_20, v_20, { 0, 0, 0, 0 }>;
  _10 = _9 * _19;
  _22 = _8 + _10;
  return _22;

So what's missing is converting the extract element, insert at 0 & splat
into splat element N.

  _30 = BIT_FIELD_REF <v_13(D), 32, 0>;
  v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>;
  _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>;

Shows a missing no-op (insert into default-def at 0 from extract from same
position can simply return the vector we extract from).

  _26 = BIT_FIELD_REF <v_13(D), 32, 32>;
  v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>;
  _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>;

is a bit more complicated - the VEC_PERM_EXPR indices should be modified
based on the fact we only pick the just inserted elements and those
were extracted from another (compatible) vector.


More information about the Gcc-bugs mailing list