Bug 85283

Summary:	Generates 20 lines of assembly while only one assembly instruction is enough.
Product:	gcc	Reporter:	mcccs
Component:	tree-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	normal	Keywords:	missed-optimization
Priority:	P3
Version:	8.0.1
Target Milestone:	11.0
Host:	x86_64-linux-gnu	Target:	x86_64-linux-gnu
Build:	x86_64-linux-gnu	Known to work:
Known to fail:		Last reconfirmed:	2018-04-09 00:00:00
Bug Depends on:
Bug Blocks:	53947

Description mcccs 2018-04-08 07:56:45 UTC

GCC version: trunk/20180407 (also older versions)
Target: x86_64-linux-gnu
Compile options: -Ofast -mavx2 -mfma -Wall -Wextra -Wpedantic

Build options: --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap --enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build 

The exact code (no #include s):
typedef struct {
  float x, y;
} Vec2;

Vec2 vec2_add(Vec2 a, Vec2 b) {
  Vec2 out = {a.x + b.x, 
              a.y + b.y};
  return out;
}

Produced assembly with line numbers:

1 vec2_add:
2  vmovq rcx, xmm0
3  vmovq rsi, xmm1
...
21 vmovq xmm0, QWORD PTR [rsp-24]
22 ret

Expected assembly (as compiled by Clang 6.0 with -Ofast -mavx2 -mfma):

1 vec2_add: # @vec2_add
2   vaddps xmm0, xmm1, xmm0
3   ret

(Yes, only three lines)

^^^^^^

(These can be experimented here: https://godbolt.org/g/tTwusV)

See also (for other inefficiencies): https://godbolt.org/g/AtWNgf

Comment 1 Richard Biener 2018-04-09 08:11:27 UTC

This isn't handled by basic-block vectorization because there are no stores
and CONSTRUCTORs are not SLP "seeds".  IIRC there are duplicates.

Comment 2 Richard Biener 2018-10-30 10:53:24 UTC

We can vectorize a variant with doubles but that results in awful code because the ABI isn't known.  The float variant now looks like the following before
vectorization:

  _1 = a.x;
  _2 = b.x;
  _3 = _1 + _2;
  _4 = a.y;
  _5 = b.y;
  _6 = _4 + _5;
  MEM[(struct  *)&D.1915] = _3;
  MEM[(struct  *)&D.1915 + 4B] = _6;
  return D.1915;

here the issue is again that we do not know the ABI details plus MMX
is disabled and the vectorizer expects 4 floats for vectorization
(that is, it cannot vectorize using partial vector regs - the ABI may
specify the upper half of %xmm0 is zero for example).

Comment 3 Andrew Pinski 2021-11-28 06:57:43 UTC

Fixed in GCC 11. Where the x86_64 target emulates 2 float vector inside SSE.