Bug 85283

Summary: Generates 20 lines of assembly while only one assembly instruction is enough.
Product: gcc Reporter: mcccs
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal Keywords: missed-optimization
Priority: P3    
Version: 8.0.1   
Target Milestone: 11.0   
Host: x86_64-linux-gnu Target: x86_64-linux-gnu
Build: x86_64-linux-gnu Known to work:
Known to fail: Last reconfirmed: 2018-04-09 00:00:00
Bug Depends on:    
Bug Blocks: 53947    

Description mcccs 2018-04-08 07:56:45 UTC
GCC version: trunk/20180407 (also older versions)
Target: x86_64-linux-gnu
Compile options: -Ofast -mavx2 -mfma -Wall -Wextra -Wpedantic

Build options: --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap --enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build 

The exact code (no #include s):
typedef struct {
  float x, y;
} Vec2;

Vec2 vec2_add(Vec2 a, Vec2 b) {
  Vec2 out = {a.x + b.x, 
              a.y + b.y};
  return out;
}

Produced assembly with line numbers:

1 vec2_add:
2  vmovq rcx, xmm0
3  vmovq rsi, xmm1
...
21 vmovq xmm0, QWORD PTR [rsp-24]
22 ret

Expected assembly (as compiled by Clang 6.0 with -Ofast -mavx2 -mfma):

1 vec2_add: # @vec2_add
2   vaddps xmm0, xmm1, xmm0
3   ret

(Yes, only three lines)

^^^^^^

(These can be experimented here: https://godbolt.org/g/tTwusV)

See also (for other inefficiencies): https://godbolt.org/g/AtWNgf
Comment 1 Richard Biener 2018-04-09 08:11:27 UTC
This isn't handled by basic-block vectorization because there are no stores
and CONSTRUCTORs are not SLP "seeds".  IIRC there are duplicates.
Comment 2 Richard Biener 2018-10-30 10:53:24 UTC
We can vectorize a variant with doubles but that results in awful code because the ABI isn't known.  The float variant now looks like the following before
vectorization:

  _1 = a.x;
  _2 = b.x;
  _3 = _1 + _2;
  _4 = a.y;
  _5 = b.y;
  _6 = _4 + _5;
  MEM[(struct  *)&D.1915] = _3;
  MEM[(struct  *)&D.1915 + 4B] = _6;
  return D.1915;

here the issue is again that we do not know the ABI details plus MMX
is disabled and the vectorizer expects 4 floats for vectorization
(that is, it cannot vectorize using partial vector regs - the ABI may
specify the upper half of %xmm0 is zero for example).
Comment 3 Andrew Pinski 2021-11-28 06:57:43 UTC
Fixed in GCC 11. Where the x86_64 target emulates 2 float vector inside SSE.