GCC version: trunk/20180407 (also older versions) Target: x86_64-linux-gnu Compile options: -Ofast -mavx2 -mfma -Wall -Wextra -Wpedantic Build options: --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap --enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build The exact code (no #include s): typedef struct { float x, y; } Vec2; Vec2 vec2_add(Vec2 a, Vec2 b) { Vec2 out = {a.x + b.x, a.y + b.y}; return out; } Produced assembly with line numbers: 1 vec2_add: 2 vmovq rcx, xmm0 3 vmovq rsi, xmm1 ... 21 vmovq xmm0, QWORD PTR [rsp-24] 22 ret Expected assembly (as compiled by Clang 6.0 with -Ofast -mavx2 -mfma): 1 vec2_add: # @vec2_add 2 vaddps xmm0, xmm1, xmm0 3 ret (Yes, only three lines) ^^^^^^ (These can be experimented here: https://godbolt.org/g/tTwusV) See also (for other inefficiencies): https://godbolt.org/g/AtWNgf
This isn't handled by basic-block vectorization because there are no stores and CONSTRUCTORs are not SLP "seeds". IIRC there are duplicates.
We can vectorize a variant with doubles but that results in awful code because the ABI isn't known. The float variant now looks like the following before vectorization: _1 = a.x; _2 = b.x; _3 = _1 + _2; _4 = a.y; _5 = b.y; _6 = _4 + _5; MEM[(struct *)&D.1915] = _3; MEM[(struct *)&D.1915 + 4B] = _6; return D.1915; here the issue is again that we do not know the ABI details plus MMX is disabled and the vectorizer expects 4 floats for vectorization (that is, it cannot vectorize using partial vector regs - the ABI may specify the upper half of %xmm0 is zero for example).
Fixed in GCC 11. Where the x86_64 target emulates 2 float vector inside SSE.