I already mentioned this example, but I don't think it is in any PR: typedef double vec __attribute__((vector_size(4*sizeof(double)))); void f(vec*x){ *x+=*x+*x; } compiled with -S -O3 -msse4, produces 4 add insns (normal), and 36 mov insns, which is a bit much... For comparison, this should be equivalent to the following code, which generates only 6 mov insn: typedef double vec __attribute__((vector_size(2*sizeof(double)))); void f(vec*x){ x[0]+=x[0]+x[0]; x[1]+=x[1]+x[1]; } One minor enhancement would be to have fold_ternary handle BIT_FIELD_REF of CONSTRUCTOR of vectors (I think it is already tracked elsewhere, though I couldn't find it). But the main issue is with copying these fake vectors. Their fake "registers" are in memory, and copying between those (4 times 2 movs going through rax in DImode, I assume it is faster than going through xmm registers?) isn't optimized away. In this example, the content of *x is first copied to a fake register. Then V2DF parts are extracted, added, and put in memory. That fake register is now copied to a new fake register. V2DF are taken from it, added to the V2DF that were still there, and stored to memory. And that is finally copied to the memory location x. I don't know how that should be improved. Maybe the vector lowering pass should go even further, turn the first program into the second one, and not leave any extra long vectors for the back-end to handle? It doesn't seem easy to optimize in the back-end, too late. Or maybe something can be done at expansion time?
The first copy is PR 52436. The second copy has a patch posted here: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00900.html The last copy would require turning: gimple_assign <constructor, _5, {_15, _18}, NULL, NULL> gimple_assign <ssa_name, *x_2(D), _5, NULL, NULL> into: gimple_assign <ssa_name, *x_2(D), _15, NULL, NULL> gimple_assign <ssa_name, MEM[(vec *)x_2(D) + 16B], _18, NULL, NULL> (not sure if endianness matters here) which could maybe more easily be done by splitting the memory write (when the vector type is not supported) into a suitable number of bit_field_ref extractions and memory writes and relying on forwprop4 to simplify the bit_field_refs of the constructor.
Author: glisse Date: Wed Nov 28 10:11:27 2012 New Revision: 193884 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193884 Log: 2012-11-28 Marc Glisse <marc.glisse@inria.fr> PR middle-end/55266 * fold-const.c (fold_ternary_loc) [BIT_FIELD_REF]: Handle CONSTRUCTOR with vector elements. * tree-ssa-propagate.c (valid_gimple_rhs_p): Handle CONSTRUCTOR and BIT_FIELD_REF. Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/tree-ssa-propagate.c
The other issue is there is no DCE that happens after forwprop4.
I see still problems when calling inline functions. It seems that the code to satisfy the "calling ABI" is generated anyhow. take the example below and compare the code generated for "dotd1" wrt "dotd2" dotd2 has a "storm" of move before the reduction c++ -std=c++11 -Ofast -march=corei7 -S conversions.cc -fabi-version=0 the avx version is better but for dotd4 (actually dotd1 is lelf see like) typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t; typedef double __attribute__( ( vector_size( 32 ) ) ) float64x4_t; inline float64x4_t convert(float32x4_t f) { return float64x4_t{f[0],f[1],f[2],f[3]}; } float dotf(float32x4_t x, float32x4_t y) { float ret=0; for (int i=0;i!=4;++i) ret+=x[i]*y[i]; return ret; } inline double dotd(float64x4_t x, float64x4_t y) { double ret=0; for (int i=0;i!=4;++i) ret+=x[i]*y[i]; return ret; } float dotd1(float32x4_t x, float32x4_t y) { float64x4_t dx,dy; for (int i=0;i!=4;++i) { dx[i]=x[i]; dy[i]=y[i]; } double ret=0; for (int i=0;i!=4;++i) ret+=dx[i]*dy[i]; return ret; } float dotd2(float32x4_t x, float32x4_t y) { float64x4_t dx=convert(x); float64x4_t dy=convert(y); return dotd(dx,dy); } float dotd3(float32x4_t x, float32x4_t y) { float64x4_t dx{x[0],x[1],x[2],x[3]}; float64x4_t dy{y[0],y[1],y[2],y[3]}; double ret=0; for (int i=0;i!=4;++i) ret+=dx[i]*dy[i]; return ret; } float dotd4(float32x4_t x, float32x4_t y) { float64x4_t dx,dy; for (int i=0;i!=4;++i) { dx[i]=x[i]; dy[i]=y[i]; } return dotd(dx,dy); }
The latter is because of 'convert' leaving us with _1 = BIT_FIELD_REF <x_32(D), 32, 0>; _2 = (double) _1; _3 = BIT_FIELD_REF <x_32(D), 32, 32>; _4 = (double) _3; _5 = BIT_FIELD_REF <x_32(D), 32, 64>; _6 = (double) _5; _7 = BIT_FIELD_REF <x_32(D), 32, 96>; _8 = (double) _7; _9 = {_2, _4, _6, _8}; rather than vect__1.83_46 = x; vect__2.84_47 = [vec_unpack_lo_expr] vect__1.83_46; vect__2.84_48 = [vec_unpack_hi_expr] vect__1.83_46; MEM[(vector(4) double *)&dx] = vect__2.84_47; MEM[(vector(4) double *)&dx + 16B] = vect__2.84_48; (which is in itself not optimal because not being in SSA form). This means generic vector support lacks widening/shortening and thus you have to jump through hoops with things like 'convert'. And SLP vectorization doesn't "vectorize" with vector CONSTRUCTORs as root (a possible enhancement I think). For the original testcase it's a duplicate of PR65832 as we get <bb 2>: _1 = *x_5(D); _7 = BIT_FIELD_REF <_1, 128, 0>; _9 = _7 + _7; _10 = BIT_FIELD_REF <_1, 128, 128>; _12 = _10 + _10; _14 = _7 + _9; _16 = _10 + _12; _3 = {_14, _16}; *x_5(D) = _3; w/o fixing PR65832 this can be improved by "combining" the loads with the extracts and the CONSTRUCTOR with the store. I have done sth similar for COMPLEX_EXPR in tree-ssa-forwprop.c ... (not that I am very proud of that - heh).
The original issue is fixed. f: .LFB0: .cfi_startproc movapd (%rdi), %xmm2 movapd 16(%rdi), %xmm1 movapd %xmm2, %xmm0 addpd %xmm2, %xmm0 addpd %xmm2, %xmm0 movaps %xmm0, (%rdi) movapd %xmm1, %xmm0 addpd %xmm1, %xmm0 addpd %xmm1, %xmm0 movaps %xmm0, 16(%rdi) ret the issue in comment#4 as well I think: _Z5dotd1Dv4_fS_: .LFB3: .cfi_startproc movaps %xmm1, %xmm3 pxor %xmm2, %xmm2 movhlps %xmm0, %xmm2 cvtps2pd %xmm0, %xmm0 cvtps2pd %xmm2, %xmm1 pxor %xmm2, %xmm2 movhlps %xmm3, %xmm2 cvtps2pd %xmm3, %xmm3 cvtps2pd %xmm2, %xmm2 mulpd %xmm3, %xmm0 mulpd %xmm2, %xmm1 addpd %xmm0, %xmm1 movapd %xmm1, %xmm0 unpckhpd %xmm1, %xmm0 addpd %xmm1, %xmm0 cvtsd2ss %xmm0, %xmm0 ret .cfi_endproc .LFE3: .size _Z5dotd1Dv4_fS_, .-_Z5dotd1Dv4_fS_ .p2align 4 .globl _Z5dotd2Dv4_fS_ .type _Z5dotd2Dv4_fS_, @function _Z5dotd2Dv4_fS_: .LFB4: .cfi_startproc movaps %xmm1, %xmm3 cvtps2pd %xmm0, %xmm4 pxor %xmm2, %xmm2 movhlps %xmm0, %xmm2 pxor %xmm0, %xmm0 movhlps %xmm3, %xmm0 cvtps2pd %xmm2, %xmm2 cvtps2pd %xmm1, %xmm1 cvtps2pd %xmm0, %xmm0 mulpd %xmm4, %xmm1 mulpd %xmm0, %xmm2 addpd %xmm2, %xmm1 movapd %xmm1, %xmm0 unpckhpd %xmm1, %xmm0 addpd %xmm1, %xmm0 cvtsd2ss %xmm0, %xmm0 ret