Are there any plans to try to vectorize parts of code like: struct A { double x, y, z; }; struct B { struct A a, b; }; struct C { struct A c; double d; }; __attribute__((noinline, noclone)) int foo (const struct C *u, struct B v) { double a, b, c, d; a = v.b.x * v.b.x + v.b.y * v.b.y + v.b.z * v.b.z; b = 2.0 * v.b.x * (v.a.x - u->c.x) + 2.0 * v.b.y * (v.a.y - u->c.y) + 2.0 * v.b.z * (v.a.z - u->c.z); c = u->c.x * u->c.x + u->c.y * u->c.y + u->c.z * u->c.z + v.a.x * v.a.x + v.a.y * v.a.y + v.a.z * v.a.z + 2.0 * (-u->c.x * v.a.x - u->c.y * v.a.y - u->c.z * v.a.z) - u->d * u->d; if ((d = b * b - 4.0 * a * c) < 0.0) return 0; return d; } int main (void) { int i, j; struct C c = { { 1.0, 1.0, 1.0 }, 1.0 }; struct B b = { { 1.0, 1.0, 1.0 }, { 1.0, 1.0, 1.0 } }; for (i = 0; i < 100000000; i++) { asm volatile ("" : : "r" (&c), "r" (&b) : "memory"); j = foo (&c, b); asm volatile ("" : : "r" (j)); } return 0; } (this is the hot spot from c-ray benchmark, the function is actually larger but at least according to callgrind in most cases the early return on < 0.0 happens; as the function is large and called from multiple spots, it isn't inlined). I'd say (though, haven't tried to code it by hand using intrinsics) that by doing many of the multiplications/additions in parallel (especially for AVX) there could be significant speedups (-O3 -ffast-math).
This code requires SLP to originate from loads, which seems to be a bit more complicated than the currently implemented use-def scan (it will also need to reduce/extract scalars from the vectors in the end of vector computation). I don't see any major obstacles for this, however, currently I don't plan to work on this. Another required feature is to work on groups bigger than vectorization factor, i.e., combining 2 statements in this example and leaving the 3rd one scalar. Ira
Confirmed.
So currently we indeed miss the "sinks": t.i:29:6: note: === vect_analyze_data_ref_accesses === t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 8B] t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 16B] t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 24B] t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 32B] t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 40B] t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->c.y t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->c.z t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->d t.i:29:6: note: Detected interleaving load of size 6 starting with v$a$x_48 = MEM[(struct B *)&v]; t.i:29:6: note: Detected interleaving load of size 4 starting with _5 = u_43(D)->c.x; t.i:29:6: note: not vectorized: no grouped stores in basic block. two classes of sinks are currently missing: reductions and vector CONSTRUCTORs parts of the testcase might be handled with reduction support.
We're almost there: t2.c:22:5: note: Starting SLP discovery for t2.c:22:5: note: powmult_4 = v$b$z_53 * v$b$z_53; t2.c:22:5: note: powmult_1 = v$b$x_51 * v$b$x_51; t2.c:22:5: note: powmult_2 = v$b$y_52 * v$b$y_52; but: t2.c:22:5: note: vectype: vector(2) double t2.c:22:5: note: nunits = 2 t2.c:22:5: missed: Build SLP failed: unrolling required in basic block SLP and for reductions we do not try to split the group.