Take: void f(float *restrict a, float * restrict b, float * restrict c, float t) { int i = 0 ; a[i] = b[i]/t; a[i+1] = b[i+1]/t; a[i+2] = c[i]/t; a[i+3] = c[i+1]/t; } Right now we do SLP once (at -O3) and produce: f: dup v2.2s, v0.s[0] ldr d1, [x1] ldr d0, [x2] fdiv v1.2s, v1.2s, v2.2s fdiv v0.2s, v0.2s, v2.2s stp d1, d0, [x0] ret But it might be better do: f: dup v2.4s, v0.s[0] ldr d0, [x1] ldr d1, [x2] ins v0.2d[1], v1.2d[0] fdiv v0.4s, v0.4s, v2.4s str q0, [x0] ret Mainly because two div is usually not pipelined.

I think what is missing is merging of two "vectors", aka, permutations of different load chains: /* Grouped store or load. */ if (STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) { if (REFERENCE_CLASS_P (lhs)) { /* Store. */ ; } else { /* Load. */ first_load = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)); if (prev_first_load) { /* Check that there are no loads from different interleaving chains in the same node. */ if (prev_first_load != first_load) { if (dump_enabled_p ()) { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "Build SLP failed: different " "interleaving chains in one node "); dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0); } /* Mismatch. */ continue; this is because we do not have a suitable way to represent those at the moment. So we split the store group and get the two element vectorization. As we don't have a good intermediate representation for SLP at the moment we can't really perfomr post-detection "optimization" on the SLP tree. unified autovect to the rescue...