void f(float *restrict a, float * restrict b, float * restrict c, float t)
int i = 0 ;
a[i] = b[i]/t;
a[i+1] = b[i+1]/t;
a[i+2] = c[i]/t;
a[i+3] = c[i+1]/t;
Right now we do SLP once (at -O3) and produce:
dup v2.2s, v0.s
ldr d1, [x1]
ldr d0, [x2]
fdiv v1.2s, v1.2s, v2.2s
fdiv v0.2s, v0.2s, v2.2s
stp d1, d0, [x0]
But it might be better do:
dup v2.4s, v0.s
ldr d0, [x1]
ldr d1, [x2]
ins v0.2d, v1.2d
fdiv v0.4s, v0.4s, v2.4s
str q0, [x0]
Mainly because two div is usually not pipelined.
I think what is missing is merging of two "vectors", aka, permutations of different load chains:
/* Grouped store or load. */
if (STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
if (REFERENCE_CLASS_P (lhs))
/* Store. */
/* Load. */
first_load = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt));
/* Check that there are no loads from different interleaving
chains in the same node. */
if (prev_first_load != first_load)
if (dump_enabled_p ())
"Build SLP failed: different "
"interleaving chains in one node ");
dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
/* Mismatch. */
this is because we do not have a suitable way to represent those at the
moment. So we split the store group and get the two element vectorization.
As we don't have a good intermediate representation for SLP at the moment
we can't really perfomr post-detection "optimization" on the SLP tree.
unified autovect to the rescue...