Take: void f(float *restrict a, float * restrict b, float * restrict c, float t) { int i = 0 ; a[i] = b[i]/t; a[i+1] = b[i+1]/t; a[i+2] = c[i]/t; a[i+3] = c[i+1]/t; } Right now we do SLP once (at -O3) and produce: f: dup v2.2s, v0.s[0] ldr d1, [x1] ldr d0, [x2] fdiv v1.2s, v1.2s, v2.2s fdiv v0.2s, v0.2s, v2.2s stp d1, d0, [x0] ret But it might be better do: f: dup v2.4s, v0.s[0] ldr d0, [x1] ldr d1, [x2] ins v0.2d[1], v1.2d[0] fdiv v0.4s, v0.4s, v2.4s str q0, [x0] ret Mainly because two div is usually not pipelined.
I think what is missing is merging of two "vectors", aka, permutations of different load chains: /* Grouped store or load. */ if (STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) { if (REFERENCE_CLASS_P (lhs)) { /* Store. */ ; } else { /* Load. */ first_load = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)); if (prev_first_load) { /* Check that there are no loads from different interleaving chains in the same node. */ if (prev_first_load != first_load) { if (dump_enabled_p ()) { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "Build SLP failed: different " "interleaving chains in one node "); dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0); } /* Mismatch. */ continue; this is because we do not have a suitable way to represent those at the moment. So we split the store group and get the two element vectorization. As we don't have a good intermediate representation for SLP at the moment we can't really perfomr post-detection "optimization" on the SLP tree. unified autovect to the rescue...
We do better now: ldp s1, s3, [x1] dup v0.4s, v0.s[0] ldr s2, [x2, 4] ins v1.s[1], v3.s[0] ld1 {v1.s}[2], [x2] ins v1.s[3], v2.s[0] fdiv v1.4s, v1.4s, v0.4s str q1, [x0] _19 = {t_12(D), t_12(D), t_12(D), t_12(D)}; _1 = *b_9(D); _3 = MEM[(float *)b_9(D) + 4B]; _5 = *c_15(D); _7 = MEM[(float *)c_15(D) + 4B]; _20 = {_1, _3, _5, _7}; vect__2.3_18 = _20 / _19; MEM <vector(4) float> [(float *)a_11(D)] = vect__2.3_18; But we still don't Do the merging of the loads.
I'm not sure if it's "better" now. While we merge the store we now do not vectorize the load at all. Basically it's a trade-off between giving up when doing discovery on the whole store group or building sub-chains from scalars. We do have heuristics in place that try to anticipate whether splitting the group would succeed - if you change 'float' to 'double' we'll go the splitting way ending up with movupd (%rsi), %xmm1 unpcklpd %xmm0, %xmm0 divpd %xmm0, %xmm1 movups %xmm1, (%rdi) movupd (%rdx), %xmm1 divpd %xmm0, %xmm1 movups %xmm1, 16(%rdi) which is quite optimal. But for float we have movss 4(%rdx), %xmm1 movss (%rdx), %xmm2 shufps $0, %xmm0, %xmm0 movss 4(%rsi), %xmm3 unpcklps %xmm1, %xmm2 movss (%rsi), %xmm1 unpcklps %xmm3, %xmm1 movlhps %xmm2, %xmm1 divps %xmm0, %xmm1 movups %xmm1, (%rdi) that might be not too bad - it avoids STLF issues compared to doing two 8 byte loads and combining them. The get-away is that SLP could be better in handling running into different interleaving chains (or also into non-grouped loads).
Note starting GCC 14 on aarch64 we get: ldp s31, s30, [x1] add x1, x2, 4 dup v0.4s, v0.s[0] ld1 {v30.s}[1], [x1] ld1 {v31.s}[1], [x2] zip1 v30.4s, v31.4s, v30.4s fdiv v0.4s, v30.4s, v0.4s str q0, [x0] And on the trunk we get: ldp s31, s30, [x1] dup v0.4s, v0.s[0] ldr s29, [x2, 4] ld1 {v31.s}[1], [x2] uzp1 v30.2s, v30.2s, v29.2s zip1 v30.4s, v31.4s, v30.4s fdiv v0.4s, v30.4s, v0.4s str q0, [x0] Which is slightly worse? This is all from: ``` _1 = *b_9(D); _3 = MEM[(float *)b_9(D) + 4B]; _5 = *c_15(D); _7 = MEM[(float *)c_15(D) + 4B]; _18 = {_1, _3, _5, _7}; ``` ``` #define vec8 __attribute__((vector_size(8))) #define vec16 __attribute__((vector_size(16))) vec16 float f1(float *restrict a, float * restrict b) { vec8 float t = {a[0], a[1]}; vec8 float t1 = {b[0], b[1]}; return __builtin_shufflevector(t, t1, 0, 1, 2, 3); } vec16 float f2(float *restrict a, float * restrict b) { vec16 float t = {a[0], a[1], b[0], b[1]}; return t; } ``` We can optimize f1 but not f2.