Take a two-level loop-nest: void foo(int8_t *__restrict__ A, int8_t *__restrict__ B, int32_t *__restrict__ sum, int n, int m) { for (int i = 0; i < n; ++i) { int8_t a = A[i]; for (int j = 0; j < m; j++) { int8_t b = B[T_FN(j) + i]; sum[j] += a * b; } } } Suppose T_FN() is some kind of pure mathematical function. Now although gcc could vectorize inner loop independent of the outer one regarding simple form of T_FN(), the result is basically far from optimal. If we consider loop-nest as a whole, and unroll the outer loop by an appropriate VF(for example, let VF=8 for 128 bit-vectorization width), we could make accumulate statement of the inner loop fit into more compact dot-product pattern as: (leftover epilog loop is omitted) for (int i = 0; i < n; i += 8) { <vector(8) int8_t> v_a = LOAD<vector(8) int8_t>(&A[i]); for (int j = 0; j < m; j++) { <vector(8) int8_t> v_b = LOAD<vector(8) int8_t>(&B[T_FN(j) + i]); sum[j] += DOT_PROD(v_a * v_b); } }
One of the implementation reasons is that we do not support niter peeling for the outer loop at this point. The other reason is that sum[j] in the inner loop conflicts with different iterations in the outer loop - we do not currently handle this situation specially. Then there's the issue that with constant outer bound we apply unroll-and-jam and that confuses us with a duplicate store to sum[j]. So the biggest roadblock is to incrementally relax the restrictions on grouped accesses for outer loop vect in vect_analyze_data_ref_access. IIRC the actual issues are subtle.