Consider the following code that uses views to implement a two dimensional `iota`: #include <array> #include <ranges> template<std::integral T> std::ranges::range auto iota2D(T xbound, T ybound) { auto fn = [=](T idx) { return std::tuple{idx / ybound, idx % ybound}; }; return std::views::iota(T{0}, xbound * ybound) | std::views::transform(fn); } constexpr std::size_t N = 20; std::array<std::array<int, N>, N> data; __attribute__((noinline)) void init1() { for (auto i : std::views::iota(size_t{}, N)) { for (auto j : std::views::iota(size_t{}, N)) { data[i][j] = 123; } } } __attribute__((noinline)) void init2() { for (auto [i, j] : iota2D(N,N)) { data[i][j] = 123; } } Using gcc 13.2 with -O3, we see that the code using a nested loop is nicely vectorized: init1(): movdqa xmm0, XMMWORD PTR .LC0[rip] mov eax, OFFSET FLAT:data .L2: movaps XMMWORD PTR [rax], xmm0 add rax, 80 movaps XMMWORD PTR [rax-64], xmm0 movaps XMMWORD PTR [rax-48], xmm0 movaps XMMWORD PTR [rax-32], xmm0 movaps XMMWORD PTR [rax-16], xmm0 cmp rax, OFFSET FLAT:data+1600 jne .L2 ret The code using iota2D is not vectorized: init2(): xor eax, eax .L6: mov DWORD PTR data[0+rax*4], 123 add rax, 1 cmp rax, 400 jne .L6 ret Although GCC 13 produces much higher quality assembly than previous versions, it fails to vectorize the loop.
Confirmed. Creating dr for MEM <struct array> [(value_type &)&data]._M_elems[_7]._M_elems[_8] analyze_innermost: t.C:23:24: missed: failed: evolution of offset is not affine. we have void init2 () { long unsigned int SR.100; long unsigned int ivtmp_1; long unsigned int _4; long unsigned int ivtmp_6; long unsigned int _7; long unsigned int _8; <bb 2> [local count: 10737416]: <bb 3> [local count: 1063004409]: # SR.100_13 = PHI <_4(5), 0(2)> # ivtmp_6 = PHI <ivtmp_1(5), 400(2)> _7 = SR.100_13 / 20; _8 = SR.100_13 % 20; MEM <struct array> [(value_type &)&data]._M_elems[_7]._M_elems[_8] = 123; _4 = SR.100_13 + 1; ivtmp_1 = ivtmp_6 - 1; if (ivtmp_1 != 0) goto <bb 5>; [99.00%] else goto <bb 4>; [1.00%] <bb 5> [local count: 1052374367]: goto <bb 3>; [100.00%] <bb 4> [local count: 10737416]: return; and the issue is the use of division/modulo for the array accesses, that is, that this C++ idiom results in a "flattened" loop instead of a nest of level two. GCC doesn't support "un-flattening" this. In general, when we can see that the number of iterations is so that unrolling the loop by VF will not cross dimensions we might be able to apply regular loop vectorization but we currently do not have code doing that.