Bug 111894 - Missed vectorization opportunity
Summary: Missed vectorization opportunity
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2023-10-20 13:57 UTC by Mark Bourgeault
Modified: 2023-10-23 09:00 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-10-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Bourgeault 2023-10-20 13:57:39 UTC
Consider the following code that uses views to implement a two dimensional `iota`:

    #include <array>
    #include <ranges>
    
    template<std::integral T>
    std::ranges::range auto iota2D(T xbound, T ybound) {
        auto fn = [=](T idx) { return std::tuple{idx / ybound, idx % ybound}; };
        return std::views::iota(T{0}, xbound * ybound) | std::views::transform(fn);
    }
    
    constexpr std::size_t N = 20;
    std::array<std::array<int, N>, N> data;
    
    __attribute__((noinline)) void init1() {
        for (auto i : std::views::iota(size_t{}, N)) {
            for (auto j : std::views::iota(size_t{}, N)) {
                data[i][j] = 123;
            }
        }
    }
    
    __attribute__((noinline)) void init2() {
        for (auto [i, j] : iota2D(N,N)) {
            data[i][j] = 123;
        }
    }

Using gcc 13.2 with -O3, we see that the code using a nested loop is nicely vectorized:

    init1():
            movdqa  xmm0, XMMWORD PTR .LC0[rip]
            mov     eax, OFFSET FLAT:data
    .L2:
            movaps  XMMWORD PTR [rax], xmm0
            add     rax, 80
            movaps  XMMWORD PTR [rax-64], xmm0
            movaps  XMMWORD PTR [rax-48], xmm0
            movaps  XMMWORD PTR [rax-32], xmm0
            movaps  XMMWORD PTR [rax-16], xmm0
            cmp     rax, OFFSET FLAT:data+1600
            jne     .L2
            ret

The code using iota2D is not vectorized:

    init2():
            xor     eax, eax
    .L6:
            mov     DWORD PTR data[0+rax*4], 123
            add     rax, 1
            cmp     rax, 400
            jne     .L6
            ret

Although GCC 13 produces much higher quality assembly than previous versions, it fails to vectorize the loop.
Comment 1 Richard Biener 2023-10-23 09:00:03 UTC
Confirmed.

Creating dr for MEM <struct array> [(value_type &)&data]._M_elems[_7]._M_elems[_8]
analyze_innermost: t.C:23:24: missed:  failed: evolution of offset is not affine.

we have

void init2 ()
{
  long unsigned int SR.100;
  long unsigned int ivtmp_1;
  long unsigned int _4;
  long unsigned int ivtmp_6;
  long unsigned int _7;
  long unsigned int _8;

  <bb 2> [local count: 10737416]:

  <bb 3> [local count: 1063004409]:
  # SR.100_13 = PHI <_4(5), 0(2)>
  # ivtmp_6 = PHI <ivtmp_1(5), 400(2)>
  _7 = SR.100_13 / 20;
  _8 = SR.100_13 % 20;
  MEM <struct array> [(value_type &)&data]._M_elems[_7]._M_elems[_8] = 123;
  _4 = SR.100_13 + 1;
  ivtmp_1 = ivtmp_6 - 1;
  if (ivtmp_1 != 0)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 5> [local count: 1052374367]:
  goto <bb 3>; [100.00%]

  <bb 4> [local count: 10737416]:
  return;

and the issue is the use of division/modulo for the array accesses, that is,
that this C++ idiom results in a "flattened" loop instead of a nest of
level two.

GCC doesn't support "un-flattening" this.  In general, when we can
see that the number of iterations is so that unrolling the loop
by VF will not cross dimensions we might be able to apply regular loop
vectorization but we currently do not have code doing that.