Bug 117738 - Failure to recognize dot-product pattern in inner loop
Summary: Failure to recognize dot-product pattern in inner loop
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-11-22 14:31 UTC by Feng Xue
Modified: 2024-11-25 07:56 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-11-25 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Feng Xue 2024-11-22 14:31:58 UTC
Take a two-level loop-nest:

  void foo(int8_t *__restrict__ A, int8_t *__restrict__ B, int32_t *__restrict__ sum, int n, int m)
  {
    for (int i = 0; i < n; ++i) {
      int8_t a = A[i];

      for (int j = 0; j < m; j++) {
        int8_t b = B[T_FN(j) + i];

        sum[j] += a * b;
      }
    }
  }

Suppose T_FN() is some kind of pure mathematical function. Now although gcc could vectorize inner loop independent of the outer one regarding simple form of T_FN(), the result is basically far from optimal. If we consider loop-nest as a whole, and unroll the outer loop by an appropriate VF(for example, let VF=8 for 128 bit-vectorization width), we could make accumulate statement of the inner loop fit into more compact dot-product pattern as: (leftover epilog loop is omitted)

    for (int i = 0; i < n; i += 8) {
      <vector(8) int8_t> v_a = LOAD<vector(8) int8_t>(&A[i]);

      for (int j = 0; j < m; j++) {
        <vector(8) int8_t> v_b = LOAD<vector(8) int8_t>(&B[T_FN(j) + i]); 

        sum[j] += DOT_PROD(v_a * v_b);
      }
    }
Comment 1 Richard Biener 2024-11-25 07:45:09 UTC
One of the implementation reasons is that we do not support niter peeling for the outer loop at this point.  The other reason is that sum[j] in the inner
loop conflicts with different iterations in the outer loop - we do not
currently handle this situation specially.
Then there's the issue that with constant outer bound
we apply unroll-and-jam and that confuses us with a duplicate store to
sum[j].

So the biggest roadblock is to incrementally relax the restrictions
on grouped accesses for outer loop vect in vect_analyze_data_ref_access.
IIRC the actual issues are subtle.