Bug 58902 - small matrix multiplication non vectorized
Summary: small matrix multiplication non vectorized
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2013-10-28 10:39 UTC by vincenzo Innocente
Modified: 2025-01-04 13:25 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2013-10-28 10:39:55 UTC
in the following example
matmul and matmul2 do not vectorize
the manual unroll does
c++ -std=c++11 -Ofast -S m3x10.cc -march=corei7-avx -fopt-info-vec-all
gcc version 4.9.0 20131011 (experimental) [trunk revision 203426] (GCC) 

cat m3x10.cc
const int nrow=3;
 alignas(32) double tmp[nrow][10];
 alignas(32) double param[nrow];
 alignas(32) double frame[10];

void matmul() {
    for (int j=0; j<nrow; ++j)
    for (int i=0; i<10; ++i)
        param[j] += tmp[j][i]*frame[i];
}

void matmul2() {
    for (int j=0; j<nrow; ++j) {
      double s=0;
      for (int i=0; i<10; ++i)
        s += tmp[j][i]*frame[i];
      param[j] =s;
    }
}


void matmul3() {
      for (int i=0; i<10; ++i) {
        param[0] += tmp[0][i]*frame[i];
        param[1] += tmp[1][i]*frame[i];
        param[2] += tmp[2][i]*frame[i];
    }
}



double vmul0() {
  double s=0;
    for (int i=0; i<10; ++i)
      s += tmp[0][i]*frame[i];
  return s;
}

double vmul1() {
  double s=0;
    for (int i=0; i<10; ++i)
      s += tmp[1][i]*frame[i];
  return s;
}
Comment 1 Tibor Győri 2025-01-04 13:25:39 UTC
Tested this with trunk (future GCC 15), and to me it looks like while the tree-vectorizer still does not understand the loop nest, and/or judges the vectorization to be unprofitable, both loops are fully unrolled and then end ub getting at least somewhat vectorized by the SLP vectorizer.

Latest Clang appears to work similarly, fully unroll then SLP.
The final version of the Intel Classic compiler also seems to favor this approach.
See https://godbolt.org/z/boWxWbYWz

Would you agree that this issue has been resolved?