I've this simple double loop (used in benchmark) the inner loop (sloop) is not vectorized when invoked inside the longer loop (dloop) c++ -Ofast -c vectdloop.cc -ftree-vectorizer-verbose=7 vectdloop.cc:9: note: Profitability threshold = 6 vectdloop.cc:9: note: Profitability threshold is 6 loop iterations. vectdloop.cc:9: note: LOOP VECTORIZED. vectdloop.cc:7: note: vectorized 1 loops in function. vectdloop.cc:20: note: not vectorized: unexpected loop form. vectdloop.cc:16: note: vectorized 0 loops in function. #include<cmath> inline float fn(float x) { return 2.f*x+std::sqrt(x); } void sloop(float * __restrict__ s, float const * __restrict__ xx) { const int ls=16; for (int j=0; j < ls; ++j) { s[j] = fn(xx[j]); } } int dloop(float yyy) { int niter = 100000; float x = 0.5f; yyy=0; const int ls=16; for (int i=0; i < niter; ++i) { float s[ls]; float xx[ls]; for (int j=0; j < ls; ++j) xx[j] =x+(5*(j&1)); sloop(s,xx); // for (int j=0; j < ls; ++j) s[j] = fn(xx[j]); x += 1e-6f; for (int j=0; j < ls; ++j) yyy+=s[j]; } if (yyy == 2.32132323232f) niter--; return niter; }
All inner loops are simply completely unrolled which eliminates the s array. Then we end up with a loop with two reductions which cannot be vectorized right now.
There is no limitation on the number of reductions in vectorization. The problem here is a non-empty latch block. There are several existing PRs for similar problems: pr 33447, pr 28643. Ira
Fixed in GCC 8.