In the following example I suspect that some sort of loop merging at O3 prevent the optimization of the second inner loop in "bar" compare c++ -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=7 -c vectHist.cpp -ffast-math c++ -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=7 -c vectHist.cpp -ffast-math what I do not understand is that if (following man page) I compare O2 and O3 with gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts diff /tmp/O2-opts /tmp/O3-opts | grep enabled > -fgcse-after-reload [enabled] > -finline-functions [enabled] > -fipa-cp-clone [enabled] > -fpredictive-commoning [enabled] > -ftree-loop-distribute-patterns [enabled] > -ftree-vectorize [enabled] > -funswitch-loops [enabled] I still get c++ -std=gnu++0x -DNDEBUG -Wall -O2 -ftree-vectorize -msse4 -fvisibility-inlines-hidden -ftree-vectorizer-verbose=2 --param vect-max-version-for-alias-checks=30 -funsafe-loop-optimizations -ftree-loop-distribution -ftree-loop-if-convert-stores -fipa-pta -Wunsafe-loop-optimizations -fgcse-sm -fgcse-las -c vectHist.cpp -ffast-math -funswitch-loops -ftree-loop-distribute-patterns -fpredictive-commoning -finline-functions -fipa-cp-clone -fgcse-after-reload vectHist.cpp:17: note: not vectorized: data ref analysis failed x_5 = co[D.4986_4]; vectHist.cpp:16: note: vectorized 0 loops in function. vectHist.cpp:35: note: not vectorized: data ref analysis failed D.4977_30 = hist[D.4976_29]; vectHist.cpp:33: note: LOOP VECTORIZED. vectHist.cpp:31: note: not vectorized: data ref analysis failed D.4957_13 = co[D.4956_12]; vectHist.cpp:25: note: vectorized 1 loops in function. while changing just O2 in 03 (that at this point should be not really effective as I added all options by hand) does not vectorize… c++ -std=gnu++0x -DNDEBUG -Wall -O3 -mavx -ftree-vectorize -msse4 -fvisibility-inlines-hidden -ftree-vectorizer-verbose=2 --param vect-max-version-for-alias-checks=30 -funsafe-loop-optimizations -ftree-loop-distribution -ftree-loop-if-convert-stores -fipa-pta -Wunsafe-loop-optimizations -fgcse-sm -fgcse-las -c vectHist.cpp -ffast-math -funswitch-loops -ftree-loop-distribute-patterns -fpredictive-commoning -finline-functions -fipa-cp-clone -fgcse-after-reload vectHist.cpp:17: note: not vectorized: data ref analysis failed x_5 = co[D.5125_4]; vectHist.cpp:17: note: not vectorized: data ref analysis failed x_5 = co[D.5125_4]; vectHist.cpp:16: note: vectorized 0 loops in function. vectHist.cpp:30: note: not vectorized: data ref analysis failed D.5096_55 = co[D.5095_54]; vectHist.cpp:30: note: not vectorized: data ref analysis failed D.5096_55 = co[D.5095_54]; vectHist.cpp:25: note: vectorized 0 loops in function. note how it does not report anything about loops at lines 31,33 and 35 --------------------------- // a classroom example #include<cmath> const int N=1024; float __attribute__ ((aligned(16))) a[N]; float __attribute__ ((aligned(16))) b[N]; float __attribute__ ((aligned(16))) c[N]; float __attribute__ ((aligned(16))) d[N]; int __attribute__ ((aligned(16))) k[N]; float __attribute__ ((aligned(16))) co[12]; float __attribute__ ((aligned(16))) hist[100]; // do not expect GCC to vectorize (yet) void foo() { for (int i=0; i!=N; ++i) { float x = co[k[i]]; float y = a[i]/std::sqrt(x*b[i]); ++hist[int(y)]; } } // let's give it an hand: split the loop so that the "heavy duty one" vectorize void bar() { const int S=8; int loops = N/S; float x[S]; float y[S]; for (int j=0; j!=loops; ++j) { for (int i=0; i!=S; ++i) x[i] = co[k[j+i]]; for (int i=0; i!=S; ++i) // this should vectorize y[i] = a[j+i]/std::sqrt(x[i]*b[j+i]); for (int i=0; i!=S; ++i) ++hist[int(y[i])]; } }
it may be a duplicate of my own PR49730 as void bar2(int jj) { const int S=8; float x[S]; float y[S]; int j = jj*S; for (int i=0; i!=S; ++i) x[i] = co[k[j+i]]; for (int i=0; i!=S; ++i) // this should vectorize y[i] = a[j+i]/std::sqrt(x[i]*b[j+i]); for (int i=0; i!=S; ++i) ++hist[int(y[i])]; } vectorize at 03 (of course in the example I submitted previously the external loop should read for (int jj=0; jj!=loops; ++jj) { int j = jj*S; )
The loop likely completely unrolled, you can disable that with --param max-completely-peel-times=1. I think scalar-code vectorization does not handle this right now because the temporary arrays that would help it have store-motion applied (and should be later optimized away, but are not).
Thanks Richard, --param max-completely-peel-times=1 does the trick and, in my real life example, does not have any adverse effect elsewhere while it speeds up the loop as expected. More in general, Do you think that GCC will ever be able to transform things like foo into bar by itself?
(In reply to comment #3) > Thanks Richard, > --param max-completely-peel-times=1 > does the trick and, in my real life example, does not have any adverse effect > elsewhere > while it speeds up the loop as expected. > More in general, > Do you think that GCC will ever be able to transform things like foo into bar > by itself? I hope so ;) The graphite framework is supposed to provide us with this kind of features.