Vectorizer fails to handle this: ---------------------------------------------------- #define align(x) __attribute__((align(x))) typedef float align(16) MATRIX[3][3]; void RotateMatrix(MATRIX ret, MATRIX a, MATRIX b) { int i, j; for (j = 0; j < 3; j++) for (i = 0; i < 3; i++) ret[j][i] = a[j][0] * b[0][i] + a[j][1] * b[1][i] + a[j][2] * b[2][i]; } ---------------------------------------------------- loop at bench.cc:33: not vectorized: unsupported scalar cycle. loop at bench.cc:33: bad scalar cycle.
Confirmed, ICC can do this but does not because it is not very inefficient to do it.
We now get: t3.c:9: note: not vectorized: can't determine dependence between: (*D.1338_16)[0] and (*D.1336_10)[i_53]
Oh, the issue here is that a, b, and ret all could point to the same array because the type is (float[3])* or arraryptr in: typedef float array[3]; typedef array *arraryptr; If we change ret, a, and b to be global variables, then the vectorizer could be done except for the fact: t.c:11: note: not vectorized: iteration count too small. t.c:11: note: bad operation or unsupported loop bound. t.c:11: note: vectorized 0 loops in function.
Test case of comment #0 is not vectorized in recent GCC: 1 #define align(x) __attribute__((align(x))) 2 typedef float align(16) MATRIX[3][3]; 3 4 void RotateMatrix(MATRIX ret, MATRIX a, MATRIX b) 5 { 6 int i, j; 7 8 for (j = 0; j < 3; j++) 9 for (i = 0; i < 3; i++) 10 ret[j][i] = a[j][0] * b[0][i] 11 + a[j][1] * b[1][i] 12 + a[j][2] * b[2][i]; 13 } t.c:8: note: not vectorized: loop contains function calls or data references that cannot be analyzed t.c:8: note: bad data references. t.c:4: note: vectorized 0 loops in function. "GCC: (GNU) 4.6.0 20110312 (experimental) [trunk revision 170907]"
The initial testcase is probably a bad example (3x3 matrix). The following testcase is borrowed from Polyhedron rnflow and is vectorized by ICC but not by GCC (the ICC variant is 15% faster): function trs2a2 (j, k, u, d, m) real, dimension (1:m,1:m) :: trs2a2 real, dimension (1:m,1:m) :: u, d integer, intent (in) :: j, k, m real (kind = selected_real_kind (10,50)) :: dtmp trs2a2 = 0.0 do iclw1 = j, k - 1 do iclw2 = j, k - 1 dtmp = 0.0d0 do iclww = j, k - 1 dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2) enddo trs2a2 (iclw1, iclw2) = dtmp enddo enddo return end function trs2a2 the reason why GCC cannot vectorize this is that the load from U has a non-constant stride, so vectorization would need to load two scalars and build up a vector (ICC does that). If the stride were constant but not power-of-two GCC would reject that as well, probably to not confuse the interleaving code. Data dependence analysis also rejects non-constant strides. Further complication (for the cost model) is the accumulator of type double compared to the data types of float. ICC uses only half of the float vectors here to handle mixed float/double type loops (but it still unrolls the loop).
Author: matz Date: Tue Apr 17 13:54:26 2012 New Revision: 186530 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=186530 Log: PR tree-optimization/18437 * tree-vectorizer.h (_stmt_vec_info.stride_load_p): New member. (STMT_VINFO_STRIDE_LOAD_P): New accessor. (vect_check_strided_load): Declare. * tree-vect-data-refs.c (vect_check_strided_load): New function. (vect_analyze_data_refs): Use it to accept strided loads. * tree-vect-stmts.c (vectorizable_load): Ditto and handle them. testsuite/ * gfortran.dg/vect/rnflow-trs2a2.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/vect/rnflow-trs2a2.f90 Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c trunk/gcc/tree-vect-stmts.c trunk/gcc/tree-vectorizer.h
Author: rguenth Date: Wed May 9 12:59:46 2012 New Revision: 187330 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187330 Log: 2012-05-09 Richard Guenther <rguenther@suse.de> PR tree-optimization/18437 * gfortran.dg/vect/rnflow-trs2a2.f90: Move ... * gfortran.dg/vect/fast-math-rnflow-trs2a2.f90: ... here. Added: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-rnflow-trs2a2.f90 - copied unchanged from r187329, trunk/gcc/testsuite/gfortran.dg/vect/rnflow-trs2a2.f90 Removed: trunk/gcc/testsuite/gfortran.dg/vect/rnflow-trs2a2.f90 Modified: trunk/gcc/testsuite/ChangeLog
Link to vectorizer missed-optimization meta-bug.
For the original testcase in comment #0, with `-O3 -fno-vect-cost-model` GCC can vectorize it on aarch64 but not on x86_64.
(In reply to Andrew Pinski from comment #9) > For the original testcase in comment #0, with `-O3 -fno-vect-cost-model` GCC > can vectorize it on aarch64 but not on x86_64. I should say starting in GCC 6 .