The following test case does not vectorize: integer, parameter :: n = 1000000 integer :: i, j, k real(8) :: pi, sum1, sum2, theta, phi, sini, cosi, dotp real(8) :: a(3), b(9,3), c(3) pi = acos(-1.0d0) theta = pi/9.0d0 phi = pi/4.5d0 do k = 1, 9 b(k,1) = 0.5d0*cos(k*phi)*sin(k*theta) b(k,2) = 0.5d0*sin(k*phi)*sin(k*theta) b(k,3) = 0.5d0*cos(k*theta) end do theta = pi/real(n,kind=8) sum2 = 0.0 do i = 1, n sini = sin(i*theta) cosi = cos(i*theta) phi = pi/4.5d0 sum1 = 0.0d0 do j = 1, 9 c(1) = 0.5d0*cos(j*phi)*sini c(2) = 0.5d0*sin(j*phi)*sini c(3) = 0.5d0*cosi do k =1, 9 a(1) = b(k,1) - c(1) a(2) = b(k,2) - c(2) a(3) = b(k,3) - c(3) ! a = b(k,:) - c dotp = a(1)*a(1) + a(2)*a(2) + a(3)*a(3) ! dotp = dot_product(a,a) sum1 = sum1 +dotp end do end do sum2 = sum2 + sum1/81.0d0 end do print *, 3.0d0*sum2/(4.0d0*pi*real(n,kind=8)) end [ibook-dhum] bug/timing% gfc -O3 -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorizer-verbose=2 test_vect.f90 test_vect.f90:20: note: not vectorized: unsupported data-type complex(kind=8) test_vect.f90:8: note: not vectorized: unsupported data-type complex(kind=8) test_vect.f90:1: note: vectorized 0 loops in function. while it vectorize for do k =1, 9 ! a(1) = b(k,1) - c(1) ! a(2) = b(k,2) - c(2) ! a(3) = b(k,3) - c(3) a = b(k,:) - c dotp = a(1)*a(1) + a(2)*a(2) + a(3)*a(3) ! dotp = dot_product(a,a) sum1 = sum1 +dotp end do [ibook-dhum] bug/timing% gfc -O3 -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorizer-verbose=2 test_vect.f90 test_vect.f90:24: note: LOOP VECTORIZED. test_vect.f90:8: note: not vectorized: unsupported data-type complex(kind=8) test_vect.f90:1: note: vectorized 1 loops in function. (see http://gcc.gnu.org/ml/gcc-patches/2008-05/msg00033.html and pr34265). Note that both variants do not vectorize on powerpc-apple-darwin9.
With a = b(k,:) - c manually unrolled the loop over k is unrolled with the early loop unrolling pass which exposes the unvectorizable calls to sin/cos, respective the complex temporaries introduced by the sincos pass to the vectorizer which then punts. The early unroller at -O3 is just limited by the maximum final loop size and the trip count (400 and 8) and the unroller estimates Loop 4 iterates 8 times. Loop size: 40 Estimated size after unrolling: 216 SLP also doesn't handle vectorization of register operations but needs memory source and destination operands(?). Likewise SLP shouldn't be confused by unvectorizable data types? On x86_64 you can reproduce the missed vectorization with -O3 -ffast-math. <bb 7>: # ivtmp.40_261 = PHI <9(6), ivtmp.40_240(8)> # sum1_5 = PHI <0.0(6), sum1_90(8)> # j_2 = PHI <1(6), j_94(8)> D.1032_55 = (real(kind=8)) j_2; D.1033_56 = D.1032_55 * 6.9813170079773179121929160828585736453533172607421875e-1; sincostmp.16_28 = __builtin_cexpi (D.1033_56); D.1034_57 = REALPART_EXPR <sincostmp.16_28>; D.1035_58 = sini_48 * D.1034_57; D.1036_59 = D.1035_58 * 5.0e-1; D.1037_62 = IMAGPART_EXPR <sincostmp.16_28>; D.1038_63 = sini_48 * D.1037_62; D.1039_64 = D.1038_63 * 5.0e-1; D.1044_128 = pretmp.30_150 - D.1036_59; D.1047_132 = pretmp.30_154 - D.1039_64; D.1052_137 = __builtin_pow (D.1044_128, 2.0e+0); D.1054_138 = __builtin_pow (D.1047_132, 2.0e+0); D.1044_149 = pretmp.30_168 - D.1036_59; D.1047_153 = pretmp.30_172 - D.1039_64; D.1052_158 = __builtin_pow (D.1044_149, 2.0e+0); D.1054_159 = __builtin_pow (D.1047_153, 2.0e+0); D.1044_170 = pretmp.30_188 - D.1036_59; D.1047_174 = pretmp.30_192 - D.1039_64; D.1052_179 = __builtin_pow (D.1044_170, 2.0e+0); D.1054_180 = __builtin_pow (D.1047_174, 2.0e+0); D.1044_191 = pretmp.30_206 - D.1036_59; D.1047_195 = pretmp.30_210 - D.1039_64; D.1052_200 = __builtin_pow (D.1044_191, 2.0e+0); D.1054_201 = __builtin_pow (D.1047_195, 2.0e+0); D.1044_212 = pretmp.30_218 - D.1036_59; D.1047_216 = pretmp.30_230 - D.1039_64; D.1052_221 = __builtin_pow (D.1044_212, 2.0e+0); D.1054_222 = __builtin_pow (D.1047_216, 2.0e+0); D.1044_233 = pretmp.30_238 - D.1036_59; D.1047_237 = pretmp.30_248 - D.1039_64; D.1052_242 = __builtin_pow (D.1044_233, 2.0e+0); D.1054_243 = __builtin_pow (D.1047_237, 2.0e+0); D.1044_254 = pretmp.30_256 - D.1036_59; D.1047_258 = pretmp.30_260 - D.1039_64; D.1052_263 = __builtin_pow (D.1044_254, 2.0e+0); D.1054_264 = __builtin_pow (D.1047_258, 2.0e+0); D.1044_275 = pretmp.30_276 - D.1036_59; D.1047_279 = pretmp.30_280 - D.1039_64; D.1052_284 = __builtin_pow (D.1044_275, 2.0e+0); D.1054_285 = __builtin_pow (D.1047_279, 2.0e+0); D.1044_71 = pretmp.30_68 - D.1036_59; D.1047_76 = pretmp.30_73 - D.1039_64; D.1052_83 = __builtin_pow (D.1044_71, 2.0e+0); D.1054_85 = __builtin_pow (D.1047_76, 2.0e+0); D.1055_86 = D.1054_85 + D.1052_83; dotp_89 = D.1055_86 + pretmp.33_294; dotp_288 = dotp_89 + D.1052_137; D.1055_286 = dotp_288 + D.1054_138; sum1_289 = pretmp.33_249 + D.1055_286; dotp_267 = sum1_289 + D.1052_158; D.1055_265 = dotp_267 + D.1054_159; sum1_268 = pretmp.33_228 + D.1055_265; dotp_246 = sum1_268 + D.1052_179; D.1055_244 = dotp_246 + D.1054_180; sum1_247 = pretmp.33_207 + D.1055_244; dotp_225 = sum1_247 + D.1052_200; D.1055_223 = dotp_225 + D.1054_201; sum1_226 = pretmp.33_186 + D.1055_223; dotp_204 = sum1_226 + D.1052_221; D.1055_202 = dotp_204 + D.1054_222; sum1_205 = pretmp.33_165 + D.1055_202; dotp_183 = sum1_205 + D.1052_242; D.1055_181 = dotp_183 + D.1054_243; sum1_184 = pretmp.33_144 + D.1055_181; dotp_162 = sum1_184 + D.1052_263; D.1055_160 = dotp_162 + D.1054_264; sum1_163 = pretmp.33_123 + D.1055_160; dotp_141 = sum1_163 + D.1052_284; D.1055_139 = dotp_141 + D.1054_285; sum1_142 = D.1055_139 + pretmp.33_292; sum1_90 = sum1_142 + sum1_5; j_94 = j_2 + 1; ivtmp.40_240 = ivtmp.40_261 - 1; if (ivtmp.40_240 == 0) goto <bb 9>; else goto <bb 8>;
The vectorizer doesn't know to vectorize __builtin_cexpi or {REAL,IMAG}PART_EXPR either. IMHO rather than somehow tweaking the early unroller the vectorizer should know how to deal with complex types.
(In reply to comment #1) > > SLP also doesn't handle vectorization of register operations but needs > memory source and destination operands(?). Right. > Likewise SLP shouldn't be > confused by unvectorizable data types? SLP does get confused by unvectorizable data types. Ira
(In reply to comment #2) > The vectorizer doesn't know to vectorize __builtin_cexpi or > {REAL,IMAG}PART_EXPR > either. > > IMHO rather than somehow tweaking the early unroller the vectorizer should > know how to deal with complex types. > There already exists a missed-optimization PR for vectorization of {REAL,IMAG}PART_EXPR - pr35252. Ira
I just noticed today that the vectorization of the variant induct.v2.f90 depends on the -m64 flag: [ibook-dhum] source/dir_indu% gfc -m64 -O3 -ffast-math -funroll-loops -ftree-vectorizer-verbose=2 indu.v2.f90 ... indu.v2.f90:2322: note: not vectorized: unsupported use in stmt. indu.v2.f90:2245: note: not vectorized: unsupported unaligned store. indu.v2.f90:2244: note: vectorizing stmts using SLP. indu.v2.f90:2244: note: LOOP VECTORIZED. indu.v2.f90:2146: note: not vectorized: unsupported use in stmt. indu.v2.f90:2069: note: not vectorized: unsupported unaligned store. indu.v2.f90:2068: note: vectorizing stmts using SLP. indu.v2.f90:2068: note: LOOP VECTORIZED. indu.v2.f90:1976: note: not vectorized: complicated access pattern. indu.v2.f90:1875: note: vectorized 2 loops in function. indu.v2.f90:1816: note: not vectorized: unsupported use in stmt. indu.v2.f90:1771: note: not vectorized: unsupported unaligned store. indu.v2.f90:1770: note: vectorizing stmts using SLP. indu.v2.f90:1770: note: LOOP VECTORIZED. indu.v2.f90:1682: note: not vectorized: unsupported use in stmt. indu.v2.f90:1633: note: not vectorized: unsupported unaligned store. indu.v2.f90:1632: note: vectorizing stmts using SLP. indu.v2.f90:1632: note: LOOP VECTORIZED. indu.v2.f90:1543: note: not vectorized: complicated access pattern. indu.v2.f90:1441: note: vectorized 2 loops in function. ... [ibook-dhum] source/dir_indu% gfc -O3 -ffast-math -funroll-loops -ftree-vectorizer-verbose=2 indu.v2.f90 ... indu.v2.f90:2334: note: LOOP VECTORIZED. indu.v2.f90:2245: note: not vectorized: unsupported unaligned store. indu.v2.f90:2244: note: vectorizing stmts using SLP. indu.v2.f90:2244: note: LOOP VECTORIZED. indu.v2.f90:2158: note: LOOP VECTORIZED. indu.v2.f90:2069: note: not vectorized: unsupported unaligned store. indu.v2.f90:2068: note: vectorizing stmts using SLP. indu.v2.f90:2068: note: LOOP VECTORIZED. indu.v2.f90:1976: note: not vectorized: complicated access pattern. indu.v2.f90:1875: note: vectorized 4 loops in function. indu.v2.f90:1825: note: LOOP VECTORIZED. indu.v2.f90:1771: note: not vectorized: unsupported unaligned store. indu.v2.f90:1770: note: vectorizing stmts using SLP. indu.v2.f90:1770: note: LOOP VECTORIZED. indu.v2.f90:1691: note: LOOP VECTORIZED. indu.v2.f90:1633: note: not vectorized: unsupported unaligned store. indu.v2.f90:1632: note: vectorizing stmts using SLP. indu.v2.f90:1632: note: LOOP VECTORIZED. indu.v2.f90:1543: note: not vectorized: complicated access pattern. indu.v2.f90:1441: note: vectorized 4 loops in function. ... Where the nested loop vectorized without -m64 at 1691 is: ... do j = 1, 9 c_vector(3) = 0.5_longreal * h_coil * z1gauss(j) ! ! rotate coil vector into the global coordinate system and translate it ! rot_c_vector(1) = rot_i_vector(1) + rotate_coil(1,3) * c_vector(3) rot_c_vector(2) = rot_i_vector(2) + rotate_coil(2,3) * c_vector(3) rot_c_vector(3) = rot_i_vector(3) + rotate_coil(3,3) * c_vector(3) ! do k = 1, 9 ! <==== line 1691 ! ! rotate quad vector into the global coordinate system ! rot_q_vector(1) = rot_q1_vector(k,1) - rot_c_vector(1) rot_q_vector(2) = rot_q1_vector(k,2) - rot_c_vector(2) rot_q_vector(3) = rot_q1_vector(k,3) - rot_c_vector(3) ! ! compute and add in quadrature term ! numerator = dotp * w1gauss(j) * w2gauss(k) dotp2=rot_q_vector(1)*rot_q_vector(1)+rot_q_vector(2)*rot_q_vector(2)+ & rot_q_vector(3)*rot_q_vector(3) denominator = sqrt(dotp2) l12_lower = l12_lower + numerator/denominator end do end do ...
I had a closer look into testcase in original bug report and it appears that vectorizer fails on attempt to vectorize __builtin_cexpi (). Regarding different behavior of vectorizer with -m32 and -m64 described in comment #5. It appears that with -m32, the internal loop is not unrolled, and thus is vectorized later. While with -m64 the internal loop gets completely unrolled, so there is no chance to vectorize it later.
Regarding missing SLP vectorization of reductions I have opened new bug report - PR37027. Regarding not vectorized call to __builtin_cexpi() we already have a bug report PR22226 about missing support of generic vector calls to IBM MASSV and Intel VML vector libraries. Thus, I propose to close this PR as duplicate of PR22226.
*** This bug has been marked as a duplicate of 22226 ***