This below test case is simulated from "air.f90" benchmark of polyhedren. What I see is vectorization makes "air" run faster with ICC than GCC by about 16%, but I am not sure if all that comes from vectorization alone. While analysing the assembly differences, found that GCC is not vectorizing the below case wheres ICC does vectorize. (Snip) DIMENSION NPX(30) , NPY(30) COMMON /XD1 / MXPy, NDX COMMON /XD2 / MXPx MXPx = 0 DO i = 1 , NDX MXPx = MXPx + NPX(i)+1 ENDDO ! END (Snip) Machine: x86_64-unknown-linux-gnu GCC revison: 183151 ICC revision: 12.1.0.233 Build 2 gcc -Ofast -march=corei7-avx -limf -lsvml -L /tool/intel/lib/intel64/ -mveclibabi=svml pattern1.f90 -ftree-vectorizer-verbose=2 -S Analyzing loop at pattern1.f90:5 5: not vectorized: unsupported use in stmt. 5: not vectorized: unsupported use in stmt. pattern1.f90:9: note: vectorized 0 loops in function. ifort -march=corei7-avx -O3 -limf -lsvml -L /tool/intel/lib/intel64/ pattern1.f90 -vec-report -S -fsource-asm pattern1.f90(5): (col. 7) remark: LOOP WAS VECTORIZED. For the expression: MXPx = MXPx + NPX(i)+1 The constant "1" is converted to a vector packet as shown below .L_2il0floatpacket.0: .long 0x00000001,0x00000001,0x00000001,0x00000001 The assembly pattern for the vectorization portion in ICC looks like as shown below: The total expression now becomes vectorizable. vmovdqu .L_2il0floatpacket.0(%rip), %xmm0 ..B1.5: # Preds ..B1.5 ..B1.4 vpaddd _unnamed_main$_$NPX.0.1(,%rax,4), %xmm0, %xmm2 #6.10 addq $4, %rax #5.7 vpaddd %xmm2, %xmm1, %xmm1 #6.22 cmpq %rdx, %rax #5.7 jb ..B1.5 # Prob 96% #5.7 Please provide your thoughts on this and possible vectorization improvement in GCC for this pattern.
I confirm that the loop in the codelet is not vectorized (it is if the "+1" is removed and taken into account as "MXPx=NDX"). > but I am not sure if all that comes from vectorization alone. Doing the above change improves the vectorization, but not the timing. One "property" of the test air.f90 is to have several nested loops with "bad" nesting (slow index first). I don't know if this was done to test compilers, but if I reverse the loops manually as in --- air.f90 2009-08-28 14:22:26.000000000 +0200 +++ air_v1.f90 2005-11-09 17:33:12.000000000 +0100 @@ -400,8 +400,8 @@ ! ! COMPUTE THE FLUX TERMS ! - DO i = 1 , MXPx - DO j = 1 , MXPy + DO j = 1 , MXPy + DO i = 1 , MXPx ! ! compute vanleer fluxes ! @@ -657,8 +657,8 @@ ENDDO ! ! COMPUTE THE FLUX TERMS - DO i = 1 , MXPx - DO j = 1 , MXPy + DO j = 1 , MXPy + DO i = 1 , MXPx ! ! compute vanleer fluxes ! @@ -838,8 +838,8 @@ ! FIND THE LOCAL TIME STEPS ! dt = 100 - DO i = 1 , MXPx - DO j = 1 , MXPy + DO j = 1 , MXPy + DO i = 1 , MXPx as = DSQRT(P(i,j)/RHO(i,j)*GMA) rdltx = RHO(i,j)*DABS(U(i,j))*ddx(i,j)/xmu(i,j) rdlty = RHO(i,j)*DABS(V(i,j))*ddy(i,j)/xmu(i,j) @@ -880,13 +880,13 @@ DO iy = 1 , NDY maxy = maxy + NPY(iy) + 1 dtd = 100.0 - DO i = minx , maxx - DO j = miny , maxy + DO j = miny , maxy + DO i = minx , maxx IF ( dtt(i,j).LE.dtd ) dtd = dtt(i,j) ENDDO ENDDO - DO i = minx , maxx - DO j = miny , maxy + DO j = miny , maxy + DO i = minx , maxx dtt(i,j) = dtd ENDDO ENDDO @@ -958,8 +958,8 @@ con2 = 0.0 con3 = 0.0 con4 = 0.0 - DO i = 1 , MXPx - DO j = 1 , MXPy + DO j = 1 , MXPy + DO i = 1 , MXPx con1 = con1 + DABS(u1(i,j)-u1o(i,j))/dtt(i,j) con2 = con2 + DABS(u2(i,j)-u2o(i,j))/dtt(i,j) con3 = con3 + DABS(u3(i,j)-u3o(i,j))/dtt(i,j) the timing goes from [macbook] lin/test% time a.out > /dev/null 7.233u 0.023s 0:07.25 100.0% 0+0k 0+8io 0pf+0w to [macbook] lin/test% time a.out > /dev/null 6.353u 0.021s 0:06.37 100.0% 0+0k 0+8io 0pf+0w I have made a few attempt to obtain gfortran to do these loops interchange using graphite without success so far.
We don't recognize this as reduction because we look for: a1 = phi < a0, a2 > a3 = ... a2 = operation (a3, a1) and here we have a1 = phi < a0, a2 > a3 = ... a4 = operation (a3, a1) a2 = a4 + 1
(In reply to comment #2) > We don't recognize this as reduction because we look for: > > a1 = phi < a0, a2 > > a3 = ... > a2 = operation (a3, a1) > > and here we have > > a1 = phi < a0, a2 > > a3 = ... > a4 = operation (a3, a1) > a2 = a4 + 1 It seems reassociation should "fix" this. IIRC we had some special code in there that was supposed to handle this.
Link to vectorizer missed-optimization meta-bug.
(In reply to Richard Biener from comment #3) > (In reply to comment #2) > > We don't recognize this as reduction because we look for: > > > > a1 = phi < a0, a2 > > > a3 = ... > > a2 = operation (a3, a1) > > > > and here we have > > > > a1 = phi < a0, a2 > > > a3 = ... > > a4 = operation (a3, a1) > > a2 = a4 + 1 > > It seems reassociation should "fix" this. IIRC we had some special code > in there that was supposed to handle this. We do reassociate equivalent functions in C, for example float x[30]; float bar() { float s = 0; for (int i = 0; i < 30; i++) { s += x[i]; s += 1; } return s; } which vectorizes just fine.
long fixed.