410.bwaves in shell_lam.f has a lot of arrays with inner dimension 5 operated on in loops that are either unrolled by early unrolling or manually unrolled in source. All but one loop in shell_lam.f are not vectorized. One reason is that basic-block vectorization gives up if it sees interleaving size that is not a multiple of a supported vectorization factor. Testcase: double a[1024], b[1024]; void foo (int k) { int j; a[k*5 + 0] = a[k*5 + 0] + b[k*5 + 0]; a[k*5 + 1] = a[k*5 + 1] + b[k*5 + 1]; a[k*5 + 2] = a[k*5 + 2] + b[k*5 + 2]; a[k*5 + 3] = a[k*5 + 3] + b[k*5 + 3]; a[k*5 + 4] = a[k*5 + 4] + b[k*5 + 4]; } taken from the last loop in shell_lam.f which has its innermost loop unrolled (and loop SLP refuses to vectorize as well, see separate bug). For the above we get: t.c:6: note: === vect_analyze_data_ref_accesses === t.c:6: note: Detected interleaving of size 5 t.c:6: note: Detected interleaving of size 5 t.c:6: note: Detected interleaving of size 5 t.c:6: note: Vectorizing an unaligned access. t.c:6: note: Vectorizing an unaligned access. t.c:6: note: Vectorizing an unaligned access. t.c:6: note: === vect_analyze_slp === t.c:6: note: get vectype with 2 units of type double t.c:6: note: vectype: vector(2) double t.c:6: note: Build SLP failed: unrolling required in basic block SLP t.c:6: note: Failed to SLP the basic block. t.c:6: note: not vectorized: failed to find SLP opportunities in basic block. but of course we could simply vectorize with an interleaving size of 4 leaving the excess operations unvectorized (with optimization opportunity if we can pick a properly sized and aligned set of accesses).

The loop that remains after fixing PR49957 in 410.bwaves is the following, which loop SLP does not handle (well, I'm not exactly sure) because t.f:18: note: ==> examining statement: t1_62 = *q_61(D)[D.1645_60]; t.f:18: note: num. args = 4 (not unary/binary/ternary op). t.f:18: note: vect_is_simple_use: operand *q_61(D)[D.1645_60] t.f:18: note: not ssa-name. t.f:18: note: use not simple. t.f:18: note: no array mode for V2DF[5] t.f:18: note: the size of the group of strided accesses is not a power of 2 t.f:18: note: not vectorized: relevant stmt not supported: t1_62 = *q_61(D)[D.1645_60]; t.f:18: note: bad operation or unsupported loop bound. t.f:1: note: vectorized 0 loops in function. probably the issue that we can't handle this kind of "invariants" in the SLP group? Thus, the SLP group should be q(2,..), q(3,...) ... q(5, ...) which is size 4, q(1,..) should be treated as invariant. subroutine shell(nx,ny,nz,q,dt,cfl,dx,dy,dz) implicit none integer nx,ny,nz,n,i,j,k real*8 cfl,dx,dy,dz,dt real*8 gm,Re,Pr,cfll,t1,t2,t3,t4,t5,t6,t7,t8,mu real*8 q(5,nx,ny,nz) C This particular problem is periodic only cfll=0.1d0+(n-1.0d0)*cfl/20.0d0 if (cfll.ge.cfl) cfll=cfl t8=0.0d0 do k=1,nz do j=1,ny do i=1,nx t1=q(1,i,j,k) t2=q(2,i,j,k)/t1 t3=q(3,i,j,k)/t1 t4=q(4,i,j,k)/t1 t5=(gm-1.0d0)*(q(5,i,j,k)-0.5d0*t1*(t2*t2+t3*t3+t4*t4)) t6=dSQRT(gm*t5/t1) mu=gm*Pr*(gm*t5/t1)**0.75d0*2.0d0/Re/t1 t7=((dabs(t2)+t6)/dx+mu/dx**2)**2 + 1 ((dabs(t3)+t6)/dy+mu/dy**2)**2 + 2 ((dabs(t4)+t6)/dz+mu/dz**2)**2 t7=DSQRT(t7) t8=max(t8,t7) enddo enddo enddo dt=cfll / t8 return end

(In reply to comment #0) > but of course we could simply vectorize with an interleaving size of 4 > leaving the excess operations unvectorized (with optimization opportunity > if we can pick a properly sized and aligned set of accesses). Right. I even had a patch for this some time ago. I can try to bring it to life. Ira

(In reply to comment #1) > The loop that remains after fixing PR49957 in 410.bwaves is the following, > which loop SLP does not handle (well, I'm not exactly sure) because > > t.f:18: note: ==> examining statement: t1_62 = *q_61(D)[D.1645_60]; > > t.f:18: note: num. args = 4 (not unary/binary/ternary op). > t.f:18: note: vect_is_simple_use: operand *q_61(D)[D.1645_60] > t.f:18: note: not ssa-name. > t.f:18: note: use not simple. > t.f:18: note: no array mode for V2DF[5] > t.f:18: note: the size of the group of strided accesses is not a power of 2 > t.f:18: note: not vectorized: relevant stmt not supported: t1_62 = > *q_61(D)[D.1645_60]; > > t.f:18: note: bad operation or unsupported loop bound. > t.f:1: note: vectorized 0 loops in function. > > probably the issue that we can't handle this kind of "invariants" in the > SLP group? Thus, the SLP group should be q(2,..), q(3,...) ... q(5, ...) > which is size 4, q(1,..) should be treated as invariant. > This loop is not SLPed because there is no SLP opportunity here besides the loads. The only isomorphism after that is t2=q(2,i,j,k)/t1 t3=q(3,i,j,k)/t1 t4=q(4,i,j,k)/t1 and somewhat here t7=((dabs(t2)+t6)/dx+mu/dx**2)**2 + 1 ((dabs(t3)+t6)/dy+mu/dy**2)**2 + 2 ((dabs(t4)+t6)/dz+mu/dz**2)**2 but these are groups of 3. Moreover, the current implementation starts building SLP tree from a group of strided stores, or a group of reductions, or a reduction chain. None of these exist here. But, again, even if we could start from a group of loads, it wouldn't help us much here anyway. Ira